Aug 24, 2008 5:45:44 PM
Although MySQL's fulltext index functionality allows the relevance of search results to be determined easily, it is often the case that relevance is dependent on non-textual columns, or on business logic that is not easily modeled in a relational database. There are ways, luckily, to score results on these types of factors, but the implementation is not trivial.
The Problem
This is especially true when scoring is not absolute. For instance, consider a database table containing a list of people and some personal information about each:
CREATE TABLE people (
`id` INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`name` VARCHAR(255) NOT NULL,
`height` INT(4) NOT NULL, # inches
`gender` ENUM('M', 'F') NOT NULL,
`interests` VARCHAR(255) NOT NULL
);
INSERT INTO `people` VALUES (NULL, 'Jerry', 71, 'M', 'comedy restaurants sports');
INSERT INTO `people` VALUES (NULL, 'Elayne', 66, 'F', 'restaurants reading writing');
INSERT INTO `people` VALUES (NULL, 'George', 67, 'M', 'sports');
INSERT INTO `people` VALUES (NULL, 'Kramer', 76, 'M', 'karate sports');
CREATE FULLTEXT INDEX `interests` ON `people` (`interests`);
Now consider an algorithm which determines how good a romantic match two entries in `people` are. Since scoring is relative, it's impossible to precalculate a person's score. Determining a person's best match, therefore, is an O(N) algorithm which requires one iteration through each row in the table.
So the two questions are:
- How do we calculate a score (in this case, romantic compatibility) across multiple columns?
- How do we do it efficiently?
Weighted scoring
One simple solution to this problem is to assign weighted scores to each column. Back to our example, for instance, let's say:
- Imperative: A match must be of the opposite sex (as the Seinfeld characters I've used are all heterosexual)
- Important: A match should have as many interests in common.
- Unimportant: A match should be as close in height as possible.
We can now assign arbitrary scorings based on this information. Let's assume that we're trying to find the best match for Elayne:
SELECT `name`, (
ROUND(MATCH(`interests`) AGAINST ('restaurants reading' IN BOOLEAN MODE)) * 20 + # interests
IF(`height` = 66, 10, ROUND(1 / ABS(`height` - 66) * 10)) # height difference - if used to avoid division by 0
) AS `score` FROM `people` WHERE `gender` != 'F' ORDER BY `score` DESC;
+--------+-------+
| name | score |
+--------+-------+
| Jerry | 22 |
| George | 10 |
| Kramer | 1 |
+--------+-------+
3 rows in set (0.00 sec)
What we've done is give height different a maximum score of 10 (where the smaller the difference the higher the score), and interests a maximum score of 20 * the number of Elayne's interests (due to the way MySQL scores fulltext results, the maximum value is the maximum number of occurrences of the query's terms in the test value). According to this algorithm, Jerry is the best romantic match for Elayne. As you'd predict, if you remove the gender requirement from the SQL statement, Elayne gets the highest score of 70 (10 + 20 * 3).
Normalization
This method can be extended by adding more columns to our `people` table, and by adding those columns to the calculation. The problem, though, is that as you add columns, the individual score on its own becomes meaningless. For instance, it's meaningless to say 'Jerry is 22 compatible with Elayne'. Far more useful is to calculate 22's percentage of the perfect score (70), and to say that 'Jerry is 31.4% compatible with Elayne'.
There are two ways to achieve this. The first is two run two queries: one to calculate the perfect score (by running the algorithm against the test subject), and the second to fetch all the values, dividing `score` by the maximum score. The other is to retrieve all the rows in one query and use a scripting language to determine the normalized value for each row. Zend Framework allows you to create custom classes for your result rowsets, so the normalization functionality can just be added to that class.
Cleaning up
As you can see, even with a simple algorithm the SQL can get messy. The quickest way to clean up is to stick the SELECT into a MySQL stored function. You can do this as follows:
DELIMITER |
CREATE FUNCTION compatibility (person INT, test INT) RETURNS INT
BEGIN
DECLARE x ENUM('M', 'F');
DECLARE y INT(4);
DECLARE z VARCHAR(255);
DECLARE s INT;
SELECT `gender`, `height`, `interests` INTO x, y, z FROM `people` WHERE `id` = person;
SELECT IF(`gender` = x, 0, (ROUND(MATCH(`interests`) AGAINST (z IN BOOLEAN MODE)) * 20 + IF(`height` = y, 10, ROUND(1 / ABS(`height` - y) * 10)))) INTO s FROM `people` WHERE `id` = test;
RETURN s;
END |
DELIMITER ;
SELECT name, compatibility(2, id) FROM people;
Discussion
Subscribe to an RSS feed of these commentsQuinton
Aug 25, 2008 8:30:44 AM
Heh. Been trying out MySQL 5's stored procs myself lately. Very cool stuffI might be wrong but i see a potential bug. Given the tables in use are MYISAM, due to implicit locking, you'll need to LOCK TABLES before the SELECT INTO and release after the SELECT that calculates the MATCH/relevancy because a concurrent call to compatibility() can also be in effect by another client. Thus one client could be inadvertently reading another client's compatibility calculation. Yikes.
Also i imagine problems scaling/paginating a large result set, in particular ordering the "score" column will be nasty. And surely managing the matching algorithm isn't going to be easy via a stored proc.
May i suggest another solution... create a sphinx index on the people table. Then to perform relevancy matching its a matter of SELECTing the user record that's looking for a match from the people table. Then using that row data to search against the sphinx index (excluding the user looking for a match of course).
Sphinx is faster than myisam's full-text (it'll take the search load off the db too) and then ur also free to use any storage engine of your choice for your people table. Naturally, sphinx wil present a matching score
http://www.sphinxsearch.com/
And no, i'm not a sphinx promoter! hah
Quinton
Aug 25, 2008 8:42:12 AM
Wait oops. I was wrong about the table locking thing. I didnt read the query right the first time. I thought it was updating the people table. doh! forgive me :-)So ignore the rubbish about the table-locking requirement.
SELECT name, compatibility(2, id) FROM people;
You'll still have a serious problem paginating large results though. With a function in the SELECT query cache is negated and if you intend using an ORDER BY and/or LIMIT full-table scans will be in effect
Neil Garb
Aug 25, 2008 8:44:41 AM
@Quinton: thanks a stack for your input. Are you saying that stored procs share local variables between calls?Will check out Sphinx -- never heard of it :)
Neil Garb
Aug 25, 2008 8:48:26 AM
@Quinton: ah missed your second comment.Re: query cache. Will that be the case even with the 'READS SQL' characteristic?
Quinton
Aug 25, 2008 11:30:51 AM
Yeah... mysql states any user function used in a SELECT negates the qcacheTo get a bit of a taste of sphinx in action check out iol motoring's search... apparently Cillier Burger implemented it recently and is going to roll it out to iol.co.za search as well
Your comment