One thing that annoys me about some websites is search functionality, and the lack of relevant results that they return. In theory a search feature is a relatively simple concept, you type a keyword for what you are looking for, and a list of matching results are provided. While this looks pretty simple on the front end, the back-end of search queries is a complicated process.
For me, when creating a new search system, it’s always best to start it out pretty simple. A simple database query to find a simple match in a simple database field. Let’s use a recent project I’ve been working on as an example.
Let’s assume our table looks like this.
+-------+------+--------+----------------+-------+---------------+--------------+------------+ | entry | item | system | title | extra | description | tags | date | +-------+------+--------+----------------+-------+---------------+--------------+------------+ | 11 | 47 | 2 | Anthonys Video | 23412 | anthony sho.. |anthony,kinson| 1346172183 |
SELECT * FROM tag_items WHERE title='Anthonys Video'
To simple right? The reason for this is that I want to just return something, anything at all, so that I can start programming and styling the front end to start receiving data for output. At this point there is no sense in me creating complex queries because I’m not absolutely certain that the table or front end will continue to work the way it is at this point. However, once I have completed my front end and have it all functioning we knock things up a step, and this next step is usually where most websites leave it.
SELECT * FROM tag_items WHERE title LIKE '%Ant%' ORDER BY date DESC LIMIT 0,10
So as you can tell from this query, we’re searching the table for items which have a “Title” like “Ant” (where Ant is the search term) and orders them so the most recent entries will show first. Some websites will give various options for this type of search, such as a drop-down offering to search by Title, Description or tag words. This is all very well for a lot of sites as they don’t need anything more expansive than this, especially when the user knows exactly what they are looking for.
The problem with this is that with today’s internet, search is used a lot as a discovery portal, a way for users to find and discover new things which are of more relevance to what they are looking for. To do this, we need to offer a search that doesn’t just deliver content to a user based on a single factor such as a title.
At Duxter, to tackle this problem I devised a simple relevance and weighting system, we take each possible search field, and we assign it a value. In this case we’ll use the following where a higher number is more relevant.
Exact “title” Match = 100 points
Partial “title” Match = 10 points
Partial “tags” Match = 7 points
Partial “extra” match = 5 points
Partial “description” match = 2 points
Let me explain these values. If you search for “Big Foot Spotted” then chances are you know what you’re looking for. If an entry is found with the exact title “Big Foot Spotted”, this will be highly relevant so we give it an excessively high score so that it’s listed above all other results which may only of had partial matches. For example, if we had an entry called “Big Spotted Foot” it is relevant, but probably not what the user was looking for. For this reason we give partial matches of a title 10 points. We then take into account other factors, such as what tags the item has, extra data supplied, and the description.
Tags are usually a good way to find content based on keywords, but tags get reused a lot and often assigned to things with little relevance, so we consider these less important than the title of the item. Extra, for us, stores extra identifying information, such as video ID’s and object keys, we only use this in our search queries so that you can search for items based off of a video id . Lastly is our lowest priority, the description. The reason for this is because descriptions often contain a lot of words so there’s more chance of matching something irrelevant.
The way we handle this with an SQL statement is as follows.
SELECT *, sum(relevance) FROM ( SELECT *, 100 AS relevance FROM tag_items WHERE title='Anthony' UNION SELECT *, 10 AS relevance FROM tag_items WHERE title like '%Anthony%' UNION SELECT *, 7 AS relevance FROM tag_items WHERE tags like '%Anthony%' UNION SELECT *, 5 AS relevance FROM tag_items WHERE extra like '%Anthony%' UNION SELECT *, 2 AS relevance FROM tag_items WHERE description like '%Anthony%' ) results GROUP BY entry ORDER BY sum(relevance) desc
So here I have created a number of select statements that will grab items where a field matches the search term. Depending on the match that is found a relevance value is given and all the results get grouped together by their entry id. We then sort them by which item has the highest relevance count.
an example of a returned result would be
+-------+------+--------+----------------+-------+---------------+--------------+----------------+ | entry | item | system | title | extra | description | tags | sum(relevance) | +-------+------+--------+----------------+-------+---------------+--------------+----------------+ | 11 | 47 | 1 | Anthony | 23412 | Anthony sho.. |anthony,kinson| 109 | | 25 | 22 | 4 | Anthonys video | 23876 | This is ant.. |anthony,video,| 19 | | 83 | 63 | 4 | Last Nights g. | 23747 | Anthony loo.. |anthony,game,l| 9 | | 22 | 13 | 3 | Seattle even.. | 92834 | Our event in. |seatle,anthony| 7 |
So as you can see above, each item was given a combined relevance score, the higher the score the more relevant the item was to the search query. Searching for Anthony in this case we found an exact match in system 1 (which is a user). If you searched Anthony, chances are that the user with the name Anthony is who you are looking for and will be the most relevant result to show you, we then found a video (system 4) which had a partial title match, a tag and a description match, and so on and so forth.
Now, the system I have set up for Duxter is much more complicated than this, I’ve tried to simplify it as much as possible to show how it works. We in fact have a separate table with indexed tag words and keys and we return the top 5 results from each “system” in our quick ajax search feature. things start getting real complicated at this point, so we’ll save that for another day. for now though, I hope this look into making searches a bit more relevant comes in handy for you.
Until next time.