Have you ever looked at Google's “Did you mean” field and wondered about the magical sorcery their servers use to interpret some kind of sensible meaning from your auto correct gibberish? Well, at least I have, but sorry to disappoint you I have got not a single clue how they do it. Too bad, I guess. But I had to solve a very similar problem recently and eventually I did, and although I am not super sure why it works, I kinda like my solution.
Banner Image by Drew Beamer
To give you a general idea of what my goal was, this is what I tried to achieve: I have a list of song titles that I want to search, but with a catch. The search query should match even if it is off by quite a bit. For example, wrong word order, missing characters, and capitalization are not that hard to test for, but typos, misspelling or touch-keyboard-jumble are way harder. So much that I was pretty much at a loss when I first started looking into it. I knew I wanted some way of calculating a similarity score for my database records, which could be used to rank the first few best matches. This is called “fuzzy matching” and seems to be a pain in the backside.
The algorithm
Of course, my thirst thought was “Is there anything premade?”. I scoured the internet, but mostly StackOverflow, I had to realize that there weren’t that many options for MariaDB, which is the SQL database I use. Turns out there is something called an “ngram parser” that gave the impression of doing mostly what I needed from the little bit of testing that I did. However, it is not supported by MariaDB, so I had to do a little bit more digging.
People kept recommending building a set of like statements for the search query, but if the user input is longer than a word the possible permutations just get ridiculous. I also read, that you could build an index, that mapped each text record into a phonetic representation via a simple algorithm called “Soundex”. This is great for misspelled queries, where the uses might know the correct pronunciation but not the correct spelling. As far as I could tell this approach, however, struggles with longer strings made up of multiple words. Moreover, it does not take different word orders into account, but most problematically the user might input additional or too few characters so that the phonetic sound might change too much to be recognized correctly.
So, I tried to find out a bit more about this “ngram” thing that seemed to do the job quite well. After reading multiple Wikipedia articles which content I could barely understand, I at least had something to start from. An N-gram is a slice of a string with a length of N characters. The word hello can be deconstructed into $$h , $he , hel , …, o$$ . By converting all records in your database like this an index can be build, which should be easier to search.
Now the task of creating a score of similarity for all entries compared to a user input has to be handled. The Wikipedia article describes the Dice-Coefficient that calculates a value between zero and one, of how similar two data sets are. I was not sure how good of a fit this would be for my kind of application, as a very specific word should match a record in some cases, that is way longer. Therefore, I settled on computing the sum of absolute differences (SAD) of the occurrence of grams. But to remove the length mismatch as a component, missing grams are ignored. My prototype implementation is written in JS and uses a hashmap as an index instead of an actual DB. The index uses the grams as keys and maps to a list of songs containing the gram.
When a user inputs some text as a search query, it is first sanitized from any non-word characters except whitespace, is split into words, and each word converted into grams. This is based on the assumption that users might enter jumbled up words, but at large split up their words correctly. As the bigrams always overlap, they automatically convey positional information in which they will most likely appear. Therefore, I don’t want to create grams that overlap word boundaries, as word order should not matter. This also has the side effect, that I require that each gram has to contain exactly two characters. Hence, the word “I” will actually be removed from the search query altogether. (To me as a German speaker it’s pretty odd to see a word made up of a single letter.) The list of grams is then put through a HashSet, to count the occurrences of each unique type. This list is then used to get all possible candidates from the index, that at least contain one similar gram.
Each song also has a list of it’s counted grams with additional information, which positions they had. This will be handy later on. A loop iterates over all candidate records to compute a SAD score of gram occurrences (gsad). If a song does not contain a certain type of query gram the whole number of them is added to the gsad. If the song includes the gram type, but it has less than the number in the query, the difference is added. On the other hand, if the song has more of the gram type than are in the query present, then the gsad remains unchanged. This decision is made on the assumption, that the input is shorter than most records, and a good match should not depend on the length of the query.
Based on this the algorithm creates a list of song titles ordered by their gsad score, with the best ones having a low one. When testing it out I was surprised how well it worked, at finding the correct song title, but the following few were usually a bit wacky. The main problem was the fact, that the correct match most commonly had a score of zero or one, but then a whole number of candidates followed with the same gsad of two. As there was no ordering inside the same gsad group, the results from the second or third one on didn’t make much sense. To combat this, I wanted to prefer the entries in each group which respect the correct order of grams.
Again, I decided to compute a SAD value, but this time based on the position of occurring grams (psad). If a query’s grams match the positions of a certain record, it will be ranked higher. This should take into account, that the grams might cluster around at a particular area, which might not be the beginning. For example, if the user searches with a word that matches a DB entry very well, meaning the correct type and number of grams exist (gsad) and the grams accumulate together in a very similar order (psad), it shouldn’t matter, that the word happened to be in the center of the song title.
Calculating the psad is done by running two nested loops, that try to find the closest matching grams in the record for each gram in the query. The absolute difference is then again accumulated. Then an offset value is subtracted, which is computed in a loop beforehand, that gets the first matching gram type, and where its first occurrence is. This positional offset is used to compensate for the position of the cluster area inside the record.
Prototyping
For quite some time now, my favorite language for quick and dirty prototyping has been JS. It's dynamic, fast, has great dev tools and a REPL, building a testing UI is super easy in the browser, development is rapid with an auto-reload plugin in Atom and I’m very familiar with the language. I grabbed a list of the top 100 songs from Wikipedia and started playing around until I ended up with the algorithm described above. When writing this dummy implementation, the focus was on testing the principle and not on code quality or performance. If you still want to take a look at it, I made a gist.
I made a very basic UI consisting of an input field and a button. When searching a ranked list is printed, with the gsad and psad scores for each result. This made testing very easy and instantly obvious when a result ended up at the completely wrong position.
MySQL
Even though my database is currently only around 100 records long, it might get pretty large over time. I want to be able to handle searching thousands of entries as snappy as possible. The problem is that just matching all records, that contain at least a certain gram, can yield around 30% of the whole DB in some cases I tested. Hence, simply detecting whether a gram is present, and then sending the records over to the server is a no-no. If the server has to do all the ranking, it has to parse, hash and evaluate thousands of records for each query. Just the time required for serialization and deserialization can get pretty substantial at that point. However, sending fewer records doesn’t make any sense, if they are not ordered beforehand. Therefore, all the ranking, ordering, and finally, the reduction of records has to be done by the database.
This meant implementing a multistage mathematical algorithm using MySQL. The problem is, I’m not a MySQL expert. The most complex things I have done up until now were some simple joins and grouped sums for this website. So, I had to do a lot of reading, trial, error, and staring at the screen to get this working, but I eventually did. DB-Fiddle is a great tool for running MySQL in the browser on a custom test DB.
/* Get song title by id */
select sng.title, qry.* from
(
/* Calculate position sad */
select gsr.songId, gsr.gsad, abs(psr.psum - psr.pcnt * gsr.poffset) as psad from
(
/* sum sads and sort results */
select gs.songId, sum(gs.subgram) as gsad, max(gs.poffset) as poffset from
(
/* Subtract occurance values */
select c.*, case
when c.gram = "be" then if( c.cnt <= 1, abs( c.cnt- 1 ), 0 )
when c.gram = "ee" then if( c.cnt <= 1, abs( c.cnt- 1 ), 0 )
when c.gram = "et" then if( c.cnt <= 1, abs( c.cnt- 1 ), 0 )
when c.gram = "th" then if( c.cnt <= 1, abs( c.cnt- 1 ), 0 )
else 0
end as subgram from
(
/* Join sums and songs/offsets */
select g.*, j.poffset from
(
/* Sum occurances of grams */
select d.songId, d.gram, sum(d.cnt) as cnt from
(
/* Count grams of each song and ensure that keyword grams exist for each song */
select songId, gram, count(gram) as cnt from songIdx group by songId, gram
union all
select distinct songId, "be" as gram, 0 as cnt from songIdx
union all
select distinct songId, "ee" as gram, 0 as cnt from songIdx
union all
select distinct songId, "et" as gram, 0 as cnt from songIdx
union all
select distinct songId, "th" as gram, 0 as cnt from songIdx
) d group by d.songId, d.gram
) g inner join
(
/* Get position/offset of first occuring keyword gram */
select distinct x.songId, min(x.minpos) as poffset from
(
/* Get pos of first occurance of each gram */
select songId, gram, min(pos) as minpos from songIdx
where gram in ("be", "ee", "et", "th")
group by songId, gram
) x group by x.songId
) j on g.songId = j.songId
) c
) gs group by songId asc
) gsr inner join
(
/* Sum all diffs (if they exist) and count the number of existing diffs */
select p.songId, sum(p.sub) as psum, count(*) as pcnt from
(
/* Calculate postion diff for all songs */
select *, case
when (gram = "be" and rep = 0) then abs( pos- 0 )
when (gram = "ee" and rep = 0) then abs( pos- 1 )
when (gram = "et" and rep = 0) then abs( pos- 2 )
when (gram = "th" and rep = 0) then abs( pos- 3 )
else -1
end as sub from songIdx
) p where p.sub >= 0 group by songId
) psr on gsr.songId = psr.songId order by gsad asc, psad asc
) qry inner join
(
select * from songs
) sng on sng.songId = qry.songId order by gsad asc, psad asc;
This is how the query looks for the search term “beeth”. As you can see it’s massive, at least for my standards. If you know more about MySQL than I do, you could probably figure out a way smarter and more compact approach. To make it more digestible to read, but mostly for me to reason about I structured it in many layers of subqueries.
First, I count the grams of each type for all songs and ensure via multiple unions, that every song has an entry for each of the grams in the query with the default count of zero. The grams in this union are then summed again, removing the zero-entry from any record, that contains the gram type. This is then inner joined with another subquery, which selects all song titles that actually contain one of the grams and where the first matching gram occurs. At this point, we have a table of each gram count of all songs that actually contain at least one of the query grams, the song id and the psad offset.
With these values, the gsad values can be computed for each gram in a custom case statement and then summed up in the parent query. In another subquery, the psad values for each gram are calculated for all songs and then summed and counted. Both queries are then again joined and the final psad is created by subtracting the positional offset times the number of matched grams. The very last join gets the actual titles based on the song ids and orders the records based on the gsad and then the psad.
I also made a little script to create the queries automatically, which you can find here.
Conclusion
Although I had no idea how to realize the algorithm in MySQL, I felt like it at least should be possible to do so. I’m happy that I tried, even though the result is far from perfect. Many of the shortcomings stem from the algorithm itself. It only considers grams in the middle of a word, instead of creating shorter grams with only a single letter. Furthermore, it doesn’t consider special characters at all, which depending on the record might result in it not being indexable. Still, I don’t really care that only a few of the results make sense and that even these results might be crappy. If you also keep in mind, that I need to handle messed up user input and that I assume that said input is always pretty short, you might be able to guess, for what I want to use this database. Any idea? 🤔
The answer is suggestions. My application sends requests to the Spotify WebAPI when a user does a search query. But to prevent the API from rate-limiting me immediately, I want to cache as many results from Spotify as possible. For suggestions in the search box, only the cache DB will be used, while the final search is handed over to Spotify. Therefore, I don’t need the most accurate results for the searches, as the worst-case scenario is whacky suggestions.
Getting it to work was surprisingly easy, was a lot of fun, and only took two short evenings. However, as I’m currently very busy with Uni, the project as a whole took about a week. The algorithm I used, probably already exists, but as I want to name it anyway, I thought about “g/psad fuzzy search”.