PitchBook June 22, 2015
In an effort to shed light on the interesting and cutting-edge ways our research, data science and product teams are leveraging advanced computing and technology to hone our platform, we will be producing an ongoing series of more technical posts. Please enjoy the first of many more to come.
Keyword Recommender
A primary use case of the PitchBook Platform is discovery, whether for deal sourcing, comps analysis, industry research, sales prospecting…the list goes on. Keywords, typically related to a company’s business, industry or products, are the primary way people conduct this discovery, so we are building a next-generation keyword recommender that gives users an improved keyword search experience and also recommends keywords that relate to matching or similar companies.
One of the first issues that arises is matching the user’s query with the list of keywords in our database. For example, if the user misspells “healthcare” and searches “healcare” instead, we want to be able to match it to “healthcare” and return the correct results. We call this search term correction.
Search Term Correction
Correcting search queries can be important for catching misspellings. One way to do this is through Levenshtein distance. Levenshtein distance is a measurement of the number of edits between two words. For example (from Wikipedia), the Levenshtein distance between “kitten” and “sitting” is three since you can change one word into the other with the following edits:
You can see how this might be useful for correcting our misspelled “healcare” search. Unfortunately, Levenshtein distance is expensive to calculate, so if you wanted to check “healcare” against a dataset of 200,000 words, this would take awhile. We can, however, create a hashing algorithm that hashes terms with high Jaccard similarity into the same hashing bucket called MinHash.
Before we go any further, here is a quick introduction to Jaccard similarity. Jaccard similarity is a simple calculation that allows you to measure the similarity between sets. Say we have two sets, a blue set with elements {a, b, c, d} and a red set with elements {c, d, e, f}. The Jaccard similarity is the size of their intersection divided by the size of their union.
In this case, there are two elements in their intersection, {c, d}, and six elements in their union, {a, b, c, d, e, f}. So the Jaccard similarity is 2/6, or 1/3.
MinHashing is a technique to hash sets such that sets with high Jaccard similarity are likely to be hashed into the same bucket. Say we have the following two sets:
A = {a,b,c}, B = {b,c,d}
And say we have some hash function h that hashes elements of the sets. To get the MinHash, we apply h to each value of the set and take the minimum value:
hmin(A) = min{h(a),h(b),h(c)} = min {564,231,632} = 231
hmin(B) = min{h(b),h(c),h(d)} = min {231,632,875} = 231
You can see that both sets will be hashed to the same bucket. In fact, we can obtain some nice theoretical guarantees about when sets will be hashed to the same bucket:
p(hmin(A) = hmin(B)) = Jaccard (A,B)
The probability that the MinHash of both sets equals each other (they’re hashed into the same bucket) is equal to the Jaccard similarity between the sets. We can use this to narrow the number of terms to calculate Levenshtein distance against if our sets represent words. For example, “apple” could be turned into the following 2-grams {“ap”, “pp”, “pl”, “le”} and then we can use MinHash on this set. Going back to our healthcare example, we would turn “healthcare” into the following set {“he”, “ea”, “al”, “lt”, “th”, “hc”, “ca”, “ar”, “re”} and then use MinHash on this set. You can see how this would overlap with the misspelled search “healcare” so the probability of these landing in the same hash bucket is high.
Implementation
We can implement a MinHash table fairly easily in Python with three functions: one for creating the ngram, one for the hash function and one for actually constructing the hash table. We can create an ngram with the following code:
Note here that if the length of the word is smaller than n, it will just return the word. We can use this to help create the hash function:
And finally we can use both of these functions to create the hash table:
We’ve defaulted to using 2-grams here, which can cause a few problems. If we enter in a single-character word, there’s a higher chance that it will not collide with any hash value in the MinHash table. Also, if someone searches for a strange term, there’s a chance it will not collide with any hash value. For these cases, you can always default to doing the Levenshtein distance against the entire dataset or use alternative hash tables using different ngrams.
To query the hash table, we can use the following code:
Here we are using the library fuzzywuzzy to calculate the Levenshtein distance. This library can be found here:https://github.com/seatgeek/fuzzywuzzy.
+1 (206) 623.1986
901 Fifth Avenue
Suite 1200
Seattle, WA 98164
+44 (0)207.190.9809
1 Oliver’s Yard
55-71 City Road
London EC1Y 1HQ
United Kingdom
+1 (206) 623.1986
155 Fifth Avenue
Suite 500
New York, NY 10010
© 2017 PitchBook Data. All rights reserved. Venture capital, private equity and M&A data and technology provider.