Building Search Engines with Gensim

tripy

  • PhD candidate @ University of Alabama
  • Roll tide 🌀🌊🌀🐘🏈💯
  • Search applications for software maintenance tasks
    • Code search
    • Change request triage
  • Currently visiting researcher @ ABB
    • Looking for new adventures in 2016!
  • I have a Twitter ✌️problem✌️
    • I tweet really dumb things sometimes
  • I am looking for a job

*CLARIFICATION: I do not "have a drinking problem", as my mother likes to claim. I am probably actually quite sober right now!

Let's make a search engine

search("drunk") => [tweets containing the word 'drunk']

search("drunk") => [tweets *related* to the word 'drunk']

In [3]:
import gensim

I'm a gensim contributor

gensim's ✌️design✌️ is not always great, certainly better tools exist that can supplement or replace parts of this talk, e.g.:

  1. whoosh
  2. nltk
  3. scikit-learn

grep is great, but

  • must search through all documents
  • not aware of language-specific semantics
  • does not rank by relevence

Basics

  1. Prepare the corpus
  2. Model the corpus
  3. Index the corpus
  4. Query fun-time!
In [4]:
cat_phrases = ["meow",
               "meow meow meow hiss",
               "meow hiss hiss",
               "hiss"
              ]
In [5]:
cat_lang = ["meow", "hiss"]

cat_phrase_vectors = [[1, 0],
                      [3, 1],
                      [1, 2],
                      [0, 1]]
In [6]:
plot_vecs(cat_phrase_vectors, labels=cat_lang)
In [7]:
cat_lang = ["meow", "hiss"]

cat_phrase_vectors = [[1, 0],
                      [3, 1],
                      [1, 2],
                      [0, 1]]
other_cat = [1, 1]

plot_vecs(cat_phrase_vectors, other_cat, labels=cat_lang)
In [8]:
for cat in cat_phrase_vectors:
    print("%s =>  %.3f" % (cat, 1 - scipy.spatial.distance.cosine(cat, other_cat)))
[1, 0] =>  0.707
[3, 1] =>  0.894
[1, 2] =>  0.949
[0, 1] =>  0.707
In [9]:
cat_lang3 = ["meow", "hiss", "purr"]

cat_phrase_vectors3 = [[1, 0, 3],
                       [3, 1, 1],
                       [1, 2, 0],
                       [0, 1, 3]]

other_cat3 = [1, 1, 2]

plot_vecs(cat_phrase_vectors3, other_cat3, labels=cat_lang3)  # matplotlib whyyy

Preparing the corpus

  1. Tokenizing
In [10]:
text = "Jupyter/IPython notebook export to a reveal.js slideshow is pretty cool!"
text.split()
Out[10]:
['Jupyter/IPython',
 'notebook',
 'export',
 'to',
 'a',
 'reveal.js',
 'slideshow',
 'is',
 'pretty',
 'cool!']
In [11]:
list(gensim.utils.tokenize(text))
Out[11]:
[u'Jupyter',
 u'IPython',
 u'notebook',
 u'export',
 u'to',
 u'a',
 u'reveal',
 u'js',
 u'slideshow',
 u'is',
 u'pretty',
 u'cool']

When do we want to tokenize?

In [12]:
list(gensim.utils.tokenize("Jupyter/IPython"))
Out[12]:
[u'Jupyter', u'IPython']
In [13]:
list(gensim.utils.tokenize("AC/DC"))
Out[13]:
[u'AC', u'DC']

Preparing the corpus

  1. Tokenizing
  2. Normalizing
In [14]:
text.lower()
Out[14]:
'jupyter/ipython notebook export to a reveal.js slideshow is pretty cool!'
  • "ipython" => "ipython"
  • "IPython" => "ipython"
  • "iPython" => "ipython"

When do we want to normalize?

In [15]:
"TriPython".lower()
Out[15]:
'tripython'
In [16]:
"Apple".lower()
Out[16]:
'apple'

Preparing the corpus

  1. Tokenizing
  2. Normalizing
  3. Stopword removal
In [17]:
' '.join(sorted(gensim.parsing.STOPWORDS))
Out[17]:
'a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co computer con could couldnt cry de describe detail did didn do does doesn doing don done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fify fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself just keep kg km last latter latterly least less ltd made make many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put quite rather re really regarding same say see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under unless until up upon us used using various very via was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves'
In [18]:
list(gensim.utils.tokenize(text.lower()))
Out[18]:
[u'jupyter',
 u'ipython',
 u'notebook',
 u'export',
 u'to',
 u'a',
 u'reveal',
 u'js',
 u'slideshow',
 u'is',
 u'pretty',
 u'cool']
In [19]:
[word for word in gensim.utils.tokenize(text.lower()) if word not in gensim.parsing.STOPWORDS]
Out[19]:
[u'jupyter',
 u'ipython',
 u'notebook',
 u'export',
 u'reveal',
 u'js',
 u'slideshow',
 u'pretty',
 u'cool']
In [20]:
import nltk
nltk.download('stopwords')
' '.join(sorted(nltk.corpus.stopwords.words('english')))
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cscorley/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[20]:
u'a about above after again against all am an and any are as at be because been before being below between both but by can did do does doing don down during each few for from further had has have having he her here hers herself him himself his how i if in into is it its itself just me more most my myself no nor not now of off on once only or other our ours ourselves out over own s same she should so some such t than that the their theirs them themselves then there these they this those through to too under until up very was we were what when where which while who whom why will with you your yours yourself yourselves'

When do we want to remove stop words?

In [21]:
[word for word in "the cat sat on the keyboard".split() if word not in gensim.parsing.STOPWORDS]
Out[21]:
['cat', 'sat', 'keyboard']
In [22]:
[word for word in "the beatles were overrated".split() if word not in gensim.parsing.STOPWORDS]
Out[22]:
['beatles', 'overrated']

Preparing the corpus

  1. Tokenizing
  2. Normalizing
  3. Stopword removal
    • Short/Long word removal
    • Words that appear in too many or few documents
In [23]:
[word for word in "the cat sat on the keyboard lllllbbbnnnnnbbbbbbbbbbbbbbbbb".split()
      if len(word) > 2 and len(word) < 30]
Out[23]:
['the', 'cat', 'sat', 'the', 'keyboard']

Preparing the corpus

  1. Tokenizing
  2. Normalizing
  3. Stopword removal
  4. Stemming
In [24]:
for word in "the runner ran on the running trail for their runs".split():
    
    stemmed_word = gensim.parsing.stem(word)
    
    if stemmed_word != word:
        print(word, "=>", stemmed_word)
running => run
runs => run

When do we want to stem words?

In [25]:
gensim.parsing.stem("running")
Out[25]:
u'run'
In [26]:
gensim.parsing.stem("jupyter")
Out[26]:
u'jupyt'

Stemming alternative: lemmatization!

In [27]:
gensim.utils.lemmatize("the runner ran on the running trail for their runs",
                       allowed_tags=re.compile(".*"))
Out[27]:
['the/DT',
 'runner/NN',
 'run/VB',
 'on/IN',
 'the/DT',
 'run/VB',
 'trail/NN',
 'for/IN',
 'their/PR',
 'run/NN']
In [28]:
gensim.utils.lemmatize("the runner ran on the running trail for their runs")
Out[28]:
['runner/NN', 'run/VB', 'run/VB', 'trail/NN', 'run/NN']

Putting it all together

In [29]:
def preprocess(text):
    for word in gensim.utils.tokenize(text.lower()):
        if word not in gensim.parsing.STOPWORDS and len(word) > 1:
            yield word
            
list(preprocess("Jupyter/IPython notebook export to a reveal.js slideshow is pretty cool!"))
Out[29]:
[u'jupyter',
 u'ipython',
 u'notebook',
 u'export',
 u'reveal',
 u'js',
 u'slideshow',
 u'pretty',
 u'cool']
In [30]:
gensim.parsing.preprocess_string("Jupyter/IPython notebook export to a reveal.js slideshow is pretty cool!")
Out[30]:
[u'jupyt',
 u'ipython',
 u'notebook',
 u'export',
 u'reveal',
 u'slideshow',
 u'pretti',
 u'cool']

The corpus

My twitter archive

In [31]:
!head -1 tweets.csv
"tweet_id","in_reply_to_status_id","in_reply_to_user_id","timestamp","source","text","retweeted_status_id","retweeted_status_user_id","retweeted_status_timestamp","expanded_urls"
In [32]:
import csv

class TwitterArchiveCorpus():
    def __init__(self, archive_path):
        self.path = archive_path
        
    def iter_texts(self):
        with open(self.path) as f:
            for row in csv.DictReader(f):
                if (row["retweeted_status_id"] == "" and   # filter retweets
                    not row["text"].startswith("RT @") and
                    row["in_reply_to_status_id"] == "" and # filter replies
                    not row["text"].startswith("@")):
                    
                    yield preprocess(row["text"])

tweets = TwitterArchiveCorpus("tweets.csv")
In [33]:
texts = tweets.iter_texts()

print(list(next(texts)))
print(list(next(texts)))
[u'roll', u'tide', u'roll']
[u'yes', u'yes', u'yes']
In [34]:
!head -3 tweets.csv | awk -F '","' '{print $6}'
text
roll tide roll, y'all.
yes yes yes

Basics

  1. Prepare the corpus
  2. Model the corpus
  3. Index the corpus
  4. Query fun-time!
In [35]:
plot_vecs(cat_phrase_vectors, other_cat, labels=cat_lang)
In [36]:
class TwitterArchiveCorpus():
    def __init__(self, archive_path):
        self.path = archive_path
        self.dictionary = gensim.corpora.Dictionary(self.iter_texts())
        
    def iter_texts(self):
        with open(self.path) as f:
            for row in csv.DictReader(f):
                if (row["retweeted_status_id"] == "" and    # filter retweets
                    not row["text"].startswith("RT @") and
                    row["in_reply_to_status_id"] == "" and  # filter replies
                    not row["text"].startswith("@")):
                    
                    yield preprocess(row["text"])
                
    def __iter__(self):
        for document in self.iter_texts():
            yield self.dictionary.doc2bow(document)

    def __len__(self):
        return self.dictionary.num_docs
    
    def get_original(self, key):
        pass  # let's not look at this :-)

tweets = TwitterArchiveCorpus("tweets.csv")
In [37]:
{"meow": 0,
 "hiss": 1,
 "purr": 2}
Out[37]:
{'hiss': 1, 'meow': 0, 'purr': 2}
In [39]:
vecs = iter(tweets)

print(next(vecs))
print(next(vecs))
[(0, 1), (1, 2)]
[(2, 3)]
In [40]:
texts = tweets.iter_texts()

print(list(next(texts)))
print(list(next(texts)))
[u'roll', u'tide', u'roll']
[u'yes', u'yes', u'yes']
In [41]:
print(tweets.get_original(0))
print(tweets.get_original(1))
roll tide roll, y'all.
yes yes yes
In [42]:
len(tweets), len(tweets.dictionary)
Out[42]:
(4411, 9294)
In [43]:
(len(tweets) * len(tweets.dictionary)) / (1024 ** 2)
Out[43]:
39.0966739654541

Basics

  1. Prepare the corpus
  2. Model the corpus
  3. Index the corpus
  4. Query fun-time!
In [44]:
plot_vecs(cat_phrase_vectors, other_cat, labels=cat_lang)
In [45]:
index = gensim.similarities.Similarity('/tmp/tweets',
                                       tweets, 
                                       num_features=len(tweets.dictionary), 
                                       num_best=15)
In [46]:
!ls /tmp/tweets*
/tmp/tweets.0

Basics

  1. Prepare the corpus
  2. Model the corpus
  3. Index the corpus
  4. Query fun-time!
In [47]:
plot_vecs(cat_phrase_vectors, other_cat, labels=cat_lang)
In [48]:
query_bow = tweets.dictionary.doc2bow(preprocess("drunk"))
query_bow
Out[48]:
[(1193, 1)]
In [49]:
index[query_bow]
Out[49]:
[(275, 0.57735025882720947),
 (2463, 0.57735025882720947),
 (420, 0.44721359014511108),
 (3170, 0.44721359014511108),
 (2530, 0.3333333432674408),
 (3903, 0.31622776389122009),
 (3290, 0.30151134729385376),
 (3928, 0.28867512941360474)]
In [50]:
def search(query):
    
    query_bow = tweets.dictionary.doc2bow(preprocess(query))
    
    for doc, percent in index[query_bow]:
        print("%.3f" % percent, "=>", tweets.get_original(doc), "\n")
In [51]:
search("drunk")
0.577 => IEEEtrans.cls is drunk 

0.577 => The university's DNS is drunk 

0.447 => coffee: it's like getting reverse drunk 

0.447 => OH: "Drunk screaming hotdog trivia." 

0.333 => Damn phone typoing names. BTW I AM DRUNK AT A BOY BAND CONCERT 

0.316 => Greatest party theme of the day: Black out in solidarity with internet freedom. #sopa #drunk 

0.302 => The chemistry department that we share a building with has alcohol at events all the time. Why aren't I allowed to get drunk in the lab? 

0.289 => I was apparently spotted on the local news being a drunk Alabama fan that flooded the streets after the win last night. Whoops. 

In [52]:
search("bar")
0.516 => the sports bar I went to watch the game at doesn't even get SEC network what kind of garbage "sports bar" is this 

0.500 => Cashing your check at the bar. #classy 

0.378 => guy at bar next to me just told someone  he once worked 175 hours in one week. lol. 

0.354 => Summertime translation: candy bar &lt;-&gt; corn ear 

0.354 => Current status: Spongebob and Patrick have just entered the Thug Tug bar... 

0.354 => *puts pants on, goes to hotel bar to drink away this game* 

0.354 => Dude next to me at the airport bar is chewing with his mouth open and I just want to scream. 

0.333 => Just saw a bar touchscreen game get rebooted. Apparently this one runs Linux. TIL. 

0.333 => im trying to have a proper gameday in Canada and nobody in this bar knows how to make bloody marys. 

0.302 => Looks like my laundromat/bar crossover idea has already been implemented: http://t.co/x35btduo /cc @o_kimly 

0.289 => Trying to convince myself to go to a health bar and have some health beers so I can watch the hockey game in HD 

0.277 => Here's to hoping I have decent cell phone service in the small town of Gordo, AL. Was informed there won't be a bar at this wedding. 

0.277 => Robotics Competition volunteer lunch. Yum! (@ Kobe Japanese Steakhouse &amp; Sushi Bar) http://t.co/BttOMWCo 

0.258 => Coffee shop meets laundromat. Not as good as my bar laundromat idea, but much better than regular laundromats. Call it Clean Beans. 

0.250 => Favorite thing about Firefox is that the address bar doesn't do a search for "nerds.jpg", but assumes ".jpg" is a TLD &amp; attempts connection 

A more slightly more advanced model

In [54]:
lsi = gensim.models.LsiModel(tweets,
                             num_topics=100,
                             power_iters=10,
                             id2word=tweets.dictionary)
In [55]:
len(tweets), len(tweets.dictionary)
Out[55]:
(4411, 9294)
In [56]:
len(tweets), lsi.num_topics
Out[56]:
(4411, 100)
In [57]:
lsi_index = gensim.similarities.Similarity('/tmp/lsitweets',
                                           lsi[tweets],
                                           num_features=lsi.num_topics,
                                           num_best=15)
In [58]:
def lsi_search(query):
    
    query_bow = tweets.dictionary.doc2bow(preprocess(query))
    
    for doc, percent in lsi_index[lsi[query_bow]]:
        print("%.3f" % percent, "=>", tweets.get_original(doc), "\n")
In [59]:
search("pigpen") # non-LSI search
0.378 => my cats names are bubbie and pigpen. why haven't I been signing off emails with 👨👵🐷? 

0.378 => pigpen really enjoys rolling around on the ground. http://t.co/1se5FPsXnM 

In [60]:
lsi_search("pigpen")
0.975 => my cats names are bubbie and pigpen. why haven't I been signing off emails with 👨👵🐷? 

0.937 => i only interact with #brands if they ask me about my cats first 

0.933 => my cats are currently flipping out about #meowthejewels 

0.923 => the travel tranquilizers I gave my cats made one of them cross-eyed 

0.917 => one of my cats definitely chases her tail 

0.891 => Snapchat videos of my cats meowing as apologies as a service 

0.798 => So I played my unsent Snapchat of my cats being cute as hell during a midterm when I went to view one sent to me 

0.785 => It's always so difficult to sleep in unfamiliar places. But at least there are two cats to keep me company on this couch 

0.768 => update: cats still playing with bug. it's not as dead as I originally thought 

0.746 => Yelling at my cats to go back to bed, it's too early, and take off all that swag you look ridiculous 

0.743 => Instead of screaming "rolllllllllllll tide roll" we pounce on the string and go "..." because cats don't yell while killing, dummy 

0.712 => I should invest in one of those networked webcams so I can look at my cats while I'm away on trips 

0.664 => The cats and I decided we would have our own party. It involves running around the apartment with string, because f the police 

0.644 => Spiders will never ever be cute 

0.626 => MARITAL STATUS:
☐ Single
☐ Married
☐ Divorced
☑ Single, but happily with cats 

WE NEED TO GO DEEPER BRRRRRRRAAAAAWWWWRWRRRMRMRMMMMM

Deeper

A deep-learning approach

Warning: here be Gensim dragons

gif credit: http://prostheticknowledge.tumblr.com/post/128044261341/implementation-of-a-neural-algorithm-of-artistic

In [61]:
w2v = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',
                                                  binary=True)
In [62]:
def generate_vector_sums():
    for doc in tweets.iter_texts():  # remember me?
        yield gensim.matutils.unitvec(  # DRAGON: hack to make Similarity happy with fullvecs
            sum(
                (w2v[word] for word in doc if word in w2v),
                numpy.zeros(w2v.vector_size)
            )
        )
    
w2v_index = gensim.similarities.Similarity('/tmp/w2v_tweets',
                                           generate_vector_sums(),
                                           num_features=w2v.vector_size,
                                           num_best=15)
In [63]:
def search_w2v(query):
    query_tokens = preprocess(query)
    query_vec = gensim.matutils.unitvec(sum((w2v[word] for word in query_tokens if word in w2v), numpy.zeros(w2v.vector_size)))
    
    for doc, percent in w2v_index[query_vec]:
        print("%.3f" % percent, "=>", tweets.get_original(doc), "\n")
In [64]:
search_w2v("drunk")
0.740 => IEEEtrans.cls is drunk 

0.606 => The university's DNS is drunk 

0.594 => OH: "Drunk screaming hotdog trivia." 

0.542 => coffee: it's like getting reverse drunk 

0.521 => Damn phone typoing names. BTW I AM DRUNK AT A BOY BAND CONCERT 

0.492 => I was apparently spotted on the local news being a drunk Alabama fan that flooded the streets after the win last night. Whoops. 

0.481 => i should be reading papers but instead i am drinking coffee and being pumped about drinking more coffee #coffee 

0.474 => Vim is old enough to drink, y'all. https://t.co/qfZqzlx7Qm 

0.469 => Hey there random anxiety, what an appropriate time to show up. Another beer please, bartender! 

0.469 => Just accidentally killed a job that has been running since April.

Welp, time to drink. 

0.466 => how many red bulls can one drink before 7am?

a lot 

0.463 => Day made by a cyclist cussing out a bro in his truck 

0.452 => damn, someone already beat me to that mango drink name http://t.co/hWzbo3j6de 

0.437 => This is how gangsters drink coffee while reviewing papers. http://t.co/7e5l5RHr 

0.436 => The chemistry department that we share a building with has alcohol at events all the time. Why aren't I allowed to get drunk in the lab? 

In [65]:
search_w2v("kittens")
0.577 => Cat cuddles http://t.co/zPNgmwTqz7 

0.549 => Just saw a bunny while I was walking the dog! Bunnies :) 

0.547 => Snapchat videos of my cats meowing as apologies as a service 

0.518 => Picked my kitty up from being spayed today, and Vet tells me she's not allowed to run or jump. Vet has never had a kitten before, apparently 

0.518 => lap cat is best cat http://t.co/x7ZYkVqbC2 

0.511 => The cat met her after school to walk her home. I wish my cats did that. 

0.511 => I don't see giant roaches often, but when I do, it's because the cats are already on the case #leakyboat http://t.co/QvItKpQFz7 

0.510 => "You are a kitten in a catnip forest" ok http://t.co/XyFNkoW80b 

0.502 => ahh, my kitties! 👨👵🐷💕 

0.500 => A large pool filled with kittens is also acceptable 

0.494 => my cats are currently flipping out about #meowthejewels 

0.489 => actually there are like 5 kittens around that lot 🐱 🐱 🐱 🐱 🐱 

0.488 => Cats really like my feet http://t.co/LuAclUIcTb 

0.488 => someone come over and throw cat treats across the room. i am stuck under this sleeping cat 

0.484 => things i am thankful for: lap cats http://t.co/iwyid8jx8l 

Thanks for listening! 😻🍻😸

(Is it beer-time yet?)

In [66]:
plot_vecs(cat_phrase_vectors, other_cat, labels=["meow", "hiss"])

No tweets were harmed as a result of this talk.
I am a person; people make mistakes and that's okay.

Short-circuiting a full search: inverted index!

In [67]:
cat_lang3 = ["meow", "hiss", "purr"]

cat_phrase_vectors3 = [[1, 0, 3],
                       [3, 1, 1],
                       [1, 2, 0],
                       [0, 1, 3]]

inv_index = defaultdict(set)

for doc_id, doc in enumerate(cat_phrase_vectors3):
    for word_id, freq in enumerate(doc):
        if freq:
            inv_index[word_id].add(doc_id)
In [68]:
inv_index[0]
Out[68]:
{0, 1, 2}
In [69]:
inv_index[0].intersection(inv_index[1])
Out[69]:
{1, 2}