Collocations are two or more words that tend to appear frequently together, for example – United States. There are many other words that can come after United, such as the United Kingdom and United Airlines. As with many aspects of natural language processing, context is very important. And for collocations, context is everything.
In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means to find common phrases that occur frequently throughout the text.
Link to DATA – Monty Python and the Holy Grail script
Code #1 : Loading Libraries
Code #2 : Let’s find the collocations
[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't'), ('villager', '#'), ('#', '2'), (']', '['), ('1', ':'), ('oh', ', '), ('black', 'knight'), ('ha', 'ha'), (':', 'oh'), ("'", 're'), ('galahad', ':'), ('well', ', ')]
As we can see in the code above finding colocations in this way is not very useful. So, the code below is a refined version by adding a word filter to remove punctuation and stopwords.
Code #3 :
[('black', 'knight'), ('clop', 'clop'), ('head', 'knight'), ('mumble', 'mumble'), ('squeak', 'squeak'), ('saw', 'saw'), ('holy', 'grail'), ('run', 'away'), ('french', 'guard'), ('cartoon', 'character'), ('iesu', 'domine'), ('pie', 'iesu'), ('round', 'table'), ('sir', 'robin'), ('clap', 'clap')]
How it works in the code?
- BigramCollocationFinder constructs two frequency distributions:
- one for each word
- another for bigrams.
- A frequency distribution is basically an enhanced Python dictionary where the keys are what’s being counted, and the values are the counts.
- Any filtering functions reduces the size by eliminating any words that don’t pass the filter
- Using a filtering function to eliminate all words that are one or two characters, and all English stopwords, results in a much cleaner result.
- After filtering, the collocation finder is ready for finding collocations.
Code #4 : Working on triplets instead of pairs.
[('clop', 'clop', 'clop'), ('mumble', 'mumble', 'mumble'), ('squeak', 'squeak', 'squeak'), ('saw', 'saw', 'saw'), ('pie', 'iesu', 'domine'), ('clap', 'clap', 'clap'), ('dona', 'eis', 'requiem'), ('brave', 'sir', 'robin'), ('heh', 'heh', 'heh'), ('king', 'arthur', 'music'), ('hee', 'hee', 'hee'), ('holy', 'hand', 'grenade'), ('boom', 'boom', 'boom'), ('...', 'dona', 'eis'), ('already', 'got', 'one')]
- Python program to read file word by word
- NLP | Likely Word Tags
- Python | Word Stretch
- Python - Get Nth word in given String
- ML | Word Encryption using Keras
- NLP | Synsets for a word in WordNet
- Python - Kth word replace in String
- Generating Word Cloud in Python
- Python | Word Embedding using Word2Vec
- Generating Word Cloud in Python | Set 2
- Word Prediction using concepts of N - grams and CDF
- PyQt5 – Jumble Word Game
- Python | Word Similarity using spaCy
- PyQt5 QSpinBox - Getting word spacing
- Count occurrences of a word in string
- Python - Separate first word from String
- Python | Reverse each word in a sentence
- Python - Move Word to Rear end
- Second most repeated word in a sequence in Python
- Python - Word starting at Index