(NLTK): NLTK is a leading platform for building python Programs to work with human language data. It provides easy to use interface to over 50 corpora and lexical resource such as WordNet, along with suite of text processing libraries for classification, tokenization, stemming, tagging, parsing and semantic reasoning.
- corpus(): is a collection of documents.
- re(): to clean the word of any html-tags and also the punctuations.
- lower(): Lower Casing the Documents .
- Tokenization(): to split a sentence into tokens.
- Stop-words(): Stop words are said to be useless data for a search engine. Words such as articles, Preposition etc are considered as stop words and we will be removed that words
- PorterStemmer():to stem the tokens to their root form. Examples: Running-Run , cookies-cooki and flying-fly
- tf_idf(): It convert a collection of raw documents to a matrix of TF-IDF features
- count_vect(): Transform text into spares matrix of n-grams counts.
- cosine_similarity() : It compute cosine similarity between vectors and build a matrix.
Then we find the best 3 matches for each document in this cross product matrix corresponding to each document entity.
R Programming Language
In R, there are multiple package available to perform these tasks such as openNLP, tm, Quanteda, tidytext etc.
Among these we have used Quanteda package because of its simplicity, better performance and speed. The approach in all these packages are more or less similar except some functionality difference.
In Quanteda package to achieve the tasks mentioned above, we have used these functions mentioned below:
- corpus() : to build corpus of text and their corresponding ID values.
- tokens() : to tokenize the sentences into words
- token_wordstem(): to stem the tokens to their root form.
- dfm(): to build a Document Feature Matrix which has the documents in rows and words/features in columns. Each row is a different documents and each column represent different word/feature.Then transform the resultant data into simple triplet matrix format so that is easy to compute its cross product using
- tcrossprod_simple_triplet_matrix() function. And that’ how the cosine similarity matrix is build.
Then we find the best 3 matches for each document in this cross-product matrix corresponding to each document entity.