In this practical, we are first going to learn some pre-processing methods for text. These are methods that can help us with cleaning, normalizing, and structuring raw text data into a format suitable for analysis or input into NLP models.
Pre-processing simple texts¶
1. Text is also known as a string variable, or as an array of characters. Create a variable a
with the text value of "Hello @Text Mining World! I'm here to learn, right?"
, and then print it!
2. Import the nltk
package and use the function lower()
to convert the characters in string a
to their lowercase form and save it into a new variable b
.
NB: nltk
comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: https://www.nltk.org/nltk_data/
To install the data, after installing nltk
, you could use the nltk.download()
data downloader. We will make use of this in Question 8.
3. Use the string
package to print the list of punctuations.
Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed!
4. Use the punctuation list to remove the punctuations from the lowercase form of our example string a
. Name your variable c
.
5. Use the function word_tokenize()
function from nltk
and tokenize string b
. Compare that with the tokenization of string c
.
6. Use the function Regexptokenizer()
from nltk
to tokenize the string b
whilst removing punctuations. This way you will avoid unnecessary concatenations.
7. Use funtion sent_tokenize()
from the nltk
package and split the string b
into sentences. Compare that with the sentence tokenization of string c
.
Pre-processing a text corpus (dataset)¶
Pre-processing a dataset is similar to pre-processing simple text strings. First, we need to get some data. For this, we can use our own dataset, or we can scrape data from web or use social media APIs. There are also some websites with publicly available datasets:
Here, we want to analyze and pre-process the Taylor Swift song lyrics data from all her albums. The dataset can be downloaded from the course website or alternatively from Kaggle.
Upload taylor_swift_lyrics.csv
to Google Colab. You can do this by clicking on the Files button on the very left side of Colab and drag and drop the data there or click the upload button. Alternatively you can mount Google Drive and upload the dataset there.
8. Read the taylor_swift.csv
dataset. Check the dataframe using head()
and tail()
functions.
9. Add a new column to the dataframe and name it Preprocessed Lyrics
, then fill the column with the preprocessed text including the steps in this and the following questions. First replace the \n
sequences with a space character.
10. Write another custom function to remove the punctuations. You can use the previous method or make use of the function maketrans()
from the string
package.
11. Change all the characters to their lower forms. Think about why and when we need this step in our analysis.
12. List the 20 most frequent terms in this dataframe.
13. Plot a wordcloud with max 50 words using the WordCloud()
function from the wordcloud
package. Use the command ?WordCloud
to check the help for this function.
14. Use the English stop word list from the nltk
package to remove the stop words. Check the stop words and update them with your optional list of words, for example: "im", "youre", "id", "dont", "cant", "didnt", "ive", "ill", "hasnt". Show the 20 most frequent terms and plot the wordcould of 50 words again.
15. We can apply stemming or lemmatization on our text data. Apply a lemmatizer from nltk
and save the results.
Text representation with Vector Space Model¶
16. Use CountVectorizer()
from the sklearn
package and build a bag of words model on Preprocessed Lyrics
based on term frequency. Check the shape of the output matrix.
17. Inspect the first 100 terms in the vocabulary.
18. Using TfidfVectorizer()
, you can create a model based on tfidf. Apply this vectorizer to your text data. Does the shape of the output matrix differ from dtm?
19. Use the TfidfVectorizer()
to create an n-gram based model with n = 1 and 2. Use the ngram_range
argument to determine the lower and upper boundary of the range of n-values for different n-grams to be extracted. (tip: use ?TfidfVectorizer
)
20. We want to compare the lyrics of Friends theme song with the lyrics of Taylor Swift's songs and find the most similar one. Use the string below. First, apply the pre-processing steps and then transform the text into count and tfidf vectors. Do the bag of words models agree on the most similar song to Friends theme song?