Practical 2: Text Pre-processing¶

logo

Text Mining, Transforming Text into Knowledge (202400006)¶

In this practical, we are first going to learn some pre-processing methods for text. These are methods that can help us with cleaning, normalizing, and structuring raw text data into a format suitable for analysis or input into NLP models.

Pre-processing simple texts¶

1. Text is also known as a string variable, or as an array of characters. Create a variable a with the text value of "Hello @Text Mining World! I'm here to learn, right?", and then print it!

2. Import the nltk package and use the function lower() to convert the characters in string a to their lowercase form and save it into a new variable b.

NB: nltk comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: https://www.nltk.org/nltk_data/

To install the data, after installing nltk, you could use the nltk.download() data downloader. We will make use of this in Question 8.

3. Use the string package to print the list of punctuations.

Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed!

4. Use the punctuation list to remove the punctuations from the lowercase form of our example string a. Name your variable c.

5. Use the function word_tokenize() function from nltk and tokenize string b. Compare that with the tokenization of string c.

6. Use the function Regexptokenizer() from nltk to tokenize the string b whilst removing punctuations. This way you will avoid unnecessary concatenations.

7. Use funtion sent_tokenize() from the nltk package and split the string b into sentences. Compare that with the sentence tokenization of string c.

Pre-processing a text corpus (dataset)¶

Pre-processing a dataset is similar to pre-processing simple text strings. First, we need to get some data. For this, we can use our own dataset, or we can scrape data from web or use social media APIs. There are also some websites with publicly available datasets:

  • CLARIN Resource Families
  • UCI Machine Learning Repository
  • Kaggle

Here, we want to analyze and pre-process the Taylor Swift song lyrics data from all her albums. The dataset can be downloaded from the course website or alternatively from Kaggle.

Upload taylor_swift_lyrics.csv to Google Colab. You can do this by clicking on the Files button on the very left side of Colab and drag and drop the data there or click the upload button. Alternatively you can mount Google Drive and upload the dataset there.

8. Read the taylor_swift.csv dataset. Check the dataframe using head() and tail() functions.

9. Add a new column to the dataframe and name it Preprocessed Lyrics , then fill the column with the preprocessed text including the steps in this and the following questions. First replace the \n sequences with a space character.

10. Write another custom function to remove the punctuations. You can use the previous method or make use of the function maketrans() from the string package.

11. Change all the characters to their lower forms. Think about why and when we need this step in our analysis.

12. List the 20 most frequent terms in this dataframe.

13. Plot a wordcloud with max 50 words using the WordCloud() function from the wordcloud package. Use the command ?WordCloud to check the help for this function.

14. Use the English stop word list from the nltk package to remove the stop words. Check the stop words and update them with your optional list of words, for example: "im", "youre", "id", "dont", "cant", "didnt", "ive", "ill", "hasnt". Show the 20 most frequent terms and plot the wordcould of 50 words again.

15. We can apply stemming or lemmatization on our text data. Apply a lemmatizer from nltk and save the results.

Text representation with Vector Space Model¶

16. Use CountVectorizer() from the sklearn package and build a bag of words model on Preprocessed Lyrics based on term frequency. Check the shape of the output matrix.

17. Inspect the first 100 terms in the vocabulary.

18. Using TfidfVectorizer(), you can create a model based on tfidf. Apply this vectorizer to your text data. Does the shape of the output matrix differ from dtm?

19. Use the TfidfVectorizer() to create an n-gram based model with n = 1 and 2. Use the ngram_range argument to determine the lower and upper boundary of the range of n-values for different n-grams to be extracted. (tip: use ?TfidfVectorizer)

20. We want to compare the lyrics of Friends theme song with the lyrics of Taylor Swift's songs and find the most similar one. Use the string below. First, apply the pre-processing steps and then transform the text into count and tfidf vectors. Do the bag of words models agree on the most similar song to Friends theme song?