In this practical, we are first going to refresh our knowledge (or get acquainted) with Python in Google Colab, then we will continue with text mining and regular expressions! Are you looking for Python documentation to refresh you knowledge of programming? If so, you can check https://docs.python.org/3/reference/
Google Colab¶
Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
- Zero configuration required
- Free access to GPUs and more
- Easy sharing
Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introduction to Colab here, but we will also cover the basics below.
Simple text processing¶
1. Open Colab and create a new empty notebook to work with Python 3!
Go to https://colab.research.google.com/ and login with your account. Then click on "File $\rightarrow$ New notebook".
If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter
(option + Enter
on Mac).
If you want to stop your code from running in Colab:
- Interrupt execution by pressing
ctrl + M I
or simply click the stop button - Or: Press
ctrl + A
to select all the code of that particular cell, pressctrl + X
to cut the entire cell code. Now the cell is empty and can be deleted by usingctrl + M D
or by pressing the delete button. You can paste your code in a new code chunk and adjust it.
NB: On Mac, use cmd
instead of ctrl
in shortcuts.
2. Text is also known as a string variable, or as an array of characters. Create a variable a
with the text value of "Hello @Text Mining World! I'm here to learn, right?"
, and then print it!
3. Print the first and last character of your variable.
4. Use the !pip install
command and install the packages: numpy
, nltk
, gensim
, and spacy
.
NB: The re
package (for regular expressions) is part of Python's standard library and comes pre-installed. You don't need to run !pip install re
. You can simply import it and use it directly in your code.
NB: Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network.
5. Import (load) the nltk
package and use the function lower()
to convert the characters in string a
to their lowercase form and save it into a new variable b
.
NB: nltk
comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: https://www.nltk.org/nltk_data/
To install the data, after installing nltk
, you could use the nltk.download()
data downloader. We will make use of this in Question 8.
6. Use the string
package to print the list of punctuations.
Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed! We will learn more about this in text pre-processing.
7. Use the punctuation list to remove the punctuations from the lowercase form of our example string a
. Name your variable c
.
8. Use the function Regexptokenizer()
from nltk
to tokenize the string b
whilst removing punctuations (tokenization is the process of splitting text into smaller units, such as words, sentences, or subwords; we'll talk more about this next week).
Working with text datasets¶
Working with a text dataset is similar to simple text processing.Here are some websites with publicly available datasets:
We want to analyze the Taylor Swift song lyrics data from all her albums. Download the dataset from the course website or alternatively from Kaggle.
Upload taylor_swift_lyrics.csv
to Google Colab. You can do this by clicking on the Files button on the very left side of Colab and drag and drop the data there or click the upload button. Alternatively you can mount Google Drive and upload the dataset there.
Taylor Swift Lyrics dataset¶
9. Read the taylor_swift.csv
dataset. Check the dataframe using head()
and tail()
functions and the iloc
attribute.
10. Use the contains
and write a regex that finds rows where the lyrics contain a specific word, such as "love".
11. Use the count
function and write a regex that counts how many times the word "love" appears in each lyric.
12. Write a regex that extracts all words that are exactly 4 characters long in each lyric.
13. Write a regex that finds rows where the lyrics contain any numeric characters.
Computer review dataset¶
The Computer Review Dataset is an annotated dataset for aspect-based sentiment analysis. The data originates from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html, while you can download a version of it from the course website (https://textminingcourse.nl/labs/week_1/data.zip).
14. Use the readlines
function to read data from the computer.txt
file. Convert the data to a dataframe and name it computer_531
.
15. In this dataset, each line represents a review along with its annotated aspects and sentiments. For example, line 3 is "screen[-1], picture quality[-1] ## review text", examining this line shows that the annotator thinks that this review has two aspects/features, which are screen and picture quality, and both are associated with a negative one sentiment score. The annotation is followed by the characters ## and then the actual review text. What we want to do now is write regular expressions to create a bit of structure for our data: 10) extract all the aspects and put them into a column, 2) the review text in a column, and 3) sum the sentiment scores and in another column give the whole review a positive, negative and neutral sentiment label based on the sign of the summed value. Write the regular expressions step by step with a decent code documentation.
16. Save the final computer_531
dataframe in a CSV file. We will be using it in the later parts of the course.