Practical 4: Feature Selection¶

Text Mining, Transforming Text into Knowledge (202400006)¶

In this practical, we are going to learn about feature selection methods for text data.

We will use the following libraries, mainly from sklearn. Take care to have them installed!

In [1]:

from sklearn.datasets import load_files
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import mutual_info_classif
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

Let's get started!¶

1. Here we are going to use a news article data set, originating from BBC news website. This dataset was provided for benchmarking machine learning algorithms. The BBC data set consists of 2,225 documents and 5 categories: business, entertainment, politics, sport, and tech. Download the data.zip file and extract it into your data folder. Now, use the code below to convert the resulting object to a dataframe.

In [ ]:

# Load the dataset, handling encoding errors gracefully
data = load_files('data/bbcsport-fulltext/bbcsport', encoding='utf-8', decode_error='replace')

# Convert the data into a pandas DataFrame
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])

# Display the first few rows
print(df.head())

2. Print the unique target names in your data and check the number of articles in each category. Then split your data into training (80%) and test (20%) sets.

3. Use the CountVectorizer from sklearn and convert the text data into a document-term matrix. What is the difference between CountVectorizer and tfidfVectorizer(use_idf=False)?

The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints. And that’s to be expected – as explained in the documentation quoted above, TfidfVectorizer() assigns a score while CountVectorizer() counts.

4. Print top 20 most frequent words in the training set.

Filter-based feature selection¶

5. From the feature selection library in sklearn load the SelectKBest function and apply it on the BBC dataset using the chi-squared method. Extract top 20 features.

6. Repeat the analysis in Question 5 with the mutual information feature selection method. Do you get the same list of words as compared to the chi-squared method?

Now you can build a classifier and train it using the output of these feature selection techniques. We are not going to do this right now, but if you are interested you can transform your training and test set using the selected features and continue with your classifier! Here are some tips:

Embedded feature selection¶

7. One of the functions for embedded feature selection is the SelectFromModel function in sklearn. Use this function with L1 norm SVM and check how many non-zero coefficients left in the model.

8. What are the top features according to the SVM model? Tip: Use the function model.get_support() to find these features.

Model comparison¶

9. Create a pipeline with the tfidf representation and a random forest classifier.

10. Fit the pipeline on the training set.

11. Use the pipeline to predict the outcome variable on your test set. Evaluate the performance of the pipeline using the classification_report function on the test subset. How do you interpret your results?

12. Create your second pipeline with the tfidf representation and a random forest classifier with the addition of an embedded feature selection using the SVM classification method with L1 penalty. Fit the pipeline on your training set and test it with the test set. How does the performance change?

13. Create your third and forth pipelines with the tfidf representation, a chi2 feature selection (with 20 and 200 features for clf3 and clf4, respectively), and a random forest classifier.

14. We can change the learner by simply plugging a different classifier object into our pipeline. Create your fifth pipeline with L1 norm SVM for the feature selection method and naive Bayes for the classifier. Compare your results on the test set with the previous pipelines.