Contents covered in this practical WILL NOT be part of the exam!¶
In this optional practical you will be answering a research question or solving a real problem. For that you will create a pipeline for classification or clustering.
All the data is processed and can be found here.
Here are some proposed research questions:
Classification¶
Example problem 1: Identification of fake news, hate speech or spam + Interpretability of results:¶
Data:
https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset or
https://github.com/aitor-garcia-p/hate-speech-dataset (https://paperswithcode.com/dataset/hate-speech) or
https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection
Goal: Evaluate performance of different methods and interpret the results using LIME
Example problem 2: Evaluate the importance of metadata. Create a classification system to identify the movie genre using and excluding metadata:¶
Data: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
Options:
Create two classifications systems, one using only metadata, one using only text. Stack them to create the best model: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
Use the functional API of Keras to create one model that handles both types of inputs: https://pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
Goal: Evaluate performance and interpret the results using LIME
Clustering:¶
Example problem 3: Create a recommendation system for movies based on their plot:¶
Data: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
Output: What are the closest movies to "The Shawshank Redemption", "Goodfellas", and "Harry Potter and the Sorcerer's Stone"?
Example problem 4: Cluster headlines using word embeddings:¶
Data: https://www.ims.uni-stuttgart.de/en/research/resources/corpora/goodnewseveryone/ (https://aclanthology.org/2020.lrec-1.194.pdf)
Do the clusters correlate to emotions or media sources? You can come up with your own research question using any dataset on text analysis, e.g. from:
UCI repository: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table
Papers with code repository: https://paperswithcode.com/datasets?mod=texts&page=1
Kaggle (code examples are often included): https://www.kaggle.com/datasets?tags=13204-NLP (but given the time restrictions, choosing one of the above is recommended)
# path to the data
path_data = "./data"
# Data wrangling
import pandas as pd
import numpy as np
# Machine learning tools
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
# Interpretable AI
#!pip install lime ipython
from lime.lime_text import LimeTextExplainer
from IPython.display import HTML
# data_rq1_fake = pd.read_csv("rq1_fake_news.csv.gzip",sep="\t",compression="gzip")
# data_rq1_hate_speech = pd.read_csv("rq1_hate_speech.csv.gzip",sep="\t",compression="gzip")
# data_rq1_youtube = pd.read_csv("rq1_youtube.csv.gzip",sep="\t",compression="gzip")
# data_rq2_3 = pd.read_csv("rq2_3_wiki_movie_plots.csv.gzip",sep="\t",compression="gzip")
# data_rq4 = pd.read_csv("rq4_gne-release-v1.0.csv.gzip",sep="\t",compression="gzip")
# data_rq1_fake.shape, data_rq1_hate_speech.shape, data_rq1_youtube.shape, data_rq2_3.shape, data_rq4.shape
# Download NLTK data
nltk.download("punkt")
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
# Custom NLTK analyzer with ngram support
def nltk_ngram_analyzer(doc, ngram_range=(1, 1)):
tokens = word_tokenize(doc.lower())
tokens = [t for t in tokens if t.isalpha() and len(t) > 1 and t not in stop_words]
all_ngrams = []
for n in range(ngram_range[0], ngram_range[1]+1):
all_ngrams.extend([" ".join(gram) for gram in ngrams(tokens, n)])
return all_ngrams
def make_analyzer(ngram_range):
return lambda doc: nltk_ngram_analyzer(doc, ngram_range=ngram_range)
Example problem 1: Identification of hate speech¶
Data on hate speech: https://github.com/aitor-garcia-p/hate-speech-dataset (https://paperswithcode.com/dataset/hate-speech) Data on fake vs real news: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset Data on youtube spam messages: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection We provide code for the first dataset. Your goal is to improve the classifier by using a more advanced method
Data: Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not
Step 1: Read data and create train-test split¶
df = pd.read_csv(
f"{path_data}/rq1_hate_speech.csv.gzip", sep="\t", compression="gzip", index_col=0
)
df["label"] = df["label"].map({"hate": 1, "noHate": 0})
df = df[["text", "label"]]
df = df.dropna()
print(df.shape)
df.head()
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(
df["text"].values, df["label"].values, test_size=0.33, random_state=42
)
Step 2: Create pipeline and hyperparameter tuning¶
Create a pipeline that vectorizes the text and transform it using TF-IDF, and classifies the news titles using LogisticRegression.
# Pipeline
pipe = Pipeline([
(
"vectorizer",
TfidfVectorizer(analyzer=make_analyzer((1,1))) # initial ngram, will be tuned
),
(
"clf",
LogisticRegression(
solver="saga", # saga supports l1_ratio
max_iter=10000 # do NOT set penalty here!
)
)
])
# Grid search
param_grid = {
"vectorizer__analyzer": [
make_analyzer(n) for n in [(1,1),(1,2)]
],
"vectorizer__min_df": [1, 2, 5],
"clf__C": [0.1, 1, 10, 100],
"clf__l1_ratio": [0, 1]
}
grid_search = GridSearchCV(
pipe, param_grid=param_grid, verbose=2, n_jobs=-1, refit=True
)
grid_search.fit(X_train[:1000], y_train[:1000])
print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)
# print resutls
results = pd.DataFrame(grid_search.cv_results_)
results.sort_values(by="mean_test_score", ascending=False).head(10)
# Use the best parameters in the pipe and fit with the entire dataset
pipe = pipe.set_params(**grid_search.best_params_)
clf_best = pipe.fit(X_train, y_train)
# print vocabulary size
print(len(clf_best["vectorizer"].get_feature_names_out()))
# vocabulary
# clf_best["vectorizer"].vocabulary_
# the best score achieved
print(clf_best.score(X_train, y_train))
# the best score achieved
print(clf_best.score(X_test, y_test))
# Add predicitons to dataframe
df["predicted"] = clf_best.predict(df["text"])
df["predicted_prob_hate"] = clf_best.predict_proba(df["text"])[:, 1]
df
Step 3: Interpretation of results¶
Interpretation of coefficients in the linear model¶
We can use the coefficients of the Logistic regression
# Extract the coeficients from the omdel
coefs = pd.DataFrame(
[clf_best["vectorizer"].get_feature_names_out(), clf_best["clf"].coef_[0]]
).T
coefs.columns = ["gram", "coef"]
# top words influencing hate
display(coefs.sort_values(by="coef", ascending=False).head(10))
# top words influencing non-hate
display(coefs.sort_values(by="coef", ascending=True).head(10))
Interpretation of coefficients using LIME (Local Interpretable Model-Agnostic Explanations)¶
LIME modifies the text to understand the impact of each word to the predictions.
# Find some extreme examples
df_confused = df.loc[df["label"] != df["predicted"]]
pred_hate_not_hate = (
df_confused.loc[df_confused["label"] == 0]
.sort_values(by="predicted_prob_hate")
.tail(1)
.values[0][0]
)
pred_not_hate_hate = (
df_confused.loc[df_confused["label"] == 1]
.sort_values(by="predicted_prob_hate")
.head(1)
.values[0][0]
)
print("Here")
less_hate = df.sort_values(by="predicted_prob_hate").head(1).values[0][0]
most_hate = df.sort_values(by="predicted_prob_hate").tail(1).values[0][0]
pred_50_50 = "She says the class is out of control and the kids are unteachable , and the black administration does not support her "
print("Least hate: ", less_hate)
print("Most hate: ", most_hate)
print("Predicted very hate but not hateful: ", pred_hate_not_hate)
print("Predicted very innocuous but hateful: ", pred_not_hate_hate)
print("Predicted 50/50: ", pred_50_50)
# start the explainer
explainer = LimeTextExplainer(class_names=["Innocuous", "Hateful"], bow=False)
# shows the explanation for our example instances
for text in [less_hate, most_hate, pred_hate_not_hate, pred_not_hate_hate, pred_50_50]:
exp = explainer.explain_instance(
text, clf_best.predict_proba, num_features=10, num_samples=1000
)
exp.save_to_file("./lime_explainer_1.html", text=text)
display(HTML(filename="./lime_explainer_1.html"))
print(exp.as_list())
print("-" * 100)
exp = explainer.explain_instance(
"I believe Dutch people have inferior food and they should be colonized by Belgium",
clf_best.predict_proba,
num_features=10,
num_samples=1000,
)
exp.save_to_file("./lime_explainer_2.html", text=text)
display(HTML(filename="./lime_explainer_2.html"))
print(exp.as_list())
print("-" * 100)
Now it's your turn.¶
Either:
Adapt RQ1 using different models (e.g. a CNN, as shown below) or data (either the ones described under RQ1, or any other)
Or start on a different RQ
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import random
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from keras import layers, utils, Sequential
import keras_tuner as kt
random_state = 321
seed = 137
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
plt.style.use('ggplot')
def plot_history(history, val=0):
acc = history.history['accuracy']
if val == 1:
val_acc = history.history['val_accuracy'] # we can add a validation set in our fit function with nn
loss = history.history['loss']
if val == 1:
val_loss = history.history['val_loss']
x = range(1, len(acc) + 1)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, acc, 'b', label='Training accuracy')
if val == 1:
plt.plot(x, val_acc, 'r', label='Validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.title('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(x, loss, 'b', label='Training loss')
if val == 1:
plt.plot(x, val_loss, 'r', label='Validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.title('Loss')
plt.legend()
## PROCESS DATA
X_train, X_test, y_train, y_test = train_test_split(
df["text"].values,
df["label"].values,
test_size=0.33,
random_state=42,
)
y_train = utils.to_categorical(y_train)
y_test = utils.to_categorical(y_test)
num_classes = y_train.shape[1]
# Text vectorization
max_words = 20000
sequence_length = 100
vectorizer = layers.TextVectorization(
max_tokens=max_words,
output_mode="int",
output_sequence_length=sequence_length,
)
vectorizer.adapt(X_train)
vocab_size = len(vectorizer.get_vocabulary())
print("Vocab size:", vocab_size)
def build_model(hp):
model = Sequential([
vectorizer, # text → integer sequences
layers.Embedding(
input_dim=vocab_size,
output_dim=hp.Choice("embedding_dim", [50, 100]),
),
layers.Conv1D(
filters=hp.Choice("num_filters", [32, 64, 128]),
kernel_size=hp.Choice("kernel_size", [3, 5, 7]),
activation="relu",
),
layers.GlobalMaxPooling1D(),
layers.Dense(10, activation="relu"),
layers.Dense(num_classes, activation="softmax"),
])
model.compile(
optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"],
)
return model
tuner = kt.RandomSearch(
build_model,
objective="val_accuracy",
max_trials=10,
executions_per_trial=1,
directory="cnn_tuning",
project_name="text_cnn_vectorized",
)
tuner.search(
X_train,
y_train,
epochs=15,
batch_size=64,
validation_data=(X_test, y_test),
verbose=1,
)
best_model = tuner.get_best_models(num_models=1)[0]
best_model.summary()
best_hp = tuner.get_best_hyperparameters(1)[0]
model = build_model(best_hp)
history = model.fit(
X_train,
y_train,
validation_data=(X_test, y_test),
epochs=15,
batch_size=64,
verbose=1,
)
plot_history(history, val=True)
# Find some extreme examples
less_hate = "- YouTube"
most_hate = "Look what happens when Whites leave black Countries alone to do what they do naturally The blacks in White Countries today should be on their knees thanking Whites for trying to civilize them"
pred_hate_not_hate = (
"Too many whites think they deserve what negroes dish out because of guilt ."
)
pred_not_hate_hate = "https://www.stormfront.org/forum/t1020784/ https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden God save them ....."
pred_50_50 = "She says the class is out of control and the kids are unteachable , and the black administration does not support her "
print("Least hate: ", less_hate)
print("Most hate: ", most_hate)
print("Predicted very hate but not hate: ", pred_hate_not_hate)
print("Predicted non hate but hate: ", pred_not_hate_hate)
print("Predicted 50/50: ", pred_50_50)
explainer = LimeTextExplainer(
class_names=["Innocuous", "Hate"],
bow=False,
)
def predict_proba(texts):
texts = np.array(texts, dtype=object)
return model.predict(texts)
# shows the explanation for our example instances
for text in [less_hate, most_hate, pred_hate_not_hate, pred_not_hate_hate, pred_50_50]:
exp = explainer.explain_instance(
text,
predict_proba,
num_features=10,
num_samples=1000,
)
exp.save_to_file("lime_explainer.html", text=text)
display(HTML(filename="lime_explainer.html"))
print(exp.as_list())
print("-" * 100)