Assignment 2: Embedding Quality

Introduction

This assignment is about exploring text representations throiugh clustering and topic moleding techniques. In this assignment, you will apply unsupervised learning techniques to explore the structure of textual data. You will experiment with word and document embeddings, combined with clustering or topic modeling, to analyze patterns and uncover latent topics in a dataset. The goal is to compare how these methods capture meaningful representations of text.

We will be working with news articles, scientific abstracts, or product reviews (you will choose one dataset).

You will:

1- Preprocess the data.
2- Generate word or document embeddings using TF-IDF, word2vec, or BERT.
3- Apply clustering (e.g., K-means) or topic modeling (LDA) to group similar texts.
4- Analyze the quality and interpretability of the results.

Step 1: Data preparation

Select a dataset of text documents. You may use an existing dataset from:
- the course lab practicals
- the Kaggle website
- or the sklearn package
Perform basic text preprocessing

Step 2: Embedding generation

Choose two of the following embedding methods:
- TF-IDF (for simple document-level representations)
- Word2Vec (pre-trained or trained on your dataset)
- BERT embeddings (sentence-transformers or another implementation)

Compute document-level embeddings by averaging word vectors or using sentence embeddings.

Step 3: Clustering or topic modeling

Choose one of the two approaches:
- Option 1: Clustering Apply K-means, or Hierarchical Clustering to the document embeddings. Find an optimal number of clusters using Elbow method, Silhouette Score, or other techniques.
- Option 2: Topic Modeling Apply LDA on the documents. Find an optimal number of topics using Elbow method, Silhouette Score, or other techniques.

Step 4: Evaluation and interpretation

Compare and discuss your clustering results across different embedding methods.

AI tools / ChatGPT

The skills assessed in this assignment include your capability to write code and a report, to communicate clearly about this code and its results, and to understand the components that go into model comparison.

You are not permitted to use AI tools such as ChatGPT / DeepSeek for any part of this assignment. Also do not use AutoML systems.

Written report

Your written report should contain the following elements.

Data description & exploration. Describe the data and use a visualization to support your story. (approx. one or two paragraphs)
Describe any additional text pre-processing steps you used (e.g., stemming, stop word removal, etc.).
Briefly describe which models you selected to generate embedding. (approx. two or three paragraphs)
Briefly describe which clustering /topic modeling you selected. (approx. one or two paragraphs)
Describe how you evaluate & compare the text representations and why. (approx. two or three paragraphs)
Write down what each team member contributed to the assignment.

Hand-in

Create a folder with:

Your report (.html or .pdf)
The source file that generates your report (Word, latex, qmd, and / or notebook file)
All the resources needed (such as data) to generate the report from the source file
Python / code file or Jupyter notebook file you have used

Zip this folder and upload it to Blackboard.

Computational reproducibility

This folder needs to be computationally reproducible. Make sure to check this (i.e., try it on different computers).

Grading

Your grade will be based on the following components:

Code quality: Including legibility, consistent style, computational reproducibility, comments. (20%)
Content & Analysis: Whether all required components of the written report are complete, correct, and concise. (80%)