Assignment 1: Text classification

Introduction

In this assignment, you will try to classify sentiment labels using review text of computer reviews. This assignment is a followup to the last question in Practical 1 regarding the analysis of the Computer Review dataset.

Computer review dataset

The Computer Review dataset is an annotated dataset for aspect-based sentiment analysis. The data originates from this website.

You can either:

  • Download a version of the dataset from the course website link.
  • Or download the latest version of the dataset you created in Practical 1.

Your task is to classify sentiment labels using the review text of computer reviews in the test data.

AI tools / ChatGPT

The skills assessed in this assignment include your capability to write code and a report, to communicate clearly about this code and its results, and to understand the components that go into model comparison.

You are not permitted to use AI tools such as ChatGPT for any part of this assignment. Also do not use AutoML systems.

The assignment

  • Select three or more suitable classification methods (e.g. logistic regression, KNN, Naive Bayes, SVM) to predict the sentiment labels for the review texts. To do this, you will need to follow the steps of a text mining pipeline, such as text pre-processing, text representation and classification. You can choose any method you like, you don’t have to limit yourself to the methods directly taught in this course.
  • Train and evaluate these models for their predictive ability. Split your data into training and test data with an appropriate validation split, and apply the techniques taught in the course.
  • Based on the evaluation and comparison study, produce a model comparison study using the predictions for your test dataset and appropriate evaluation measures such as F1 and Confusion Matrix.
  • Produce a written report to communicate your work.

Written report

Your written report should contain the following elements.

  • Data description & exploration. Describe the data and use a visualization to support your story. (approx. one or two paragraphs)
  • Briefly describe which models you selected to perform text classification. (approx. two or three paragraphs)
  • Describe any additional text pre-processing steps you used (e.g., stemming, stop word removal, etc.).
  • Describe how you compare the methods and why. (approx. two or three paragraphs)
  • Show which method is best and why. (approx. one paragraph)
  • Write down what each team member contributed to the assignment.

Hand-in

Create a folder with:

  • Your report (.html or .pdf)
  • The source file that generates your report (Word, latex, qmd, and / or notebook file)
  • All the resources needed (such as data) to generate the report from the source file
  • Python / code file or Jupyter notebook file you have used

Zip this folder and upload it to Blackboard.

Computational reproducibility

This folder needs to be computationally reproducible. Make sure to check this (i.e., try it on different computers).

Grading

Your grade will be based on the following components:

  • Code quality: Including legibility, consistent style, computational reproducibility, comments. (20%)
  • Content & Analysis: Whether all required components of the written report are complete, correct, and concise. (70%)
  • Performance: Your best classification should be better than a trivial baseline. If they’re as good as a trivial baseline, you get half the points for this component. If they’re on a par with a “good” model, you’ll get all the points for this component. (10%)