Text Mining, Transforming Text into Knowledge: Course Syllabus 2025

Author

Ayoub Bagheri

Published

January 9, 2025

1 Introduction

With the rapid growth of digital textual data in many areas of science, there is a growing need for automated tools that can analyse, classify, and interpret this type of data. Text mining techniques can be used to create a structured representation of text, making its content more accessible to users and researchers. Text mining applications are everywhere: social media, web search, advertising, email, customer service, healthcare, marketing, etc. During the course, students will actively learn how to apply text mining methods to data analysis and how to use them together with natural language processing and machine learning techniques on real data problems. The course has a strong practical focus: students will gain hands-on experience in Python by applying the methods to real data during the course and interpreting the results.

Prerequisites

We assume that students who will join the course will have basic knowledge and / or motivation in programming (Python) and data science.

Objectives

The aim of this course is to provide students with an understanding of the principles, problems, techniques, and solutions associated with text mining and to enable them to gain knowledge of how recent advances in text mining relate to innovative approaches to organising, characterizing, finding and exploiting large amounts of textual information in the search for new knowledge. On completion of the course, students should be able to:

  • Explain and use text preprocessing and representation techniques.
  • Describe a text analysis system and its components, both optional and mandatory.
  • Define a text mining pipeline given a practical data science problem.
  • Implement all steps of a text mining pipeline: feature extraction, model learning, model evaluation.
  • Analyse and reflect on the different techniques used in text mining, the parameters required, and the problem solved.
  • Understand and apply some of the state-of-the-art methods in text mining and natural language processing.
  • Plan and carry out a text analysis experiment.

2 Course Policy

This course is worth 7.5 ECTS, which means it is designed to give one lecture and on lab per week.

Weekly course flow

A regular week in this course consists of one lecture (Monday at 15:15-17:00) and one lab session (Monday at 17:15-19:00). The material is introduced on a theoretical level in the lectures and then put into practice in the lab sessions. The practical work done in these labs is drawn from real life situations that allow the students to experience how to solve text mining and NLP taks in data science problems.

In addition, students will spend time during the course on two take-home group assignments.

  • The lectures are in-person. The required readings should be read before the lecture. These are not optional.
  • The lab sessions are in-person interactive sessions in which you apply the methods you learn about in the lectures. The answers to the exercises in the labs are discussed at the end of each session.
  • The skills acquired in the lectures and the labs provide the basis for doing the take-home assignment. This assignment is made in groups of 3-4 students and handed in via Brightspace.

Synchronous course policy

  • 202400006 is an offline-first course, with in-person lectures and lab sessions.
  • We find it important for interactive and collaborative learning that the course is offline-first.
  • If you miss a session, e.g., due to sickness, you should catch up in the regular way:
    • Read the readings
    • Go through the lecture slides
    • Do the practicals
    • Ask your peers if you have questions
    • (after the above) ask the lab teacher and the lecturer for further explanation

Who to ask what

If you have questions, first ensure the answer isn’t in this syllabus and then follow the table below:

Question type How to ask
Course proceedings Email course coordinator (a.bagheri@uu.nl)
Content - general Email / ask lecturer
Practical content Email / ask lab teacher (t.h.vanderkuil@uu.nl)
Assignment content Email / ask lab teacher
Lecture content Email Lecturer

Grading policy

Your final grade in the course consists of the following grading components:

  • Assignments (10%): There are two group assignments. Each assignment is graded and worth 5% of the final grade.
  • Final exam (90%): At the end of the course, there is a final exam. The exam consists of TRUE/FALSE, multiple-choice, and open questions.

To pass the course:

  • The weighted final grade of all grading components should be greater than or equal to 5.5. There is a minimum required grade of 5.5 for each of the above grading components.

Resit:

  • You can only retake one resit and only for the exam.

3 Course materials

Required Software

In this course, we will use Python. Try to install both on your computer by the start of the course.

Installing Python & Jupyter The best choice is to install Python and Jupyter on your computer, and for the easiest complete way could be to install [Anaconda] (https://www.anaconda.com/download). Otherwise you can use Google Colab, which is an interactive online notebook environment; this means no installation is necessary! However, you do need a Google account, so make sure you have one (or make one specifically for the course).

Required readings

Freely available sections from the following books:

Book Title (Authors) URL
SLP3 Speech and Language Processing, third edition (Jurafsky & Martin) link
NLPE Natural Language Processing (Eisenstein) link
ISLR Introduction to Statistical Learning (James et al.) link
IVFS Introduction to Variable and Feature Selection (Guyon & Elisseeff) link
PTMs Probabilistic Topic Models (Blei) link

And some other freely available articles & chapters

4 Class Schedule

You can find the up-to-date class schedule with locations on mytimetable.uu.nl.

Week Date Topic Type Reading
1 2025-02-03 Intro to text mining & regular expressions Lecture Syllabus
1 2025-02-03 Intro to text mining & regular expressions Lab NLPE 1, SLP3 2.1
2 2025-02-10 Text preprocessing Lecture SLP3 2.2, 2.3, 2.4, 2.5, 2.6, 2.7
2 2025-02-10 Text preprocessing Lab
3 2025-02-17 Text classification Lecture SLP3 4.1, 4.2, 4.3, 4.7, 4.8, NLPE 4.4
3 2025-02-17 Text classification Lab
4 2025-02-24 10:00 AM assignment 1 Deadline
4 2025-02-24 Feature selection Lecture IVSF
4 2025-02-24 Feature selection Lab
5 2025-03-03 Clustering & topic modeling Lecture ISLR 12.4, PTMs
5 2025-03-03 Clustering & topic modeling Lab
6 2025-03-10 Word embedding Lecture SLP3 6.3, 6.8, 6.9, 6.10, NLPE 14.5
6 2025-03-10 Word embedding Lab
7 2025-03-17 Deep learning & LLMs Lecture SLP3 7.1, 7.3, 8.1, 8.2
7 2025-03-17 Deep learning & LLMs Lab
8 2025-03-24 10:00 AM assignment 2 Deadline
8 2025-03-24 Sentiment analysis Lecture NLPE 4.1, SLP3 4.4
8 2025-03-24 Sentiment analysis Lab
9 2025-03-31 Responsible text mining & applications Lecture NLPE 14.6, SLP3 6.11
9 2025-03-31 Responsible text mining & applications Lab
10 2025-04-10 Final exam Exam
2025-04-22 Inspecting the final exam Exam Inspection
10 2025-05-14 Resit exam Exam