Main

Special Topics In Text Mining

Instructor:

Dr. Manuel Montes y Gómez
mmontesg(at)inaoep(dot)mx
Office: CH 126
Office hours: 9:00 to 11:00
Lecture sections: Monday and Wednesday; CH 145; 16:00-17:15

Course Description:

This course provides an advanced overview to some Text Mining tasks. It mainly focuses on presenting some state-of-the art approaches for text classification and document clustering. In addition, it introduces some important subtasks and applications, such as authorship attribution, sentiment classification, plagiarism detection, text summarization and visualization of search results. This course is intended for students that already have some background on data mining or machine learning

The Syllabus of the course can be donwloaded here.

Topics:

Text classification

  1. Introduction to the task
    • Feature selection
    • Classification algorithms
    • Evaluation
  2. Semi-supervised methods for text classification
    • Self-training
    • Co-training
  3. Problems of non-thematic text classification
    • Sentiment classification
    • Authorship Attribution
    • Plagiarism detection
  4. Crosslingual text classification

Document clustering

  1. Introduction to the task
    • Clustering methods
    • Clustering evaluation
  2. Improving clustering performance
    • Cluster ensembles
    • Unsupervised feature selection
    • Clustering short texts
  3. Applications of document clustering
    • Using clustering in information retrieval
    • Using clustering in document summarization
    • Using clustering in text classification
  4. Methods for multilingual clustering

Slides for the course:

Assigments

All assignments will consider the writing of a brief report (1-2 pages) about a selected research paper. Written reports must include the description of the problem, the proposed solution and achieved results, and some ideas for work improvement.

Assignment 1: Read a paper about weighting schemes or feature selection for text classification. Due date: 01/17.

Assignment 2: Read a paper about centroid-based text classification, or using kNN for text classification. Due date: 02/02.

Assignment 3: Read a paper about the application of self-training or co-training in text classification. Due date: 02/16.

Project proposal report/presentations: 02/28 and 03/02

Assignment 4: Read a paper about cross-language text classification . Due date: 03/07.

Assignment 5: Read a paper about authorship attribution or sentiment classification (opinion mining). Due date: 03/23.

Assigment 6: Read a paper about document clustering or clustering evaluation. Due date: 04/06.

Assigment 7: Read a paper about cluster ensembles or feature selection for document clustering or multinglual document clustering. Due date: 04/20.

Project final presentations: 04/25 and 04/27

Project final reports: 05/02