Novel Representations And Methods In Text Classification


Dr. Manuel Montes y Gómez
Dr. Hugo Jair Escalante

Course Motivation and Description:

Text classification, that is, the automatic assignment of documents to one or more predefined categories, is a highly relevant task to information retrieval that has been studied extensively. Most text classification solutions consider supervised approaches and focus on organizing documents from static digital libraries according to their topics (e.g., sports vs. politics). Two core components of any classification system are the adopted representation for documents and the classification model itself. This tutorial deals with recent advances and developments on both components. The default representation for documents in text classification is the bag-of-words (BOW), where weighting schemes similar to those used in information retrieval are adopted. Whereas this representation has proven to be very helpful for thematic text classification, in novel, non-thematic text classification problems (e.g., authorship attribution, sentiment analysis and opinion mining, etc.), the standard BOW can be outperformed by other advanced representations. We focus in this course in three document representations that have proved to be useful for capturing more information than the raw occurrence of terms in documents as in BOW. The considered representations are: locally weighted BOW, distributional term representations, concise representations and graph-based representations. Likewise, the tutorial covers recent developments in the task of building classification models. Specifically, we consider contextual classification techniques and full model selection methods. The former approach is focused in the design of classifiers that consider the neighborhood of a document for making better predictions. The latter formulation focuses in the development of automatic methods for building classification systems, that is, black box tools that receive as input a data set and return a very effective classification model. Keywords: Text classification, document representations, locally-weighted bag-of-words, distributional representations, concise representations, graph-based representations, full model selection, contextual classification.


Session 1: Introduction to text classification
This session considers the presentation of the fundamentals of text classification (both thematic and non-thematic). It includes an introduction to the machine-learning approach for text classification, the description of common representations and methods as well as the presentation of different issues concerning the evaluation of text classification methods.

Session 2: Concept-based representations
This session elaborates on concept-based representations of documents. On the one hand, we review distributional representations that capture 2nd and higher order relationships between terms. We present the application of these representations to short-text categorization. We also outline novel techniques for including syntactic information into distributional representations. On the other hand, we introduce the concise document representation and describe its usage in authorship attribution and author profiling.

Session 3: Modeling sequential and syntactic information
This session describes different alternatives that extend the BOW approach by including sequential or syntactic information. First, we describe the locally weighted bag-of-words (LOWBOW) framework, which is a novel representation that captures the sequential information of documents. We describe the use of this representation in text classification and authorship attribution and outline other suitable applications. We also present some new representations that capture syntactic information of documents and their application in non-thematic classification tasks.

Session 4: Non-conventional classification methods
This session considers text classification techniques specially suited to work with low quality training sets. We discuss techniques such as self-training, PU-learning and consensus classification in the context of text classification. We outline novel applications for these methods including short-text classification, cross-lingual text classification and opinion spam detection.

Session 5: Automatic construction of classification models
This session elaborates on methods for the automated construction of classification models in the context of text classification. We review the full model selection formulation, a solution to this problem, and applications in authorship attribution and text classification.