Topic labelling

Workflow

  • convert input corpus into raw text format
  • build/train topic labelling models
  • evaluate and compare model performance

Reuters dataset

As a simple example we’ll use the Reuters corpus (available through the nltk library). The corpus consists of ~10k news articles (70/30 train/test split) with ~90 different topic labels. Each document can be attributed multiple topic labels and the topics have a highly non-uniform distribution.

Go to a new work area and issue:

reuters_to_txt.py

This will create the following files:

  • data_train.txt
  • data_test.txt
  • labels_train.txt
  • labels_test.txt

The data files contain the raw text from the news articles (one document per line). The labels files contain the topic labels in a one-hot format (an MxN boolean matrix where M is the number of documents, N is the number of topics, indicating which labels are attributed to each document).

This is the standard format used by the library. If you convert other corpora into this format, then you will easily be able to apply the atnlp topic modelling tools to them.

Model building

atnlp comes with a number of preconfigured topic labelling models. The models take in raw text and produce labels in a one-hot format. The text parsing and modelling is typically split into a few steps (called a pipeline). Blueprints for some basic pipelines are provided at Models. User-defined pipelines can be utilised by placing the pipeline definition (python file) in your local working directory.

Each of the algorithms in the pipelines typically have a number of hyperparameters, which the user will typically want to tune. Pipelines are configured using yaml. Some basic configurations for the atnlp pipelines can be found at Configs. The config first defines the blueprint for the pipeline. Pipelines already in the share/models path can be referenced by filename without .py extension, while for user-defined models an absolute/relative path including .py extension must be given.

User-defined configs can also be made, and should be placed in the working directory.

Train model

Let’s train the default svm model on the training dataset:

train.py data_train.txt labels_train.txt -m svm

The model is saved to svm.pkl, which can then be used to provide predictions.

To train a user-defined model, provide full path to your model config (including .yml extension).

Evaluate model

A quick evaluation of the topic modelling performance can be obtained via:

evaluate.py data_test.txt labels_test.txt svm.pkl

Note: multiple models can be passed for comparisons

This will generate and html report.

For more detailed investigations we suggest using a jupyter notebook. You may want to take advantage of the functionality provided in atnlp.eval.

Model predictions

Model predictions in txt format can be obtained using:

predict.py data_test.txt svm.pkl