Topic labelling¶
Workflow¶
- convert input corpus into raw text format
- build/train topic labelling models
- evaluate and compare model performance
Reuters dataset¶
As a simple example we’ll use the Reuters corpus (available through the nltk library). The corpus consists of ~10k news articles (70/30 train/test split) with ~90 different topic labels. Each document can be attributed multiple topic labels and the topics have a highly non-uniform distribution.
Go to a new work area and issue:
reuters_to_txt.py
This will create the following files:
- data_train.txt
- data_test.txt
- labels_train.txt
- labels_test.txt
The data files contain the raw text from the news articles (one document per line). The labels files contain the topic labels in a one-hot format (an MxN boolean matrix where M is the number of documents, N is the number of topics, indicating which labels are attributed to each document).
This is the standard format used by the library. If you convert other corpora into this format, then you will easily be able to apply the atnlp topic modelling tools to them.
Model building¶
atnlp comes with a number of preconfigured topic labelling models. The models take in raw text and produce labels in a one-hot format. The text parsing and modelling is typically split into a few steps (called a pipeline). Blueprints for some basic pipelines are provided at Models. User-defined pipelines can be utilised by placing the pipeline definition (python file) in your local working directory.
Each of the algorithms in the pipelines typically have a number
of hyperparameters, which the user will typically want to tune.
Pipelines are configured using yaml. Some basic configurations
for the atnlp pipelines can be found at Configs. The config
first defines the blueprint for the pipeline. Pipelines
already in the share/models
path can be referenced by
filename without .py
extension, while for user-defined
models an absolute/relative path including .py
extension must
be given.
User-defined configs can also be made, and should be placed in the working directory.
Train model¶
Let’s train the default svm model on the training dataset:
train.py data_train.txt labels_train.txt -m svm
The model is saved to svm.pkl
, which can then be used to provide predictions.
To train a user-defined model, provide full path to your model config (including .yml
extension).
Evaluate model¶
A quick evaluation of the topic modelling performance can be obtained via:
evaluate.py data_test.txt labels_test.txt svm.pkl
Note: multiple models can be passed for comparisons
This will generate and html report.
For more detailed investigations we suggest using a jupyter notebook. You may want to take advantage of
the functionality provided in atnlp.eval
.
Model predictions¶
Model predictions in txt format can be obtained using:
predict.py data_test.txt svm.pkl