Topic labelling =============== Workflow -------- - convert input corpus into raw text format - build/train topic labelling models - evaluate and compare model performance Reuters dataset --------------- As a simple example we'll use the Reuters corpus (available through the nltk library). The corpus consists of ~10k news articles (70/30 train/test split) with ~90 different topic labels. Each document can be attributed multiple topic labels and the topics have a highly non-uniform distribution. Go to a new work area and issue:: reuters_to_txt.py This will create the following files: - data_train.txt - data_test.txt - labels_train.txt - labels_test.txt The data files contain the raw text from the news articles (one document per line). The labels files contain the topic labels in a one-hot format (an MxN boolean matrix where M is the number of documents, N is the number of topics, indicating which labels are attributed to each document). This is the standard format used by the library. If you convert other corpora into this format, then you will easily be able to apply the atnlp topic modelling tools to them. Model building -------------- atnlp comes with a number of preconfigured topic labelling models. The models take in raw text and produce labels in a one-hot format. The text parsing and modelling is typically split into a few steps (called a pipeline). Blueprints for some basic pipelines are provided at :ref:`models`. User-defined pipelines can be utilised by placing the pipeline definition (python file) in your local working directory. Each of the algorithms in the pipelines typically have a number of hyperparameters, which the user will typically want to tune. Pipelines are configured using yaml. Some basic configurations for the atnlp pipelines can be found at :ref:`configs`. The config first defines the blueprint for the pipeline. Pipelines already in the ``share/models`` path can be referenced by filename without ``.py`` extension, while for user-defined models an absolute/relative path including ``.py`` extension must be given. User-defined configs can also be made, and should be placed in the working directory. Train model ----------- Let's train the default svm model on the training dataset:: train.py data_train.txt labels_train.txt -m svm The model is saved to ``svm.pkl``, which can then be used to provide predictions. To train a user-defined model, provide full path to your model config (including ``.yml`` extension). Evaluate model -------------- A quick evaluation of the topic modelling performance can be obtained via:: evaluate.py data_test.txt labels_test.txt svm.pkl Note: multiple models can be passed for comparisons This will generate and html report. For more detailed investigations we suggest using a jupyter notebook. You may want to take advantage of the functionality provided in :mod:`atnlp.eval`. Model predictions ----------------- Model predictions in txt format can be obtained using:: predict.py data_test.txt svm.pkl