atnlp.eval

Model evaluation

html.py

Classes for rendering documents as html.

class atnlp.eval.html.Report[source]

Simple html report class based on bootstrap

Use the interface to add elements (title, figures, tables, text, etc), then use write to dump rendered html to file.

add_figure(cap='')[source]

Add figure to document

Call this directly after creating a figure with matplotlib. The figure will be embedded into the html document.

Parameters:cap – figure caption (optional)
add_section(title)[source]

Add section to document

Parameters:title – section title
add_styled_table(tab, cap='')[source]

Add styled table to document

Note: full control over html style is given to the Styler and bootstrap css is not used, so it can be difficult to get something that actually looks good.

Parameters:
  • tab – table (pandas Styler)
  • cap – caption (optional)
add_table(tab, cap='')[source]

Add table to document

Parameters:
  • tab – table (pandas DataFrame)
  • cap – caption (optional)
add_text(text)[source]

Add paragraph text to document

Parameters:text – text string
add_title(title, par=None)[source]

Add title to document

Parameters:
  • title – title string
  • par – paragraph to go with title (optional)
write(filename)[source]

Write rendered html to file

Parameters:filename – path to output file

metrics.py

Functionality for computing performance metrics. Typically custom metrics not provided by sklearn.

atnlp.eval.metrics.flpd_score(Y_true, Y_pred)[source]

Return ‘false labels per document’ score

‘False labels per document’ is defined as:

score := total number of false labels / number of examples
Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
Returns:

false labels per document score

atnlp.eval.metrics.mlpd_score(Y_true, Y_pred)[source]

Missing labels per document score

‘Missing labels per document’ is defined as:

score := total number of missing labels / number of examples
Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
Returns:

missing labels per document score

atnlp.eval.metrics.recall_all_score(Y_true, Y_pred)[source]

Return the ‘recall all’ score

‘Recall all’ is defined as:

score := number of examples with all labels correct / number of examples
Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
Returns:

recall all score

plot.py

Functionality for creating performance summary plots.

atnlp.eval.plot.background_composition_pie(Y_true, Y_score, topic, threshold, min_topic_frac=0.05)[source]

Create a pie chart illustrating the major background contributions for given label

Background topics contributing less than min_topic_frac will be merged into a single contribution called “Other”.

A bar chart is also included illustrating the overall topic composition.

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_score – topic probability predictions (shape: samples x topics)
  • topic – name of topic to investigate
  • threshold – threshold above which to investigate background contributions
  • min_topic_frac – minimum background sample fraction
Returns:

tuple (figure, list of axes)

atnlp.eval.plot.binary_classification_accuracy_overlays(classifiers, X_train, y_train, X_test, y_test)[source]

Create overlays of binary classification accuracy for multiple classifiers

Parameters:
  • classifiers – list of tuples (name, classifier)
  • X_train – training data
  • y_train – binary training labels
  • X_test – testing data
  • y_test – binary testing labels
Returns:

tuple (figure, axis)

atnlp.eval.plot.create_awesome_plot_grid(nminor, ncol=5, maj_h=2, maj_w=3, min_xlabel=None, min_ylabel=None, maj_xlabel=None, maj_ylabel=None, grid=True)[source]

Returns an awesome plot grid

The grid includes a specified number (nminor) of minor plots (unit size in the grid) and a single major plot whose size can be specified in grid units (maj_h and maj_w).

The major plot is located top-right. If either dimension is 0 the major plot is omitted.

The minor plots are tiled from left-to-right, top-to-bottom on a grid of width ncol and will be spaced around the major plot.

The grid will look something like this

#----#----#----#---------#
|    |    |    |         |
|    |    |    |         |
#----#----#----#         |
|    |    |    |         |
|    |    |    |         |
#----#----#----#----#----#
|    |    |    |    |    |
|    |    |    |    |    |
#----#----#----#----#----#
|    |    |
|    |    | -->
#----#----#
Parameters:
  • nminor – number of minor plots
  • ncol – width of grid (in grid units)
  • maj_h – height of major plot (in grid units)
  • maj_w – width of major plot (in grid units)
  • min_xlabel – x-axis label of minor plots
  • min_ylabel – y-axis label of minor plots
  • maj_xlabel – x-axis label of major plot
  • maj_ylabel – y-axis label of major plot
  • grid – draw grid lines (if True)
Returns:

tuple (figure, major axis, minor axes (flat list), minor axes (2D list))

atnlp.eval.plot.false_labels_matrix(Y_true, Y_pred)[source]

Create MxM false labels matrix for M topics

Each column represents a given ground truth topic label. Each row represents the absolute number of false predicted labels.

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
Returns:

tuple (figure, axis)

atnlp.eval.plot.get_multimodel_sample_size_dependence(models, datasets, labels, sample_fracs, scoring=None, cat_scoring=None)[source]

Return performance metrics vs training sample size

Fractions of data (sample_fracs) are randomly sampled from the training dataset and used to train the models, which are always evaluated on the full testing datasets.

Parameters:
  • models – list of topic labelling models
  • datasets – list of input data for models (each is (training, testing) tuple)
  • labels – tuple (train, test) of ground truth topic labels (one-hot format)
  • sample_fracs – list of sample fractions to scan
  • scoring – sklearn scorer or scoring name for topic averaged metric
  • cat_scoring – sklearn scorer or scoring name for individual topic metric
Returns:

tuple (entries per step, averaged model scores for each step, model scores for each topic for each step)

atnlp.eval.plot.keras_train_history_graph(history, metrics)[source]

Plot selected performance metrics as a function of training epoch.

Parameters:
  • history – keras training history
  • metrics – list of metric names to plot
Returns:

tuple (figure, list of axes)

atnlp.eval.plot.multimodel_sample_size_dependence_graph(models, model_names, datasets, labels, sample_fracs, scoring=None, cat_scoring=None)[source]

Create graph of performance metric vs training sample size

Fractions of data (sample_fracs) are randomly sampled from the training dataset and used to train the models, which are always evaluated on the full testing datasets.

Parameters:
  • models – list of topic labelling models
  • model_names – list of model names
  • datasets – list of input data for models (each is (training, testing) tuple)
  • labels – tuple (train, test) of ground truth topic labels (one-hot format)
  • sample_fracs – list of sample fractions to scan
  • scoring – sklearn scorer or scoring name for topic averaged metric
  • cat_scoring – sklearn scorer or scoring name for individual topic metric
Returns:

tuple (figure, major axis, minor axes (flat list), minor axes (2D list))

atnlp.eval.plot.topic_correlation_matrix(Y)[source]

Create MxM correlation matrix for M topics

Each column represents a given ground truth topic label. Each row represents the relative frequency with which other ground truth labels co-occur.

Parameters:Y – ground truth topic labels (one-hot format)
Returns:tuple (figure, axis)
atnlp.eval.plot.topic_labelling_barchart(Y_true, Y_preds, model_names)[source]

Create topic labelling barchart

The figure includes a 1x4 grid of bar charts, illustrating the number of samples, precision, recall and f1 scores for each topic. The scores are overlayed for each model.

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_preds – topic predictions for each model (list of one-hot formats)
  • model_names – topic labelling model names
Returns:

tuple (figure, list of axes)

atnlp.eval.plot.topic_labelling_barchart_cv(models, model_names, model_inputs, Y, cv=10)[source]

Create topic labelling barchart with k-fold cross-validation

Figure layout is the same as in topic_labelling_barchart().

K-fold cross-validation is used to estimate uncertainties on the metrics.

Parameters:
  • models – list of topic labelling models
  • model_names – list of model names
  • model_inputs – list of input data for models
  • Y – ground truth topic labels (one-hot format)
  • cv – number of folds for cross-validation
Returns:

tuple (figure, list of axes)

atnlp.eval.plot.topic_labelling_scatter_plots(Y_true, Y_pred, sample_min=None, thresholds=None)[source]

Create scatter plots comparing precision, recall and number of samples

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
  • sample_min – minimum number of examples per topic
  • thresholds – list of thresholds per category (optional)
Returns:

tuple (figure, list of axes)

atnlp.eval.plot.topic_migration_matrix(Y_true, Y_pred)[source]

Create MxM migration matrix for M topics

Each column represents a given ground truth topic label. Each row represents the relative frequency with which predicted labels are assigned.

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
Returns:

tuple (figure, axis)

table.py

Functionality for creating performance summary tables.

atnlp.eval.table.multimodel_topic_labelling_summary_tables(Y_true, Y_preds, model_names, sample_min=None, thresholds=None)[source]

Return dictionary of topic labelling summary tables for multiple model predictions

The dictionary includes a single table for each of the metrics included in topic_labelling_summary_table(), where the key is the metric name.

An overall summary table (with key summary) is also provided, including the following metrics:

In each table, metrics are provided for each of the models provided.

If sample_min is specified, topics with fewer examples will be omitted.

thresholds is a list of one threshold per category per model, which if specified, will be applied to Y_pred to generate class predictions. In this case Y_pred is assumed to be a matrix of class probability scores rather than predictions.

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_preds – list of topic predictions for each model (one-hot format)
  • model_names – name of each model
  • sample_min – minimum number of examples per topic
  • thresholds – list of thresholds per category (optional)
Returns:

dict of summary tables (pandas DataFrames)

atnlp.eval.table.topic_labelling_summary_table(Y_true, Y_pred, sample_min=None, thresholds=None)[source]

Return topic labelling summary table for single model predictions

Contents of the table includes the following entries per topic:

  • samples: total number of examples
  • standard metrics: precision, recall, f1
  • fl: total number of false labels (for topic)
  • flps: false labels for topic / topic samples
  • flpd: false labels for topic / total documents
  • ml: total numebr of missing labels (for topic)
  • mlps: missing labels for topic / topic samples
  • mlpd: missing labels for topic / total documents

If sample_min is specified, topics with fewer examples will be omitted.

thresholds is a list of one threshold per category, which if specified, will be applied to Y_pred to generate class predictions. In this case Y_pred is assumed to be a matrix of class probability scores rather than predictions.

Parameters:
  • Y_true – ground truth topic labels (one-hot format)
  • Y_pred – topic predictions (one-hot format)
  • sample_min – minimum number of examples per topic
  • thresholds – list of thresholds per category (optional)
Returns:

summary table (pandas DataFrame)