atnlp.eval¶
Model evaluation
html.py¶
Classes for rendering documents as html.
-
class
atnlp.eval.html.
Report
[source]¶ Simple html report class based on bootstrap
Use the interface to add elements (title, figures, tables, text, etc), then use write to dump rendered html to file.
-
add_figure
(cap='')[source]¶ Add figure to document
Call this directly after creating a figure with matplotlib. The figure will be embedded into the html document.
Parameters: cap – figure caption (optional)
-
add_styled_table
(tab, cap='')[source]¶ Add styled table to document
Note: full control over html style is given to the Styler and bootstrap css is not used, so it can be difficult to get something that actually looks good.
Parameters: - tab – table (pandas Styler)
- cap – caption (optional)
-
add_table
(tab, cap='')[source]¶ Add table to document
Parameters: - tab – table (pandas DataFrame)
- cap – caption (optional)
-
metrics.py¶
Functionality for computing performance metrics. Typically custom metrics not provided by sklearn.
-
atnlp.eval.metrics.
flpd_score
(Y_true, Y_pred)[source]¶ Return ‘false labels per document’ score
‘False labels per document’ is defined as:
score := total number of false labels / number of examples
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
Returns: false labels per document score
-
atnlp.eval.metrics.
mlpd_score
(Y_true, Y_pred)[source]¶ Missing labels per document score
‘Missing labels per document’ is defined as:
score := total number of missing labels / number of examples
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
Returns: missing labels per document score
-
atnlp.eval.metrics.
recall_all_score
(Y_true, Y_pred)[source]¶ Return the ‘recall all’ score
‘Recall all’ is defined as:
score := number of examples with all labels correct / number of examples
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
Returns: recall all score
plot.py¶
Functionality for creating performance summary plots.
-
atnlp.eval.plot.
background_composition_pie
(Y_true, Y_score, topic, threshold, min_topic_frac=0.05)[source]¶ Create a pie chart illustrating the major background contributions for given label
Background topics contributing less than min_topic_frac will be merged into a single contribution called “Other”.
A bar chart is also included illustrating the overall topic composition.
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_score – topic probability predictions (shape: samples x topics)
- topic – name of topic to investigate
- threshold – threshold above which to investigate background contributions
- min_topic_frac – minimum background sample fraction
Returns: tuple (figure, list of axes)
-
atnlp.eval.plot.
binary_classification_accuracy_overlays
(classifiers, X_train, y_train, X_test, y_test)[source]¶ Create overlays of binary classification accuracy for multiple classifiers
Parameters: - classifiers – list of tuples (name, classifier)
- X_train – training data
- y_train – binary training labels
- X_test – testing data
- y_test – binary testing labels
Returns: tuple (figure, axis)
-
atnlp.eval.plot.
create_awesome_plot_grid
(nminor, ncol=5, maj_h=2, maj_w=3, min_xlabel=None, min_ylabel=None, maj_xlabel=None, maj_ylabel=None, grid=True)[source]¶ Returns an awesome plot grid
The grid includes a specified number (nminor) of minor plots (unit size in the grid) and a single major plot whose size can be specified in grid units (maj_h and maj_w).
The major plot is located top-right. If either dimension is 0 the major plot is omitted.
The minor plots are tiled from left-to-right, top-to-bottom on a grid of width ncol and will be spaced around the major plot.
The grid will look something like this
#----#----#----#---------# | | | | | | | | | | #----#----#----# | | | | | | | | | | | #----#----#----#----#----# | | | | | | | | | | | | #----#----#----#----#----# | | | | | | --> #----#----#
Parameters: - nminor – number of minor plots
- ncol – width of grid (in grid units)
- maj_h – height of major plot (in grid units)
- maj_w – width of major plot (in grid units)
- min_xlabel – x-axis label of minor plots
- min_ylabel – y-axis label of minor plots
- maj_xlabel – x-axis label of major plot
- maj_ylabel – y-axis label of major plot
- grid – draw grid lines (if True)
Returns: tuple (figure, major axis, minor axes (flat list), minor axes (2D list))
-
atnlp.eval.plot.
false_labels_matrix
(Y_true, Y_pred)[source]¶ Create MxM false labels matrix for M topics
Each column represents a given ground truth topic label. Each row represents the absolute number of false predicted labels.
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
Returns: tuple (figure, axis)
-
atnlp.eval.plot.
get_multimodel_sample_size_dependence
(models, datasets, labels, sample_fracs, scoring=None, cat_scoring=None)[source]¶ Return performance metrics vs training sample size
Fractions of data (sample_fracs) are randomly sampled from the training dataset and used to train the models, which are always evaluated on the full testing datasets.
Parameters: - models – list of topic labelling models
- datasets – list of input data for models (each is (training, testing) tuple)
- labels – tuple (train, test) of ground truth topic labels (one-hot format)
- sample_fracs – list of sample fractions to scan
- scoring – sklearn scorer or scoring name for topic averaged metric
- cat_scoring – sklearn scorer or scoring name for individual topic metric
Returns: tuple (entries per step, averaged model scores for each step, model scores for each topic for each step)
-
atnlp.eval.plot.
keras_train_history_graph
(history, metrics)[source]¶ Plot selected performance metrics as a function of training epoch.
Parameters: - history – keras training history
- metrics – list of metric names to plot
Returns: tuple (figure, list of axes)
-
atnlp.eval.plot.
multimodel_sample_size_dependence_graph
(models, model_names, datasets, labels, sample_fracs, scoring=None, cat_scoring=None)[source]¶ Create graph of performance metric vs training sample size
Fractions of data (sample_fracs) are randomly sampled from the training dataset and used to train the models, which are always evaluated on the full testing datasets.
Parameters: - models – list of topic labelling models
- model_names – list of model names
- datasets – list of input data for models (each is (training, testing) tuple)
- labels – tuple (train, test) of ground truth topic labels (one-hot format)
- sample_fracs – list of sample fractions to scan
- scoring – sklearn scorer or scoring name for topic averaged metric
- cat_scoring – sklearn scorer or scoring name for individual topic metric
Returns: tuple (figure, major axis, minor axes (flat list), minor axes (2D list))
-
atnlp.eval.plot.
topic_correlation_matrix
(Y)[source]¶ Create MxM correlation matrix for M topics
Each column represents a given ground truth topic label. Each row represents the relative frequency with which other ground truth labels co-occur.
Parameters: Y – ground truth topic labels (one-hot format) Returns: tuple (figure, axis)
-
atnlp.eval.plot.
topic_labelling_barchart
(Y_true, Y_preds, model_names)[source]¶ Create topic labelling barchart
The figure includes a 1x4 grid of bar charts, illustrating the number of samples, precision, recall and f1 scores for each topic. The scores are overlayed for each model.
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_preds – topic predictions for each model (list of one-hot formats)
- model_names – topic labelling model names
Returns: tuple (figure, list of axes)
-
atnlp.eval.plot.
topic_labelling_barchart_cv
(models, model_names, model_inputs, Y, cv=10)[source]¶ Create topic labelling barchart with k-fold cross-validation
Figure layout is the same as in
topic_labelling_barchart()
.K-fold cross-validation is used to estimate uncertainties on the metrics.
Parameters: - models – list of topic labelling models
- model_names – list of model names
- model_inputs – list of input data for models
- Y – ground truth topic labels (one-hot format)
- cv – number of folds for cross-validation
Returns: tuple (figure, list of axes)
-
atnlp.eval.plot.
topic_labelling_scatter_plots
(Y_true, Y_pred, sample_min=None, thresholds=None)[source]¶ Create scatter plots comparing precision, recall and number of samples
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
- sample_min – minimum number of examples per topic
- thresholds – list of thresholds per category (optional)
Returns: tuple (figure, list of axes)
-
atnlp.eval.plot.
topic_migration_matrix
(Y_true, Y_pred)[source]¶ Create MxM migration matrix for M topics
Each column represents a given ground truth topic label. Each row represents the relative frequency with which predicted labels are assigned.
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
Returns: tuple (figure, axis)
table.py¶
Functionality for creating performance summary tables.
-
atnlp.eval.table.
multimodel_topic_labelling_summary_tables
(Y_true, Y_preds, model_names, sample_min=None, thresholds=None)[source]¶ Return dictionary of topic labelling summary tables for multiple model predictions
The dictionary includes a single table for each of the metrics included in
topic_labelling_summary_table()
, where the key is the metric name.An overall summary table (with key summary) is also provided, including the following metrics:
- pre_mic, rec_mic, f1_mic: precision, recall and f1 scores using ‘micro’ averaging over topics
- recall_all: recall calculated requiring all labels in document correct (see
atnlp.eval.metrics.recall_all_score()
) - flpd, mlpd: false/missing labels per document (see
atnlp.eval.metrics.flpd_score()
,atnlp.eval.metrics.mlpd_score()
)
In each table, metrics are provided for each of the models provided.
If sample_min is specified, topics with fewer examples will be omitted.
thresholds is a list of one threshold per category per model, which if specified, will be applied to Y_pred to generate class predictions. In this case Y_pred is assumed to be a matrix of class probability scores rather than predictions.
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_preds – list of topic predictions for each model (one-hot format)
- model_names – name of each model
- sample_min – minimum number of examples per topic
- thresholds – list of thresholds per category (optional)
Returns: dict of summary tables (pandas DataFrames)
-
atnlp.eval.table.
topic_labelling_summary_table
(Y_true, Y_pred, sample_min=None, thresholds=None)[source]¶ Return topic labelling summary table for single model predictions
Contents of the table includes the following entries per topic:
- samples: total number of examples
- standard metrics: precision, recall, f1
- fl: total number of false labels (for topic)
- flps: false labels for topic / topic samples
- flpd: false labels for topic / total documents
- ml: total numebr of missing labels (for topic)
- mlps: missing labels for topic / topic samples
- mlpd: missing labels for topic / total documents
If sample_min is specified, topics with fewer examples will be omitted.
thresholds is a list of one threshold per category, which if specified, will be applied to Y_pred to generate class predictions. In this case Y_pred is assumed to be a matrix of class probability scores rather than predictions.
Parameters: - Y_true – ground truth topic labels (one-hot format)
- Y_pred – topic predictions (one-hot format)
- sample_min – minimum number of examples per topic
- thresholds – list of thresholds per category (optional)
Returns: summary table (pandas DataFrame)