atnlp.model¶

Model building, training, tuning

grid.py¶

Functionality for creating hyperparameter grid scans

atnlp.model.grid.create(params)[source]¶

Return parameter grid (cross-product of individual parameter scans)

The scan is defined by the params dictionary, where parameters to: scan should be given a list of parameter values. Ie it should be in format:

{'par1': [val1, val2],
 'par2': [val1, val2, val3],
 'par3': [val1]
 ...
 }

Parameters:	params – params dictionary (with lists for parameters to be scanned)
Returns:	model params for each scan point (list of dicts)

atnlp.model.grid.tofile(grid, counter, exec, filename='grid.sh', shutdown=True)[source]¶

Write parameter grid as sequence of commands in bash script

exec is the command line function

counter is a flag passed to the command line function that receives the command index as a string format argument. It can be used to increment the output model number, eg -o model.{}.h5

If shutdown is true, a shutdown command will be included at the end of the script (useful for shutting down virtual instances after completing scan eg. in google cloud).

Parameters:	grid – parameter grid counter – command line increment flag (string) exec – command line exec (string) filename – output file name shutdown – include shutdown command

io.py¶

Functionality for reading and writing models

tune.py¶

Functionality for tuning models

atnlp.model.tune.find_threshold(y_true, y_score, target_contamination=0.01, show=True)[source]¶

Return binary classification probability threshold that yields contamination closest to target

Parameters:	y_true – ground truth labels y_score – predicted class scores target_contamination – target level of contamination (1-precision) show – make plots
Returns:	optimal threshold

atnlp.model.tune.fit_xgb_model(alg, X, y, X_test, y_test, useTrainCV=True, cv_folds=5, early_stopping_rounds=50)[source]¶

Fit xgboost model

Parameters:	alg – XGBClassifier (sklearn api class) X – training data y – training labels X_test – testing data y_test – testing labels useTrainCV – use cross validation cv_folds – number of folds for cross-validation early_stopping_rounds – minimum number of rounds before early stopping

atnlp.model.tune.grid_search(X_train, y_train, model, pname, pvals, scoring=None)[source]¶

Perform 1D model hyper-parameter scan using 5-fold cross-validation

The sklearn grid is returned and a plot of the performance is made.

Parameters:	X_train – training data y_train – ground truth labels model – model pname – model hyperparameter name pvals – model hyperparameter values scoring – sklearn performance metric (optional)
Returns:	sklearn GridSearchCV

wordmatch.py¶

Implements key-word based topic labelling classifier

class atnlp.model.wordmatch.WordMatchClassifier(df_threshold=0.1)[source]¶

Keyword based binary topic classifier

The classifier works by assigning the positive class to any example that contains any of the given keywords. It is assumed the input data is in the bag-of-words format.

The set of keywords is fit to the training data using the following algorithm. Initially, a single keyword that maximises $rp^4$ is selected, where $r$ is recall and $p$ is precision. The choice $p^4$ is made to strongly penalise false positives. This process is repeated, adding keywords until the metric decreases.

TODO: allow predefinition of keywords, that would then be built on? TODO: include max number of words as hyperparameter

Parameters:	df_threshold – minimum document frequency for keywords

fit(X, y)[source]¶

Fit model to data

Parameters:	X – data (bag-of-words format) y – binary classification labels

predict(X)[source]¶

Parameters:	X –
Returns:

atnlp.model.wordmatch.display_keywords(model, topic_names, vocab)[source]¶

Print keywords for WordMatchClassifier instances in OneVsRestClassifier

Parameters:	model – OneVsRestClassifier containing WordMatchClassifier instances topic_names – topic for each model instance in OneVsRest vocab – id-to-word dictionary for bag-of-words input data

atnlp.model.wordmatch.get_keyword_dataframe(model, topic_names, vocab)[source]¶

Return pandas DataFrame with keywords for WordMatchClassifier instances in OneVsRestClassifier

Parameters:	model – OneVsRestClassifier containing WordMatchClassifier instances topic_names – topic for each model instance in OneVsRest vocab – id-to-word dictionary for bag-of-words input data
Returns:	pandas DataFrame