atnlp.model

Model building, training, tuning

grid.py

Functionality for creating hyperparameter grid scans

atnlp.model.grid.create(params)[source]

Return parameter grid (cross-product of individual parameter scans)

The scan is defined by the params dictionary, where parameters to
scan should be given a list of parameter values. Ie it should be in format:
{'par1': [val1, val2],
 'par2': [val1, val2, val3],
 'par3': [val1]
 ...
 }
Parameters:params – params dictionary (with lists for parameters to be scanned)
Returns:model params for each scan point (list of dicts)
atnlp.model.grid.tofile(grid, counter, exec, filename='grid.sh', shutdown=True)[source]

Write parameter grid as sequence of commands in bash script

exec is the command line function

counter is a flag passed to the command line function that receives the command index as a string format argument. It can be used to increment the output model number, eg -o model.{}.h5

If shutdown is true, a shutdown command will be included at the end of the script (useful for shutting down virtual instances after completing scan eg. in google cloud).

Parameters:
  • grid – parameter grid
  • counter – command line increment flag (string)
  • exec – command line exec (string)
  • filename – output file name
  • shutdown – include shutdown command

io.py

Functionality for reading and writing models

tune.py

Functionality for tuning models

atnlp.model.tune.find_threshold(y_true, y_score, target_contamination=0.01, show=True)[source]

Return binary classification probability threshold that yields contamination closest to target

Parameters:
  • y_true – ground truth labels
  • y_score – predicted class scores
  • target_contamination – target level of contamination (1-precision)
  • show – make plots
Returns:

optimal threshold

atnlp.model.tune.fit_xgb_model(alg, X, y, X_test, y_test, useTrainCV=True, cv_folds=5, early_stopping_rounds=50)[source]

Fit xgboost model

Parameters:
  • alg – XGBClassifier (sklearn api class)
  • X – training data
  • y – training labels
  • X_test – testing data
  • y_test – testing labels
  • useTrainCV – use cross validation
  • cv_folds – number of folds for cross-validation
  • early_stopping_rounds – minimum number of rounds before early stopping

Perform 1D model hyper-parameter scan using 5-fold cross-validation

The sklearn grid is returned and a plot of the performance is made.

Parameters:
  • X_train – training data
  • y_train – ground truth labels
  • model – model
  • pname – model hyperparameter name
  • pvals – model hyperparameter values
  • scoring – sklearn performance metric (optional)
Returns:

sklearn GridSearchCV

wordmatch.py

Implements key-word based topic labelling classifier

class atnlp.model.wordmatch.WordMatchClassifier(df_threshold=0.1)[source]

Keyword based binary topic classifier

The classifier works by assigning the positive class to any example that contains any of the given keywords. It is assumed the input data is in the bag-of-words format.

The set of keywords is fit to the training data using the following algorithm. Initially, a single keyword that maximises $rp^4$ is selected, where $r$ is recall and $p$ is precision. The choice $p^4$ is made to strongly penalise false positives. This process is repeated, adding keywords until the metric decreases.

TODO: allow predefinition of keywords, that would then be built on? TODO: include max number of words as hyperparameter

Parameters:df_threshold – minimum document frequency for keywords
fit(X, y)[source]

Fit model to data

Parameters:
  • X – data (bag-of-words format)
  • y – binary classification labels
predict(X)[source]
Parameters:X
Returns:
atnlp.model.wordmatch.display_keywords(model, topic_names, vocab)[source]

Print keywords for WordMatchClassifier instances in OneVsRestClassifier

Parameters:
  • model – OneVsRestClassifier containing WordMatchClassifier instances
  • topic_names – topic for each model instance in OneVsRest
  • vocab – id-to-word dictionary for bag-of-words input data
atnlp.model.wordmatch.get_keyword_dataframe(model, topic_names, vocab)[source]

Return pandas DataFrame with keywords for WordMatchClassifier instances in OneVsRestClassifier

Parameters:
  • model – OneVsRestClassifier containing WordMatchClassifier instances
  • topic_names – topic for each model instance in OneVsRest
  • vocab – id-to-word dictionary for bag-of-words input data
Returns:

pandas DataFrame