atnlp.model¶
Model building, training, tuning
grid.py¶
Functionality for creating hyperparameter grid scans
-
atnlp.model.grid.
create
(params)[source]¶ Return parameter grid (cross-product of individual parameter scans)
- The scan is defined by the params dictionary, where parameters to
- scan should be given a list of parameter values. Ie it should be in format:
{'par1': [val1, val2], 'par2': [val1, val2, val3], 'par3': [val1] ... }
Parameters: params – params dictionary (with lists for parameters to be scanned) Returns: model params for each scan point (list of dicts)
-
atnlp.model.grid.
tofile
(grid, counter, exec, filename='grid.sh', shutdown=True)[source]¶ Write parameter grid as sequence of commands in bash script
exec is the command line function
counter is a flag passed to the command line function that receives the command index as a string format argument. It can be used to increment the output model number, eg -o model.{}.h5
If shutdown is true, a shutdown command will be included at the end of the script (useful for shutting down virtual instances after completing scan eg. in google cloud).
Parameters: - grid – parameter grid
- counter – command line increment flag (string)
- exec – command line exec (string)
- filename – output file name
- shutdown – include shutdown command
io.py¶
Functionality for reading and writing models
tune.py¶
Functionality for tuning models
-
atnlp.model.tune.
find_threshold
(y_true, y_score, target_contamination=0.01, show=True)[source]¶ Return binary classification probability threshold that yields contamination closest to target
Parameters: - y_true – ground truth labels
- y_score – predicted class scores
- target_contamination – target level of contamination (1-precision)
- show – make plots
Returns: optimal threshold
-
atnlp.model.tune.
fit_xgb_model
(alg, X, y, X_test, y_test, useTrainCV=True, cv_folds=5, early_stopping_rounds=50)[source]¶ Fit xgboost model
Parameters: - alg – XGBClassifier (sklearn api class)
- X – training data
- y – training labels
- X_test – testing data
- y_test – testing labels
- useTrainCV – use cross validation
- cv_folds – number of folds for cross-validation
- early_stopping_rounds – minimum number of rounds before early stopping
-
atnlp.model.tune.
grid_search
(X_train, y_train, model, pname, pvals, scoring=None)[source]¶ Perform 1D model hyper-parameter scan using 5-fold cross-validation
The sklearn grid is returned and a plot of the performance is made.
Parameters: - X_train – training data
- y_train – ground truth labels
- model – model
- pname – model hyperparameter name
- pvals – model hyperparameter values
- scoring – sklearn performance metric (optional)
Returns: sklearn GridSearchCV
wordmatch.py¶
Implements key-word based topic labelling classifier
-
class
atnlp.model.wordmatch.
WordMatchClassifier
(df_threshold=0.1)[source]¶ Keyword based binary topic classifier
The classifier works by assigning the positive class to any example that contains any of the given keywords. It is assumed the input data is in the bag-of-words format.
The set of keywords is fit to the training data using the following algorithm. Initially, a single keyword that maximises $rp^4$ is selected, where $r$ is recall and $p$ is precision. The choice $p^4$ is made to strongly penalise false positives. This process is repeated, adding keywords until the metric decreases.
TODO: allow predefinition of keywords, that would then be built on? TODO: include max number of words as hyperparameter
Parameters: df_threshold – minimum document frequency for keywords
-
atnlp.model.wordmatch.
display_keywords
(model, topic_names, vocab)[source]¶ Print keywords for WordMatchClassifier instances in OneVsRestClassifier
Parameters: - model – OneVsRestClassifier containing WordMatchClassifier instances
- topic_names – topic for each model instance in OneVsRest
- vocab – id-to-word dictionary for bag-of-words input data
-
atnlp.model.wordmatch.
get_keyword_dataframe
(model, topic_names, vocab)[source]¶ Return pandas DataFrame with keywords for WordMatchClassifier instances in OneVsRestClassifier
Parameters: - model – OneVsRestClassifier containing WordMatchClassifier instances
- topic_names – topic for each model instance in OneVsRest
- vocab – id-to-word dictionary for bag-of-words input data
Returns: pandas DataFrame