atnlp.data¶
Data io and parsing
io.py¶
Functionality for reading and writing datasets
-
atnlp.data.io.
read_one_hot_labels
(filename)[source]¶ Read topic labels from file in one-hot form
Parameters: filename – name of input file Returns: topic labels (one-hot DataFrame, M x N)
-
atnlp.data.io.
read_raw
(filename)[source]¶ Read raw text data from file
Parameters: filename – name of input file Returns: list of strings
parse.py¶
Functionality for parsing text inputs.
-
atnlp.data.parse.
build_vocab
(raw_data, max_size=None)[source]¶ Return dict of word-to-id from raw text data
If max_size is specified, vocab is truncated to set of highest frequency words within size.
Parameters: - raw_data – list of strings
- max_size – maximum size of vocab
Returns: word-to-id dict
-
atnlp.data.parse.
filter1
(word)[source]¶ Return True if word passes filter
Parameters: word – string Returns: True or False
-
atnlp.data.parse.
process_text
(text, tokenize=<function tokenize1>, filter=<function filter1>, stem=None, lower=True)[source]¶ Return processed list of words from raw text input
To be honest, we’re currently using sklearn CountVectorizer and keras text_to_word_sequence instead of this function.
Parameters: - text – raw text input (string)
- tokenize – tokenizing function
- filter – filter function
- stem – stemming function
- lower – convert input text to lowercase if true
Returns: list of strings
-
atnlp.data.parse.
raw_to_ids
(raw_data, word_to_id)[source]¶ Convert raw text data into integer ids
Parameters: - raw_data – raw text data (list of strings)
- word_to_id – word-to-id dict
Returns: list of list of integer ids
reuters.py¶
Functionality to read in Reuters corpus using the nltk module
-
class
atnlp.data.reuters.
ReutersIter
(files, tokenize=None)[source]¶ Reuters dataset iterator
Implements generator instead of reading full dataset into memory. However, its not super necessary coz this dataset is small, and most of the time we actually create a list from this anyway.
Parameters: - files – list of files to iterate over
- tokenize – tokenization function (optional)
-
atnlp.data.reuters.
get_data
(cats=None, tokenize=None)[source]¶ Return raw text data from Reuters corpus in (train, test) tuple
If cats is specified, data is filtered to only contain documents from the specified categories.
If tokenize is specified, data is tokenized.
Parameters: - cats – categories
- tokenize – tokenization function
Returns: tuple of (train, test) data (each is list of strings)
-
atnlp.data.reuters.
get_data_test
(cats=None, tokenize=None)[source]¶ Return raw text testing data (cf get_data)
Parameters: - cats – categories
- tokenize – tokenization function
Returns: test data (list of strings)
-
atnlp.data.reuters.
get_data_train
(cats=None, tokenize=None)[source]¶ Return raw text training data (cf get_data)
Parameters: - cats – categories
- tokenize – tokenization function
Returns: train data (list of strings)
-
atnlp.data.reuters.
get_labels
(cats=None)[source]¶ Return topic labels (one-hot format) from Reuters corpus in (train, test) tuple
Parameters: cats – categories Returns: tuple of (train, test) topic labels (one-hot format)
-
atnlp.data.reuters.
get_labels_test
(cats=None)[source]¶ Return testing set topic labels (one-hot format) from Reuters corpus (cf get_labels)
Parameters: cats – categories Returns: test topic labels (one-hot format)
-
atnlp.data.reuters.
get_labels_train
(cats=None)[source]¶ Return training set topic labels (one-hot format) from Reuters corpus (cf get_labels)
Parameters: cats – categories Returns: train topic labels (one-hot format)
-
atnlp.data.reuters.
get_topics
(min_samples=None)[source]¶ Return set of topics from Reuters corpus
If min_samples is specified, only topics with at least that many examples are included.
Parameters: min_samples – minimum number of example per topic Returns: list of topics
-
atnlp.data.reuters.
labels
(filenames, cats=None)[source]¶ Return topic labels (one-hot format) for given files
Parameters: - filenames – selected files from Reuters dataset
- cats – categories to filter (optional)
Returns: topic labels (one-hot format)