atnlp.data

Data io and parsing

io.py

Functionality for reading and writing datasets

atnlp.data.io.read_one_hot_labels(filename)[source]

Read topic labels from file in one-hot form

Parameters:filename – name of input file
Returns:topic labels (one-hot DataFrame, M x N)
atnlp.data.io.read_raw(filename)[source]

Read raw text data from file

Parameters:filename – name of input file
Returns:list of strings
atnlp.data.io.write_one_hot_labels(Y, filename)[source]

Write topic labels to file in one-hot form

Parameters:
  • Y – topic labels (one-hot DataFrame, M x N)
  • filename – name of output file
atnlp.data.io.write_raw(X, filename)[source]

Write raw text data to file

Parameters:
  • X – list of strings
  • filename – name of output file

parse.py

Functionality for parsing text inputs.

atnlp.data.parse.build_vocab(raw_data, max_size=None)[source]

Return dict of word-to-id from raw text data

If max_size is specified, vocab is truncated to set of highest frequency words within size.

Parameters:
  • raw_data – list of strings
  • max_size – maximum size of vocab
Returns:

word-to-id dict

atnlp.data.parse.filter1(word)[source]

Return True if word passes filter

Parameters:word – string
Returns:True or False
atnlp.data.parse.process_text(text, tokenize=<function tokenize1>, filter=<function filter1>, stem=None, lower=True)[source]

Return processed list of words from raw text input

To be honest, we’re currently using sklearn CountVectorizer and keras text_to_word_sequence instead of this function.

Parameters:
  • text – raw text input (string)
  • tokenize – tokenizing function
  • filter – filter function
  • stem – stemming function
  • lower – convert input text to lowercase if true
Returns:

list of strings

atnlp.data.parse.raw_to_ids(raw_data, word_to_id)[source]

Convert raw text data into integer ids

Parameters:
  • raw_data – raw text data (list of strings)
  • word_to_id – word-to-id dict
Returns:

list of list of integer ids

atnlp.data.parse.tokenize1(text)[source]

Return tokenized list of strings from raw text input

Parameters:text – raw text (string)
Returns:list of tokens (strings)
atnlp.data.parse.tokenize_keras(raw_data)[source]

Return tokenized list of strings from raw text input using keras functionality

Parameters:raw_data – raw text (string)
Returns:list of tokens (strings)

reuters.py

Functionality to read in Reuters corpus using the nltk module

class atnlp.data.reuters.ReutersIter(files, tokenize=None)[source]

Reuters dataset iterator

Implements generator instead of reading full dataset into memory. However, its not super necessary coz this dataset is small, and most of the time we actually create a list from this anyway.

Parameters:
  • files – list of files to iterate over
  • tokenize – tokenization function (optional)
atnlp.data.reuters.get_data(cats=None, tokenize=None)[source]

Return raw text data from Reuters corpus in (train, test) tuple

If cats is specified, data is filtered to only contain documents from the specified categories.

If tokenize is specified, data is tokenized.

Parameters:
  • cats – categories
  • tokenize – tokenization function
Returns:

tuple of (train, test) data (each is list of strings)

atnlp.data.reuters.get_data_test(cats=None, tokenize=None)[source]

Return raw text testing data (cf get_data)

Parameters:
  • cats – categories
  • tokenize – tokenization function
Returns:

test data (list of strings)

atnlp.data.reuters.get_data_train(cats=None, tokenize=None)[source]

Return raw text training data (cf get_data)

Parameters:
  • cats – categories
  • tokenize – tokenization function
Returns:

train data (list of strings)

atnlp.data.reuters.get_labels(cats=None)[source]

Return topic labels (one-hot format) from Reuters corpus in (train, test) tuple

Parameters:cats – categories
Returns:tuple of (train, test) topic labels (one-hot format)
atnlp.data.reuters.get_labels_test(cats=None)[source]

Return testing set topic labels (one-hot format) from Reuters corpus (cf get_labels)

Parameters:cats – categories
Returns:test topic labels (one-hot format)
atnlp.data.reuters.get_labels_train(cats=None)[source]

Return training set topic labels (one-hot format) from Reuters corpus (cf get_labels)

Parameters:cats – categories
Returns:train topic labels (one-hot format)
atnlp.data.reuters.get_topics(min_samples=None)[source]

Return set of topics from Reuters corpus

If min_samples is specified, only topics with at least that many examples are included.

Parameters:min_samples – minimum number of example per topic
Returns:list of topics
atnlp.data.reuters.labels(filenames, cats=None)[source]

Return topic labels (one-hot format) for given files

Parameters:
  • filenames – selected files from Reuters dataset
  • cats – categories to filter (optional)
Returns:

topic labels (one-hot format)

atnlp.data.reuters.test_filenames(cats=None)[source]

Return filenames of testing examples

If cats is specified, filenames are filtered to only contain documents from the specified categories.

Parameters:cats – categories
Returns:list of filenames
atnlp.data.reuters.train_filenames(cats=None)[source]

Return filenames of training examples

If cats is specified, filenames are filtered to only contain documents from the specified categories.

Parameters:cats – categories
Returns:list of filenames