core.dict

Contains code for parsing and building a dictionary from text.

parlai.core.dict.escape(s)

Replace potential special characters with escaped version. For example, newline => n and tab => t

parlai.core.dict.unescape(s)

Revert escaped characters back to their special version. For example, n => newline and t => tab

parlai.core.dict.find_ngrams(token_dict, text, n)

Breaks text into ngrams that appear in token_dict.

class parlai.core.dict.DictionaryAgent(opt, shared=None)

Builds and/or loads a dictionary.

The dictionary provides access to the frequency of each token, functions to translate sentences from tokens to their vectors (list of ints, each int is the index of a token in the dictionary) and back from vectors to tokenized text.

__contains__(key)

If key is an int, returns whether the key is in the indices. If key is a str, return if the token is in the dict of tokens.

__getitem__(key)

If key is an int, returns the corresponding token. If it does not exist, return the unknown token. If key is a str, return the token’s index. If the token is not in the dictionary, return the index of the unknown token. If there is no unknown token, return None.

__setitem__(key, value)

If the key is not in the dictionary, add it to the dictionary and set its frequency to value.

tokenize(text, building=False)

Returns a sequence of tokens from the iterable.

add_to_dict(tokens)

Builds dictionary from the list of provided tokens.

load(filename)

Load pre-existing dictionary in ‘token[<TAB>count]’ format. Initialize counts from other dictionary, or 0 if they aren’t included.

save(filename=None, append=False, sort=True)

Save dictionary to file. Format is ‘token<TAB>count’ for every token in the dictionary, sorted by count with the most frequent words first.

If append (default False) is set to True, appends instead of overwriting.

If sort (default True), then first sort the dictionary before saving.

sort()

Sorts the dictionary, so that the elements with the lowest index have the highest counts. This reindexes the dictionary according to the sorted frequencies, breaking ties alphabetically by token.

parse(txt_or_vec, vec_type=<class 'list'>)

Convenience function for parsing either text or vectors of indices.

vec_type is the type of the returned vector if the input is a string.

txt2vec(text, vec_type=<class 'list'>)

Converts a string to a vector (list of ints).

First runs a sentence tokenizer, then a word tokenizer.

vec_type is the type of the returned vector if the input is a string.

vec2txt(vector, delimiter=' ')

Converts a vector (iterable of ints) into a string, with each token separated by the delimiter (default ' ').

act()

Add any words passed in the ‘text’ field of the observation to this dictionary.

shutdown()

Save on shutdown if save_path is set.