loaders.load_20News¶

pai4sk.simsearch.loaders.load_20News(directory_path, vocabulary_file_path='vocabulary.txt', norm='l1', max_histogram_size=500, stop_word_threshold=900)¶

Loads 20News dataset into arrays X, labels, ids

Parameters

directory_path (string) – Source location for the data
vocabulary_file_path (string) – Vocabulary that is used in tokenization and vectorization of text
norm (string (default='l1')) – l2 if the user wants to normalize the data using l2 normalization
max_histogram_size (integer (default=500)) – Each document histogram stores only the top max_histogram_size most-frequent words of the respective document.
stop_word_threshold (integer (default=900)) – The first stop_word_threshold words of the vocabulary are treated as stop words, and omitted in the histograms.

Returns

X (array-like, sparse_matrix, shape (n_samples, n_features)) – Feature vectors
labels (array-like, shape (n_samples,)) – labels are the class labels of the samples in X
ids (array_like, shape (n_samples,)) – ids are the ids of the samples in X