loaders.load_20News

pai4sk.simsearch.loaders.load_20News(directory_path, vocabulary_file_path='vocabulary.txt', norm='l1', max_histogram_size=500, stop_word_threshold=900)

Loads 20News dataset into arrays X, labels, ids

Parameters
  • directory_path (string) – Source location for the data

  • vocabulary_file_path (string) – Vocabulary that is used in tokenization and vectorization of text

  • norm (string (default='l1')) – l2 if the user wants to normalize the data using l2 normalization

  • max_histogram_size (integer (default=500)) – Each document histogram stores only the top max_histogram_size most-frequent words of the respective document.

  • stop_word_threshold (integer (default=900)) – The first stop_word_threshold words of the vocabulary are treated as stop words, and omitted in the histograms.

Returns

  • X (array-like, sparse_matrix, shape (n_samples, n_features)) – Feature vectors

  • labels (array-like, shape (n_samples,)) – labels are the class labels of the samples in X

  • ids (array_like, shape (n_samples,)) – ids are the ids of the samples in X