loaders.load_20News

pai4sk.simsearch.loaders.load_20News(directory_path, vocabulary_file_path='vocabulary.txt', norm='l1')

Loads 20News dataset into arrays X, labels, ids

Parameters:
  • directory_path (string) – Source location for the data
  • vocabulary_file_path (string) – Vocabulary that is used in tokenization and vectorization of text
  • norm (string (default='l1')) – l2 if the user wants to normalize the data using l2 normalization
Returns:

  • X (array-like, sparse_matrix, shape (n_samples, n_features)) – Feature vectors
  • labels (array-like, shape (n_samples,)) – labels are the class labels of the samples in X
  • ids (array_like, shape (n_samples,)) – ids are the ids of the samples in X