load_svmlight_file¶

pai4sk.datasets.load_svmlight_file(f, n_features=None, dtype=<class 'numpy.float64'>, multilabel=False, zero_based='auto', query_id=False, offset=0, length=-1)¶

Load datasets in the svmlight / libsvm format into sparse CSR matrix

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

This format is used as the default format for both svmlight and the libsvm command line programs.

Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.

In case the file contains a pairwise preference constraint (known as “qid” in the svmlight format) these are ignored unless the query_id parameter is set to True. These pairwise preference constraints can be used to constraint the combination of samples when using pairwise loss functions (as is the case in some learning to rank problems) so that only pairs with the same query_id value are considered.

This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is also available at:

https://github.com/mblondel/svmlight-loader

Parameters:

f ({str, file-like, int}) – (Path to) a file to load. If a path ends in “.gz” or “.bz2”, it will be uncompressed on the fly. If an integer is passed, it is assumed to be a file descriptor. A file-like or file descriptor will not be closed by this function. A file-like object must be opened in binary mode.
n_features (int or None) – The number of features to use. If None, it will be inferred. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have examples of every feature, hence the inferred shape might vary from one slice to another. n_features is only required if offset or length are passed a non-default value.
dtype (numpy data type, default np.float64) – Data type of dataset to be loaded. This will be the data type of the output numpy arrays X and y.
multilabel (boolean, optional, default False) – Samples may have several labels each (see https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html)
zero_based (boolean or "auto", optional, default "auto") – Whether column indices in f are zero-based (True) or one-based (False). If column indices are one-based, they are transformed to zero-based to match Python/NumPy conventions. If set to “auto”, a heuristic check is applied to determine this from the file contents. Both kinds of files occur “in the wild”, but they are unfortunately not self-identifying. Using “auto” or True should always be safe when no offset or length is passed. If offset or length are passed, the “auto” mode falls back to zero_based=True to avoid having the heuristic check yield inconsistent results on different segments of the file.
query_id (boolean, default False) – If True, will return the query_id array for each file.
offset (integer, optional, default 0) – Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character.
length (integer, optional, default -1) – If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold.

Returns:

X (scipy.sparse matrix of shape (n_samples, n_features))
y (ndarray of shape (n_samples,), or, in the multilabel a list of) – tuples of length n_samples.
query_id (array of shape (n_samples,)) – query_id for each sample. Only returned when query_id is set to True.