pai4sk API¶

class pai4sk.linear_model.Ridge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None, dual=False, verbose=0, use_gpu=True, device_ids=[], return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, num_threads=1)¶

Linear least squares with l2 regularization.

Minimizes the objective function:

||y - Xw||^2_2 + alpha * ||w||^2_2

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape [n_samples, n_targets]).

Read more in the User Guide.

For SnapML solver this supports both local and distributed(MPI) method of execution.

Parameters:

alpha ({float, array-like}, shape (n_targets)) – Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to C^-1 in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.
fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use pai4sk.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
copy_X (boolean, optional, default True) – If True, X will be copied; else, it may be overwritten.
max_iter (int, optional) – Maximum number of iterations for conjugate gradient solver. For ‘sparse_cg’ and ‘lsqr’ solvers, the default value is determined by scipy.sparse.linalg. For ‘sag’ solver, the default value is 1000.
tol (float) – Precision of the solution.
regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.
use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
solver ({'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'snapml'}) –
Solver to use in the computational routines:
- ’auto’ chooses the solver automatically based on the type of data.
- ’svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than ‘cholesky’.
- ’cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution.
- ’sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter).
- ’lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative procedure.
- ’sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from pai4sk.preprocessing.
All last five solvers support both dense and sparse data. However, only ‘sag’ and ‘saga’ supports sparse input when fit_intercept is True.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.
random_state (int, RandomState instance or None, optional, default None) –
The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when solver == ‘sag’.

New in version 0.17: random_state to support Stochastic Average Gradient.
privacy (bool, default : False) – Train the model using a differentially private algorithm.
eta (float, default : 0.3) – Learning rate for the differentially private training algorithm.
batch_size (int, default : 100) – Mini-batch size for the differentially private training algorithm.
privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)-private.
grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm

Variables:

coef (array, shape (n_features,) or (n_targets, n_features)) – Weight vector(s).
intercept (float | array, shape = (n_targets,)) – Independent term in decision function. Set to 0.0 if fit_intercept = False.
n_iter (array or None, shape (n_targets,)) – Actual number of iterations for each target. Available only for sag and lsqr solvers. Other solvers will return None.
training_history (dict) –
It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.

New in version 0.17.

See also

RidgeClassifier: Ridge classifier
RidgeCV: Ridge regression with built-in cross validation
pai4sk.kernel_ridge.KernelRidge: Kernel ridge regression combines ridge regression with the kernel trick

Examples

>>> from pai4sk.linear_model import Ridge
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = Ridge(alpha=1.0)
>>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

fit(X, y, sample_weight=None)¶

Fit Ridge regression model

Parameters:	X ({array-like, sparse matrix}, shape = [n_samples, n_features]) – Training data For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray. y (array-like, shape = [n_samples] or [n_samples, n_targets]) – Target values sample_weight (float or numpy array of shape [n_samples]) – Individual weights for each sample
Returns:	self
Return type:	returns an instance of self.

predict(X, num_threads=0)¶

Class predictions The returned class estimates. Parameters ———- X : sparse matrix (csr_matrix) or dense matrix (ndarray)

Dataset used for predicting class estimates. For SnapML solver it also supports input of type SnapML data partition.

num_threads : int, default : 0

Number of threads used to run inference. By default inference runs with maximum number of available threads.

proba: array-like, shape = (n_samples,): Returns the predicted class of the sample.

class pai4sk.linear_model.Lasso(alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic', verbose=0, use_gpu=True, device_ids=[], return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, num_threads=1)¶

Linear Model trained with L1 prior as regularizer (aka the Lasso)

The optimization objective for Lasso is:

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).

Read more in the User Guide.

For SnapML solver this supports both local and distributed(MPI) method of execution.

Parameters:

alpha (float, optional) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
fit_intercept (boolean, optional, default True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use pai4sk.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
precompute (True | False | array-like, default=False) – Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto' let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
copy_X (boolean, optional, default True) – If True, X will be copied; else, it may be overwritten.
max_iter (int, optional) – The maximum number of iterations
tol (float, optional) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
warm_start (bool, optional) – When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary.
positive (bool, optional) – When set to True, forces the coefficients to be positive.
random_state (int, RandomState instance or None, optional, default None) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when selection == ‘random’.
selection (str, default 'cyclic') – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
verbose (bool, default : False) – If True, it prints the training cost, one per iteration. Warning: this will increase the training time. For performance evaluation, use verbose=False.
use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
privacy (bool, default : False) – Train the model using a differentially private algorithm.
eta (float, default : 0.3) – Learning rate for the differentially private training algorithm.
batch_size (int, default : 100) – Mini-batch size for the differentially private training algorithm.
privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)-private.
grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm

Variables:

coef (array, shape (n_features,) | (n_targets, n_features)) – parameter vector (w in the cost function formula)
sparse_coef (scipy.sparse matrix, shape (n_features, 1) | (n_targets, n_features)) – sparse_coef_ is a readonly property derived from coef_
intercept (float | array, shape (n_targets,)) – independent term in decision function.
n_iter (int | array-like, shape (n_targets,)) – number of iterations run by the coordinate descent solver to reach the specified tolerance.
training_history (dict) – It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
support (array-like) – Indices of the features that lie in the support ond contribute to the decision.
model_sparsity (float) – Fraction of non-zeros in the model parameters.

Examples

>>> from pai4sk import linear_model
>>> clf = linear_model.Lasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
>>> print(clf.coef_)
[0.85 0.  ]
>>> print(clf.intercept_)  # doctest: +ELLIPSIS
0.15...

See also

lars_path, lasso_path, LassoLars, LassoCV, LassoLarsCV, pai4sk.decomposition.sparse_encode

Notes

The algorithm used to fit the model is coordinate descent.

To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a Fortran-contiguous numpy array.

fit(X, y, check_input=True)¶

Fit model with coordinate descent.

Parameters:

X (ndarray or scipy.sparse matrix, (n_samples, n_features)) – Data For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.
y (ndarray, shape (n_samples,) or (n_samples, n_targets)) – Target. Will be cast to X’s dtype if necessary
check_input (boolean, (default=True)) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do.

Notes

Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically convert the X input as a Fortran-contiguous numpy array if necessary.

To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.

predict(X, num_threads=0)¶

Class predictions The returned class estimates. Parameters ———- X : sparse matrix (csr_matrix) or dense matrix (ndarray)

Dataset used for predicting class estimates. For SnapML solver it also supports input of type SnapML data partition.

num_threads : int, default : 0

Number of threads used to run inference. By default inference runs with maximum number of available threads.

proba: array-like, shape = (n_samples,): Returns the predicted class of the sample.

class pai4sk.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None, use_gpu=True, device_ids=[], return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, num_threads=1)¶

Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross- entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty.

Read more in the User Guide.

For SnapML solver this supports both local and distributed(MPI) method of execution.

Parameters:

penalty (str, 'l1' or 'l2', default: 'l2') –
Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties.

New in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)
dual (bool, default: False) – Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.
tol (float, default: 1e-4) – Tolerance for stopping criteria.
C (float, default: 1.0) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
fit_intercept (bool, default: True) – Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
intercept_scaling (float, default 1.) –
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight.

Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
class_weight (dict or 'balanced', default: None) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

New in version 0.17: class_weight=’balanced’
random_state (int, RandomState instance or None, optional, default: None) – The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when solver == ‘sag’ or ‘liblinear’.
solver (str, {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga', 'snapml'}, default: 'snapml', if 'snap_ml' library is in PYTHONPATH, else,) –
default: ‘liblinear’.

Algorithm to use in the optimization problem.
- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
- ’newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’ handle L1 penalty.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from pai4sk.preprocessing.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

Changed in version 0.20: Default will change from ‘liblinear’ to ‘lbfgs’ in 0.22.
max_iter (int, default: 100) – Useful only for the newton-cg, sag and lbfgs solvers. Maximum number of iterations taken for the solvers to converge.
multi_class (str, {'ovr', 'multinomial', 'auto'}, default: 'ovr') –
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.

Changed in version 0.20: Default will change from ‘ovr’ to ‘auto’ in 0.22.
verbose (int, default: 0) – For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.
warm_start (bool, default: False) –
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See the Glossary.

New in version 0.17: warm_start to support lbfgs, newton-cg, sag, saga solvers.
n_jobs (int or None, optional (default=None)) – Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single-GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
privacy (bool, default : False) – Train the model using a differentially private algorithm.
eta (float, default : 0.3) – Learning rate for the differentially private training algorithm.
batch_size (int, default : 100) – Mini-batch size for the differentially private training algorithm.
privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)-private.
grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm

Variables:

coef (array, shape (1, n_features) or (n_classes, n_features)) –
Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class=’multinomial’, coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).
intercept (array, shape (1,) or (n_classes,)) –
Intercept (a.k.a. bias) added to the decision function.

If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class=’multinomial’, intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).
n_iter (array, shape (n_classes,) or (1, )) – Actual number of iterations for all classes. If binary or multinomial, it returns only 1 element. For liblinear solver, only the maximum number of iteration across all classes is given.
training_history (dict) – It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
support (array-like) – Indices of the features that contribute to the decision. (only available for L1)
model_sparsity (float) –
Fraction of non-zeros in the model parameters. (only available for L1)

Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed max_iter. n_iter_ will now report at most max_iter.

Examples

>>> from pai4sk.datasets import load_iris
>>> from pai4sk.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0, solver='lbfgs',
...                          multi_class='multinomial').fit(X, y)
>>> clf.predict(X[:2, :])
array([0, 0])
>>> clf.predict_proba(X[:2, :]) # doctest: +ELLIPSIS
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
       [9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y)
0.97...

See also

SGDClassifier: incrementally trained logistic regression (when given the parameter loss="log").
LogisticRegressionCV: Logistic regression with built-in cross validation

Notes

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.

Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.

References

LIBLINEAR – A Library for Large Linear Classification: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
SAG – Mark Schmidt, Nicolas Le Roux, and Francis Bach: Minimizing Finite Sums with the Stochastic Average Gradient https://hal.inria.fr/hal-00860051/document
SAGA – Defazio, A., Bach F. & Lacoste-Julien S. (2014).: SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives https://arxiv.org/abs/1407.0202
Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descent: methods for logistic regression and maximum entropy models. Machine Learning 85(1-2):41-75. http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf

fit(X, y, sample_weight=None)¶

Fit the model according to the given training data. :param X: Training vector, where n_samples is the number of samples and

n_features is the number of features. For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.

Parameters:	y (array-like, shape (n_samples,) or (n_samples, n_targets)) – Target vector relative to X. sample_weight (array-like, shape (n_samples,) optional) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. .. versionadded:: 0.17 sample_weight support to LogisticRegression.
Returns:	self
Return type:	object

predict(X, num_threads=0)¶

Class predictions The returned class estimates. :param X: Dataset used for predicting class estimates.

For SnapML solver it also supports input of type SnapML data partition.

Parameters:	num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.
Returns:	proba – Returns the predicted class of the sample.
Return type:	array-like, shape = (n_samples,)

predict_log_proba(X)¶

Log of probability estimates. The returned estimates for all classes are ordered by the label of classes. :param X: For SnapML solver it also supports input of type SnapML data partition. :type X: array-like, shape = [n_samples, n_features]

Returns:	T – Returns the log-probability of the sample for each class in the model, where classes are ordered as they are in `self.classes_`.
Return type:	array-like, shape = [n_samples, n_classes]

predict_proba(X, num_threads=0)¶

Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes. :param X: For SnapML solver it also supports input of type SnapML data partition. :type X: array-like, shape = [n_samples, n_features] :param num_threads: Number of threads used to run inference.

By default inference runs with maximum number of available threads.

Returns:	T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in `self.classes_`.
Return type:	array-like, shape = [n_samples, n_classes]

class pai4sk.svm.LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000, use_gpu=True, device_ids=[], num_threads=1, return_training_history=None)¶

Linear Support Vector Classification.

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.

Read more in the User Guide.

For SnapML solver this supports both local and distributed(MPI) method of execution.

Parameters:

penalty (string, 'l1' or 'l2' (default='l2')) – Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
loss (string, 'hinge' or 'squared_hinge' (default='squared_hinge')) – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
dual (bool, (default=True)) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
tol (float, optional (default=1e-4)) – Tolerance for stopping criteria.
C (float, optional (default=1.0)) – Penalty parameter C of the error term.
multi_class (string, 'ovr' or 'crammer_singer' (default='ovr')) – Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while "crammer_singer" optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If "crammer_singer" is chosen, the options loss, penalty and dual will be ignored.
fit_intercept (boolean, optional (default=True)) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
intercept_scaling (float, optional (default=1)) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
class_weight ({dict, 'balanced'}, optional) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
verbose (int, (default=0)) – Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.
random_state (int, RandomState instance or None, optional (default=None)) – The seed of the pseudo random number generator to use when shuffling the data for the dual coordinate descent (if dual=True). When dual=False the underlying implementation of LinearSVC is not random and random_state has no effect on the results. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
max_iter (int, (default=1000)) – The maximum number of iterations to be run.
use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver

Variables:

coef (array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]) –
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.

coef_ is a readonly property derived from raw_coef_ that follows the internal memory layout of liblinear.
intercept (array, shape = [1] if n_classes == 2 else [n_classes]) – Constants in decision function.
training_history (dict) – It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
support (array-like, shape (n_SV)) – indices of the support vectors.
n_support (int) – Number of support vectors.
n_iter (array, shape (n_classes,) or (1, )) – Actual number of iterations for all classes to reach the specified tolerance. If binary or multinomial, it returns only 1 element.

Examples

>>> from pai4sk.svm import LinearSVC
>>> from pai4sk.datasets import make_classification
>>> X, y = make_classification(n_features=4, random_state=0)
>>> clf = LinearSVC(random_state=0, tol=1e-5)
>>> clf.fit(X, y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=1e-05, verbose=0)
>>> print(clf.coef_)
[[0.085... 0.394... 0.498... 0.375...]]
>>> print(clf.intercept_)
[0.284...]
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]

Notes

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.

The underlying implementation, liblinear, uses a sparse internal representation for the data that will incur a memory copy.

Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.

References

LIBLINEAR: A Library for Large Linear Classification

See also

SVC: Implementation of Support Vector Machine classifier using libsvm: the kernel can be non-linear but its SMO algorithm does not scale to large number of samples as LinearSVC does. Furthermore SVC multi-class mode is implemented using one vs one scheme while LinearSVC uses one vs the rest. It is possible to implement one vs the rest with SVC by using the pai4sk.multiclass.OneVsRestClassifier wrapper. Finally SVC can fit dense data without memory copy if the input is C-contiguous. Sparse data will still incur memory copy though.
pai4sk.linear_model.SGDClassifier: SGDClassifier can optimize the same cost function as LinearSVC by adjusting the penalty and loss parameters. In addition it requires less memory, allows incremental (online) learning, and implements various loss functions and regularization regimes.

decision_function(X, num_threads=0)¶

Predicts confidence scores.

The confidence score of a sample is the signed distance of that sample to the decision boundary.

Parameters:	X (sparse matrix (csr_matrix) or dense matrix (ndarray)) – Dataset used for predicting distances to the decision boundary. For SnapML solver it also supports input of type SnapML data partition. num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.
Returns:	proba – Returns the distance to the decision boundary of the samples in X.
Return type:	array-like, shape = (n_samples,) or (n_sample, n_classes)

fit(X, y, sample_weight=None)¶

Fit the model according to the given training data. :param X: Training vector, where n_samples in the number of samples and

n_features is the number of features. For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.

Parameters:	y (array-like, shape = [n_samples]) – Target vector relative to X sample_weight (array-like, shape = [n_samples], optional) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
Returns:	self
Return type:	object

predict(X, num_threads=0)¶

Class predictions The returned class estimates. Parameters ———- X : sparse matrix (csr_matrix) or dense matrix (ndarray)

Dataset used for predicting class estimates. For SnapML solver it also supports input of type SnapML data partition.

num_threads : int, default : 0

Number of threads used to run inference. By default inference runs with maximum number of available threads.

proba: array-like, shape = (n_samples,): Returns the predicted class of the sample.

class pai4sk.cluster.KMeans(n_clusters=8, max_iter=300, tol=0.0001, verbose=0, random_state=1, precompute_distances='auto', init='k-means++', n_init=1, algorithm='auto', copy_x=True, n_jobs=None, use_gpu=True)¶

K-Means clustering.

If cudf dataframe is passed as input, then pai4sk will try to use the accelerated KMeans algorithm from cuML. Otherwise, scikit-learn’s KMeans algorithm will be used.

cuML in pai4sk is currently supported only

(a) with python 3.6 and
(b) without MPI.
If KMeans from cuML is run, then the return values from the APIs will be

cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

Parameters:

n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
init ({'k-means++', 'random' or an ndarray}) –
Method for initialization, defaults to ‘k-means++’:

’k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

’random’: choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
n_init (int, default: 10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default: 300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence
precompute_distances ({'auto', True, False}) –
Precompute distances (faster but takes more memory).

’auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

True : always precompute distances

False : never precompute distances
verbose (int, default 0) – Verbosity mode.
random_state (int, RandomState instance or None (default)) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
copy_x (boolean, optional) – When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified, ensuring X is C-contiguous. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean, in this case it will also not ensure that data is C-contiguous which may cause a significant slowdown.
n_jobs (int or None, optional (default=None)) –
The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
algorithm ("auto", "full" or "elkan", "cuml", default="auto") –
K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.

If cudf dataframe is passed as input, then if either

(1) algorithm is set to “cuml” or

(2) algorithm is “auto”,

then pai4sk will try to use kmeans algorithm from RAPIDS cuML.

cuML in pai4sk is currently supported only

(a) with python 3.6 and

(b) without MPI.

If KMeans from cuML is run, then the return values of the APIs will be cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

Variables:

cluster_centers (array, [n_clusters, n_features] or cudf dataframe) – Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_. If KMeans from cuML is run, then the return values of some of the APIs will be cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.
labels (array or cudf Series) – Labels of each point
inertia (float) – Sum of squared distances of samples to their closest cluster center.
n_iter (int) – Number of iterations run.
use_gpu (boolean, Default is True) – If True, cuML will use all GPUs. Applicable only for cuML.

Examples

>>> from pai4sk.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[1., 2.],
       [4., 2.]])

See also

MiniBatchKMeans: Alternative online implementation that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.

Notes

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

The average complexity is given by O(k n T), were n is the number of samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006)

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

If the algorithm stops before fully converging (because of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e. the cluster_centers_ will not be the means of the points in each cluster. Also, the estimator will reassign labels_ after the last iteration to make labels_ consistent with predict on the training set.

fit(X, y=None, sample_weight=None)¶

Fit the model according to the given training data.

Parameters:	X ({array-like, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – Training vector, where n_samples is the number of samples and n_features is the number of features. y (array-like, shape (n_samples,) or (n_samples, n_targets)) – Target vector relative to X. sample_weight (array-like, shape (n_samples,) optional) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. .. versionadded:: 0.17 sample_weight support to KMeans.
Returns:	self – If KMeans from cuML is run then this fit method saves the cluster centers and labels as cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.
Return type:	object

fit_predict(X, y=None, sample_weight=None)¶

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X). Parameters:

X : {array-like, sparse matrix}, shape = [n_samples, n_features]: cuDF dataframe if cuml is being used. New data to transform.
y : Ignored: not used, present here for API consistency by convention.
sample_weight : array-like, shape (n_samples,), optional: The weights for each observation in X. If None, all observations are assigned equal weight (default: None)

Returns: labels : array, shape [n_samples,] or cudf Series object

Index of the cluster each sample belongs to. If KMeans from cuML is run, then this method saves the cluster centers and labels as cudf dataframe and cudf Series objects instead of the return types of scikit-learn API. Returns cudf Series object.

fit_transform(X, y=None, sample_weight=None)¶

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented. Parameters:

X : {array-like, sparse matrix}, shape = [n_samples, n_features]: cuDF dataframe if cuml is being used. New data to transform.
y : Ignored: not used, present here for API consistency by convention.
sample_weight : array-like, shape (n_samples,), optional: The weights for each observation in X. If None, all observations are assigned equal weight (default: None)

Returns: X_new : array, shape [n_samples, k] or cudf dataframe

X transformed in the new space. If KMeans from cuML is run, then this method saves the cluster centers and labels as cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

predict(X, sample_weight=None)¶

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book. :param X: cuDF dataframe if cuml is being used.

New data to predict.

Parameters:

sample_weight (array-like, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None)
Returns –
labels (array, shape [n_samples,] or cudf Series object) – Index of the cluster each sample belongs to. If KMeans from cuML is run, then this method returns cudf Series object instead of the return types of scikit-learn API.

transform(X)¶

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.

Parameters:	X ({array-like, sparse matrix}, shape = [n_samples, n_features]) – cuDF dataframe if cuml is being used. New data to transform. If KMeans from cuML is run and if the input data is a cudf dataframe, then this method returns cudf dataframe instead of array.
Returns:	X_new – X transformed in the new space. If KMeans from cuML is run, then this method returns cudf dataframe instead of the return types of scikit-learn API.
Return type:	array, shape [n_samples, k] or cudf dataframe

class pai4sk.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, use_gpu=True)¶

Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

If the input data is cudf dataframe and if possible, then the accelerated DBSCAN algorithm from cuML will be used. Otherwise, scikit-learn’s DBSCAN algorithm will be used.

cuML in pai4sk is currently supported only

(a) with python 3.6 and
(b) without MPI.
If DBSCAN from cuML is run, then the return values from the APIs will be

cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

eps : float, optional

The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samples : int, optional

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by pai4sk.metrics.pairwise_distances for its metric parameter. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only nonzero elements may be considered neighbors for DBSCAN. New in version 0.17: metric precomputed to accept precomputed sparse matrix.

metric_params : dict, optional

Additional keyword arguments for the metric function. New in version 0.19.

algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’, ‘cuml’}, optional

The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

If cudf dataframe is given as input, if either

(1) algorithm is set to “cuml” or
(2) algorithm is “auto”,
then pai4sk will try to use DBSCAN algorithm from RAPIDS cuML if possible.
cuML in pai4sk is currently supported only
(a) with python 3.6 and
(b) without MPI.

leaf_size : int, optional (default = 30)

Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : float, optional

The power of the Minkowski metric to be used to calculate distance between points.

n_jobs : int or None, optional (default=None)

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

use_gpu : boolean, Default is True

If True, cuML will use GPU 0. Applicable only for cuML.

Attributes: core_sample_indices_ : array, shape = [n_core_samples]

Indices of core samples.

components_ : array, shape = [n_core_samples, n_features]: Copy of each core sample found by training.
labels_ : array, shape = [n_samples]: Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

fit(X, y=None, sample_weight=None)¶

Perform DBSCAN clustering from features or distance matrix. Parameters: ———- X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cuDF dataframe

A feature array, or array of distances between samples if metric=’precomputed’.

sample_weight : array, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

y : Ignored

self : object: If DBSCAN from cuML is run, then this fit method saves the computed labels as cudf Series object instead of array.

fit_predict(X, y=None, sample_weight=None)¶

Performs clustering on X and returns cluster labels.

Parameters: X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cudf dataframe

A feature array, or array of distances between samples if metric=’precomputed’.

sample_weight : array, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

y : Ignored

y : ndarray, shape (n_samples,) or cudf Series: If DBSCAN from cuML is run, then this fit method returns the computed labels as cudf Series object instead of ndarray.

class pai4sk.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, use_gpu=True)¶

Principal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

It can also use the scipy.sparse.linalg ARPACK implementation of the truncated SVD.

If the input data is cudf dataframe, then pai4sk will try to use the accelerated PCA algorithm from cuML. Otherwise, scikit-learn’s PCA algorithm will be used.

cuML in pai4sk is currently supported only

(a) with python 3.6 and
(b) without MPI.
If PCA from cuML is run, then the return values from the APIs will be

cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

Notice that this class does not support sparse input. See TruncatedSVD for an alternative with sparse data.

Read more in the User Guide.

Parameters:

n_components (int, float, None or string) –
Number of components to keep. if n_components is not set all components are kept:
```
n_components == min(n_samples, n_features)
```
If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.

If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples.

Hence, the None case results in:
```
n_components == min(n_samples, n_features) - 1
```
copy (bool (default True)) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool, optional (default False)) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
svd_solver (string {'auto', 'full', 'arpack', 'randomized', 'cuml', 'jacobi'}) –

auto :

when cuml is not used, the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards. If cuml is used, then the default algorithm ‘full’ will be used when the svd_solver is ‘auto’ or ‘cuml’.
If cudf dataframe is given as input, if either

(1) svd_solver is set to “cuml” or

(2) svd_solver is “auto”,

then pai4sk will try to use PCA algorithm from RAPIDS cuML if possible.

cuML in pai4sk is currently supported only

(a) with python 3.6 and

(b) without MPI.

full :

run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

arpack :

run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)

randomized :

run randomized SVD by the method of Halko et al.

New in version 0.18.0.
tol (float >= 0, optional (default .0)) –
Tolerance for singular values computed by svd_solver == ‘arpack’.

New in version 0.18.0.
iterated_power (int >= 0, or 'auto', (default 'auto')) –
Number of iterations for the power method computed by svd_solver == ‘randomized’. Note : cuML for pai4sk only supports integer values for this parameter.

New in version 0.18.0.
random_state (int, RandomState instance or None, optional (default None)) –
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.

New in version 0.18.0.
use_gpu (boolean, Default is True) – If True, cuML will use GPU 0. Applicable only for cuML.

Variables:

components (array of shape (n_components, n_features) or cudf dataframe) – Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
explained_variance (array of shape (n_components,) or cudf Series) –
The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues of the covariance matrix of X.

New in version 0.18.
explained_variance_ratio (array of shape (n_components,) or cudf Series) –
Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
singular_values (array of shape (n_components,) or cudf Series) – The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.
mean (array, shape (n_features,)) –
Per-feature empirical mean, estimated from the training set.

Equal to X.mean(axis=0).
n_components (int) – The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
noise_variance (float or cudf Series) –
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

References

For n_components == ‘mle’, this class uses the method of Minka, T. P. “Automatic choice of dimensionality for PCA”. In NIPS, pp. 598-604

Implements the probabilistic PCA model from: `Tipping, M. E., and Bishop, C. M. (1999). “Probabilistic principal component analysis”. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611-622. via the score and score_samples methods. See http://www.miketipping.com/papers/met-mppca.pdf

For svd_solver == ‘arpack’, refer to scipy.sparse.linalg.svds.

For svd_solver == ‘randomized’, see: Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. SIAM review, 53(2), 217-288. and also Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). “A randomized algorithm for the decomposition of matrices”. Applied and Computational Harmonic Analysis, 30(1), 47-68.

Examples

>>> import numpy as np
>>> from pai4sk.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
>>> print(pca.explained_variance_ratio_)  # doctest: +ELLIPSIS
[0.9924... 0.0075...]
>>> print(pca.singular_values_)  # doctest: +ELLIPSIS
[6.30061... 0.54980...]

>>> pca = PCA(n_components=2, svd_solver='full')
>>> pca.fit(X)                 # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='full', tol=0.0, whiten=False)
>>> print(pca.explained_variance_ratio_)  # doctest: +ELLIPSIS
[0.9924... 0.00755...]
>>> print(pca.singular_values_)  # doctest: +ELLIPSIS
[6.30061... 0.54980...]

>>> pca = PCA(n_components=1, svd_solver='arpack')
>>> pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
  svd_solver='arpack', tol=0.0, whiten=False)
>>> print(pca.explained_variance_ratio_)  # doctest: +ELLIPSIS
[0.99244...]
>>> print(pca.singular_values_)  # doctest: +ELLIPSIS
[6.30061...]

See also

KernelPCA, SparsePCA, TruncatedSVD, IncrementalPCA

fit(X, y=None, _transform=True)¶

Fit the model with X.

Parameters:	X (array-like of shape (n_samples, n_features) or cudf dataframe) – Training data, where n_samples is the number of samples and n_features is the number of features. y (Ignored) –
Returns:	self – Returns the instance itself. If PCA from cuML is run, then this fit method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikit-learn’s fit method.
Return type:	object

fit_transform(X, y=None)¶

Fit the model with X and apply the dimensionality reduction on X.

Parameters:	X (array-like of shape (n_samples, n_features) or cudf dataframe) – Training data, where n_samples is the number of samples and n_features is the number of features. y (Ignored) –
Returns:	X_new – If PCA from cuML is run, then this method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikit-learn’s fit_transform method.
Return type:	array-like of shape (n_samples, n_components) or cudf dataframe

inverse_transform(X)¶

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters:	X (array-like of shape (n_samples, n_components) or cudf dataframe) – New data, where n_samples is the number of samples and n_components is the number of components.
Returns:	X_original – If PCA from cuML is run, then this method returns cudf dataframe instead of the results’ types seen from scikit-learn’s inverse_transform method.
Return type:	array-like of shape (n_samples, n_features) or cudf dataframe

Notes

If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes reversing whitening.

score(X, y=None)¶

Return the average log-likelihood of all samples.

See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf

Parameters:	X (array, shape(n_samples, n_features)) – The data. y (Ignored) –
Returns:	ll – Average log-likelihood of the samples under the current model
Return type:	float

transform(X)¶

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters:	X (array-like of shape (n_samples, n_features) or cudf dataframe) – New data, where n_samples is the number of samples and n_features is the number of features.
Returns:	X_new – If PCA from cuML is run, then this method saves the computed values as cudf dataframe instead of the results’ types seen from scikit-learn’s transform method.
Return type:	array-like of shape (n_samples, n_components) or cudf dataframe

Examples

>>> import numpy as np
>>> from pai4sk.decomposition import IncrementalPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> ipca = IncrementalPCA(n_components=2, batch_size=3)
>>> ipca.fit(X)
IncrementalPCA(batch_size=3, copy=True, n_components=2, whiten=False)
>>> ipca.transform(X) # doctest: +SKIP

class pai4sk.decomposition.TruncatedSVD(n_components=2, algorithm='auto', n_iter=5, random_state=None, tol=0.0, use_gpu=True)¶

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently.

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in pai4sk.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.

If the input data is cudf dataframe and if possible, then the accelerated TruncatedSVD algorithm from cuML will be used. Otherwise, scikit-learn’s TruncatedSVD algorithm will be used.

cuML in pai4sk is currently supported only

(a) with python 3.6 and
(b) without MPI.
If TruncatedSVD from cuML is run, then the return values from the APIs

will be cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

Read more in the User Guide.

Parameters:

n_components (int, default = 2) – Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.
algorithm (string, "arpack", "randomized", "cuml", "auto", "full" or "jacobi". default = "auto".) –
SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009) if cuml can’t be used.

”auto” will become “full” if the arguments satisfy some validations for using cuml. “auto” will become “randomized” if cuml is not used. algorithm should be one of “auto”, “cuml”, “full” and “jacobi” to use cuml.
n_iter (int, optional (default 5)) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
random_state (int, RandomState instance or None, optional, default = None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.
use_gpu (boolean, Default is True) – If True, cuML will use GPU 0. Applicable only for cuML.

Variables:

components (array of shape (n_components, n_features) or cudf dataframe) –
explained_variance (array of shape (n_components,) or cudf Series object) – The variance of the training samples transformed by a projection to each component.
explained_variance_ratio (array of shape (n_components,) or cudf Series object) – Percentage of variance explained by each of the selected components.
singular_values (array of shape (n_components,) or cudf Series object) – The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.

Examples

>>> from pai4sk.decomposition import TruncatedSVD
>>> from pai4sk.random_projection import sparse_random_matrix
>>> X = sparse_random_matrix(100, 100, density=0.01, random_state=42)
>>> svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
>>> svd.fit(X)  # doctest: +NORMALIZE_WHITESPACE
TruncatedSVD(algorithm='randomized', n_components=5, n_iter=7,
        random_state=42, tol=0.0)
>>> print(svd.explained_variance_ratio_)  # doctest: +ELLIPSIS
[0.0606... 0.0584... 0.0497... 0.0434... 0.0372...]
>>> print(svd.explained_variance_ratio_.sum())  # doctest: +ELLIPSIS
0.249...
>>> print(svd.singular_values_)  # doctest: +ELLIPSIS
[2.5841... 2.5245... 2.3201... 2.1753... 2.0443...]

See also

PCA

References

Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 (arXiv:909) https://arxiv.org/pdf/0909.4061.pdf

Notes

SVD suffers from a problem called “sign indeterminacy”, which means the sign of the components_ and the output from transform depend on the algorithm and random state. To work around this, fit instances of this class to data once, then keep the instance around to do transformations.

fit(X, y=None)¶

Fit LSI model on training data X.

Parameters:	X ({array-like, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – Training data. y (Ignored) –
Returns:	self – Returns the transformer object. If TruncatedSVD from cuML is run, then this fit method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikit-learn’s fit method.
Return type:	object

fit_transform(X, y=None)¶

Fit LSI model to X and perform dimensionality reduction on X.

Parameters:	X ({array-like, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – Training data. If TruncatedSVD from cuML is run, then this method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikit-learn’s API. y (Ignored) –
Returns:	X_new – Reduced version of X. This will always be a dense array.
Return type:	array of shape (n_samples, n_components) or cudf dataframe

inverse_transform(X)¶

Transform X back to its original space.

Returns an array or cudf dataframe X_original whose transform would be X.

Parameters:	X (array-like of shape (n_samples, n_components) or cudf dataframe) – New data.
Returns:	X_original – Note that this is always dense. If TruncatedSVD from cuML is run, then this method returns cudf dataframe instead of the results’ types seen from scikit-learn’s transform method.
Return type:	array of shape (n_samples, n_features) or cudf dataframe

transform(X)¶

Perform dimensionality reduction on X.

Parameters:	X ({array-like, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – New data.
Returns:	X_new – Reduced version of X. This will always be dense. If TruncatedSVD from cuML is run, then this method returns cudf dataframe instead of the results’ types seen from scikit-learn’s transform method.
Return type:	array of shape (n_samples, n_components) or cudf dataframe