pai4sk API¶

class
pai4sk.linear_model.
Ridge
(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None, dual=False, verbose=0, use_gpu=True, device_ids=[], return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, num_threads=1)¶ Linear least squares with l2 regularization.
Minimizes the objective function:
y  Xw^2_2 + alpha * w^2_2
This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has builtin support for multivariate regression (i.e., when y is a 2darray of shape [n_samples, n_targets]).
Read more in the User Guide.
For SnapML solver this supports both local and distributed(MPI) method of execution.
Parameters:  alpha ({float, arraylike}, shape (n_targets)) – Regularization strength; must be a positive float. Regularization
improves the conditioning of the problem and reduces the variance of
the estimates. Larger values specify stronger regularization.
Alpha corresponds to
C^1
in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.  fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize (boolean, optional, default False) – This parameter is ignored when
fit_intercept
is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm. If you wish to standardize, please usepai4sk.preprocessing.StandardScaler
before callingfit
on an estimator withnormalize=False
.  copy_X (boolean, optional, default True) – If True, X will be copied; else, it may be overwritten.
 max_iter (int, optional) – Maximum number of iterations for conjugate gradient solver. For ‘sparse_cg’ and ‘lsqr’ solvers, the default value is determined by scipy.sparse.linalg. For ‘sag’ solver, the default value is 1000.
 tol (float) – Precision of the solution.
 regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.
 use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
 device_ids (arraylike of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multiGPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
 num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
 return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
 solver ({'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'snapml'}) –
Solver to use in the computational routines:
 ’auto’ chooses the solver automatically based on the type of data.
 ’svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than ‘cholesky’.
 ’cholesky’ uses the standard scipy.linalg.solve function to obtain a closedform solution.
 ’sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for largescale data (possibility to set tol and max_iter).
 ’lsqr’ uses the dedicated regularized leastsquares routine scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative procedure.
 ’sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from pai4sk.preprocessing.
All last five solvers support both dense and sparse data. However, only ‘sag’ and ‘saga’ supports sparse input when fit_intercept is True.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
 random_state (int, RandomState instance or None, optional, default None) –
The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
solver
== ‘sag’.New in version 0.17: random_state to support Stochastic Average Gradient.
 privacy (bool, default : False) – Train the model using a differentially private algorithm.
 eta (float, default : 0.3) – Learning rate for the differentially private training algorithm.
 batch_size (int, default : 100) – Minibatch size for the differentially private training algorithm.
 privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)private.
 grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm
Variables:  coef (array, shape (n_features,) or (n_targets, n_features)) – Weight vector(s).
 intercept (float  array, shape = (n_targets,)) – Independent term in decision function. Set to 0.0 if
fit_intercept = False
.  n_iter (array or None, shape (n_targets,)) – Actual number of iterations for each target. Available only for sag and lsqr solvers. Other solvers will return None.
 training_history (dict) –
It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
New in version 0.17.
See also
RidgeClassifier
 Ridge classifier
RidgeCV
 Ridge regression with builtin cross validation
pai4sk.kernel_ridge.KernelRidge
 Kernel ridge regression combines ridge regression with the kernel trick
Examples
>>> from pai4sk.linear_model import Ridge >>> import numpy as np >>> n_samples, n_features = 10, 5 >>> np.random.seed(0) >>> y = np.random.randn(n_samples) >>> X = np.random.randn(n_samples, n_features) >>> clf = Ridge(alpha=1.0) >>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)

fit
(X, y, sample_weight=None)¶ Fit Ridge regression model
Parameters:  X ({arraylike, sparse matrix}, shape = [n_samples, n_features]) – Training data For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.
 y (arraylike, shape = [n_samples] or [n_samples, n_targets]) – Target values
 sample_weight (float or numpy array of shape [n_samples]) – Individual weights for each sample
Returns: self
Return type: returns an instance of self.

predict
(X, num_threads=0)¶ Class predictions The returned class estimates. Parameters ——— X : sparse matrix (csr_matrix) or dense matrix (ndarray)
Dataset used for predicting class estimates. For SnapML solver it also supports input of type SnapML data partition. num_threads : int, default : 0
 Number of threads used to run inference. By default inference runs with maximum number of available threads.
 proba: arraylike, shape = (n_samples,)
 Returns the predicted class of the sample.
 alpha ({float, arraylike}, shape (n_targets)) – Regularization strength; must be a positive float. Regularization
improves the conditioning of the problem and reduces the variance of
the estimates. Larger values specify stronger regularization.
Alpha corresponds to

class
pai4sk.linear_model.
Lasso
(alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic', verbose=0, use_gpu=True, device_ids=[], return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, num_threads=1)¶ Linear Model trained with L1 prior as regularizer (aka the Lasso)
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * y  Xw^2_2 + alpha * w_1
Technically the Lasso model is optimizing the same objective function as the Elastic Net with
l1_ratio=1.0
(no L2 penalty).Read more in the User Guide.
For SnapML solver this supports both local and distributed(MPI) method of execution.
Parameters:  alpha (float, optional) – Constant that multiplies the L1 term. Defaults to 1.0.
alpha = 0
is equivalent to an ordinary least square, solved by theLinearRegression
object. For numerical reasons, usingalpha = 0
with theLasso
object is not advised. Given this, you should use theLinearRegression
object.  fit_intercept (boolean, optional, default True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize (boolean, optional, default False) – This parameter is ignored when
fit_intercept
is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm. If you wish to standardize, please usepai4sk.preprocessing.StandardScaler
before callingfit
on an estimator withnormalize=False
.  precompute (True  False  arraylike, default=False) – Whether to use a precomputed Gram matrix to speed up
calculations. If set to
'auto'
let us decide. The Gram matrix can also be passed as argument. For sparse input this option is alwaysTrue
to preserve sparsity.  copy_X (boolean, optional, default True) – If
True
, X will be copied; else, it may be overwritten.  max_iter (int, optional) – The maximum number of iterations
 tol (float, optional) – The tolerance for the optimization: if the updates are
smaller than
tol
, the optimization code checks the dual gap for optimality and continues until it is smaller thantol
.  warm_start (bool, optional) – When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary.
 positive (bool, optional) – When set to
True
, forces the coefficients to be positive.  random_state (int, RandomState instance or None, optional, default None) – The seed of the pseudo random number generator that selects a random
feature to update. If int, random_state is the seed used by the random
number generator; If RandomState instance, random_state is the random
number generator; If None, the random number generator is the
RandomState instance used by np.random. Used when
selection
== ‘random’.  selection (str, default 'cyclic') – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
 verbose (bool, default : False) – If True, it prints the training cost, one per iteration. Warning: this will increase the training time. For performance evaluation, use verbose=False.
 use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
 device_ids (arraylike of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multiGPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
 num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
 return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
 privacy (bool, default : False) – Train the model using a differentially private algorithm.
 eta (float, default : 0.3) – Learning rate for the differentially private training algorithm.
 batch_size (int, default : 100) – Minibatch size for the differentially private training algorithm.
 privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)private.
 grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm
Variables:  coef (array, shape (n_features,)  (n_targets, n_features)) – parameter vector (w in the cost function formula)
 sparse_coef (scipy.sparse matrix, shape (n_features, 1)  (n_targets, n_features)) –
sparse_coef_
is a readonly property derived fromcoef_
 intercept (float  array, shape (n_targets,)) – independent term in decision function.
 n_iter (int  arraylike, shape (n_targets,)) – number of iterations run by the coordinate descent solver to reach the specified tolerance.
 training_history (dict) – It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
 support (arraylike) – Indices of the features that lie in the support ond contribute to the decision.
 model_sparsity (float) – Fraction of nonzeros in the model parameters.
Examples
>>> from pai4sk import linear_model >>> clf = linear_model.Lasso(alpha=0.1) >>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2]) Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False) >>> print(clf.coef_) [0.85 0. ] >>> print(clf.intercept_) # doctest: +ELLIPSIS 0.15...
See also
lars_path
,lasso_path
,LassoLars
,LassoCV
,LassoLarsCV
,pai4sk.decomposition.sparse_encode
Notes
The algorithm used to fit the model is coordinate descent.
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a Fortrancontiguous numpy array.

fit
(X, y, check_input=True)¶ Fit model with coordinate descent.
Parameters:  X (ndarray or scipy.sparse matrix, (n_samples, n_features)) – Data For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.
 y (ndarray, shape (n_samples,) or (n_samples, n_targets)) – Target. Will be cast to X’s dtype if necessary
 check_input (boolean, (default=True)) – Allow to bypass several input checking. Don’t use this parameter unless you know what you do.
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically convert the X input as a Fortrancontiguous numpy array if necessary.
To avoid memory reallocation it is advised to allocate the initial data in memory directly using that format.

predict
(X, num_threads=0)¶ Class predictions The returned class estimates. Parameters ——— X : sparse matrix (csr_matrix) or dense matrix (ndarray)
Dataset used for predicting class estimates. For SnapML solver it also supports input of type SnapML data partition. num_threads : int, default : 0
 Number of threads used to run inference. By default inference runs with maximum number of available threads.
 proba: arraylike, shape = (n_samples,)
 Returns the predicted class of the sample.
 alpha (float, optional) – Constant that multiplies the L1 term. Defaults to 1.0.

class
pai4sk.linear_model.
LogisticRegression
(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None, use_gpu=True, device_ids=[], return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, num_threads=1)¶ Logistic Regression (aka logit, MaxEnt) classifier.
In the multiclass case, the training algorithm uses the onevsrest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’ and ‘newtoncg’ solvers.)
This class implements regularized logistic regression using the ‘liblinear’ library, ‘newtoncg’, ‘sag’ and ‘lbfgs’ solvers. It can handle both dense and sparse input. Use Cordered arrays or CSR matrices containing 64bit floats for optimal performance; any other input format will be converted (and copied).
The ‘newtoncg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty.
Read more in the User Guide.
For SnapML solver this supports both local and distributed(MPI) method of execution.
Parameters:  penalty (str, 'l1' or 'l2', default: 'l2') –
Used to specify the norm used in the penalization. The ‘newtoncg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties.
New in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)
 dual (bool, default: False) – Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.
 tol (float, default: 1e4) – Tolerance for stopping criteria.
 C (float, default: 1.0) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
 fit_intercept (bool, default: True) – Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
 intercept_scaling (float, default 1.) –
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes
intercept_scaling * synthetic_feature_weight
.Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
 class_weight (dict or 'balanced', default: None) –
Weights associated with classes in the form
{class_label: weight}
. If not given, all classes are supposed to have weight one.The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
.Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
New in version 0.17: class_weight=’balanced’
 random_state (int, RandomState instance or None, optional, default: None) – The seed of the pseudo random number generator to use when shuffling
the data. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState
instance used by np.random. Used when
solver
== ‘sag’ or ‘liblinear’.  solver (str, {'newtoncg', 'lbfgs', 'liblinear', 'sag', 'saga', 'snapml'}, default: 'snapml', if 'snap_ml' library is in PYTHONPATH, else,) –
default: ‘liblinear’.
Algorithm to use in the optimization problem.
 For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
 For multiclass problems, only ‘newtoncg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to oneversusrest schemes.
 ’newtoncg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’ handle L1 penalty.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from pai4sk.preprocessing.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
Changed in version 0.20: Default will change from ‘liblinear’ to ‘lbfgs’ in 0.22.
 max_iter (int, default: 100) – Useful only for the newtoncg, sag and lbfgs solvers. Maximum number of iterations taken for the solvers to converge.
 multi_class (str, {'ovr', 'multinomial', 'auto'}, default: 'ovr') –
If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.
Changed in version 0.20: Default will change from ‘ovr’ to ‘auto’ in 0.22.
 verbose (int, default: 0) – For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.
 warm_start (bool, default: False) –
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See the Glossary.
New in version 0.17: warm_start to support lbfgs, newtoncg, sag, saga solvers.
 n_jobs (int or None, optional (default=None)) – Number of CPU cores used when parallelizing over classes if
multi_class=’ovr’”. This parameter is ignored when the
solver
is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not.None
means 1 unless in ajoblib.parallel_backend
context.1
means using all processors. See Glossary for more details.  use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
 device_ids (arraylike of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For singleGPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multiGPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
 num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
 return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
 privacy (bool, default : False) – Train the model using a differentially private algorithm.
 eta (float, default : 0.3) – Learning rate for the differentially private training algorithm.
 batch_size (int, default : 100) – Minibatch size for the differentially private training algorithm.
 privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)private.
 grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm
Variables:  coef (array, shape (1, n_features) or (n_classes, n_features)) –
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class=’multinomial’, coef_ corresponds to outcome 1 (True) and coef_ corresponds to outcome 0 (False).
 intercept (array, shape (1,) or (n_classes,)) –
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class=’multinomial’, intercept_ corresponds to outcome 1 (True) and intercept_ corresponds to outcome 0 (False).
 n_iter (array, shape (n_classes,) or (1, )) – Actual number of iterations for all classes. If binary or multinomial, it returns only 1 element. For liblinear solver, only the maximum number of iteration across all classes is given.
 training_history (dict) – It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
 support (arraylike) – Indices of the features that contribute to the decision. (only available for L1)
 model_sparsity (float) –
Fraction of nonzeros in the model parameters. (only available for L1)
Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed
max_iter
.n_iter_
will now report at mostmax_iter
.
Examples
>>> from pai4sk.datasets import load_iris >>> from pai4sk.linear_model import LogisticRegression >>> X, y = load_iris(return_X_y=True) >>> clf = LogisticRegression(random_state=0, solver='lbfgs', ... multi_class='multinomial').fit(X, y) >>> clf.predict(X[:2, :]) array([0, 0]) >>> clf.predict_proba(X[:2, :]) # doctest: +ELLIPSIS array([[9.8...e01, 1.8...e02, 1.4...e08], [9.7...e01, 2.8...e02, ...e08]]) >>> clf.score(X, y) 0.97...
See also
SGDClassifier
 incrementally trained logistic regression (when given the parameter
loss="log"
). LogisticRegressionCV
 Logistic regression with builtin cross validation
Notes
The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.
Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.
References
 LIBLINEAR – A Library for Large Linear Classification
 http://www.csie.ntu.edu.tw/~cjlin/liblinear/
 SAG – Mark Schmidt, Nicolas Le Roux, and Francis Bach
 Minimizing Finite Sums with the Stochastic Average Gradient https://hal.inria.fr/hal00860051/document
 SAGA – Defazio, A., Bach F. & LacosteJulien S. (2014).
 SAGA: A Fast Incremental Gradient Method With Support for NonStrongly Convex Composite Objectives https://arxiv.org/abs/1407.0202
 HsiangFu Yu, FangLan Huang, ChihJen Lin (2011). Dual coordinate descent
 methods for logistic regression and maximum entropy models. Machine Learning 85(12):4175. http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf

fit
(X, y, sample_weight=None)¶ Fit the model according to the given training data. :param X: Training vector, where n_samples is the number of samples and
n_features is the number of features. For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.Parameters:  y (arraylike, shape (n_samples,) or (n_samples, n_targets)) – Target vector relative to X.
 sample_weight (arraylike, shape (n_samples,) optional) –
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. .. versionadded:: 0.17
sample_weight support to LogisticRegression.
Returns: self
Return type:

predict
(X, num_threads=0)¶ Class predictions The returned class estimates. :param X: Dataset used for predicting class estimates.
For SnapML solver it also supports input of type SnapML data partition.Parameters: num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads. Returns: proba – Returns the predicted class of the sample. Return type: arraylike, shape = (n_samples,)

predict_log_proba
(X)¶ Log of probability estimates. The returned estimates for all classes are ordered by the label of classes. :param X: For SnapML solver it also supports input of type SnapML data partition. :type X: arraylike, shape = [n_samples, n_features]
Returns: T – Returns the logprobability of the sample for each class in the model, where classes are ordered as they are in self.classes_
.Return type: arraylike, shape = [n_samples, n_classes]

predict_proba
(X, num_threads=0)¶ Probability estimates. The returned estimates for all classes are ordered by the label of classes. For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a onevsrest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes. :param X: For SnapML solver it also supports input of type SnapML data partition. :type X: arraylike, shape = [n_samples, n_features] :param num_threads: Number of threads used to run inference.
By default inference runs with maximum number of available threads.Returns: T – Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_
.Return type: arraylike, shape = [n_samples, n_classes]
 penalty (str, 'l1' or 'l2', default: 'l2') –

class
pai4sk.svm.
LinearSVC
(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000, use_gpu=True, device_ids=[], num_threads=1, return_training_history=None)¶ Linear Support Vector Classification.
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
This class supports both dense and sparse input and the multiclass support is handled according to a onevstherest scheme.
Read more in the User Guide.
For SnapML solver this supports both local and distributed(MPI) method of execution.
Parameters:  penalty (string, 'l1' or 'l2' (default='l2')) – Specifies the norm used in the penalization. The ‘l2’
penalty is the standard used in SVC. The ‘l1’ leads to
coef_
vectors that are sparse.  loss (string, 'hinge' or 'squared_hinge' (default='squared_hinge')) – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
 dual (bool, (default=True)) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
 tol (float, optional (default=1e4)) – Tolerance for stopping criteria.
 C (float, optional (default=1.0)) – Penalty parameter C of the error term.
 multi_class (string, 'ovr' or 'crammer_singer' (default='ovr')) – Determines the multiclass strategy if y contains more than
two classes.
"ovr"
trains n_classes onevsrest classifiers, while"crammer_singer"
optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If"crammer_singer"
is chosen, the options loss, penalty and dual will be ignored.  fit_intercept (boolean, optional (default=True)) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
 intercept_scaling (float, optional (default=1)) – When self.fit_intercept is True, instance vector x becomes
[x, self.intercept_scaling]
, i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.  class_weight ({dict, 'balanced'}, optional) – Set the parameter C of class i to
class_weight[i]*C
for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data asn_samples / (n_classes * np.bincount(y))
 verbose (int, (default=0)) – Enable verbose output. Note that this setting takes advantage of a perprocess runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.
 random_state (int, RandomState instance or None, optional (default=None)) – The seed of the pseudo random number generator to use when shuffling
the data for the dual coordinate descent (if
dual=True
). Whendual=False
the underlying implementation ofLinearSVC
is not random andrandom_state
has no effect on the results. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.  max_iter (int, (default=1000)) – The maximum number of iterations to be run.
 use_gpu (bool, default : True) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. The value of this parameter is subjected to changed based on the training data unless set explicitly. Applicable only for snapml solver
 device_ids (arraylike of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multiGPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1]. Applicable only for snapml solver
 num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True) (default value for GPU is 256). Applicable only for snapml solver
 return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. Applicable only for snapml solver
Variables:  coef (array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]) –
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.
coef_
is a readonly property derived fromraw_coef_
that follows the internal memory layout of liblinear.  intercept (array, shape = [1] if n_classes == 2 else [n_classes]) – Constants in decision function.
 training_history (dict) – It returns a dictionary with the following keys : ‘epochs’, ‘t_elap_sec’, ‘train_obj’. If ‘return_training_history’ is set to “summary”, ‘epochs’ contains the total number of epochs performed, ‘t_elap_sec’ contains the total time for completing all of those epochs. If ‘return_training_history’ is set to “full”, ‘epochs’ indicates the number of epochs that have elapsed so far, and ‘t_elap_sec’ contains the time to do those epochs. ‘train_obj’ is the training loss. Applicable only for snapml solver.
 support (arraylike, shape (n_SV)) – indices of the support vectors.
 n_support (int) – Number of support vectors.
 n_iter (array, shape (n_classes,) or (1, )) – Actual number of iterations for all classes to reach the specified tolerance. If binary or multinomial, it returns only 1 element.
Examples
>>> from pai4sk.svm import LinearSVC >>> from pai4sk.datasets import make_classification >>> X, y = make_classification(n_features=4, random_state=0) >>> clf = LinearSVC(random_state=0, tol=1e5) >>> clf.fit(X, y) LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=0, tol=1e05, verbose=0) >>> print(clf.coef_) [[0.085... 0.394... 0.498... 0.375...]] >>> print(clf.intercept_) [0.284...] >>> print(clf.predict([[0, 0, 0, 0]])) [1]
Notes
The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller
tol
parameter.The underlying implementation, liblinear, uses a sparse internal representation for the data that will incur a memory copy.
Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.
References
LIBLINEAR: A Library for Large Linear Classification
See also
SVC
 Implementation of Support Vector Machine classifier using libsvm: the kernel can be nonlinear but its SMO algorithm does not scale to large number of samples as LinearSVC does. Furthermore SVC multiclass mode is implemented using one vs one scheme while LinearSVC uses one vs the rest. It is possible to implement one vs the rest with SVC by using the
pai4sk.multiclass.OneVsRestClassifier
wrapper. Finally SVC can fit dense data without memory copy if the input is Ccontiguous. Sparse data will still incur memory copy though. pai4sk.linear_model.SGDClassifier
 SGDClassifier can optimize the same cost function as LinearSVC by adjusting the penalty and loss parameters. In addition it requires less memory, allows incremental (online) learning, and implements various loss functions and regularization regimes.

decision_function
(X, num_threads=0)¶ Predicts confidence scores.
The confidence score of a sample is the signed distance of that sample to the decision boundary.
Parameters:  X (sparse matrix (csr_matrix) or dense matrix (ndarray)) – Dataset used for predicting distances to the decision boundary. For SnapML solver it also supports input of type SnapML data partition.
 num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.
Returns: proba – Returns the distance to the decision boundary of the samples in X.
Return type: arraylike, shape = (n_samples,) or (n_sample, n_classes)

fit
(X, y, sample_weight=None)¶ Fit the model according to the given training data. :param X: Training vector, where n_samples in the number of samples and
n_features is the number of features. For SnapML solver it also supports input of types SnapML data partition and DeviceNDArray.Parameters:  y (arraylike, shape = [n_samples]) – Target vector relative to X
 sample_weight (arraylike, shape = [n_samples], optional) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.
Returns: self
Return type:

predict
(X, num_threads=0)¶ Class predictions The returned class estimates. Parameters ——— X : sparse matrix (csr_matrix) or dense matrix (ndarray)
Dataset used for predicting class estimates. For SnapML solver it also supports input of type SnapML data partition. num_threads : int, default : 0
 Number of threads used to run inference. By default inference runs with maximum number of available threads.
 proba: arraylike, shape = (n_samples,)
 Returns the predicted class of the sample.
 penalty (string, 'l1' or 'l2' (default='l2')) – Specifies the norm used in the penalization. The ‘l2’
penalty is the standard used in SVC. The ‘l1’ leads to

class
pai4sk.cluster.
KMeans
(n_clusters=8, max_iter=300, tol=0.0001, verbose=0, random_state=1, precompute_distances='auto', init='kmeans++', n_init=1, algorithm='auto', copy_x=True, n_jobs=None, use_gpu=True)¶ KMeans clustering.
If cudf dataframe is passed as input, then pai4sk will try to use the accelerated KMeans algorithm from cuML. Otherwise, scikitlearn’s KMeans algorithm will be used.
cuML in pai4sk is currently supported only
(a) with python 3.6 and(b) without MPI.If KMeans from cuML is run, then the return values from the APIs will becudf dataframe and cudf Series objects instead of the return types of scikitlearn API.
Parameters:  n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
 init ({'kmeans++', 'random' or an ndarray}) –
Method for initialization, defaults to ‘kmeans++’:
’kmeans++’ : selects initial cluster centers for kmean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
’random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
 n_init (int, default: 10) – Number of time the kmeans algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
 max_iter (int, default: 300) – Maximum number of iterations of the kmeans algorithm for a single run.
 tol (float, default: 1e4) – Relative tolerance with regards to inertia to declare convergence
 precompute_distances ({'auto', True, False}) –
Precompute distances (faster but takes more memory).
’auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.
True : always precompute distances
False : never precompute distances
 verbose (int, default 0) – Verbosity mode.
 random_state (int, RandomState instance or None (default)) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
 copy_x (boolean, optional) – When precomputing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified, ensuring X is Ccontiguous. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean, in this case it will also not ensure that data is Ccontiguous which may cause a significant slowdown.
 n_jobs (int or None, optional (default=None)) –
The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.1
means using all processors. See Glossary for more details.  algorithm ("auto", "full" or "elkan", "cuml", default="auto") –
Kmeans algorithm to use. The classical EMstyle algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
If cudf dataframe is passed as input, then if either
(1) algorithm is set to “cuml” or(2) algorithm is “auto”,then pai4sk will try to use kmeans algorithm from RAPIDS cuML.cuML in pai4sk is currently supported only(a) with python 3.6 and(b) without MPI.If KMeans from cuML is run, then the return values of the APIs will be cudf dataframe and cudf Series objects instead of the return types of scikitlearn API.
Variables:  cluster_centers (array, [n_clusters, n_features] or cudf dataframe) – Coordinates of cluster centers. If the algorithm stops before fully
converging (see
tol
andmax_iter
), these will not be consistent withlabels_
. If KMeans from cuML is run, then the return values of some of the APIs will be cudf dataframe and cudf Series objects instead of the return types of scikitlearn API.  labels (array or cudf Series) – Labels of each point
 inertia (float) – Sum of squared distances of samples to their closest cluster center.
 n_iter (int) – Number of iterations run.
 use_gpu (boolean, Default is True) – If True, cuML will use all GPUs. Applicable only for cuML.
Examples
>>> from pai4sk.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X) >>> kmeans.labels_ array([0, 0, 0, 1, 1, 1], dtype=int32) >>> kmeans.predict([[0, 0], [4, 4]]) array([0, 1], dtype=int32) >>> kmeans.cluster_centers_ array([[1., 2.], [4., 2.]])
See also
MiniBatchKMeans
 Alternative online implementation that does incremental updates of the centers positions using minibatches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.
Notes
The kmeans problem is solved using either Lloyd’s or Elkan’s algorithm.
The average complexity is given by O(k n T), were n is the number of samples and T is the number of iteration.
The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii, ‘How slow is the kmeans method?’ SoCG2006)
In practice, the kmeans algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.
If the algorithm stops before fully converging (because of
tol
ormax_iter
),labels_
andcluster_centers_
will not be consistent, i.e. thecluster_centers_
will not be the means of the points in each cluster. Also, the estimator will reassignlabels_
after the last iteration to makelabels_
consistent withpredict
on the training set.
fit
(X, y=None, sample_weight=None)¶ Fit the model according to the given training data.
Parameters:  X ({arraylike, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – Training vector, where n_samples is the number of samples and n_features is the number of features.
 y (arraylike, shape (n_samples,) or (n_samples, n_targets)) – Target vector relative to X.
 sample_weight (arraylike, shape (n_samples,) optional) –
Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. .. versionadded:: 0.17
sample_weight support to KMeans.
Returns: self – If KMeans from cuML is run then this fit method saves the cluster centers and labels as cudf dataframe and cudf Series objects instead of the return types of scikitlearn API.
Return type:

fit_predict
(X, y=None, sample_weight=None)¶ Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X). Parameters:
 X : {arraylike, sparse matrix}, shape = [n_samples, n_features]
 cuDF dataframe if cuml is being used. New data to transform.
 y : Ignored
 not used, present here for API consistency by convention.
 sample_weight : arraylike, shape (n_samples,), optional
 The weights for each observation in X. If None, all observations are assigned equal weight (default: None)
Returns: labels : array, shape [n_samples,] or cudf Series object
Index of the cluster each sample belongs to. If KMeans from cuML is run, then this method saves the cluster centers and labels as cudf dataframe and cudf Series objects instead of the return types of scikitlearn API. Returns cudf Series object.

fit_transform
(X, y=None, sample_weight=None)¶ Compute clustering and transform X to clusterdistance space.
Equivalent to fit(X).transform(X), but more efficiently implemented. Parameters:
 X : {arraylike, sparse matrix}, shape = [n_samples, n_features]
 cuDF dataframe if cuml is being used. New data to transform.
 y : Ignored
 not used, present here for API consistency by convention.
 sample_weight : arraylike, shape (n_samples,), optional
 The weights for each observation in X. If None, all observations are assigned equal weight (default: None)
Returns: X_new : array, shape [n_samples, k] or cudf dataframe
X transformed in the new space. If KMeans from cuML is run, then this method saves the cluster centers and labels as cudf dataframe and cudf Series objects instead of the return types of scikitlearn API.

predict
(X, sample_weight=None)¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book. :param X: cuDF dataframe if cuml is being used.
New data to predict.Parameters:  sample_weight (arraylike, shape (n_samples,), optional) – The weights for each observation in X. If None, all observations are assigned equal weight (default: None)
 Returns –
 labels (array, shape [n_samples,] or cudf Series object) – Index of the cluster each sample belongs to. If KMeans from cuML is run, then this method returns cudf Series object instead of the return types of scikitlearn API.

transform
(X)¶ Transform X to a clusterdistance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
Parameters: X ({arraylike, sparse matrix}, shape = [n_samples, n_features]) – cuDF dataframe if cuml is being used. New data to transform. If KMeans from cuML is run and if the input data is a cudf dataframe, then this method returns cudf dataframe instead of array. Returns: X_new – X transformed in the new space. If KMeans from cuML is run, then this method returns cudf dataframe instead of the return types of scikitlearn API. Return type: array, shape [n_samples, k] or cudf dataframe

class
pai4sk.cluster.
DBSCAN
(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, use_gpu=True)¶ Perform DBSCAN clustering from vector array or distance matrix.
DBSCAN  DensityBased Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
If the input data is cudf dataframe and if possible, then the accelerated DBSCAN algorithm from cuML will be used. Otherwise, scikitlearn’s DBSCAN algorithm will be used.
cuML in pai4sk is currently supported only
(a) with python 3.6 and(b) without MPI.If DBSCAN from cuML is run, then the return values from the APIs will becudf dataframe and cudf Series objects instead of the return types of scikitlearn API.
 eps : float, optional
 The maximum distance between two samples for them to be considered as in the same neighborhood.
 min_samples : int, optional
 The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
 metric : string, or callable
 The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by pai4sk.metrics.pairwise_distances for its metric parameter. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only nonzero elements may be considered neighbors for DBSCAN. New in version 0.17: metric precomputed to accept precomputed sparse matrix.
 metric_params : dict, optional
 Additional keyword arguments for the metric function. New in version 0.19.
 algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’, ‘cuml’}, optional
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
If cudf dataframe is given as input, if either
(1) algorithm is set to “cuml” or(2) algorithm is “auto”,then pai4sk will try to use DBSCAN algorithm from RAPIDS cuML if possible.cuML in pai4sk is currently supported only(a) with python 3.6 and(b) without MPI. leaf_size : int, optional (default = 30)
 Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
 p : float, optional
 The power of the Minkowski metric to be used to calculate distance between points.
 n_jobs : int or None, optional (default=None)
 The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. 1 means using all processors. See Glossary for more details.
 use_gpu : boolean, Default is True
 If True, cuML will use GPU 0. Applicable only for cuML.
Attributes: core_sample_indices_ : array, shape = [n_core_samples]
Indices of core samples. components_ : array, shape = [n_core_samples, n_features]
 Copy of each core sample found by training.
 labels_ : array, shape = [n_samples]
 Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label 1.

fit
(X, y=None, sample_weight=None)¶ Perform DBSCAN clustering from features or distance matrix. Parameters: ——— X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cuDF dataframe
A feature array, or array of distances between samples if metric=’precomputed’. sample_weight : array, shape (n_samples,), optional
 Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its epsneighbor from being core. Note that weights are absolute, and default to 1.
y : Ignored
 self : object
 If DBSCAN from cuML is run, then this fit method saves the computed labels as cudf Series object instead of array.

fit_predict
(X, y=None, sample_weight=None)¶ Performs clustering on X and returns cluster labels.
Parameters: X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cudf dataframe
A feature array, or array of distances between samples if metric=’precomputed’. sample_weight : array, shape (n_samples,), optional
 Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its epsneighbor from being core. Note that weights are absolute, and default to 1.
y : Ignored
 y : ndarray, shape (n_samples,) or cudf Series
 If DBSCAN from cuML is run, then this fit method returns the computed labels as cudf Series object instead of ndarray.

class
pai4sk.decomposition.
PCA
(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None, use_gpu=True)¶ Principal component analysis (PCA)
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.
It can also use the scipy.sparse.linalg ARPACK implementation of the truncated SVD.
If the input data is cudf dataframe, then pai4sk will try to use the accelerated PCA algorithm from cuML. Otherwise, scikitlearn’s PCA algorithm will be used.
cuML in pai4sk is currently supported only
(a) with python 3.6 and(b) without MPI.If PCA from cuML is run, then the return values from the APIs will becudf dataframe and cudf Series objects instead of the return types of scikitlearn API.
Notice that this class does not support sparse input. See
TruncatedSVD
for an alternative with sparse data.Read more in the User Guide.
Parameters:  n_components (int, float, None or string) –
Number of components to keep. if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
If
n_components == 'mle'
andsvd_solver == 'full'
, Minka’s MLE is used to guess the dimension. Use ofn_components == 'mle'
will interpretsvd_solver == 'auto'
assvd_solver == 'full'
.If
0 < n_components < 1
andsvd_solver == 'full'
, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.If
svd_solver == 'arpack'
, the number of components must be strictly less than the minimum of n_features and n_samples.Hence, the None case results in:
n_components == min(n_samples, n_features)  1
 copy (bool (default True)) – If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
 whiten (bool, optional (default False)) –
When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit componentwise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hardwired assumptions.
 svd_solver (string {'auto', 'full', 'arpack', 'randomized', 'cuml', 'jacobi'}) –
 auto :
 when cuml is not used, the solver is selected by a default policy based
on X.shape and n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’
method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards. If cuml is used, then the default
algorithm ‘full’ will be used when the svd_solver is ‘auto’ or ‘cuml’.
If cudf dataframe is given as input, if either
(1) svd_solver is set to “cuml” or(2) svd_solver is “auto”,then pai4sk will try to use PCA algorithm from RAPIDS cuML if possible.cuML in pai4sk is currently supported only(a) with python 3.6 and(b) without MPI.  full :
 run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
 arpack :
 run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
 randomized :
 run randomized SVD by the method of Halko et al.
New in version 0.18.0.
 tol (float >= 0, optional (default .0)) –
Tolerance for singular values computed by svd_solver == ‘arpack’.
New in version 0.18.0.
 iterated_power (int >= 0, or 'auto', (default 'auto')) –
Number of iterations for the power method computed by svd_solver == ‘randomized’. Note : cuML for pai4sk only supports integer values for this parameter.
New in version 0.18.0.
 random_state (int, RandomState instance or None, optional (default None)) –
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when
svd_solver
== ‘arpack’ or ‘randomized’.New in version 0.18.0.
 use_gpu (boolean, Default is True) – If True, cuML will use GPU 0. Applicable only for cuML.
Variables:  components (array of shape (n_components, n_features) or cudf dataframe) – Principal axes in feature space, representing the directions of
maximum variance in the data. The components are sorted by
explained_variance_
.  explained_variance (array of shape (n_components,) or cudf Series) –
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
New in version 0.18.
 explained_variance_ratio (array of shape (n_components,) or cudf Series) –
Percentage of variance explained by each of the selected components.
If
n_components
is not set then all components are stored and the sum of the ratios is equal to 1.0.  singular_values (array of shape (n_components,) or cudf Series) – The singular values corresponding to each of the selected components.
The singular values are equal to the 2norms of the
n_components
variables in the lowerdimensional space.  mean (array, shape (n_features,)) –
Perfeature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
 n_components (int) – The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
 noise_variance (float or cudf Series) –
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf. It is required to compute the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples)  n_components) smallest eigenvalues of the covariance matrix of X.
References
For n_components == ‘mle’, this class uses the method of Minka, T. P. “Automatic choice of dimensionality for PCA”. In NIPS, pp. 598604
Implements the probabilistic PCA model from: `Tipping, M. E., and Bishop, C. M. (1999). “Probabilistic principal component analysis”. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611622. via the score and score_samples methods. See http://www.miketipping.com/papers/metmppca.pdf
For svd_solver == ‘arpack’, refer to scipy.sparse.linalg.svds.
For svd_solver == ‘randomized’, see: Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. SIAM review, 53(2), 217288. and also Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). “A randomized algorithm for the decomposition of matrices”. Applied and Computational Harmonic Analysis, 30(1), 4768.
Examples
>>> import numpy as np >>> from pai4sk.decomposition import PCA >>> X = np.array([[1, 1], [2, 1], [3, 2], [1, 1], [2, 1], [3, 2]]) >>> pca = PCA(n_components=2) >>> pca.fit(X) PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False) >>> print(pca.explained_variance_ratio_) # doctest: +ELLIPSIS [0.9924... 0.0075...] >>> print(pca.singular_values_) # doctest: +ELLIPSIS [6.30061... 0.54980...]
>>> pca = PCA(n_components=2, svd_solver='full') >>> pca.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='full', tol=0.0, whiten=False) >>> print(pca.explained_variance_ratio_) # doctest: +ELLIPSIS [0.9924... 0.00755...] >>> print(pca.singular_values_) # doctest: +ELLIPSIS [6.30061... 0.54980...]
>>> pca = PCA(n_components=1, svd_solver='arpack') >>> pca.fit(X) PCA(copy=True, iterated_power='auto', n_components=1, random_state=None, svd_solver='arpack', tol=0.0, whiten=False) >>> print(pca.explained_variance_ratio_) # doctest: +ELLIPSIS [0.99244...] >>> print(pca.singular_values_) # doctest: +ELLIPSIS [6.30061...]
See also
KernelPCA
,SparsePCA
,TruncatedSVD
,IncrementalPCA

fit
(X, y=None, _transform=True)¶ Fit the model with X.
Parameters:  X (arraylike of shape (n_samples, n_features) or cudf dataframe) – Training data, where n_samples is the number of samples and n_features is the number of features.
 y (Ignored) –
Returns: self – Returns the instance itself. If PCA from cuML is run, then this fit method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikitlearn’s fit method.
Return type:

fit_transform
(X, y=None)¶ Fit the model with X and apply the dimensionality reduction on X.
Parameters:  X (arraylike of shape (n_samples, n_features) or cudf dataframe) – Training data, where n_samples is the number of samples and n_features is the number of features.
 y (Ignored) –
Returns: X_new – If PCA from cuML is run, then this method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikitlearn’s fit_transform method.
Return type: arraylike of shape (n_samples, n_components) or cudf dataframe

inverse_transform
(X)¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
Parameters: X (arraylike of shape (n_samples, n_components) or cudf dataframe) – New data, where n_samples is the number of samples and n_components is the number of components. Returns: X_original – If PCA from cuML is run, then this method returns cudf dataframe instead of the results’ types seen from scikitlearn’s inverse_transform method. Return type: arraylike of shape (n_samples, n_features) or cudf dataframe Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes reversing whitening.

score
(X, y=None)¶ Return the average loglikelihood of all samples.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/metmppca.pdf
Parameters:  X (array, shape(n_samples, n_features)) – The data.
 y (Ignored) –
Returns: ll – Average loglikelihood of the samples under the current model
Return type:

transform
(X)¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
Parameters: X (arraylike of shape (n_samples, n_features) or cudf dataframe) – New data, where n_samples is the number of samples and n_features is the number of features. Returns: X_new – If PCA from cuML is run, then this method saves the computed values as cudf dataframe instead of the results’ types seen from scikitlearn’s transform method. Return type: arraylike of shape (n_samples, n_components) or cudf dataframe Examples
>>> import numpy as np >>> from pai4sk.decomposition import IncrementalPCA >>> X = np.array([[1, 1], [2, 1], [3, 2], [1, 1], [2, 1], [3, 2]]) >>> ipca = IncrementalPCA(n_components=2, batch_size=3) >>> ipca.fit(X) IncrementalPCA(batch_size=3, copy=True, n_components=2, whiten=False) >>> ipca.transform(X) # doctest: +SKIP
 n_components (int, float, None or string) –

class
pai4sk.decomposition.
TruncatedSVD
(n_components=2, algorithm='auto', n_iter=5, random_state=None, tol=0.0, use_gpu=True)¶ Dimensionality reduction using truncated SVD (aka LSA).
This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices efficiently.
In particular, truncated SVD works on term count/tfidf matrices as returned by the vectorizers in pai4sk.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.
If the input data is cudf dataframe and if possible, then the accelerated TruncatedSVD algorithm from cuML will be used. Otherwise, scikitlearn’s TruncatedSVD algorithm will be used.
cuML in pai4sk is currently supported only
(a) with python 3.6 and(b) without MPI.If TruncatedSVD from cuML is run, then the return values from the APIswill be cudf dataframe and cudf Series objects instead of the return types of scikitlearn API.
Read more in the User Guide.
Parameters:  n_components (int, default = 2) – Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.
 algorithm (string, "arpack", "randomized", "cuml", "auto", "full" or "jacobi". default = "auto".) –
SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009) if cuml can’t be used.
”auto” will become “full” if the arguments satisfy some validations for using cuml. “auto” will become “randomized” if cuml is not used. algorithm should be one of “auto”, “cuml”, “full” and “jacobi” to use cuml.
 n_iter (int, optional (default 5)) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
 random_state (int, RandomState instance or None, optional, default = None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
 tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.
 use_gpu (boolean, Default is True) – If True, cuML will use GPU 0. Applicable only for cuML.
Variables:  components (array of shape (n_components, n_features) or cudf dataframe) –
 explained_variance (array of shape (n_components,) or cudf Series object) – The variance of the training samples transformed by a projection to each component.
 explained_variance_ratio (array of shape (n_components,) or cudf Series object) – Percentage of variance explained by each of the selected components.
 singular_values (array of shape (n_components,) or cudf Series object) – The singular values corresponding to each of the selected components.
The singular values are equal to the 2norms of the
n_components
variables in the lowerdimensional space.
Examples
>>> from pai4sk.decomposition import TruncatedSVD >>> from pai4sk.random_projection import sparse_random_matrix >>> X = sparse_random_matrix(100, 100, density=0.01, random_state=42) >>> svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42) >>> svd.fit(X) # doctest: +NORMALIZE_WHITESPACE TruncatedSVD(algorithm='randomized', n_components=5, n_iter=7, random_state=42, tol=0.0) >>> print(svd.explained_variance_ratio_) # doctest: +ELLIPSIS [0.0606... 0.0584... 0.0497... 0.0434... 0.0372...] >>> print(svd.explained_variance_ratio_.sum()) # doctest: +ELLIPSIS 0.249... >>> print(svd.singular_values_) # doctest: +ELLIPSIS [2.5841... 2.5245... 2.3201... 2.1753... 2.0443...]
See also
References
Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 (arXiv:909) https://arxiv.org/pdf/0909.4061.pdf
Notes
SVD suffers from a problem called “sign indeterminacy”, which means the sign of the
components_
and the output from transform depend on the algorithm and random state. To work around this, fit instances of this class to data once, then keep the instance around to do transformations.
fit
(X, y=None)¶ Fit LSI model on training data X.
Parameters:  X ({arraylike, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – Training data.
 y (Ignored) –
Returns: self – Returns the transformer object. If TruncatedSVD from cuML is run, then this fit method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikitlearn’s fit method.
Return type:

fit_transform
(X, y=None)¶ Fit LSI model to X and perform dimensionality reduction on X.
Parameters:  X ({arraylike, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – Training data. If TruncatedSVD from cuML is run, then this method saves the computed values as cudf dataframes and cudf Series objects instead of the results’ types seen from scikitlearn’s API.
 y (Ignored) –
Returns: X_new – Reduced version of X. This will always be a dense array.
Return type: array of shape (n_samples, n_components) or cudf dataframe

inverse_transform
(X)¶ Transform X back to its original space.
Returns an array or cudf dataframe X_original whose transform would be X.
Parameters: X (arraylike of shape (n_samples, n_components) or cudf dataframe) – New data. Returns: X_original – Note that this is always dense. If TruncatedSVD from cuML is run, then this method returns cudf dataframe instead of the results’ types seen from scikitlearn’s transform method. Return type: array of shape (n_samples, n_features) or cudf dataframe

transform
(X)¶ Perform dimensionality reduction on X.
Parameters: X ({arraylike, sparse matrix} of shape (n_samples, n_features) or cudf dataframe) – New data. Returns: X_new – Reduced version of X. This will always be dense. If TruncatedSVD from cuML is run, then this method returns cudf dataframe instead of the results’ types seen from scikitlearn’s transform method. Return type: array of shape (n_samples, n_components) or cudf dataframe