snap-ml API

class pai4sk.LinearRegression(max_iter=1000, regularizer=1.0, device_ids=[], verbose=False, use_gpu=False, dual=True, num_threads=1, penalty='l2', tol=0.001, return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, fit_intercept=False, intercept_scaling=1.0)

Linear Regression

This class implements regularized linear regression using the IBM Snap ML solver. It supports both local and distributed(MPI) methods of the snap ML solver. It handles both dense and sparse matrix inputs. Use csr, csc, ndarray, deviceNDArray or SnapML data partition format for training and csr, ndarray or SnapML data partition format for prediction. DeviceNDArray input data format is currently not supported for training with MPI implementation.

Parameters
  • max_iter (int, default : 1000) – Maximum number of iterations used by the solver to converge.

  • regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.

  • use_gpu (bool, default : False) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU.

  • device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1].

  • dual (bool, default : True) – Dual or primal formulation. Recommendation: if n_samples > n_features use dual=True.

  • verbose (bool, default : False) – If True, it prints the training cost, one per iteration. Warning: this will increase the training time. For performance evaluation, use verbose=False.

  • num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True).

  • penalty (str, default : "l2") – The regularization / penalty type. Possible values are “l2” for L2 regularization (RidgeRegression) or “l1” for L1 regularization (LassoRegression). L1 regularization is possible only for the primal optimization problem (dual=False).

  • tol (float, default : 0.001) – The tolerance parameter. Training will finish when maximum change in model coefficients is less than tol.

  • return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. return_training_history is not supported for DeviceNDArray input format.

  • privacy (bool, default : False) – Train the model using a differentially private algorithm. Currently not supported for MPI implementation.

  • eta (float, default : 0.3) – Learning rate for the differentially private training algorithm. Currently not supported for MPI implementation.

  • batch_size (int, default : 100) – Mini-batch size for the differentially private training algorithm. Currently not supported for MPI implementation.

  • privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)-private. Currently not supported for MPI implementation.

  • grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm. Currently not supported for MPI implementation.

  • fit_intercept (bool, default : False) – Add bias term – note, may affect speed of convergence, especially for sparse datasets.

  • intercept_scaling (float, default : 1.0) – Scaling of bias term. The inclusion of a bias term is implemented by appending an additional feature to the dataset. This feature has a constant value, that can be set using this parameter.

Variables
  • coef_ (array-like, shape (n_features,)) – Coefficients of the features in the trained model.

  • support_ (array-like) – Indices of the features that lie in the support ond contribute to the decision. (only available for L1). Currently not supported for MPI implementation.

  • model_sparsity_ (float) – fraction of non-zeros in the model parameters. (only available for L1). Currently not supported for MPI implementation.

fit(X_train, y_train=None)

Fit the model according to the given train dataset.

Parameters
  • X_train (Train dataset. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. DeviceNDArray. Not supported for MPI execution.

    3. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • y_train (The target corresponding to X_train.) – If X_train is sparse matrix or dense matrix, y_train should be array-like of shape = (n_samples,) In case of deviceNDArray, y_train should be array-like of shape = (n_samples, 1) If X_train is SnapML data partition type, then y_train is not required (i.e. None).

Returns

training_history – If return_training_history parameter is set to “summary” or “full” it returns a list with statistics for the training procedure. The return_training_history parameter is not set (None) by default, where it returns no information (None).

Return type

list or None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (Dataset used for predicting estimates. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – Returns the predicted estimate of the sample.

Return type

array-like, shape = (n_samples,)

class pai4sk.LogisticRegression(max_iter=1000, regularizer=1.0, device_ids=[], verbose=False, use_gpu=False, class_weight=None, dual=True, num_threads=1, penalty='l2', tol=0.001, return_training_history=None, privacy=False, eta=0.3, batch_size=100, privacy_epsilon=10, grad_clip=1, fit_intercept=False, intercept_scaling=1.0)

Logistic Regression classifier

This class implements regularized logistic regression using the IBM Snap ML solver. It supports both local and distributed(MPI) methods of the Snap ML solver. It can be used for both binary and multi-class classification problems. For multi-class classification it predicts only classes (no probabilities). It handles both dense and sparse matrix inputs. Use csr, csc, ndarray, deviceNDArray or SnapML data partition format for training and csr, ndarray or SnapML data partition format for prediction. DeviceNDArray input data format is currently not supported for training with MPI implementation. We recommend the user to first normalize the input values.

Parameters
  • max_iter (int, default : 1000) – Maximum number of iterations used by the solver to converge.

  • regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.

  • use_gpu (bool, default : False) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU.

  • device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single-GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1].

  • class_weight ('balanced' or None, optional) – If set to ‘None’, all classes will have weight 1.

  • dual (bool, default : True) – Dual or primal formulation. Recommendation: if n_samples > n_features use dual=True.

  • verbose (bool, default : False) – If True, it prints the training cost, one per iteration. Warning: this will increase the training time. For performance evaluation, use verbose=False.

  • num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True).

  • penalty (str, default : "l2") – The regularization / penalty type. Possible values are “l2” for L2 regularization (LogisticRegression) or “l1” for L1 regularization (SparseLogisticRegression). L1 regularization is possible only for the primal optimization problem (dual=False).

  • tol (float, default : 0.001) – The tolerance parameter. Training will finish when maximum change in model coefficients is less than tol.

  • return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. return_training_history is not supported for DeviceNDArray input format.

  • privacy (bool, default : False) – Train the model using a differentially private algorithm. Currently not supported for MPI implementation.

  • eta (float, default : 0.3) – Learning rate for the differentially private training algorithm. Currently not supported for MPI implementation.

  • batch_size (int, default : 100) – Mini-batch size for the differentially private training algorithm. Currently not supported for MPI implementation.

  • privacy_epsilon (float, default : 10.0) – Target privacy gaurantee. Learned model will be (privacy_epsilon, 0.01)-private. Currently not supported for MPI implementation.

  • grad_clip (float, default: 1.0) – Gradient clipping parameter for the differentially private training algorithm. Currently not supported for MPI implementation.

  • fit_intercept (bool, default : False) – Add bias term – note, may affect speed of convergence, especially for sparse datasets.

  • intercept_scaling (float, default : 1.0) – Scaling of bias term. The inclusion of a bias term is implemented by appending an additional feature to the dataset. This feature has a constant value, that can be set using this parameter.

Variables
  • coef_ (array-like, shape (n_features, 1) for binary classification or) – (n_features, n_classes) for multi-class classification. Coefficients of the features in the trained model.

  • support_ (array-like) – Indices of the features that contribute to the decision. (only available for L1) Currently not supported for MPI implementation.

  • model_sparsity_ (float) – fraction of non-zeros in the model parameters. (only available for L1) Currently not supported for MPI implementation.

fit(X_train, y_train=None)

Fit the model according to the given train data.

Parameters
  • X_train (Train dataset. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. DeviceNDArray. Not supported for MPI execution.

    3. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • y_train (The target corresponding to X_train.) – If X_train is sparse matrix or dense matrix, y_train should be array-like of shape = (n_samples,) In case of deviceNDArray, y_train should be array-like of shape = (n_samples, 1) For binary classification the labels should be {-1, 1} or {0, 1}. If X_train is SnapML data partition type, then y_train is not required (i.e. None).

Returns

training_history – If return_training_history parameter is set to “summary” or “full” it returns a list with statistics for the training procedure. The return_training_history parameter is not set (None) by default, where it returns no information (None).

Return type

list or None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (Dataset used for predicting class estimates. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – Returns the predicted class of the sample.

Return type

array-like, shape = (n_samples,)

predict_log_proba(X, num_threads=0)

Log of probability estimates

The returned log-probability estimates for the two classes. Only for binary classification.

Parameters
  • X (Dataset used for predicting log-probability estimates. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – array-like of shape = (n_samples, 2) Returns the log-probability of the sample to be a positive example for MPI :

array-like of shape = (n_samples,)

Return type

Returns the log-probability of the sample of each of the two classes for local implementation :

predict_proba(X, num_threads=0)

Probability estimates

The returned probability estimates for the two classes. Only for binary classification.

Parameters
  • X (Dataset used for predicting probability estimates. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – array-like of shape = (n_samples, 2) Returns the probability of the sample to be a positive example for MPI :

array-like of shape = (n_samples,)

Return type

Returns the probability of the sample of each of the two classes for local implementation :

class pai4sk.SupportVectorMachine(max_iter=1000, regularizer=1.0, device_ids=[], verbose=False, use_gpu=False, class_weight=None, num_threads=1, tol=0.001, return_training_history=None, fit_intercept=False, intercept_scaling=1.0)

Support Vector Machine classifier

This class implements regularized support vector machine using the IBM Snap ML solver. It supports both local and distributed(MPI) methods of the Snap ML solver. It can be used for both binary and multi-class classification problems. For multi-class classification it predicts classes or the decision function for each class in the model. It handles both dense and sparse matrix inputs. Use csr, ndarray, deviceNDArray or SnapML data partition format for both training and prediction. DeviceNDArray input data format is currently not supported for training with MPI implementation. The training uses the dual formulation. We recommend the user to normalize the input values.

Parameters
  • max_iter (int, default : 1000) – Maximum number of iterations used by the solver to converge.

  • regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.

  • use_gpu (bool, default : False) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU.

  • device_ids (array-like of int, default : []) – If use_gpu is True, it indicates the IDs of the GPUs used for training. For single GPU training, set device_ids to the GPU ID to be used for training, e.g., [0]. For multi-GPU training, set device_ids to a list of GPU IDs to be used for training, e.g., [0, 1].

  • class_weight ('balanced' or None, optional) – If set to ‘None’, all classes will have weight 1.

  • verbose (bool, default : False) – If True, it prints the training cost, one per iteration. Warning: this will increase the training time. For performance evaluation, use verbose=False.

  • num_threads (int, default : 1) – The number of threads used for running the training. The value of this parameter should be a multiple of 32 if the training is performed on GPU (use_gpu=True).

  • tol (float, default : 0.001) – The tolerance parameter. Training will finish when maximum change in model coefficients is less than tol.

  • return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training. return_training_history is not supported for DeviceNDArray input format.

  • fit_intercept (bool, default : False) – Add bias term – note, may affect speed of convergence, especially for sparse datasets.

  • intercept_scaling (float, default : 1.0) – Scaling of bias term. The inclusion of a bias term is implemented by appending an additional feature to the dataset. This feature has a constant value, that can be set using this parameter.

Variables
  • coef_ (array-like, shape (n_features,) for binary classification or) – (n_features, n_classes) for multi-class classification. Coefficients of the features in the trained model.

  • support_ (array-like, shape (n_SV)) – indices of the support vectors. Currently not supported for MPI implementation.

  • n_support_ (int) – Number of support vectors. Currently not supported for MPI implementation.

decision_function(X, num_threads=0)

Predicts confidence scores.

The confidence score of a sample is the signed distance of that sample to the decision boundary.

Parameters
  • X (Dataset used for predicting distances to the decision boundary. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – Returns the distance to the decision boundary of the samples in X.

Return type

array-like, shape = (n_samples,) or (n_sample, n_classes)

fit(X_train, y_train=None)

Fit the model according to the given train dataset.

Parameters
  • X_train (Train dataset. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix, csc_matrix) or dense matrix (ndarray)

    2. DeviceNDArray. Not supported for MPI execution.

    3. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • y_train (The target corresponding to X_train.) – If X_train is sparse matrix or dense matrix, y_train should be array-like of shape = (n_samples,) In case of deviceNDArray, y_train should be array-like of shape = (n_samples, 1) For binary classification the labels should be {-1, 1} or {0, 1}. If X_train is SnapML data partition type, then y_train is not required (i.e. None).

Returns

training_history – If return_training_history parameter is set to “summary” or “full” it returns a list with statistics for the training procedure. The return_training_history parameter is not set (None) by default, where it returns no information (None).

Return type

list or None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (Dataset used for predicting class estimates. Supports the following input data-types :) –

    1. Sparse matrix (csr_matrix) or dense matrix (ndarray)

    2. SnapML data partition of type DensePartition, SparsePartition or ConstantValueSparsePartition

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – Returns the predicted class of the samples in X.

Return type

array-like, shape = (n_samples,)

class pai4sk.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_threads=1, use_histograms=False, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Classifier

This class implements a decision tree classifier using the IBM Snap ML library. It can be used for binary classification problems.

Parameters
  • criterion (string, optional, default : "gini") – This function measures the quality of a split. Possible values: “gini” and “entropy” for information gain. “entropy” is currently not supported.

  • splitter (string, optional, default : "best") – This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

  • max_depth (int or None, optional, default : None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

  • min_samples_leaf (int or float, optional, default : 1) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

  • max_features (int, float, string or None, optional, default : None) –

    The number of features to consider when looking for the best split:
    • If int, then consider max_features features at each split.

    • If float, then consider int(max_features * n_features) features at each split.

    • If “auto”, then max_features=sqrt(n_features).

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int, or None, optional, default : None) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • n_threads (integer, optional, default : 1) – The number of CPU threads to use.

  • use_histograms (boolean, default : False) – Use histogram-based splits rather than exact splits.

  • hist_nbins (int, default : 256) – Number of histogram bins.

  • use_gpu (boolean, default : False) – Use GPU acceleration (only supported for histogram-based splits).

  • gpu_id (int, default: 0) – Device ID of the GPU which will be used when GPU acceleration is enabled.

verbosebool, defaultFalse

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Variables
  • classes_ (array of shape = [n_classes]) – The classes labels (single output problem)

  • n_classes_ (int) – The number of classes (for single output problems)

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
  • X_train (dense matrix (ndarray)) – Train dataset

  • y_train (array-like, shape = (n_samples,)) – The target vector corresponding to X_train.

  • sample_weight (array-like, shape = [n_samples] or None) – Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

Return type

None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting class estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred – Returns the predicted class of the sample.

Return type

array-like, shape = (n_samples,)

predict_log_proba(X, num_threads=0)

Log of probability estimates

The returned log-probability estimates for the two classes. Only for binary classification.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting log-probability estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

Return type

None

predict_proba(X, num_threads=0)

Probability estimates

The returned probability estimates for the two classes. Only for binary classification.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting probability estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

Return type

None

class pai4sk.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_leaf=1, max_features='auto', bootstrap=True, n_jobs=None, random_state=None, verbose=False, use_histograms=False, hist_nbins=256, use_gpu=False, gpu_ids=[0])

Random Forest Classifier

This class implements a random forest classifier using the IBM Snap ML library. It can be used for binary classification problems.

Parameters
  • n_estimators (integer, optional, default : 10) – This parameter defines the number of trees in forest.

  • criterion (string, optional, default : "gini") – This function measures the quality of a split. The currently supported criterion is “gini”.

  • max_depth (integer or None, optional, default : None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

  • min_samples_leaf (int or float, optional, default : 1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a fraction and

    ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • max_features (int, float, string or None, optional, default : 'auto') –

    The number of features to consider when looking for the best split:
    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “auto”, then max_features=sqrt(n_features).

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • bootstrap (boolean, optional, default : True) – This parameter determines whether bootstrap samples are used when building trees.

  • n_jobs (integer or None, optional, default : None) – The number of jobs to run in parallel the fit function. None = 1 process.

  • random_state (integer, or None, optional, default : None) – If integer, random_state is the seed used by the random number generator. If None, the random number generator is the RandomState instance used by np.random.

  • verbose (boolean, default : False) – If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

  • use_histograms (boolean, default : False) – Use histogram-based splits rather than exact splits.

  • hist_nbins (int, default : 256) – Number of histogram bins.

  • use_gpu (boolean, default : False) – Use GPU acceleration (only supported for histogram-based splits).

  • gpu_ids (array-like of int, default: [0]) – Device IDs of the GPUs which will be used when GPU acceleration is enabled.

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
  • X_train (dense matrix (ndarray)) – Train dataset

  • y_train (array-like, shape = (n_samples,)) – The target vector corresponding to X_train.

  • sample_weight (array-like, shape = [n_samples] or None) – Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

self

Return type

object

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting class estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

proba – Returns the predicted class of the sample.

Return type

array-like, shape = (n_samples,)