Manual¶
The Snap Machine Learning (Snap ML) Library is designed to offer fast training of generalized linear models and tree based models. The library is under development and currently supports the following machine learning models:
Linear Regression¶
LinearRegression
fits a linear model with coefficients to minimize the residual sum
of squares between the predicted responses and the true labels of the training data.
In order to prevent the model from overfitting you have the option to impose an or an norm penalty on the size of the coefficients.
Ridge Regression adds an -regularization term to the least-squares loss. Mathematically it solves the following optimization problem:
Lasso adds an -regularization term to the least-squares loss. Mathematically it solves the following optimization problem:
In both cases denotes the training data matrix with samples in its columns and are the corresponding labels. The regularization strength is controlled by the regularization parameter ; the larger the more robust the model becomes to overfitting.
Note
(Regularization Parameter) In order to find an appropriate regularization parameter we recommend to perform cross validation.
Note
(Feature Selection) As Lasso regression yields sparse models it can be used to perform feature selection. The sparsity can be controlled by - a larger regularization parameter encourages more sparsity.
Snap ML implements different variants of stochastic coordinate descent [SCD] and stochastic dual coordinate ascent [SDCA] as an algorithm to fit the model parameters . In order to optimally support GPUs for training Snap ML implements a parallel asynchronous version of these solvers especially designed to leverage the massive parallelism of modern GPUs [TPASCD].
To train the LinearRegression
model the fit
method is used; it takes the training data and the labels as input and stores the learnt coefficients of the model in its coef_
member function.
The regularization type can be specified during initialization using the penalty
argument.
The trained model can then be used to make predictions by calling the predict
method on unlabelled data.
>>> from pai4sk import LinearRegression
>>> import numpy as np
>>> reg = LinearRegression(max_iter = 100, regularizer = 0.1, penalty = 'l2')
>>> X_train = np.array([[0, 0], [1, 1], [2, 2]])
>>> y_train = np.array([0, 1, 2])
>>> reg.fit (X_train, y_train)
>>> reg.coef_
[ 0.495, 0.495]
>>> X_test = np.array([[3, 3], [0, 1]])
>>> reg.predict(X_test)
[2.97, 0.495]
For more details about the API we refer to the snap-ml API.
Support Vector Machine¶
Support Vector Machine (SVM) is a supervised learning method which can be applied for regression as well as classification.
Currently Snap ML implements SupportVectorMachine
(SVMs) with a linear kernel function and offers
regularization to prevent the model from overfitting.
Mathematically it solves the following optimization problem:
where denotes the hinge loss with being the training samples and the corresponding labels. The regularization strength can be controlled by the user through the regularizer
parameter.
The larger the more robust the model becomes to overfitting.
Snap ML implements stochastic dual coordinate [SDCA] and the GPU optimized [TPASCD] as an algorithm to train the SVM classifier. SDCA runs on the equivalent SVM dual problem formulation:
with the constraint .
To train the model the fit
method is used; it takes the training data and the labels as input and stores the learnt coefficients of the model in its coef_
member function. The trained model can then be used to make predictions by calling the predict
method on unlabelled data.
>>> from pai4sk import SupportVectorMachine
>>> import numpy as np
>>> reg = SupportVectorMachine(max_iter = 100, regularizer = 1.0)
>>> X_train = np.array([[0, 0], [1, 1], [2, 2]])
>>> y_train = np.array([-1,-1,1])
>>> reg.fit (X_train, y_train)
>>> reg.coef_
[ 0.25, 0.25]
>>> X_test = np.array([[3, 3], [0, 1]])
>>> reg.predict(X_test)
[1,-1]
A full example of training a SupportVectorMachine
model in a real application can be found in the PowerAI distribution under /opt/DL/snap-ml-local/examples. For more details about the API we refer to the snap-ml API.
Logistic Regression¶
LogisticRegression
is a linear model for classification. A logistic model is used to estimate the probability of an outcome based on the features of the input data. In order to prevent the model from overfitting or regularization can be used.
Mathematically -regularized Logistic Regression solves the following optimization problem composing of the logistic loss and an regularization term:
Similarly, -regularized Logistic Regression solves the following optimization problem:
where is the training data matrix with samples in its columns and denote the corresponding labels.
The regularization strength is controlled by the regularization parameter ; the larger the more robust the model becomes to overfitting. can be specified by the user through the regularizer
input parameter.
Snap ML implements stochastic coordinate descent [SCD] and stochastic dual coordinate ascent [SDCA] as an algorithm to fit the model parameters . In order to support GPU acceleration Snap ML implements a parallel asynchronous version of these solvers especially designed to leverage the massive parallelism of moderne GPUs [TPASCD].
The model can be trained using the fit
method which takes the training data and the labels as input and stores the coefficients of the learnt model in its coef_
attribute. This model can then be used to make predictions by calling the predict
method on unlabelled data. The regularization type can be specified at initialization using the penalty
argument.
>>> from pai4sk import LogisticRegression
>>> import numpy as np
>>> lr = LogisticRegression(max_iter = 100, regularizer = 0.01, penalty = 'l2')
>>> X_train = np.array([[0, 0], [1, 1], [2, 2]])
>>> y_train = np.array([-1,-1,1])
>>> lr.fit (X_train, y_train)
>>> lr.coef_
[0.145, 0.145]
>>> X_test = np.array([[3,3],[-2,1]])
>>> lr.predict(X_test)
[1,-1]
>>> lr.predict_proba(X_test)
[[0.295, 0.705]
[0.536, 0.464]]
A full example of training a LogisticRegression
model in Snap ML can be found in the PowerAI distribution under /opt/DL/snap-ml-local/examples. For more details about the API we refer to the snap-ml API.
Decision Tree¶
Snap ML offers a DecisionTreeClassifier
for training a binary decision tree. The training routine is single-threaded and executed on the CPU. Check the snap-ml API for details about the available options or check out the Tutorials for an application example.
Random Forest¶
With the RandomForestClassifier
you can train an ensemble of trees. This is usually preferred over a single decision tree as it improves the generalization performance of the model. Snap ML implements a multi-threaded training routine that is executed on the CPU. Check the snap-ml API for details about the available options or check out the Tutorials for an application example.
References¶
[SCD]: | Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 2012. |
---|---|
[SDCA]: | Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research, 2013. |
[TPASCD]: | Thomas Parnell, Celestine Dünner, Kubilay Atasu, Manolis Sifalakis and Haris Pozidis. Tera-Scale Coordinate Descent on GPUs. Journal on Future Generation Computer Systems, 2018. |