snap-ml-spark API

class snap_ml_spark.LinearRegression.LinearRegression(max_iter=1000, dual=True, regularizer=1.0, verbose=False, use_gpu=False, class_weights=None, gpu_mem_limit=0, n_threads=-1, penalty='l2', tol=0.001, return_training_history=None, labelColIndex=1)

Linear Regression classifier

This class implements Regularized Linear regression using the IBM Snap ML distributed solver. It can handle sparse and dense dataset formats. Please use libsvm, snap or csv format for the Dual algorithm, or snap.t (transposed) format for the primal algorithm.

Parameters:
  • max_iter (int, default : 1000) – Maximum number of iterations used by the solver to converge.
  • dual (bool, default : True) – Dual or primal formulation. Recommendation: if n_samples > n_features use dual=True, else dual=False.
  • regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.
  • verbose (boolean, default : False) – Flag for indicating if the training loss will be printed at each epoch.
  • use_gpu (bool, default : False) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU.
  • class_weights ('balanced'/True or None/False, optional) – If set to ‘None’, all classes will have weight = 1.
  • gpu_mem_limit (int, default : 0) – Limit of the GPU memory. If set to the default value 0, the maximum possible memory is used.
  • n_threads (int, default : -1 meaning that n_threads=256 if GPU is enabled, else 1) – Number of threads to be used.
  • penalty (str, default : "l2") – The regularization / penalty type. Possible values are “l2” for L2 regularization (RidgeRegression) or “l1” for L1 regularization (LassoRegression). L1 regularization is possible only for the primal optimization problem (dual=False).
  • tol (float, default : 0.001) – The tolerance parameter. Training will finish when maximum change in model coefficients is less than tol.
  • return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training.
  • labelColIndex (int, default : 1) – Denotes which column to be marked as label. It’s applicable only when Dataframe is passed in as input and its holding features in individual columns rather than in DenseVector or SparseVector
Variables:
  • coef (ndarray, shape (n_features,)) – Coefficients of the features in the trained model.
  • pred_array (ndarray, shape(number_of_test_examples,)) – linear predictions written by the predict() function of this class
fit(data)

learn model

Parameters:data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and it cannot be accessed by python as a python array.) – data to fit model
Returns:double – final training loss of the last epoch
get_params()
Returns:all the initialized parameters of the Linear Regression model as a python dictionary
predict(data, num_threads=0)

Predict regression values

Parameters:
  • data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and cannot be accessed by python as an array but only can passed as a parameter to this function in order to get the predictions) – data to make predictions
  • num_threads – the number of threads to use for inference (default 0 means use all avaliable threads)
Returns:

a pointer which points to a com.ibm.snap.ml.DatasetWithPredictions java object. This pointer cannot be accessed by python but the user can access the predictions from the pred_array_ field which is a python array.

class snap_ml_spark.LogisticRegression.LogisticRegression(max_iter=1000, dual=True, regularizer=1.0, verbose=False, use_gpu=False, class_weights=None, gpu_mem_limit=0, n_threads=-1, penalty='l2', tol=0.001, return_training_history=None, labelColIndex=1)

Logistic Regression classifier

This class implements regularized Logistic Regression using the IBM Snap ML solver. It can handle sparse and dense dataset formats. Use libsvm, snap or csv format for the Dual algorithm, or snap.t (transposed) format for the primal algorithm.

Parameters:
  • max_iter (int, default : 1000) – Maximum number of iterations used by the solver to converge.
  • dual (bool, default : True) – Dual or Primal formulation. Recommendation: if n_samples > n_features use dual=True.
  • regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.
  • verbose (boolean, default : False) – Flag for indicating if the training loss will be printed at each epoch.
  • use_gpu (bool, default : False) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU.
  • class_weights ('balanced'/True or None/False, optional) – If set to ‘None’, all classes will have weight 1.
  • gpu_mem_limit (int, default : 0) – Limit of the GPU memory. If set to the default value 0, the maximum possible memory is used.
  • n_threads (int, default : -1 meaning that n_threads=256 if GPU is enabled, else 1) – Number of threads to be used.
  • penalty (str, default : "l2") – The regularization / penalty type. Possible values are “l2” for L2 regularization or “l1” for L1 regularization. L1 regularization is possible only for the primal optimization problem (dual=False).
  • tol (float, default : 0.001) – The tolerance parameter. Training will finish when maximum change in model coefficients is less than tol.
  • return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training.
  • labelColIndex (int, default : 1) – Denotes which column to be marked as label. It’s applicable only when Dataframe is passed in as input and its holding features in individual columns rather than in DenseVector or SparseVector
Variables:
  • coef (ndarray, shape (n_features,)) – Coefficients of the features in the trained model.
  • pred_array (ndarray, shape(number_of_test_examples,)) – binary predictions written by the predict() function
  • proba_array (ndarray, shape(number_of_test_examples,)) – predicted probabilities written by the predict_proba() function
fit(data)

learn model

Parameters:data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and it cannot be accessed by python as a python array.) – data to fit model
Returns:double – final training loss of the last epoch
get_params()
Returns:all the initialized parameters of the Logistic Regression model as a python dictionary
predict(data, num_threads=0)

Predict label

Parameters:
  • data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and cannot be accessed by python as an array but only can passed as a parameter to this function in order to get the predictions) – data to make predictions
  • num_threads – the number of threads to use for inference (default 0 means use all avaliable threads)
Returns:

a pointer which points to a com.ibm.snap.ml.DatasetWithPredictions java object. This pointer cannot be accessed by python but the user can access the predictions from the pred_array_ field which is a python array.

predict_proba(data, num_threads=0)

Predict probabilities

Parameters:
  • data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and cannot be accessed by python as an array but only can passed as a parameter to this function in order to get the predictions) – data to make predictions
  • num_threads – the number of threads to use for inference (default 0 means use all avaliable threads)
Returns:

a pointer which points to a com.ibm.snap.ml.DatasetWithPredictions java object. This pointer cannot be accessed by python but the user can access the predictions from the proba_array_ field which is a python array.

class snap_ml_spark.SupportVectorMachine.SupportVectorMachine(max_iter=1000, regularizer=1.0, verbose=False, use_gpu=False, class_weights=None, gpu_mem_limit=0, n_threads=-1, tol=0.001, return_training_history=None, labelColIndex=1)

Support Vector Machine classifier

This class implements regularized support vector machine using the IBM Snap ML solver. It can handle sparse and dense dataset formats. Use libsvm, snap or csv format for the Dual algorithm, or snap.t (transposed) format for the primal algorithm.

Parameters:
  • max_iter (int, default : 1000) – Maximum number of iterations used by the solver to converge.
  • regularizer (float, default : 1.0) – Regularization strength. It must be a positive float. Larger regularization values imply stronger regularization.
  • verbose (boolean, default : False) – Flag for indicating if the training loss will be printed at each epoch.
  • use_gpu (bool, default : False) – Flag for indicating the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU.
  • class_weights ('balanced'/True or None/False, optional) – If set to ‘None’, all classes will have weight 1.
  • gpu_mem_limit (int, default : 0) – Limit of the GPU memory. If set to the default value 0, the maximum possible memory is used.
  • n_threads (int, default : -1 meaning that n_threads=256 if GPU is enabled, else 1) – Number of threads to be used.
  • tol (float, default : 0.001) – The tolerance parameter. Training will finish when maximum change in model coefficients is less than tol.
  • return_training_history (str or None, default : None) – How much information about the training should be collected and returned by the fit function. By default no information is returned (None), but this parameter can be set to “summary”, to obtain summary statistics at the end of training, or “full” to obtain a complete set of statistics for the entire training procedure. Note, enabling either option will result in slower training.
  • labelColIndex (int, default : 1) – Denotes which column to be marked as label. It’s applicable only when Dataframe is passed in as input and its holding features in individual columns rather than in DenseVector or SparseVector
Variables:
  • coef (ndarray, shape (n_features,)) – Coefficients of the features in the trained model.
  • pred_array (ndarray, shape(number_of_test_examples,)) – binary predictions written by the predict() function
fit(data)

learn model

Parameters:data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and it cannot be accessed by python as a python array.) – data to fit model
Returns:double – final training loss of the last epoch
get_params()
Returns:all the initialized parameters of the Support Vector Machine Model as a python dictionary
predict(data, num_threads=0)

Predict label

Parameters:
  • data (pyspark.sql.DataFrame/py4j.java_gateway.JavaObject which points to a memory address where the actual data is stored and cannot be accessed by python as an array but only can passed as a parameter to this function in order to get the predictions) – data to make predictions
  • num_threads – the number of threads to use for inference (default 0 means use all avaliable threads)
Returns:

a pointer which points to a com.ibm.snap.ml.DatasetWithPredictions java object. This pointer cannot be accessed by python but the user can access the predictions from the pred_array_ field which is a python array.

class snap_ml_spark.DatasetReader.DatasetReader

Load distributed dataset from file.

load(file)

Load training data in memory

Parameters:file (string) – filename
setFormat(format)

Specify the dataformat of the file. Format values: “snap” or “libsvm” or “csv”

Parameters:format (string) – data format
setNumFt(x)

Set the number of features

Parameters:x (int) – index
takeRange(idx_start, idx_end)

If not the whole dataset should be loaded specify start and end index.

Parameters:
  • idx_start (int) – first sample to load
  • idx_end (int) – last sample to load
snap_ml_spark.Metrics.accuracy(dataWithPredictions)
Parameters:dataWithPredictions – binary predictions computed by the LogisticRegression or SupportVectorMachine predict() function
Returns:accuracy computed based on the binary predictions of a classifier (LogisticRegression, SupportVectorMachines)
Return type:double
snap_ml_spark.Metrics.f1score(dataWithPredictions)
Parameters:dataWithPredictions – binary predictions computed by the LogisticRegression or SupportVectorMachine predict() function
Returns:f1score metric (2*(precision*recall)/(precision+recall)), computed based on the binary predictions of a classifier (LogisticRegression, SupportVectorMachines)
Return type:double
snap_ml_spark.Metrics.logisticLoss(dataWithPredictions)
Parameters:dataWithPredictions – probabilities computed by the LogisticRegression predict_proba() function
Returns:logistic loss computed by the logistic regression predicted probabilities
Return type:double
snap_ml_spark.Metrics.meanSquaredError(dataWithPredictions)
Parameters:dataWithPredictions – linear regression predictions, predicted by the RidgeRegression predict() function
Returns:mean squared error computed based on the provided dataWithPredictions parameter
Return type:double
snap_ml_spark.Metrics.precision(dataWithPredictions)
Parameters:dataWithPredictions – binary predictions computed by the LogisticRegression or SupportVectorMachine predict() function
Returns:precision metric (TP/(TP+FP)), computed based on the binary predictions of a classifier (LogisticRegression, SupportVectorMachines)
Return type:double
snap_ml_spark.Metrics.recall(dataWithPredictions)
Parameters:dataWithPredictions – binary predictions computed by the LogisticRegression or SupportVectorMachine predict() function
Returns:recall metric (TP/(TP+FN)), computed based on the binary predictions of a classifier (LogisticRegression, SupportVectorMachines)
Return type:double
snap_ml_spark.Utils.dump_to_snap_format(X, y, filename, transpose=False, implicit_vals=False)

Non-distributed data writing to snap format

Parameters:
  • X (numpy array or sparse matrix) – The data used for training or inference.
  • y (numpy array) – The labels of the samples in X.
  • filename (str) – The file where X and y will be stored in snap format.
  • transpose (bool , default : False) – If transpose is True, X will be stored in transposed format.
snap_ml_spark.Utils.read_from_snap_format(filename)

Non-distributed data loading from snap format

Parameters:filename (str) – The file where the data resides.
Returns:X, y – Returns two datasets. X : the data used for training or inference y : the labels of the samples in X.
Return type:numpy array or sparse matrix, numpy array