DecisionTree

class pai4sk.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_threads=1, use_histograms=False, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Classifier

This class implements a decision tree classifier using the IBM Snap ML library. It can be used for binary classification problems.

Parameters
  • criterion (string, optional, default : "gini") – This function measures the quality of a split. Possible values: “gini” and “entropy” for information gain. “entropy” is currently not supported.

  • splitter (string, optional, default : "best") – This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

  • max_depth (int or None, optional, default : None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

  • min_samples_leaf (int or float, optional, default : 1) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

  • max_features (int, float, string or None, optional, default : None) –

    The number of features to consider when looking for the best split:
    • If int, then consider max_features features at each split.

    • If float, then consider int(max_features * n_features) features at each split.

    • If “auto”, then max_features=sqrt(n_features).

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int, or None, optional, default : None) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • n_threads (integer, optional, default : 1) – The number of CPU threads to use.

  • use_histograms (boolean, default : False) – Use histogram-based splits rather than exact splits.

  • hist_nbins (int, default : 256) – Number of histogram bins.

  • use_gpu (boolean, default : False) – Use GPU acceleration (only supported for histogram-based splits).

  • gpu_id (int, default: 0) – Device ID of the GPU which will be used when GPU acceleration is enabled.

verbosebool, defaultFalse

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Variables
  • classes_ (array of shape = [n_classes]) – The classes labels (single output problem)

  • n_classes_ (int) – The number of classes (for single output problems)

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
  • X_train (dense matrix (ndarray)) – Train dataset

  • y_train (array-like, shape = (n_samples,)) – The target vector corresponding to X_train.

  • sample_weight (array-like, shape = [n_samples] or None) – Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

Return type

None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting class estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred – Returns the predicted class of the sample.

Return type

array-like, shape = (n_samples,)

predict_log_proba(X, num_threads=0)

Log of probability estimates

The returned log-probability estimates for the two classes. Only for binary classification.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting log-probability estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

Return type

None

predict_proba(X, num_threads=0)

Probability estimates

The returned probability estimates for the two classes. Only for binary classification.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting probability estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

Return type

None

class pai4sk.DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_threads=1, use_histograms=False, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Regressor

This class implements a decision tree regressor using the IBM Snap ML library. It can be used for regression tasks.

Parameters
  • criterion (string, optional, default : "mse") – This function measures the quality of a split. Possible values: “mse” for mean squared error. “friedsman_mse” and “mae” are currently not supported.

  • splitter (string, optional, default : "best") – This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

  • max_depth (int or None, optional, default : None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

  • min_samples_leaf (int or float, optional, default : 1) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

  • max_features (int, float, string or None, optional, default : None) –

    The number of features to consider when looking for the best split:
    • If int, then consider max_features features at each split.

    • If float, then consider int(max_features * n_features) features at each split.

    • If “auto”, then max_features=n_features.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int, or None, optional, default : None) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • n_threads (integer, optional, default : 1) – The number of CPU threads to use.

  • use_histograms (boolean, default : False) – Use histogram-based splits rather than exact splits.

  • hist_nbins (int, default : 256) – Number of histogram bins.

  • use_gpu (boolean, default : False) – Use GPU acceleration (only supported for histogram-based splits).

  • gpu_id (int, default: 0) – Device ID of the GPU which will be used when GPU acceleration is enabled.

  • verbose (bool, default : False) – If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
  • X_train (dense matrix (ndarray)) – Train dataset

  • y_train (array-like, shape = (n_samples,)) – The target vector corresponding to X_train.

  • sample_weight (array-like, shape = [n_samples] or None) – Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

Return type

None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Regression predictions

The returned regression estimates.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting regression estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred – Returns the predicted values of the samples.

Return type

array-like, shape = (n_samples,)