DecisionTree

class pai4sk.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, verbose=False, use_gpu=False, use_histograms=False, hist_nbins=64)

Decision Tree Classifier

This class implements a decision tree classifier using the IBM Snap ML library. It can be used for binary classification problems.

Parameters
  • criterion (string, optional, default : "gini") – This function measures the quality of a split. Possible values: “gini” and “entropy” for information gain. “entropy” is currently not supported.

  • splitter (string, optional, default : "best") – This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

  • max_depth (int or None, optional, default : None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

  • min_samples_leaf (int or float, optional, default : 1) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

  • max_features (int, float, string or None, optional, default : None) –

    The number of features to consider when looking for the best split:
    • If int, then consider max_features features at each split.

    • If float, then consider int(max_features * n_features) features at each split.

    • If “auto”, then max_features=sqrt(n_features).

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int, or None, optional, default : None) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • use_gpu (bool, default : False) – Flag that indicates the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. Currently only CPU training is supported.

  • use_histograms (boolean, default : True) – Use histogram-based splits rather than exact splits.

  • hist_nbins (int, default : 64) – Number of histogram bins.

verbosebool, defaultFalse

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Variables
  • classes_ (array of shape = [n_classes]) – The classes labels (single output problem)

  • n_classes_ (int) – The number of classes (for single output problems)

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
  • X_train (dense matrix (ndarray)) – Train dataset

  • y_train (array-like, shape = (n_samples,)) – The target vector corresponding to X_train.

  • sample_weight (array-like, shape = [n_samples] or None) – Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

Return type

None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Class predictions

The returned class estimates.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting class estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred – Returns the predicted class of the sample.

Return type

array-like, shape = (n_samples,)

predict_log_proba(X, num_threads=0)

Log of probability estimates

The returned log-probability estimates for the two classes. Only for binary classification.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting log-probability estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

Return type

None

predict_proba(X, num_threads=0)

Probability estimates

The returned probability estimates for the two classes. Only for binary classification.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting probability estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

Return type

None

class pai4sk.DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, verbose=False, use_gpu=False, use_histograms=False, hist_nbins=64)

Decision Tree Regressor

This class implements a decision tree regressor using the IBM Snap ML library. It can be used for regression tasks.

Parameters
  • criterion (string, optional, default : "mse") – This function measures the quality of a split. Possible values: “mse” for mean squared error. “friedsman_mse” and “mae” are currently not supported.

  • splitter (string, optional, default : "best") – This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

  • max_depth (int or None, optional, default : None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

  • min_samples_leaf (int or float, optional, default : 1) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

  • max_features (int, float, string or None, optional, default : None) –

    The number of features to consider when looking for the best split:
    • If int, then consider max_features features at each split.

    • If float, then consider int(max_features * n_features) features at each split.

    • If “auto”, then max_features=n_features.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int, or None, optional, default : None) – If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • use_gpu (bool, default : False) – Flag that indicates the hardware platform used for training. If True, the training is performed using the GPU. If False, the training is performed using the CPU. Currently only CPU training is supported.

  • use_histograms (boolean, default : True) – Use histogram-based splits rather than exact splits.

  • hist_nbins (int, default : 64) – Number of histogram bins.

  • verbose (bool, default : False) – If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
  • X_train (dense matrix (ndarray)) – Train dataset

  • y_train (array-like, shape = (n_samples,)) – The target vector corresponding to X_train.

  • sample_weight (array-like, shape = [n_samples] or None) – Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

Return type

None

get_params()

Get the values of the model parameters.

Returns

params

Return type

dict

predict(X, num_threads=0)

Regression predictions

The returned regression estimates.

Parameters
  • X (dense matrix (ndarray)) – Dataset used for predicting regression estimates.

  • num_threads (int, default : 0) – Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred – Returns the predicted values of the samples.

Return type

array-like, shape = (n_samples,)