cluster.DBSCAN (uses cuML)¶

class pai4sk.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, use_gpu=True, verbose=False, max_mbytes_per_batch=None, handle=None)¶

Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

If the input data is cudf dataframe and if possible, then the accelerated DBSCAN algorithm from cuML will be used. Otherwise, scikit-learn’s DBSCAN algorithm will be used.

cuML in pai4sk is currently supported only without MPI. | If DBSCAN from cuML is run, then the return values from the APIs will be cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

epsfloat, optional

The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samplesint, optional

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metricstring, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by pai4sk.metrics.pairwise_distances for its metric parameter. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only nonzero elements may be considered neighbors for DBSCAN. New in version 0.17: metric precomputed to accept precomputed sparse matrix.

metric_paramsdict, optional

Additional keyword arguments for the metric function. New in version 0.19.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’, ‘cuml’}, optional

The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

If cudf dataframe is given as input, if either

(1) algorithm is set to “cuml” or
(2) algorithm is “auto”,
then pai4sk will try to use DBSCAN algorithm from RAPIDS cuML if possible.
cuML in pai4sk is currently supported only without MPI.

leaf_sizeint, optional (default = 30)

Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

pfloat, optional

The power of the Minkowski metric to be used to calculate distance between points.

n_jobsint or None, optional (default=None)

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

use_gpuboolean, Default is True

If True, cuML will use GPU 0. Applicable only for cuML.

handlecuml.Handle, Default is None

The cumlHandle resources to use. If it is None, a new one is created just for this class. Applicable only for cuML.

verbosebool, Default is False

Whether to print debug spews. Applicable only for cuML.

max_mbytes_per_batch(optional) int64, Default is 0

Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device. Applicable only for cuML.

Attributes: core_sample_indices_ : array, shape = [n_core_samples]

Indices of core samples.

components_array, shape = [n_core_samples, n_features]: Copy of each core sample found by training.
labels_array, shape = [n_samples]: Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

fit(X, y=None, sample_weight=None)¶

Perform DBSCAN clustering from features or distance matrix. Parameters: ———- X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cuDF dataframe

A feature array, or array of distances between samples if metric=’precomputed’.

sample_weightarray, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

y : Ignored

selfobject: If DBSCAN from cuML is run, then this fit method saves the computed labels as cudf Series object instead of array.

fit_predict(X, y=None, sample_weight=None)¶

Performs clustering on X and returns cluster labels.

Parameters: X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cudf dataframe

A feature array, or array of distances between samples if metric=’precomputed’.

sample_weightarray, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

y : Ignored

yndarray, shape (n_samples,) or cudf Series: If DBSCAN from cuML is run, then this fit method returns the computed labels as cudf Series object instead of ndarray.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any