cluster.DBSCAN¶

class pai4sk.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, use_gpu=True)¶

Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

If the input data is cudf dataframe and if possible, then the accelerated DBSCAN algorithm from cuML will be used. Otherwise, scikit-learn’s DBSCAN algorithm will be used.

cuML in pai4sk is currently supported only

(a) with python 3.6 and
(b) without MPI.
If DBSCAN from cuML is run, then the return values from the APIs will be

cudf dataframe and cudf Series objects instead of the return types of scikit-learn API.

eps : float, optional

The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samples : int, optional

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by pai4sk.metrics.pairwise_distances for its metric parameter. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only nonzero elements may be considered neighbors for DBSCAN. New in version 0.17: metric precomputed to accept precomputed sparse matrix.

metric_params : dict, optional

Additional keyword arguments for the metric function. New in version 0.19.

algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’, ‘cuml’}, optional

The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

If cudf dataframe is given as input, if either

(1) algorithm is set to “cuml” or
(2) algorithm is “auto”,
then pai4sk will try to use DBSCAN algorithm from RAPIDS cuML if possible.
cuML in pai4sk is currently supported only
(a) with python 3.6 and
(b) without MPI.

leaf_size : int, optional (default = 30)

Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : float, optional

The power of the Minkowski metric to be used to calculate distance between points.

n_jobs : int or None, optional (default=None)

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

use_gpu : boolean, Default is True

If True, cuML will use GPU 0. Applicable only for cuML.

Attributes: core_sample_indices_ : array, shape = [n_core_samples]

Indices of core samples.

components_ : array, shape = [n_core_samples, n_features]: Copy of each core sample found by training.
labels_ : array, shape = [n_samples]: Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

fit(X, y=None, sample_weight=None)¶

Perform DBSCAN clustering from features or distance matrix. Parameters: ———- X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cuDF dataframe

A feature array, or array of distances between samples if metric=’precomputed’.

sample_weight : array, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

y : Ignored

self : object: If DBSCAN from cuML is run, then this fit method saves the computed labels as cudf Series object instead of array.

fit_predict(X, y=None, sample_weight=None)¶

Performs clustering on X and returns cluster labels.

Parameters: X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples) or cudf dataframe

A feature array, or array of distances between samples if metric=’precomputed’.

sample_weight : array, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

y : Ignored

y : ndarray, shape (n_samples,) or cudf Series: If DBSCAN from cuML is run, then this fit method returns the computed labels as cudf Series object instead of ndarray.