DES Clustering

class deslib.des.des_clustering.DESClustering(pool_classifiers=None, clustering=None, pct_accuracy=0.5, voting='hard', pct_diversity=0.33, more_diverse=True, metric_diversity='DF', metric_performance='accuracy_score', n_clusters=5, random_state=None, DSEL_perc=0.5, n_jobs=-1)[source]

Dynamic ensemble selection-Clustering (DES-Clustering).

This method selects an ensemble of classifiers taking into account the accuracy and diversity of the base classifiers. The K-means algorithm is used to define the region of competence. For each cluster, the N most accurate classifiers are first selected. Then, the J more diverse classifiers from the N most accurate classifiers are selected to compose the ensemble.

Parameters:
pool_classifiers : list of classifiers (Default = None)

The generated_pool of classifiers trained for the corresponding classification problem. Each base classifiers should support the method “predict”. If None, then the pool of classifiers is a bagging classifier.

clustering : sklearn.cluster (Default = None)

The clustering model used to estimate the region of competence. If None, a KMeans with K = 5 is used.

pct_accuracy : float (Default = 0.5)

Percentage of base classifiers selected based on accuracy

pct_diversity : float (Default = 0.33)

Percentage of base classifiers selected based on diversity

more_diverse : Boolean (Default = True)

Whether we select the most or the least diverse classifiers to add to the pre-selected ensemble

metric_diversity : String (Default = ‘df’)

Metric used to estimate the diversity of the base classifiers. Can be either the double fault (df), Q-statistics (Q), or error correlation.

metric_performance : String (Default = ‘accuracy_score’)

Metric used to estimate the performance of a base classifier on a cluster. Can be either any metric from sklearn.metrics.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

DSEL_perc : float (Default = 0.5)

Percentage of the input data used to fit DSEL. Note: This parameter is only used if the pool of classifier is None or unfitted.

voting : {‘hard’, ‘soft’}, default=’hard’

If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

n_jobs : int, default=-1

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. Doesn’t affect fit method.

References

Soares, R. G., Santana, A., Canuto, A. M., & de Souto, M. C. P. “Using accuracy and more_diverse to select classifiers to build ensembles.” International Joint Conference on Neural Networks (IJCNN)., 2006.

Britto, Alceu S., Robert Sabourin, and Luiz ES Oliveira. “Dynamic selection of classifiers—a comprehensive review.” Pattern Recognition 47.11 (2014): 3665-3680.

R. M. O. Cruz, R. Sabourin, and G. D. Cavalcanti, “Dynamic classifier selection: Recent advances and perspectives,” Information Fusion, vol. 41, pp. 195 – 216, 2018.

estimate_competence(competence_region, distances=None, predictions=None)[source]

Get the competence estimates of each base classifier \(c_{i}\) for the classification of the query sample.

In this case, the competences were already pre-calculated for each cluster. So this method computes the nearest cluster and get the pre-calculated competences of the base classifiers for the corresponding cluster.

Parameters:
predictions : array of shape (n_samples, n_classifiers)

Predictions of the base classifiers for all test examples.

Returns:
competences : array = [n_samples, n_classifiers]

The competence level estimated for each base classifier.

fit(X, y)[source]

Train the DS model by setting the Clustering algorithm and pre-processing the information required to apply the DS methods.

First the data is divided into K clusters. Then, for each cluster, the N most accurate classifiers are first selected. Then, the J more diverse classifiers from the N most accurate classifiers are selected to compose the ensemble of the corresponding cluster. An ensemble of classifiers is assigned to each of the K clusters.

Parameters:
X : array of shape (n_samples, n_features)

Data used to fit the model.

y : array of shape (n_samples)

class labels of each example in X.

Returns:
self
predict(X)[source]

Predict the class label for each sample in X.

Parameters:
X : array of shape (n_samples, n_features)

The input data.

Returns:
predicted_labels : array of shape (n_samples)

Predicted class label for each sample in X.

predict_proba(X)[source]

Estimates the posterior probabilities for sample in X.

Parameters:
X : array of shape (n_samples, n_features)

The input data.

Returns:
predicted_proba : array of shape (n_samples, n_classes)

Probabilities estimates for each sample in X.

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
X : array-like of shape (n_samples, n_features)

Test samples.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weight : array-like of shape (n_samples,), default=None

Sample weights.

Returns:
score : float

Mean accuracy of self.predict(X) wrt. y.

select(competences)[source]

Select an ensemble with the most accurate and most diverse classifier for the classification of the query.

The ensemble for each cluster was already pre-calculated in the fit method. So, this method calculates the closest cluster, and returns the ensemble associated to this cluster.

Parameters:
competences : array of shape (n_samples)

Array containing closest cluster index.

Returns:
selected_classifiers : array of shape = [n_samples, self.k]

Indices of the selected base classifier for each test example.