DES-KNN¶

class deslib.des.des_knn.DESKNN(pool_classifiers=None, k=7, DFP=False, with_IH=False, safe_k=None, IH_rate=0.3, pct_accuracy=0.5, pct_diversity=0.3, more_diverse=True, metric='DF', random_state=None, knn_classifier='knn', knn_metric='minkowski', knne=False, DSEL_perc=0.5, n_jobs=-1, voting='hard')[source]¶

Dynamic ensemble Selection KNN (DES-KNN).

This method selects an ensemble of classifiers taking into account the accuracy and diversity of the base classifiers. The k-NN algorithm is used to define the region of competence. The N most accurate classifiers in the region of competence are first selected. Then, the J more diverse classifiers from the N most accurate classifiers are selected to compose the ensemble.

Parameters:

pool_classifiers : list of classifiers (Default = None)

The generated_pool of classifiers trained for the corresponding classification problem. Each base classifiers should support the method “predict”. If None, then the pool of classifiers is a bagging classifier.

k : int (Default = 7)

Number of neighbors used to estimate the competence of the base classifiers.

DFP : Boolean (Default = False)

Determines if the dynamic frienemy pruning is applied.

with_IH : Boolean (Default = False)

Whether the hardness level of the region of competence is used to decide between using the DS algorithm or the KNN for classification of a given query sample.

safe_k : int (default = None)

The size of the indecision region.

IH_rate : float (default = 0.3)

Hardness threshold. If the hardness level of the competence region is lower than the IH_rate the KNN classifier is used. Otherwise, the DS algorithm is used for classification.

pct_accuracy : float (Default = 0.5)

Percentage of base classifiers selected based on accuracy

pct_diversity : float (Default = 0.3)

Percentage of base classifiers selected based n diversity

more_diverse : Boolean (Default = True)

Whether we select the most or the least diverse classifiers to add to the pre-selected ensemble

metric : String (Default = ‘df’)

Metric used to estimate the diversity of the base classifiers. Can be either the double fault (df), Q-statistics (Q), or error correlation.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

knn_classifier : {‘knn’, ‘faiss’, None} (Default = ‘knn’)

The algorithm used to estimate the region of competence:

‘knn’ will use KNeighborsClassifier from sklearn

KNNE available on deslib.utils.knne

‘faiss’ will use Facebook’s Faiss similarity search through the class FaissKNNClassifier
None, will use sklearn KNeighborsClassifier.

knn_metric : {‘minkowski’, ‘cosine’, ‘mahalanobis’} (Default = ‘minkowski’)

The metric used by the k-NN classifier to estimate distances.

‘minkowski’ will use minkowski distance.
‘cosine’ will use the cosine distance.
‘mahalanobis’ will use the mahalonibis distance.

knne : bool (Default=False)

Whether to use K-Nearest Neighbor Equality (KNNE) for the region of competence estimation.

DSEL_perc : float (Default = 0.5)

Percentage of the input data used to fit DSEL. Note: This parameter is only used if the pool of classifier is None or unfitted.

voting : {‘hard’, ‘soft’}, default=’hard’

If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

n_jobs : int, default=-1

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. Doesn’t affect fit method.

References

Soares, R. G., Santana, A., Canuto, A. M., & de Souto, M. C. P. “Using accuracy and more_diverse to select classifiers to build ensembles.” International Joint Conference on Neural Networks (IJCNN)., 2006.

Britto, Alceu S., Robert Sabourin, and Luiz ES Oliveira. “Dynamic selection of classifiers—a comprehensive review.” Pattern Recognition 47.11 (2014): 3665-3680.

R. M. O. Cruz, R. Sabourin, and G. D. Cavalcanti, “Dynamic classifier selection: Recent advances and perspectives,” Information Fusion, vol. 41, pp. 195 – 216, 2018.

estimate_competence(competence_region, distances=None, predictions=None)[source]¶

estimate the competence level of each base classifier \(c_{i}\) for the classification of the query sample.

The competence is estimated using the accuracy and diversity criteria. First the classification accuracy of the base classifiers in the region of competence is estimated. Then the diversity of the base classifiers is estimated.

The method returns two arrays: One containing the accuracy and the other the diversity of each base classifier.

Parameters:	competence_region : array of shape (n_samples, n_neighbors) Indices of the k nearest neighbors according for each test sample. distances : array of shape (n_samples, n_neighbors) Distances from the k nearest neighbors to the query predictions : array of shape (n_samples, n_classifiers) Predictions of the base classifiers for all test examples.
Returns:	accuracy : array of shape = [n_samples, n_classifiers} Local Accuracy estimates (competences) of the base classifiers for all query samples. diversity : array of shape = [n_samples, n_classifiers} Average pairwise diversity of each base classifiers for all test examples.

Notes

This technique uses both the accuracy and diversity information to perform dynamic selection. For this reason the function returns a dictionary containing these two values instead of a single ndarray containing the competence level estimates for each base classifier.

fit(X, y)[source]¶

Prepare the DS model by setting the KNN algorithm and pre-processing the information required to apply the DS method.

Parameters:	X : array of shape (n_samples, n_features) Data used to fit the model. y : array of shape (n_samples) class labels of each example in X.
Returns:	self

predict(X)[source]¶

Predict the class label for each sample in X.

Parameters:	X : array of shape (n_samples, n_features) The input data.
Returns:	predicted_labels : array of shape (n_samples) Predicted class label for each sample in X.

predict_proba(X)[source]¶

Estimates the posterior probabilities for sample in X.

Parameters:	X : array of shape (n_samples, n_features) The input data.
Returns:	predicted_proba : array of shape (n_samples, n_classes) Probabilities estimates for each sample in X.

score(X, y, sample_weight=None)[source]¶

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:	X : array-like of shape (n_samples, n_features) Test samples. y : array-like of shape (n_samples,) or (n_samples, n_outputs) True labels for X. sample_weight : array-like of shape (n_samples,), default=None Sample weights.
Returns:	score : float Mean accuracy of `self.predict(X)` wrt. y.

select(accuracy, diversity)[source]¶

Select an ensemble containing the N most accurate ant the J most diverse classifiers for the classification of the query sample.

Parameters:	accuracy : array of shape (n_samples, n_classifiers) Local Accuracy estimates (competence) of each base classifiers. diversity : array of shape (n_samples, n_classifiers) Average pairwise diversity of each base classifiers.
Returns:	selected_classifiers : array of shape = [n_samples, self.J] Array containing the indices of the J selected base classifier for each test example.