How to Use Forest-Guided Clustering#
Basic Usage
To apply Forest-Guided Clustering (FGC) for explaining a Random Forest model, you can follow the simple workflow consisting of three main steps: computing the forest-guided clusters, evaluating feature importance, and visualizing the results.
# for classification tasks
clustering_distance_metric = DistanceRandomForestProximity()
# for regression tasks
clustering_distance_metric = DistanceRandomForestLCA()
# compute the forest-guided clusters
fgc = forest_guided_clustering(
estimator=model,
X=X,
y=y,
clustering_distance_metric=clustering_distance_metric,
clustering_strategy=ClusteringKMedoids(),
)
# evaluate feature importance for best k
fgc_fi = forest_guided_feature_importance(
X=X,
y=y,
y_pred=model.predict(X),
cluster_labels=fgc.cluster_labels[fgc.best_k],
)
# visualize the results
plot_forest_guided_clustering(
ks=fgc.ks,
scores=fgc.scores,
mean_ji=fgc.mean_ji,
cluster_jis=fgc.cluster_jis,
best_k=fgc.best_k,
)
plot_forest_guided_feature_importance(
feature_importance_local=fgc_fi.feature_importance_local,
feature_importance_global=fgc_fi.feature_importance_global,
)
plot_forest_guided_decision_paths(
data_clustering=fgc_fi.data_clustering,
feature_importance_global=fgc_fi.feature_importance_global,
feature_importance_local=fgc_fi.feature_importance_local,
model_type=fgc.model_type,
)
where
estimatoris the trained Random Forest modelXis the feature matrixyis the target variableclustering_distance_metricdefines how similarity between samples is measured based on the Random Forest structureclustering_strategydetermines how the distance-based clustering is performed
For a detailed walkthrough, refer to the Introduction to FGC: Simple Use Cases notebook.
Using FGC on Large Datasets
When working with datasets containing a large number of samples, Forest-Guided Clustering (FGC) provides several strategies to ensure efficient performance and scalability:
Parallelize Cluster Optimization: Leverage multiple CPU cores by setting the
n_jobsparameter to a value greater than 1 in theforest_guided_clustering()function. This will parallelize the bootstrapping process for evaluating cluster stability.Use a Faster Clustering Algorithm: Improve the efficiency of the K-Medoids clustering step by using the optimized
"fasterpam"algorithm. Set themethodparameter of your clustering strategy (e.g.,ClusteringKMedoids(method="fasterpam")) to activate this faster implementation.Enable Subsampling with CLARA: For extremely large datasets, consider using the CLARA (Clustering Large Applications) variant by choosing
ClusteringClara()as your clustering strategy. CLARA performs clustering on smaller random subsamples, making it suitable for high-volume data.
For a detailed example, please refer to the notebook Special Case: FGC for Big Datasets.