How to Use Forest-Guided Clustering

How to Use Forest-Guided Clustering#

Basic Usage

To apply Forest-Guided Clustering (FGC) for explaining a Random Forest model, you can follow the simple workflow consisting of three main steps: computing the forest-guided clusters, evaluating feature importance, and visualizing the results.

# for classification tasks
clustering_distance_metric = DistanceRandomForestProximity()
# for regression tasks
clustering_distance_metric = DistanceRandomForestLCA()

# compute the forest-guided clusters
fgc = forest_guided_clustering(
    estimator=model,
    X=X,
    y=y,
    clustering_distance_metric=clustering_distance_metric,
    clustering_strategy=ClusteringKMedoids(),
)

# evaluate feature importance for best k
fgc_fi = forest_guided_feature_importance(
    X=X,
    y=y,
    y_pred=model.predict(X),
    cluster_labels=fgc.cluster_labels[fgc.best_k],
)

# visualize the results
plot_forest_guided_clustering(
    ks=fgc.ks,
    scores=fgc.scores,
    mean_ji=fgc.mean_ji,
    cluster_jis=fgc.cluster_jis,
    best_k=fgc.best_k,
)

plot_forest_guided_feature_importance(
    feature_importance_local=fgc_fi.feature_importance_local,
    feature_importance_global=fgc_fi.feature_importance_global,
)

plot_forest_guided_decision_paths(
    data_clustering=fgc_fi.data_clustering,
    feature_importance_global=fgc_fi.feature_importance_global,
    feature_importance_local=fgc_fi.feature_importance_local,
    model_type=fgc.model_type,
)

where

  • estimator is the trained Random Forest model

  • X is the feature matrix

  • y is the target variable

  • clustering_distance_metric defines how similarity between samples is measured based on the Random Forest structure

  • clustering_strategy determines how the distance-based clustering is performed

For a detailed walkthrough, refer to the Introduction to FGC: Simple Use Cases notebook.

Using FGC on Large Datasets

When working with datasets containing a large number of samples, Forest-Guided Clustering (FGC) provides several strategies to ensure efficient performance and scalability:

  • Parallelize Cluster Optimization: Leverage multiple CPU cores by setting the n_jobs parameter to a value greater than 1 in the forest_guided_clustering() function. This will parallelize the bootstrapping process for evaluating cluster stability.

  • Use a Faster Clustering Algorithm: Improve the efficiency of the K-Medoids clustering step by using the optimized "fasterpam" algorithm. Set the method parameter of your clustering strategy (e.g., ClusteringKMedoids(method="fasterpam")) to activate this faster implementation.

  • Enable Subsampling with CLARA: For extremely large datasets, consider using the CLARA (Clustering Large Applications) variant by choosing ClusteringClara() as your clustering strategy. CLARA performs clustering on smaller random subsamples, making it suitable for high-volume data.

For a detailed example, please refer to the notebook Special Case: FGC for Big Datasets.