Special Case: Running Forest-Guided Clustering on Large Datasets#
π In this tutorial, we demonstrate how to apply Forest-Guided Clustering (FGC) efficiently on large datasets. FGC relies on pairwise sample similarities derived from a trained Random Forest model. Depending on the chosen distance metric, these similarities are computed either from shared terminal leaves (proximity-based distance) or from shared decision-path structure (LCA-based distance). While these approaches enable highly model-specific and interpretable clustering, they can become computationally demanding as the number of samples increases. Several steps in the FGC pipeline may contribute to high runtime and memory usage on large datasets:
Computing large pairwise distance matrices (size (N :nbsphinx-math:`times `N))
Running bootstrapped Jaccard Index evaluations to estimate clustering stability
Repeated K-Medoids clustering, especially during optimization of the number of clusters
For example, if k=None (default), FGC will evaluate clustering stability for each k from 2 to 6 (5 values). With JI_bootstrap_iter=30 bootstrap iterations, this leads to 155 K-Medoids runs (5 without bootstrapping + 5 Γ 30 bootstrapped). Each run operates on the full N Γ N proximity matrix, which can quickly exhaust memory and CPU resources on large datasets. In this tutorial, weβll walk through practical strategies to improve the runtime and scalability of FGC, so you can still uncover actionable patterns from large data:
Parallelize operations
Use faster K-Medoids optimization methods
Apply subsampling to reduce computational load
Store distance matrices as disk-backed memory-mapped arrays instead of fully in RAM
π¦ Installation: To get started, you need to install the fgclustering package. Please follow the instructions on the official installation guide.
π§ Note: For a general introduction to FGC, please refer to our Introduction Notebook.
β οΈ Note: Total runtime depends on your systemβs resources. Timing results shown here may vary slightly, but relative performance gains should remain consistent.
Imports:
[ ]:
## Import the Forest-Guided Clustering package
from fgclustering import (
forest_guided_clustering,
DistanceRandomForestLCA,
ClusteringKMedoids,
ClusteringClara
)
## Imports for datasets
from sklearn.datasets import fetch_california_housing
## Additional imports for use-cases
import time
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
π The California Housing Dataset#
For this special case, weβll once again use the California Housing dataset, which contains 20,640 samples, a size large enough to noticeably impact the runtime of Forest-Guided Clustering (FGC). This makes it a suitable candidate for exploring how FGC scales and how we can improve its performance. This dataset was also used in Use Case 3, please refer there for a full description. To benchmark different runtime optimization strategies, we will train a Random Forest model on the full dataset and compare the runtime and memory usage across various configurations.
[2]:
data_housing = fetch_california_housing(as_frame=True)
data_housing = data_housing.frame
data_housing.rename(columns={'MedHouseVal':'target'}, inplace=True)
X_housing = data_housing.loc[:, data_housing.columns != 'target']
y_housing = data_housing.target
print(f"Traing model for {X_housing.shape[0]} house blocks.")
Traing model for 20640 house blocks.
[3]:
model = RandomForestRegressor(max_samples=0.8, max_depth=20, max_features='log2', n_estimators=100, bootstrap=True, oob_score=True, random_state=42)
model.fit(X_housing, y_housing)
print(f'OOB accuracy of prediction model: {round(model.oob_score_,3)}')
OOB accuracy of prediction model: 0.819
Benchmarking FGC Runtime on Large Datasets#
To ensure that the runtime comparison remains manageable, we will use only the first 10,000 samples of the dataset. This subset is still large enough to demonstrate the impact of various performance optimization strategies, while keeping execution times reasonable.
[4]:
X_housing_subset = X_housing.iloc[:10000].reset_index(drop=True)
y_housing_subet = y_housing.iloc[:10000].reset_index(drop=True)
β±οΈ Baseline Runtime#
To evaluate how Forest-Guided Clustering (FGC) performs on large datasets, we first define a baseline configuration. This will serve as the reference point for evaluating performance improvements.
[5]:
clustering_strategy = ClusteringKMedoids(method="pam")
n_jobs = 1
start = time.perf_counter()
fgc = forest_guided_clustering(
estimator=model,
X=X_housing_subset,
y=y_housing_subet,
clustering_distance_metric=DistanceRandomForestLCA(),
clustering_strategy=clustering_strategy,
n_jobs=n_jobs,
)
end = time.perf_counter()
minutes = (end - start)
print(f"Baseline Runtime: β±οΈ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|ββββββββββ| 5/5 [02:20<00:00, 28.18s/it]
Optimal number of clusters k = 6
Clustering Evaluation Summary:
k Score Stable Mean_JI Cluster_JI
2 0.797185 True 0.927 {1: 0.92, 2: 0.933}
3 0.901761 True 0.735 {1: 0.683, 2: 0.68, 3: 0.842}
4 0.555171 True 0.785 {1: 0.793, 2: 0.79, 3: 0.811, 4: 0.744}
5 0.459049 True 0.758 {1: 0.827, 2: 0.838, 3: 0.697, 4: 0.539, 5: 0.891}
6 0.454685 True 0.813 {1: 0.873, 2: 0.85, 3: 0.686, 4: 0.756, 5: 0.822, 6: 0.892}
Baseline Runtime: β±οΈ 145.88 seconds
π Parallelizing Bootstrap Evaluation#
FGC calculates the Jaccard Index over multiple bootstrapped subsets, which can be computationally intensive. You can parallelize this process using the n_jobs parameter:
n_jobs=1: default, no parallelismn_jobs=-1: use all available CPU coresn_jobs=n: use n parallel workers
We now re-run FGC with n_jobs=5, enabling concurrent execution of the Jaccard bootstrap process.
[6]:
clustering_strategy = ClusteringKMedoids(method="pam")
n_jobs = 5
start = time.perf_counter()
fgc = forest_guided_clustering(
estimator=model,
X=X_housing_subset,
y=y_housing_subet,
clustering_distance_metric=DistanceRandomForestLCA(),
clustering_strategy=clustering_strategy,
n_jobs=n_jobs,
)
end = time.perf_counter()
minutes = (end - start)
print(f"Runtime: β±οΈ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|ββββββββββ| 5/5 [02:06<00:00, 25.32s/it]
Optimal number of clusters k = 6
Clustering Evaluation Summary:
k Score Stable Mean_JI Cluster_JI
2 0.797185 True 0.927 {1: 0.92, 2: 0.933}
3 0.901761 True 0.732 {1: 0.689, 2: 0.669, 3: 0.837}
4 0.555171 True 0.792 {1: 0.818, 2: 0.777, 3: 0.816, 4: 0.759}
5 0.459049 True 0.752 {1: 0.837, 2: 0.833, 3: 0.688, 4: 0.542, 5: 0.86}
6 0.454685 True 0.800 {1: 0.878, 2: 0.835, 3: 0.654, 4: 0.737, 5: 0.803, 6: 0.891}
Runtime: β±οΈ 131.61 seconds
By parallelizing the bootstrap phase, we already observe a noticeable speedup.
β‘ Speeding Up Clustering#
FGC uses K-Medoids to cluster the proximity matrix. By default, it uses the classic PAM (Partitioning Around Medoids) algorithm, which is accurate but computationally expensive. The fasterpam algorithm is an optimized variant that reduces redundant distance calculations and improves convergence speed while maintaining clustering quality. You can activate it in FGC by setting: ClusteringKMedoids(method="fasterpam").
[7]:
clustering_strategy = ClusteringKMedoids(method="fasterpam")
n_jobs = 1
start = time.perf_counter()
fgc = forest_guided_clustering(
estimator=model,
X=X_housing_subset,
y=y_housing_subet,
clustering_distance_metric=DistanceRandomForestLCA(),
clustering_strategy=clustering_strategy,
n_jobs=n_jobs,
)
end = time.perf_counter()
minutes = (end - start)
print(f"Runtime: β±οΈ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|ββββββββββ| 5/5 [01:45<00:00, 21.03s/it]
Optimal number of clusters k = 6
Clustering Evaluation Summary:
k Score Stable Mean_JI Cluster_JI
2 0.797185 True 0.936 {1: 0.93, 2: 0.942}
3 0.901761 True 0.698 {1: 0.648, 2: 0.628, 3: 0.817}
4 0.555171 True 0.769 {1: 0.803, 2: 0.777, 3: 0.79, 4: 0.704}
5 0.510650 True 0.729 {1: 0.787, 2: 0.822, 3: 0.551, 4: 0.606, 5: 0.879}
6 0.454685 True 0.845 {1: 0.885, 2: 0.874, 3: 0.782, 4: 0.801, 5: 0.822, 6: 0.907}
Runtime: β±οΈ 110.16 seconds
This small change leads to substantial gains in clustering speed, especially when repeated across many bootstrap iterations.
π₯ Combining Both Strategies#
The best performance is achieved when we combine both techniques:
Set
n_jobs > 1to parallelize bootstrap computationSet
method_clustering="fasterpam"to accelerate clustering
[8]:
clustering_strategy = ClusteringKMedoids(method="fasterpam")
n_jobs = 5
start = time.perf_counter()
fgc = forest_guided_clustering(
estimator=model,
X=X_housing_subset,
y=y_housing_subet,
clustering_distance_metric=DistanceRandomForestLCA(),
clustering_strategy=clustering_strategy,
n_jobs=n_jobs,
)
end = time.perf_counter()
minutes = (end - start)
print(f"Baseline Runtime: β±οΈ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|ββββββββββ| 5/5 [01:24<00:00, 17.00s/it]
Optimal number of clusters k = 6
Clustering Evaluation Summary:
k Score Stable Mean_JI Cluster_JI
2 0.797185 True 0.926 {1: 0.919, 2: 0.932}
3 0.901761 True 0.725 {1: 0.684, 2: 0.658, 3: 0.833}
4 0.555171 True 0.807 {1: 0.799, 2: 0.799, 3: 0.831, 4: 0.798}
5 0.510650 True 0.724 {1: 0.792, 2: 0.831, 3: 0.498, 4: 0.619, 5: 0.882}
6 0.454685 True 0.836 {1: 0.883, 2: 0.868, 3: 0.764, 4: 0.791, 5: 0.819, 6: 0.894}
Baseline Runtime: β±οΈ 89.95 seconds
This combined strategy significantly reduces FGCβs runtimeβoften by a factor of 2β5, depending on hardware, and scales even better on larger datasets.
π Summary#
Strategy |
Avg. Runtime |
|---|---|
Baseline |
145 sec |
Parallelizing Bootstrap Evaluation |
131 sec |
Speeding Up Clustering |
110 sec |
Both combined |
89 sec |
These runtime reductions make FGC practical even for larger datasets, without compromising its ability to generate insightful explanations.
Clustering at Scale: Strategies for Large Datasets#
If the previously mentioned performance enhancements (e.g. parallelization or faster K-Medoids) are still not sufficient, perhaps due to very large datasets or limited computational resources, then the memory-efficient proximity matrix and subsampling becomes a practical solution.
πΎ Memory-Efficient Proximity Matrix#
When working with large datasets like the California Housing dataset (20,640 samples), the proximity matrix becomes very largeβspecifically, 20,640 Γ 20,640, which translates to about 3.4 GB of memory if stored as float32. To avoid memory bottlenecks or system swapping, we enable FGCβs memory-efficient mode:
memory_efficient=True: stores the proximity matrix on disk rather than in RAMoutput_dir: directory where the matrix is saved
Internally, np.memmap is used, a NumPy feature that enables reading/writing parts of large arrays from disk without loading the entire matrix into memory. This is especially useful for large-scale computations, as it prevents RAM overload and allows for seamless streaming access. This memory-efficient mode is highly recommended for datasets with more than ~10,000 samples.
[9]:
clustering_distance_metric = DistanceRandomForestLCA(memory_efficient=True, dir_distance_matrix="./")
π Clustering Large Applications#
One effective approach for this is the CLARA algorithm (Clustering Large Applications). CLARA works by applying K-Medoids clustering not on the full dataset but on multiple small, randomly selected subsamples. For each subsample, it performs clustering and then evaluates the resulting medoids across the entire dataset. The configuration that minimizes the overall clustering cost is selected as the best. This method allows us to find a representative clustering structure without computing the full distance matrix or running K-Medoids over the entire dataset.
π‘ While CLARA trades off some precision for performance, it often yields a very good approximation of the optimal clustering in a fraction of the time and memory.
Weβll now apply CLARA-based clustering on the California Housing dataset used earlier. This lets us benchmark performance and evaluate clustering quality when using subsampling instead of full-matrix proximity calculations.
β οΈ Note: The choice of sub_sample_size and sampling_iter is crucial. If the subsample size is too small or the number of iterations too low, CLARA might miss relevant patterns in the data. A good rule of thumb is to use a moderate fraction of the full dataset for subsampling (e.g. 10-60%, depending on the JI bootstrap size) and enough iterations (e.g. 5β20) to ensure robustness and reproducibility of the results.
[10]:
clustering_distance_metric = DistanceRandomForestLCA(memory_efficient=True, dir_distance_matrix="./")
clustering_strategy = ClusteringClara(sub_sample_size=0.6, sampling_iter=5, method="fasterpam")
JI_bootstrap_iter = 10
JI_bootstrap_sample_size = 0.7
n_jobs = 5
start = time.perf_counter()
fgc = forest_guided_clustering(
estimator=model,
X=X_housing,
y=y_housing,
clustering_distance_metric=clustering_distance_metric,
clustering_strategy=clustering_strategy,
JI_bootstrap_sample_size=JI_bootstrap_sample_size,
JI_bootstrap_iter=JI_bootstrap_iter,
n_jobs=n_jobs,
)
end = time.perf_counter()
minutes = (end - start) / 60
print(f"Runtime: β±οΈ {minutes:.2f} minutes")
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|ββββββββββ| 5/5 [31:49<00:00, 381.91s/it]
Optimal number of clusters k = 6
Clustering Evaluation Summary:
k Score Stable Mean_JI Cluster_JI
2 0.961174 True 0.909 {1: 0.906, 2: 0.911}
3 0.601106 True 0.911 {1: 0.914, 2: 0.929, 3: 0.89}
4 0.582561 True 0.864 {1: 0.853, 2: 0.755, 3: 0.905, 4: 0.944}
5 0.539363 True 0.931 {1: 0.923, 2: 0.898, 3: 0.935, 4: 0.934, 5: 0.967}
6 0.530107 True 0.867 {1: 0.833, 2: 0.864, 3: 0.838, 4: 0.836, 5: 0.889, 6: 0.943}
Runtime: β±οΈ 31.98 minutes
As we can see above, using the CLARA algorithm does not necessarily reduce runtime, but it makes FGC feasible on very large datasets, where even memory-efficient techniques like storing the distance matrix on disk become impractical. For instance, a full proximity matrix for 200,000 samples would require roughly 160 GB of memory if stored as 4-byte float32 values. This far exceeds the RAM capacity of most machines and could also strain disk I/O when using memory mapping.
In conclusion, CLARA offers a robust workaround by working on smaller, representative subsamples, allowing FGC to scale to massive datasets. While runtime gains may vary, its real strength lies in making interpretation possible when traditional distance-based clustering would otherwise be infeasible.