Special Case: Running Forest-Guided Clustering on Large Datasets#

πŸ“š In this tutorial, we demonstrate how to apply Forest-Guided Clustering (FGC) efficiently on large datasets. FGC relies on pairwise sample similarities derived from a trained Random Forest model. Depending on the chosen distance metric, these similarities are computed either from shared terminal leaves (proximity-based distance) or from shared decision-path structure (LCA-based distance). While these approaches enable highly model-specific and interpretable clustering, they can become computationally demanding as the number of samples increases. Several steps in the FGC pipeline may contribute to high runtime and memory usage on large datasets:

  • Computing large pairwise distance matrices (size (N :nbsphinx-math:`times `N))

  • Running bootstrapped Jaccard Index evaluations to estimate clustering stability

  • Repeated K-Medoids clustering, especially during optimization of the number of clusters

For example, if k=None (default), FGC will evaluate clustering stability for each k from 2 to 6 (5 values). With JI_bootstrap_iter=30 bootstrap iterations, this leads to 155 K-Medoids runs (5 without bootstrapping + 5 Γ— 30 bootstrapped). Each run operates on the full N Γ— N proximity matrix, which can quickly exhaust memory and CPU resources on large datasets. In this tutorial, we’ll walk through practical strategies to improve the runtime and scalability of FGC, so you can still uncover actionable patterns from large data:

  • Parallelize operations

  • Use faster K-Medoids optimization methods

  • Apply subsampling to reduce computational load

  • Store distance matrices as disk-backed memory-mapped arrays instead of fully in RAM

πŸ“¦ Installation: To get started, you need to install the fgclustering package. Please follow the instructions on the official installation guide.

🚧 Note: For a general introduction to FGC, please refer to our Introduction Notebook.

⚠️ Note: Total runtime depends on your system’s resources. Timing results shown here may vary slightly, but relative performance gains should remain consistent.

Imports:

[ ]:
## Import the Forest-Guided Clustering package
from fgclustering import (
    forest_guided_clustering,
    DistanceRandomForestLCA,
    ClusteringKMedoids,
    ClusteringClara
)

## Imports for datasets
from sklearn.datasets import fetch_california_housing

## Additional imports for use-cases
import time
import pandas as pd

from sklearn.ensemble import RandomForestRegressor

🏠 The California Housing Dataset#

For this special case, we’ll once again use the California Housing dataset, which contains 20,640 samples, a size large enough to noticeably impact the runtime of Forest-Guided Clustering (FGC). This makes it a suitable candidate for exploring how FGC scales and how we can improve its performance. This dataset was also used in Use Case 3, please refer there for a full description. To benchmark different runtime optimization strategies, we will train a Random Forest model on the full dataset and compare the runtime and memory usage across various configurations.

[2]:
data_housing = fetch_california_housing(as_frame=True)
data_housing = data_housing.frame
data_housing.rename(columns={'MedHouseVal':'target'}, inplace=True)

X_housing = data_housing.loc[:, data_housing.columns != 'target']
y_housing = data_housing.target

print(f"Traing model for {X_housing.shape[0]} house blocks.")
Traing model for 20640 house blocks.
[3]:
model = RandomForestRegressor(max_samples=0.8, max_depth=20, max_features='log2', n_estimators=100, bootstrap=True, oob_score=True, random_state=42)
model.fit(X_housing, y_housing)

print(f'OOB accuracy of prediction model: {round(model.oob_score_,3)}')
OOB accuracy of prediction model: 0.819

Benchmarking FGC Runtime on Large Datasets#

To ensure that the runtime comparison remains manageable, we will use only the first 10,000 samples of the dataset. This subset is still large enough to demonstrate the impact of various performance optimization strategies, while keeping execution times reasonable.

[4]:
X_housing_subset = X_housing.iloc[:10000].reset_index(drop=True)
y_housing_subet = y_housing.iloc[:10000].reset_index(drop=True)

⏱️ Baseline Runtime#

To evaluate how Forest-Guided Clustering (FGC) performs on large datasets, we first define a baseline configuration. This will serve as the reference point for evaluating performance improvements.

[5]:
clustering_strategy = ClusteringKMedoids(method="pam")
n_jobs = 1

start = time.perf_counter()
fgc = forest_guided_clustering(
    estimator=model,
    X=X_housing_subset,
    y=y_housing_subet,
    clustering_distance_metric=DistanceRandomForestLCA(),
    clustering_strategy=clustering_strategy,
    n_jobs=n_jobs,
)
end = time.perf_counter()

minutes = (end - start)
print(f"Baseline Runtime: ⏱️ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [02:20<00:00, 28.18s/it]

Optimal number of clusters k = 6

Clustering Evaluation Summary:
 k    Score  Stable  Mean_JI                                                  Cluster_JI
 2 0.797185    True    0.927                                         {1: 0.92, 2: 0.933}
 3 0.901761    True    0.735                               {1: 0.683, 2: 0.68, 3: 0.842}
 4 0.555171    True    0.785                     {1: 0.793, 2: 0.79, 3: 0.811, 4: 0.744}
 5 0.459049    True    0.758          {1: 0.827, 2: 0.838, 3: 0.697, 4: 0.539, 5: 0.891}
 6 0.454685    True    0.813 {1: 0.873, 2: 0.85, 3: 0.686, 4: 0.756, 5: 0.822, 6: 0.892}
Baseline Runtime: ⏱️ 145.88 seconds

πŸ”€ Parallelizing Bootstrap Evaluation#

FGC calculates the Jaccard Index over multiple bootstrapped subsets, which can be computationally intensive. You can parallelize this process using the n_jobs parameter:

  • n_jobs=1: default, no parallelism

  • n_jobs=-1: use all available CPU cores

  • n_jobs=n: use n parallel workers

We now re-run FGC with n_jobs=5, enabling concurrent execution of the Jaccard bootstrap process.

[6]:
clustering_strategy = ClusteringKMedoids(method="pam")
n_jobs = 5

start = time.perf_counter()
fgc = forest_guided_clustering(
    estimator=model,
    X=X_housing_subset,
    y=y_housing_subet,
    clustering_distance_metric=DistanceRandomForestLCA(),
    clustering_strategy=clustering_strategy,
    n_jobs=n_jobs,
)
end = time.perf_counter()

minutes = (end - start)
print(f"Runtime: ⏱️ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [02:06<00:00, 25.32s/it]

Optimal number of clusters k = 6

Clustering Evaluation Summary:
 k    Score  Stable  Mean_JI                                                   Cluster_JI
 2 0.797185    True    0.927                                          {1: 0.92, 2: 0.933}
 3 0.901761    True    0.732                               {1: 0.689, 2: 0.669, 3: 0.837}
 4 0.555171    True    0.792                     {1: 0.818, 2: 0.777, 3: 0.816, 4: 0.759}
 5 0.459049    True    0.752            {1: 0.837, 2: 0.833, 3: 0.688, 4: 0.542, 5: 0.86}
 6 0.454685    True    0.800 {1: 0.878, 2: 0.835, 3: 0.654, 4: 0.737, 5: 0.803, 6: 0.891}
Runtime: ⏱️ 131.61 seconds

By parallelizing the bootstrap phase, we already observe a noticeable speedup.

⚑ Speeding Up Clustering#

FGC uses K-Medoids to cluster the proximity matrix. By default, it uses the classic PAM (Partitioning Around Medoids) algorithm, which is accurate but computationally expensive. The fasterpam algorithm is an optimized variant that reduces redundant distance calculations and improves convergence speed while maintaining clustering quality. You can activate it in FGC by setting: ClusteringKMedoids(method="fasterpam").

[7]:
clustering_strategy = ClusteringKMedoids(method="fasterpam")
n_jobs = 1

start = time.perf_counter()
fgc = forest_guided_clustering(
    estimator=model,
    X=X_housing_subset,
    y=y_housing_subet,
    clustering_distance_metric=DistanceRandomForestLCA(),
    clustering_strategy=clustering_strategy,
    n_jobs=n_jobs,
)
end = time.perf_counter()

minutes = (end - start)
print(f"Runtime: ⏱️ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [01:45<00:00, 21.03s/it]

Optimal number of clusters k = 6

Clustering Evaluation Summary:
 k    Score  Stable  Mean_JI                                                   Cluster_JI
 2 0.797185    True    0.936                                          {1: 0.93, 2: 0.942}
 3 0.901761    True    0.698                               {1: 0.648, 2: 0.628, 3: 0.817}
 4 0.555171    True    0.769                      {1: 0.803, 2: 0.777, 3: 0.79, 4: 0.704}
 5 0.510650    True    0.729           {1: 0.787, 2: 0.822, 3: 0.551, 4: 0.606, 5: 0.879}
 6 0.454685    True    0.845 {1: 0.885, 2: 0.874, 3: 0.782, 4: 0.801, 5: 0.822, 6: 0.907}
Runtime: ⏱️ 110.16 seconds

This small change leads to substantial gains in clustering speed, especially when repeated across many bootstrap iterations.

πŸ₯‡ Combining Both Strategies#

The best performance is achieved when we combine both techniques:

  • Set n_jobs > 1 to parallelize bootstrap computation

  • Set method_clustering="fasterpam" to accelerate clustering

[8]:
clustering_strategy = ClusteringKMedoids(method="fasterpam")
n_jobs = 5

start = time.perf_counter()
fgc = forest_guided_clustering(
    estimator=model,
    X=X_housing_subset,
    y=y_housing_subet,
    clustering_distance_metric=DistanceRandomForestLCA(),
    clustering_strategy=clustering_strategy,
    n_jobs=n_jobs,
)
end = time.perf_counter()

minutes = (end - start)
print(f"Baseline Runtime: ⏱️ {minutes:.2f} seconds")
Using a sample size of 10.00% of the input data for Jaccard Index computation.
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [01:24<00:00, 17.00s/it]

Optimal number of clusters k = 6

Clustering Evaluation Summary:
 k    Score  Stable  Mean_JI                                                   Cluster_JI
 2 0.797185    True    0.926                                         {1: 0.919, 2: 0.932}
 3 0.901761    True    0.725                               {1: 0.684, 2: 0.658, 3: 0.833}
 4 0.555171    True    0.807                     {1: 0.799, 2: 0.799, 3: 0.831, 4: 0.798}
 5 0.510650    True    0.724           {1: 0.792, 2: 0.831, 3: 0.498, 4: 0.619, 5: 0.882}
 6 0.454685    True    0.836 {1: 0.883, 2: 0.868, 3: 0.764, 4: 0.791, 5: 0.819, 6: 0.894}
Baseline Runtime: ⏱️ 89.95 seconds

This combined strategy significantly reduces FGC’s runtimeβ€”often by a factor of 2–5, depending on hardware, and scales even better on larger datasets.

πŸ“‹ Summary#

Strategy

Avg. Runtime

Baseline

145 sec

Parallelizing Bootstrap Evaluation

131 sec

Speeding Up Clustering

110 sec

Both combined

89 sec

These runtime reductions make FGC practical even for larger datasets, without compromising its ability to generate insightful explanations.

Clustering at Scale: Strategies for Large Datasets#

If the previously mentioned performance enhancements (e.g. parallelization or faster K-Medoids) are still not sufficient, perhaps due to very large datasets or limited computational resources, then the memory-efficient proximity matrix and subsampling becomes a practical solution.

πŸ’Ύ Memory-Efficient Proximity Matrix#

When working with large datasets like the California Housing dataset (20,640 samples), the proximity matrix becomes very largeβ€”specifically, 20,640 Γ— 20,640, which translates to about 3.4 GB of memory if stored as float32. To avoid memory bottlenecks or system swapping, we enable FGC’s memory-efficient mode:

  • memory_efficient=True: stores the proximity matrix on disk rather than in RAM

  • output_dir: directory where the matrix is saved

Internally, np.memmap is used, a NumPy feature that enables reading/writing parts of large arrays from disk without loading the entire matrix into memory. This is especially useful for large-scale computations, as it prevents RAM overload and allows for seamless streaming access. This memory-efficient mode is highly recommended for datasets with more than ~10,000 samples.

[9]:
clustering_distance_metric = DistanceRandomForestLCA(memory_efficient=True, dir_distance_matrix="./")

πŸ” Clustering Large Applications#

One effective approach for this is the CLARA algorithm (Clustering Large Applications). CLARA works by applying K-Medoids clustering not on the full dataset but on multiple small, randomly selected subsamples. For each subsample, it performs clustering and then evaluates the resulting medoids across the entire dataset. The configuration that minimizes the overall clustering cost is selected as the best. This method allows us to find a representative clustering structure without computing the full distance matrix or running K-Medoids over the entire dataset.

πŸ’‘ While CLARA trades off some precision for performance, it often yields a very good approximation of the optimal clustering in a fraction of the time and memory.

We’ll now apply CLARA-based clustering on the California Housing dataset used earlier. This lets us benchmark performance and evaluate clustering quality when using subsampling instead of full-matrix proximity calculations.

⚠️ Note: The choice of sub_sample_size and sampling_iter is crucial. If the subsample size is too small or the number of iterations too low, CLARA might miss relevant patterns in the data. A good rule of thumb is to use a moderate fraction of the full dataset for subsampling (e.g. 10-60%, depending on the JI bootstrap size) and enough iterations (e.g. 5–20) to ensure robustness and reproducibility of the results.

[10]:
clustering_distance_metric = DistanceRandomForestLCA(memory_efficient=True, dir_distance_matrix="./")
clustering_strategy = ClusteringClara(sub_sample_size=0.6, sampling_iter=5, method="fasterpam")
JI_bootstrap_iter = 10
JI_bootstrap_sample_size = 0.7
n_jobs = 5

start = time.perf_counter()
fgc = forest_guided_clustering(
    estimator=model,
    X=X_housing,
    y=y_housing,
    clustering_distance_metric=clustering_distance_metric,
    clustering_strategy=clustering_strategy,
    JI_bootstrap_sample_size=JI_bootstrap_sample_size,
    JI_bootstrap_iter=JI_bootstrap_iter,
    n_jobs=n_jobs,
)
end = time.perf_counter()

minutes = (end - start) / 60
print(f"Runtime: ⏱️ {minutes:.2f} minutes")
Using range k = (2, 6) to optimize k.
Optimizing k: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [31:49<00:00, 381.91s/it]

Optimal number of clusters k = 6

Clustering Evaluation Summary:
 k    Score  Stable  Mean_JI                                                   Cluster_JI
 2 0.961174    True    0.909                                         {1: 0.906, 2: 0.911}
 3 0.601106    True    0.911                                {1: 0.914, 2: 0.929, 3: 0.89}
 4 0.582561    True    0.864                     {1: 0.853, 2: 0.755, 3: 0.905, 4: 0.944}
 5 0.539363    True    0.931           {1: 0.923, 2: 0.898, 3: 0.935, 4: 0.934, 5: 0.967}
 6 0.530107    True    0.867 {1: 0.833, 2: 0.864, 3: 0.838, 4: 0.836, 5: 0.889, 6: 0.943}
Runtime: ⏱️ 31.98 minutes

As we can see above, using the CLARA algorithm does not necessarily reduce runtime, but it makes FGC feasible on very large datasets, where even memory-efficient techniques like storing the distance matrix on disk become impractical. For instance, a full proximity matrix for 200,000 samples would require roughly 160 GB of memory if stored as 4-byte float32 values. This far exceeds the RAM capacity of most machines and could also strain disk I/O when using memory mapping.

In conclusion, CLARA offers a robust workaround by working on smaller, representative subsamples, allowing FGC to scale to massive datasets. While runtime gains may vary, its real strength lies in making interpretation possible when traditional distance-based clustering would otherwise be infeasible.