Special Case: Impact of Model Complexity on FGC#

When training a Random Forest model, we usually tune our models wrt. hyperparameters by optimizing a specified scoring function, e.g. R^2 or accuracy. When only optimizing for a metric, we might end up with a highly complex Random Forest model, which has deeply grown trees to better fit the data at hand. When the model gets too complex, it can start to learn irrelevant information (“noise”) within the dataset and we run into the problem of overfitting. When this happens, the algorithm unfortunately cannot perform accurately on unseen data, defeating its purpose. This problem also propagates into the generalization of the explanations we retrieve from FGC. FGC allows us to uncover the stable patterns in the data using the structure of a Random Forest model. However, if the model becomes too complex, e.g. has deeply grown trees, it starts learning patterns that are specific to certain instances in the training set, rather than learning generalizeable patterns.

Note: for installation description and a general introduction to FGC please have a look at Read the Docs - Installation and Introduction Notebook.

Imports:

[1]:
## Import the Forest-Guided Clustering package
from fgclustering import FgClustering

## Imports for datasets
from sklearn.datasets import fetch_california_housing

## Additional imports for use-cases

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

To showcase how model complexity impacts FGC, we will again use the California Housing dataset (for dataset description, please see Use Case 3). We will use the first 1000 samples of the dataset as training data to train a Random Forest Regressor, where we tune the max_depth with 5-fold corss-validation.

[2]:
data_housing = fetch_california_housing(as_frame=True).frame

data_housing_train = data_housing.iloc[:1000]
X_housing_train = data_housing_train.loc[:, data_housing_train.columns != 'MedHouseVal']
y_housing_train = data_housing_train.MedHouseVal

regressor = RandomForestRegressor(max_features='log2', max_samples=0.8, bootstrap=True, oob_score=True, random_state=42)

grid = {'max_depth':[2, 5, 10, 20, 30]}
grid_regressor = GridSearchCV(regressor, grid, cv=5)
grid_regressor.fit(X_housing_train, y_housing_train)
rf_housing = grid_regressor.best_estimator_

params = grid_regressor.cv_results_['params']
score = grid_regressor.cv_results_['mean_test_score']
print(f'Parameter Grid: {params}')
print(f'Test R^2 score: {score}')

print(f'Parameters of best prediction model: {grid_regressor.best_params_}')
Parameter Grid: [{'max_depth': 2}, {'max_depth': 5}, {'max_depth': 10}, {'max_depth': 20}, {'max_depth': 30}]
Test R^2 score: [0.21068248 0.43956159 0.51229987 0.51506051 0.51523013]
Parameters of best prediction model: {'max_depth': 30}

The results above show that optimizing only for metrics results in a highly complex model with maximum tree depth of 30, although the performance metric does not change from max_depth=10 upwards. We now apply Forest-Guided Clustering on the trained model and dataset, to see if we retrieve any stable pattern from this Random Forest model.

[3]:
fgc = FgClustering(model=rf_housing, data=data_housing_train, target_column='MedHouseVal')
fgc.run()
Interpreting RandomForestRegressor
 17%|█▋        | 1/6 [00:02<00:13,  2.68s/it]
For number of cluster 2 the Jaccard Index is 0.04151478360120275
Clustering is instable, no score computed!
 33%|███▎      | 2/6 [00:07<00:14,  3.68s/it]
For number of cluster 3 the Jaccard Index is 0.0639124681474813
Clustering is instable, no score computed!
 50%|█████     | 3/6 [00:14<00:15,  5.30s/it]
For number of cluster 4 the Jaccard Index is 0.06234212556086201
Clustering is instable, no score computed!
 67%|██████▋   | 4/6 [00:24<00:14,  7.16s/it]
For number of cluster 5 the Jaccard Index is 0.08500454028798372
Clustering is instable, no score computed!
 83%|████████▎ | 5/6 [00:37<00:09,  9.47s/it]
For number of cluster 6 the Jaccard Index is 0.10481112006920552
Clustering is instable, no score computed!
100%|██████████| 6/6 [00:55<00:00,  9.25s/it]
For number of cluster 7 the Jaccard Index is 0.0807985607763865
Clustering is instable, no score computed!

/Users/lisa.barros/tools/anaconda3/envs/FGC/lib/python3.10/site-packages/fgclustering/forest_guided_clustering.py:109: UserWarning: No stable clusters were found!
  warnings.warn("No stable clusters were found!")

As we can see from the results above, FGC does not find any stable clustering, which means that we do not find any generalizeable pattern in the data. But how is that possible, given that in Use Case 3 we use the same data (to train the model / run FGC) and find clear and stable patterns using FGC? The reason is that we optimized our model only in terms of metric performance and not in terms of explainability! We saw above that optimizing our model in terms of R^2 score, lead to a highly complex model with a maximum tree depth of 30, while the performance is not significantly better than a Random Forest model with a maximum tree depth of 10. A high tree depth leads to trees with many leaves containing only few samples. However, the deeper we go in the tree, the higher the chances that the separation is only based on properties specific to the training samples, i.e. we start fitting the “noise” in our training data. Let’s now see what happens if we apply FGC to the same Random Forest model trained with max_depth=10.

[4]:
regressor = RandomForestRegressor(max_depth=10, max_features='log2', max_samples=0.8, bootstrap=True, oob_score=True, random_state=42)
regressor.fit(X_housing_train, y_housing_train)

fgc = FgClustering(model=regressor, data=data_housing_train, target_column='MedHouseVal')
fgc.run()

Interpreting RandomForestRegressor
 17%|█▋        | 1/6 [00:02<00:12,  2.52s/it]
For number of cluster 2 the Jaccard Index is 0.7204314803463794
For number of cluster 2 the score is 776.0190906201934
 33%|███▎      | 2/6 [00:07<00:15,  3.87s/it]
For number of cluster 3 the Jaccard Index is 0.343363692706108
Clustering is instable, no score computed!
 50%|█████     | 3/6 [00:14<00:16,  5.52s/it]
For number of cluster 4 the Jaccard Index is 0.4464751826329628
Clustering is instable, no score computed!
 67%|██████▋   | 4/6 [00:25<00:15,  7.67s/it]
For number of cluster 5 the Jaccard Index is 0.5457086314072344
Clustering is instable, no score computed!
 83%|████████▎ | 5/6 [00:40<00:10, 10.18s/it]
For number of cluster 6 the Jaccard Index is 0.5388533537672784
Clustering is instable, no score computed!
100%|██████████| 6/6 [00:59<00:00,  9.99s/it]
For number of cluster 7 the Jaccard Index is 0.3574391861664191
Clustering is instable, no score computed!
Optimal number of cluster is: 2

Using a Random Forest model with max_depth=10, which achieves equally good performance results as a Random Forest model with max_depth=30, indeed finds a stable clustering with k=2. This shows that the performance metric should not be the only optimization aim when we train a Random Forest model that we want to interpret!