fgclustering package

fgclustering package#

Module contents#

Forest-Guided Clustering (FGC) is an explainability method for Random Forest models that addresses one of the key limitations of many standard XAI techniques: the inability to effectively handle correlated features and complex decision patterns.

forest_guided_clustering(estimator: RandomForestClassifier | RandomForestRegressor, X: DataFrame, y: str | Series, clustering_distance_metric: DistanceRandomForestBase, clustering_strategy: ClusteringKMedoids | ClusteringClara, k: int | tuple[int, int] | None = None, JI_bootstrap_iter: int = 100, JI_bootstrap_sample_size: int | float | None = None, JI_discart_value: float = 0.6, n_jobs: int = 1, random_state: int | None = None, verbose: int = 1) → Bunch#

Run forest-guided clustering with Random-Forest-derived distances.

The fitted Random Forest is first encoded by clustering_distance_metric. The selected clustering strategy then groups samples using pairwise distances derived from this encoding. Supported distance metrics include terminal-node proximity and LCA-based decision-path distances.

The number of clusters can be fixed or optimized over a range. When a range is provided, candidate values are evaluated by clustering stability and task-specific cluster quality. Stability is estimated with bootstrapped Jaccard indices. Cluster quality is measured by balanced impurity for classification models and within-cluster variation for regression models.

Parameters:

estimator (RandomForestClassifier | RandomForestRegressor) – Fitted Random Forest estimator used to derive the sample encoding.
X (pd.DataFrame) – Input feature matrix.
y (str | pd.Series) – Target variable, either as a target vector or as the name of a column in X.
clustering_distance_metric (DistanceRandomForestBase) – Forest-derived distance metric used to encode samples and compute pairwise distances.
clustering_strategy (ClusteringKMedoids | ClusteringClara) – Clustering algorithm used with the computed distances.
k (int | tuple[int, int] | None) – Fixed number of clusters, inclusive optimization range (min_k, max_k), or None to use the default range.
JI_bootstrap_iter (int) – Number of bootstrap iterations for Jaccard stability estimation.
JI_bootstrap_sample_size (int | float | None) – Number or fraction of samples drawn in each Jaccard bootstrap iteration. If None, an adaptive size is selected.
JI_discart_value (float) – Minimum mean Jaccard index required for a solution to be marked as stable.
n_jobs (int) – Number of parallel jobs used during cluster-number optimization.
random_state (int | None) – Random seed used for reproducible clustering and subsampling.
verbose (int) – Verbosity level for progress output.

Returns:

Results containing best_k, evaluated ks, mean Jaccard indices, quality scores, stability mask, per-cluster Jaccard values, cluster labels for each evaluated k, and model_type.

Return type:

Bunch

forest_guided_feature_importance(X: DataFrame, y: str | Series, cluster_labels: ndarray, y_pred: ndarray | Series | None = None, feature_importance_distance_metric: str = 'wasserstein', verbose: int = 1) → Bunch#

Compute cluster-wise and global forest-guided feature importance.

For each feature and cluster, the feature distribution inside the cluster is compared with the background distribution across all samples. Local feature importance contains the resulting cluster-specific distances. Global feature importance is computed by aggregating local values across clusters.

Supported distance metrics are "wasserstein" and "jensenshannon".

Parameters:

X (pd.DataFrame) – Input feature matrix.
y (str | pd.Series) – Target variable, either as a target vector or as the name of a column in X.
cluster_labels (np.ndarray) – Cluster labels aligned with X.
y_pred (np.ndarray | pd.Series | None) – Optional predicted target values aligned with X.
feature_importance_distance_metric (str) – Distance metric used to compare cluster and background feature distributions. Must be "wasserstein" or "jensenshannon".
verbose (int) – Verbosity level for progress output.

Raises:

ValueError – If feature_importance_distance_metric is not supported.

Returns:

Results containing local feature importance, global feature importance, and the clustering table used for downstream visualization.

Return type:

Bunch

plot_forest_guided_clustering(ks: Sequence[int] | ndarray, scores: Sequence[float] | ndarray, mean_ji: Sequence[float] | ndarray, cluster_jis: dict[int, dict[int, float]], best_k: int | None = None, JI_discart_value: float | None = None, color_spec: dict[str, Any] | None = None, show: bool = True, save: str | None = None) → tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes] | None#

Plot clustering quality and stability across evaluated cluster numbers.

The plot shows the task-specific clustering score, mean Jaccard stability, and per-cluster Jaccard stability for each evaluated k. Optionally, the selected best_k and the Jaccard stability threshold are highlighted.

Parameters:

ks (Sequence[int] | np.ndarray) – Evaluated numbers of clusters.
scores (Sequence[float] | np.ndarray) – Clustering quality score for each evaluated k.
mean_ji (Sequence[float] | np.ndarray) – Mean Jaccard stability for each evaluated k.
cluster_jis (dict[int, dict[int, float]]) – Per-cluster Jaccard stability values keyed by k.
best_k (int | None) – Optional selected number of clusters to highlight.
JI_discart_value (float | None) – Optional Jaccard stability threshold to draw.
color_spec (dict[str, Any] | None) – Optional overrides for DEFAULT_COLOR_SPEC.
show (bool) – If True, display the figure. If False, return it.
save (str | None) – Optional file path for saving the figure.

Returns:

Figure and primary axes when show=False; otherwise None.

Return type:

tuple[Figure, Axes] | None

plot_forest_guided_feature_importance(feature_importance_local: DataFrame, feature_importance_global: Series, top_n: int | None = None, num_cols: int = 4, color_spec: dict | None = None, reorder: bool = False, recolor: bool = False, show: bool = True, save: str | None = None) → tuple[matplotlib.figure.Figure, list[matplotlib.axes._axes.Axes]] | None#

Plot global and cluster-specific feature importance.

The global panel summarizes feature importance across clusters. The local panels show cluster-specific feature importance values. Features can be limited to the top-ranked entries and optionally reordered or recolored according to the global ranking.

Parameters:

feature_importance_local (pd.DataFrame) – Local feature importance values with features as rows and clusters as columns.
feature_importance_global (pd.Series) – Global feature importance values indexed by feature name.
top_n (int | None) – Number of top-ranked features to display, or None to display all.
num_cols (int) – Maximum number of subplot columns.
color_spec (dict | None) – Optional overrides for DEFAULT_COLOR_SPEC.
reorder (bool) – If True, order local feature panels by global feature ranking.
recolor (bool) – If True, color local bars by global feature ranking.
show (bool) – If True, display the figure. If False, return it.
save (str | None) – Optional file path for saving the figure.

Returns:

Figure and axes when show=False; otherwise None.

Return type:

tuple[Figure, list[Axes]] | None

plot_forest_guided_decision_paths(data_clustering: DataFrame, feature_importance_global: Series, feature_importance_local: DataFrame, model_type: type[sklearn.ensemble._forest.RandomForestClassifier] | type[sklearn.ensemble._forest.RandomForestRegressor], draw_distributions: bool = True, draw_dotplot: bool = True, draw_heatmap: bool = True, heatmap_type: str = 'static', top_n: int | None = None, num_cols: int = 6, color_spec: dict | None = None, show: bool = True, save: str | None = None) → tuple[tuple[matplotlib.figure.Figure, list[matplotlib.axes._axes.Axes]] | None, tuple[matplotlib.figure.Figure, list[matplotlib.axes._axes.Axes]] | None, tuple[matplotlib.figure.Figure, list[matplotlib.axes._axes.Axes]] | plotly.graph_objs._figure.Figure | None] | None#

Plot cluster-specific decision patterns for important features.

Features are ranked by global feature importance, optionally restricted to top_n, and visualized with up to three complementary plots: feature distributions, a dot plot, and a heatmap. The heatmap variant is selected from model_type to support both regression and classification outputs.

Parameters:

data_clustering (pd.DataFrame) – Clustering table containing cluster, target, optional predicted_target, and feature columns.
feature_importance_global (pd.Series) – Global feature importance values used to rank features.
feature_importance_local (pd.DataFrame) – Local feature importance values used for the dot plot.
model_type (type[RandomForestClassifier] | type[RandomForestRegressor]) – Random Forest estimator class used to select the heatmap variant.
draw_distributions (bool) – If True, generate feature distribution plots.
draw_dotplot (bool) – If True, generate the local-importance dot plot.
draw_heatmap (bool) – If True, generate the cluster-wise heatmap.
heatmap_type (str) – Heatmap rendering mode, for example "static" or "interactive".
top_n (int | None) – Number of top-ranked features to plot, or None to plot all.
num_cols (int) – Maximum number of columns for distribution subplots.
color_spec (dict | None) – Optional overrides for DEFAULT_COLOR_SPEC.
show (bool) – If True, display the generated figures. If False, return them.
save (str | None) – Optional file path or prefix for saving generated figures.

Raises:

ValueError – If model_type is not a Random Forest classifier or regressor.

Returns:

Distribution, dot-plot, and heatmap outputs when show=False. Disabled plots are returned as None. Returns None when show=True.

Return type:

class ClusteringKMedoids(method: str = 'fasterpam', init: str = 'random', max_iter: int = 100, random_state: int = 42)#

Bases: object

K-Medoids clustering with precomputed Random-Forest-derived distances.

This class clusters selected samples using a distance matrix computed on demand by a DistanceRandomForestBase instance. The distance metric must already contain its forest encoding, such as terminal-node assignments for proximity distances or decision paths for LCA distances.

Parameters:

method (str) – Optimization method passed to kmedoids.KMedoids.
init (str) – Initialization strategy passed to kmedoids.KMedoids.
max_iter (int) – Maximum number of K-Medoids iterations.
random_state (int) – Random seed used for reproducible initialization.

run_clustering(k: int, distance_metric: DistanceRandomForestBase, sample_indices: ndarray, random_state_subsampling: int | None, verbose: int) → ndarray#

Cluster selected samples with K-Medoids.

A pairwise distance matrix is computed for sample_indices using distance_metric and passed to kmedoids.KMedoids with metric="precomputed". After fitting, the temporary distance matrix is released. Returned labels use one-based indexing.

Parameters:

k (int) – Number of clusters.
distance_metric (DistanceRandomForestBase) – Distance metric with a precomputed forest encoding.
sample_indices (np.ndarray) – Indices of samples to cluster.
random_state_subsampling (int | None) – Optional subsampling seed. Not used by this implementation.
verbose (int) – Verbosity level. Not used by this implementation.

Returns:

One-based cluster labels for sample_indices.

Return type:

np.ndarray

class ClusteringClara(sub_sample_size: int | float | None = None, sampling_iter: int | None = None, sampling_target: list | None = None, method: str = 'fasterpam', init: str = 'random', max_iter: int = 100, random_state: int = 42)#

Bases: object

CLARA clustering with precomputed Random-Forest-derived distances.

CLARA approximates K-Medoids by repeatedly clustering subsamples, evaluating each candidate medoid set on the full selected sample set, and retaining the medoids with the lowest inertia. Final labels are assigned by the nearest retained medoid.

Distances are computed on demand by a DistanceRandomForestBase instance, which allows the same clustering logic to work with proximity-based and LCA-based Random Forest distances.

Parameters:

sub_sample_size (int | float | None) – Number of samples, fraction of samples, or None for an adaptive CLARA subsample size.
sampling_iter (int | None) – Number of CLARA subsampling iterations, or None to choose a default based on the sample size.
sampling_target (list | None) – Optional labels used for stratified subsampling.
method (str) – Optimization method passed to kmedoids.KMedoids.
init (str) – Initialization strategy passed to kmedoids.KMedoids.
max_iter (int) – Maximum number of K-Medoids iterations per subsample.
random_state (int) – Random seed used for reproducible subsampling and initialization.

run_clustering(k: int, distance_metric: DistanceRandomForestBase, sample_indices: ndarray, random_state_subsampling: int | None, verbose: int) → ndarray#

Cluster selected samples with the CLARA algorithm.

In each CLARA iteration, a subsample of sample_indices is selected and clustered with K-Medoids using a precomputed distance matrix. The resulting medoids are scored by computing their inertia over the full selected sample set. After all iterations, labels are assigned to all selected samples using the best medoid set.

If sampling_target is provided, subsamples are drawn with stratification over the target values corresponding to sample_indices. Otherwise, samples are drawn uniformly without replacement.

Returned labels use one-based indexing.

Parameters:

k (int) – Number of clusters.
distance_metric (DistanceRandomForestBase) – Distance metric with a precomputed forest encoding.
sample_indices (np.ndarray) – Indices of samples to cluster.
random_state_subsampling (int | None) – Optional random seed for CLARA subsampling. If None, the instance-level random_state is used.
verbose (int) – Verbosity level forwarded to subsample-size validation.

Returns:

One-based cluster labels for sample_indices.

Return type:

np.ndarray

class DistanceRandomForestBase(memory_efficient: bool = False, dir_distance_matrix: str | None = None)#

Bases: ABC

Base class for Random-Forest-based distance metrics used in forest-guided clustering.

This class defines the shared interface for metrics that derive pairwise sample distances from the structure of a fitted random forest. It provides utilities for allocating distance matrices either in memory or as disk-backed memmap arrays, and for safely removing memmap-backed matrices after use.

Subclasses are responsible for building the forest-derived sample encoding and for implementing the actual distance computation. Typical implementations include proximity-based distances using terminal leaf agreement and LCA-based distances using root-to-leaf path information.

This class is abstract and must not be instantiated directly. Use a concrete subclass such as DistanceRandomForestProximity or DistanceRandomForestLCA.

Parameters:

memory_efficient (bool) – If True, store distance matrices as disk-backed memmap arrays instead of dense arrays in memory.
dir_distance_matrix (str | None) – Directory used for memmap-backed distance matrices. Required when memory_efficient=True.

abstract calculate_forest_encoding(estimator: RandomForestClassifier | RandomForestRegressor, X: DataFrame) → None#

Build the subclass-specific forest encoding for the input samples.

The encoding describes how each sample traverses the fitted random forest and is stored on the instance for later distance computations. For example, proximity-based metrics may store terminal leaf indices, while LCA-based metrics may store root-to-leaf paths and path lengths.

This method must be called before calculate_distance_matrix(), compute_inertia(), or assign_labels().

Parameters:

estimator (RandomForestClassifier | RandomForestRegressor) – Fitted random forest estimator.
X (pd.DataFrame) – Feature matrix to encode. Columns must be compatible with the fitted estimator.

Returns:

None.

Return type:

None

abstract calculate_distance_matrix(sample_indices: ndarray | None) → tuple[numpy.ndarray | numpy.memmap, str | None]#

Compute the pairwise distance matrix for the selected samples.

The distance computation uses the subclass-specific forest encoding generated by calculate_forest_encoding(). Subclasses define both the interpretation of the forest structure and the numerical kernel used to compute distances.

Parameters:: sample_indices (np.ndarray | None) – Indices of samples for which pairwise distances are computed. If None, distances are computed for all encoded samples.
Returns:: Tuple containing the distance matrix and the backing memmap file path. The path is None when the matrix is allocated in memory.
Return type:: tuple[np.ndarray | np.memmap, str | None]

abstract compute_inertia(sample_idx: ndarray, medoids_idx: ndarray) → float#

Compute the medoid inertia for a set of samples.

The inertia is the sum, over all samples in sample_idx, of the distance to the nearest medoid in medoids_idx. This method allows medoid-based algorithms such as PAM or CLARA to optimize the same metric used for the full distance matrix.

Parameters:

sample_idx (np.ndarray) – Indices of samples included in the inertia calculation.
medoids_idx (np.ndarray) – Indices of candidate medoids.

Returns:

Sum of distances from each sample to its nearest medoid.

Return type:

float

abstract assign_labels(sample_idx: ndarray, medoids_idx: ndarray) → ndarray#

Assign each sample to its nearest medoid.

Cluster labels are zero-based and correspond to the position of the selected medoid in medoids_idx.

Parameters:

sample_idx (np.ndarray) – Indices of samples to assign to clusters.
medoids_idx (np.ndarray) – Indices of medoids defining the clusters.

Returns:

Cluster label for each sample in sample_idx.

Return type:

np.ndarray

remove_distance_matrix(distance_matrix: ndarray | memmap, file_distance_matrix: str | None) → None#

Release a distance matrix and remove its backing file if present.

If distance_matrix is a memmap, the data are flushed before references are released. Garbage collection is triggered to help close file handles, and the backing file is removed when file_distance_matrix points to an existing file.

Parameters:

distance_matrix (np.ndarray | np.memmap) – Distance matrix returned by _allocate_distance_matrix().
file_distance_matrix (str | None) – Path to the memmap backing file, or None for in-memory matrices.

Returns:

None.

Return type:

None

class DistanceRandomForestProximity(memory_efficient: bool = False, dir_distance_matrix: str | None = None, min_samples_in_node: int | None = None, max_depth_for_proximity: int | None = None, min_variance_in_node: float | None = None)#

Bases: DistanceRandomForestBase

Proximity-based Random Forest distance metric.

This class computes sample distances from Random Forest node assignments. By default, proximity is defined as the fraction of trees in which two samples reach the same terminal leaf. Distance is then computed as 1 - proximity.

Optionally, terminal leaves can be collapsed to coarser ancestor nodes before computing proximity. This can reduce sparsity in deep forests, especially for regression models.

Exactly one ancestor-collapse strategy may be enabled:

min_samples_in_node: Use the nearest ancestor containing at least this many training samples.
max_depth_for_proximity: Use the nearest ancestor whose depth is less than or equal to this threshold.
min_variance_in_node: Regression-only. Use the nearest ancestor whose impurity is greater than or equal to this variance threshold.

Distance matrices can be allocated in memory or as disk-backed memmap arrays.

Parameters:

memory_efficient (bool) – If True, store distance matrices as disk-backed memmap arrays.
dir_distance_matrix (str | None) – Directory used for memmap-backed distance matrices. Required when memory_efficient=True.
min_samples_in_node (int | None) – Minimum number of training samples required for an ancestor node to be used as the effective leaf.
max_depth_for_proximity (int | None) – Maximum allowed depth for an ancestor node to be used as the effective leaf. 0 collapses all leaves to the root.
min_variance_in_node (float | None) – Minimum impurity threshold for an ancestor node to be used as the effective leaf. Requires a RandomForestRegressor trained with criterion="squared_error" or criterion="friedman_mse". 0 is treated as a no-op (identical to None) and skips ancestor collapse.

calculate_forest_encoding(estimator: RandomForestClassifier | RandomForestRegressor, X: DataFrame) → None#

Encode samples by their effective node assignment in each tree.

The initial encoding is obtained from estimator.apply(X), which returns the terminal leaf reached by each sample in each tree. If an ancestor-collapse option is configured, each terminal leaf is replaced by the nearest ancestor satisfying the selected criterion.

The resulting array is stored in self.terminals with shape (n_samples, n_estimators) and dtype int32.

Parameters:

estimator (RandomForestClassifier | RandomForestRegressor) – Fitted Random Forest estimator used to encode the samples.
X (pd.DataFrame) – Feature matrix to encode. Columns must be compatible with the fitted estimator.

Raises:

ValueError – If min_variance_in_node is used with a classifier.
ValueError – If min_variance_in_node is used with an unsupported regression criterion.

Returns:

None.

Return type:

None

calculate_distance_matrix(sample_indices: ndarray | None) → tuple[numpy.ndarray | numpy.memmap, str | None]#

Compute the proximity-based pairwise distance matrix.

For two samples, proximity is the fraction of trees in which both samples have the same effective node assignment. The returned distance is 1 - proximity.

If sample_indices is provided, only the selected rows of self.terminals are used. Otherwise, the full encoded sample set is used.

Parameters:

sample_indices (np.ndarray | None) – Indices of samples for which pairwise distances are computed. If None, all encoded samples are used.

Raises:

ValueError – If calculate_forest_encoding() has not been called.
MemoryError – If memmap allocation is requested but there is insufficient disk space.

Returns:

Tuple containing the distance matrix and the backing memmap file path. The path is None for in-memory matrices.

Return type:

tuple[np.ndarray | np.memmap, str | None]

compute_inertia(sample_idx: ndarray, medoids_idx: ndarray) → float#

Compute medoid inertia using the proximity-based distance.

The inertia is the sum, over all samples in sample_idx, of the distance to the closest medoid in medoids_idx.

Parameters:

sample_idx (np.ndarray) – Indices of samples included in the inertia calculation.
medoids_idx (np.ndarray) – Indices of medoid samples.

Raises:

ValueError – If calculate_forest_encoding() has not been called.

Returns:

Sum of distances from each sample to its closest medoid.

Return type:

float

assign_labels(sample_idx: ndarray, medoids_idx: ndarray) → ndarray#

Assign samples to their nearest medoid using the proximity-based distance.

Cluster labels are zero-based and correspond to the position of the nearest medoid in medoids_idx.

Parameters:

sample_idx (np.ndarray) – Indices of samples to assign to clusters.
medoids_idx (np.ndarray) – Indices of medoids defining the clusters.

Raises:

ValueError – If calculate_forest_encoding() has not been called.

Returns:

Cluster label for each sample in sample_idx.

Return type:

np.ndarray

class DistanceRandomForestLCA(memory_efficient: bool = False, dir_distance_matrix: str | None = None)#

Bases: DistanceRandomForestBase

LCA-based Random Forest distance metric.

This class computes sample distances from root-to-leaf decision paths in a fitted Random Forest. For each pair of samples and each tree, similarity is defined as the depth of the least common ancestor (LCA), normalized by the longer of the two root-to-leaf path lengths. Distances are computed as 1 - mean_similarity across trees.

Unlike terminal-node proximity, this metric can assign non-zero similarity to samples that end in different leaves if they share part of the same decision path.

Parameters:

memory_efficient (bool) – If True, store distance matrices as disk-backed memmap arrays.
dir_distance_matrix (str | None) – Directory used for memmap-backed distance matrices. Required when memory_efficient=True.

calculate_forest_encoding(estimator: RandomForestClassifier | RandomForestRegressor, X: DataFrame) → None#

Encode samples by their root-to-leaf decision paths in each tree.

Leaf assignments are obtained from estimator.apply(X). For each tree, this method reconstructs the full root-to-leaf node-id path for every sample by walking from each terminal leaf to the root through the parent array.

Paths are stored in self.paths as a padded int32 tensor with shape (n_samples, n_estimators, max_path_len). Unused positions are filled with -1. Effective path lengths are stored in self.path_lens with shape (n_samples, n_estimators).

Parameters:

estimator (RandomForestClassifier | RandomForestRegressor) – Fitted Random Forest estimator used to encode the samples.
X (pd.DataFrame) – Feature matrix to encode. Columns must be compatible with the fitted estimator.

Returns:

None.

Return type:

None

calculate_distance_matrix(sample_indices: ndarray | None) → tuple[numpy.ndarray | numpy.memmap, str | None]#

Compute the LCA-based pairwise distance matrix.

For each tree, two samples are compared by the deepest shared node along their root-to-leaf paths. The depth of this LCA is normalized by the longer path length of the two samples. Per-tree similarities are averaged, and distance is computed as 1 - mean_similarity.

If sample_indices is provided, only the selected rows of self.paths and self.path_lens are used. Otherwise, all encoded samples are used.

Parameters:

sample_indices (np.ndarray | None) – Indices of samples for which pairwise distances are computed. If None, all encoded samples are used.

Raises:

ValueError – If calculate_forest_encoding() has not been called.
MemoryError – If memmap allocation is requested but there is insufficient disk space.

Returns:

Tuple containing the distance matrix and the backing memmap file path. The path is None for in-memory matrices.

Return type:

tuple[np.ndarray | np.memmap, str | None]

compute_inertia(sample_idx: ndarray, medoids_idx: ndarray) → float#

Compute medoid inertia using the LCA-based distance.

The inertia is the sum, over all samples in sample_idx, of the LCA-based distance to the closest medoid in medoids_idx.

Parameters:

sample_idx (np.ndarray) – Indices of samples included in the inertia calculation.
medoids_idx (np.ndarray) – Indices of medoid samples.

Raises:

ValueError – If calculate_forest_encoding() has not been called.

Returns:

Sum of distances from each sample to its closest medoid.

Return type:

float

assign_labels(sample_idx: ndarray, medoids_idx: ndarray) → ndarray#

Assign samples to their nearest medoid using the LCA-based distance.

Cluster labels are zero-based and correspond to the position of the nearest medoid in medoids_idx.

Parameters:

sample_idx (np.ndarray) – Indices of samples to assign to clusters.
medoids_idx (np.ndarray) – Indices of medoids defining the clusters.

Raises:

ValueError – If calculate_forest_encoding() has not been called.

Returns:

Cluster label for each sample in sample_idx.

Return type:

np.ndarray

class DistanceJensenShannon(scale_features: bool)#

Bases: object

Jensen-Shannon distance between cluster-specific and background feature distributions.

This metric compares the distribution of a feature within a cluster against its distribution in the full dataset. Both numeric and categorical features are supported.

Categorical features are compared using normalized category frequencies. Numeric features are compared using histogram-based approximations of their empirical distributions.

Optionally, numeric features can be scaled before distance computation.

Parameters:: scale_features (bool) – If True, scale numeric features before computing distances.

run_scale_features(X: DataFrame) → DataFrame#

Scale numeric feature columns using standard scaling without mean centering.

Only numeric columns are transformed. Non-numeric columns are returned unchanged.

Parameters:: X (pd.DataFrame) – Input feature matrix.
Returns:: Feature matrix with scaled numeric columns.
Return type:: pd.DataFrame

calculate_distance_cluster_vs_background(values_background: Series, values_cluster: Series, is_categorical: bool) → float#

Compute the Jensen-Shannon distance between cluster and background distributions.

For categorical features, normalized category frequencies are computed over the categories present in the background data.

For numeric features, histogram-based probability distributions are constructed using bin edges derived from the background values. The number of bins is estimated using the Freedman-Diaconis rule with additional lower and upper bounds for numerical stability.

Parameters:

values_background (pd.Series) – Feature values from the full dataset.
values_cluster (pd.Series) – Feature values from the cluster being evaluated.
is_categorical (bool) – Whether the feature should be treated as categorical.

Returns:

Jensen-Shannon distance between the cluster and background distributions.

Return type:

float

class DistanceWasserstein(scale_features: bool)#

Bases: object

Wasserstein distance between cluster-specific and background feature distributions.

This metric compares the distribution of a feature within a cluster against its distribution in the full dataset. Both numeric and categorical features are supported.

Numeric features are compared directly using the first Wasserstein distance. Categorical features are dummy-encoded and compared independently per category, with the maximum category-wise Wasserstein distance returned as the final score.

Optionally, numeric features can be scaled before distance computation.

Parameters:: scale_features (bool) – If True, scale numeric features before computing distances.

run_scale_features(X: DataFrame) → DataFrame#

Scale numeric feature columns using standard scaling without mean centering.

Only numeric columns are transformed. Non-numeric columns are returned unchanged.

Parameters:: X (pd.DataFrame) – Input feature matrix.
Returns:: Feature matrix with scaled numeric columns.
Return type:: pd.DataFrame

calculate_distance_cluster_vs_background(values_background: Series, values_cluster: Series, is_categorical: bool) → float#

Compute the Wasserstein distance between cluster and background distributions.

For categorical features, values are one-hot encoded and the Wasserstein distance is computed independently for each category indicator variable. The maximum category-wise distance is returned.

For numeric features, the raw feature values are compared directly using the first Wasserstein distance.

Parameters:

values_background (pd.Series) – Feature values from the full dataset.
values_cluster (pd.Series) – Feature values from the cluster being evaluated.
is_categorical (bool) – Whether the feature should be treated as categorical.

Returns:

Wasserstein distance between the cluster and background distributions.

Return type:

float

class Optimizer(distance_metric: DistanceRandomForestBase, clustering_strategy: ClusteringKMedoids | ClusteringClara, random_state: int | None)#

Bases: object

Optimize the number of clusters using stability and target-based quality.

For each candidate k, samples are clustered with the configured forest-derived distance metric and clustering strategy. The resulting clustering is evaluated by bootstrap-based Jaccard stability and by a task-specific quality score.

Classification models are scored with balanced average Gini impurity. Regression models are scored with normalized within-cluster variation. The selected solution is the stable clustering with the lowest quality score.

Parameters:

distance_metric (DistanceRandomForestBase) – Forest-derived distance metric used for clustering.
clustering_strategy (ClusteringKMedoids | ClusteringClara) – Clustering strategy used to produce cluster assignments.
random_state (int | None) – Random seed used for reproducible bootstrap sampling.

optimizeK(y: Series, k_range: tuple[int, int], JI_bootstrap_iter: int, JI_bootstrap_sample_size: int | float, JI_discart_value: float, model_type: type[sklearn.ensemble._forest.RandomForestClassifier] | type[sklearn.ensemble._forest.RandomForestRegressor], n_jobs: int, verbose: int) → tuple[list[dict], int | None]#

Evaluate candidate cluster numbers and select the best stable solution.

Each value in the inclusive k_range is clustered on the full dataset. Stability is estimated by repeated subsampling, reclustering each subsample, and matching bootstrap clusters to full-data clusters with the Jaccard index. Quality is evaluated with balanced average impurity for classification targets or normalized within-cluster variation for regression targets.

Cluster labels are reordered by increasing mean target value before results are stored. The selected best_k is the stable candidate with the lowest quality score. If no candidate exceeds JI_discart_value, best_k is None.

Parameters:

y (pd.Series) – Target values aligned with the encoded samples.
k_range (tuple[int, int]) – Inclusive range of cluster counts to evaluate as (min_k, max_k).
JI_bootstrap_iter (int) – Number of bootstrap iterations used for stability estimation.
JI_bootstrap_sample_size (int | float) – Number of samples drawn in each bootstrap iteration.
JI_discart_value (float) – Minimum mean Jaccard index required for a clustering to be considered stable.
model_type (type[RandomForestClassifier] | type[RandomForestRegressor]) – Random Forest estimator class used to select classification or regression scoring.
n_jobs (int) – Number of parallel jobs used for bootstrap stability computation.
verbose (int) – Verbosity level for progress bars and printed summaries.

Raises:

ValueError – If model_type is neither a Random Forest classifier nor a Random Forest regressor.

Returns:

Tuple containing all per-k result dictionaries and the selected best_k.

Return type:

tuple[list[dict], int | None]

class FeatureImportance(distance_metric: DistanceJensenShannon | DistanceWasserstein)#

Bases: object

Compute feature importance from cluster-vs-background distribution shifts.

For each feature and cluster, the configured distance metric compares the feature distribution within the cluster to the feature distribution across the full dataset. These cluster-specific scores form the local feature importance matrix.

Local scores are normalized within each cluster so that the largest feature distance is 1. Global feature importance is computed as the mean local importance across clusters and is used to rank feature columns in the returned clustering table.

Parameters:: distance_metric (DistanceJensenShannon | DistanceWasserstein) – Distance metric used to compare cluster and background feature distributions.

calculate_feature_importance(X: DataFrame, y: Series, y_pred: Series | None, cluster_labels: ndarray, verbose: int) → tuple[pandas.core.frame.DataFrame, pandas.core.series.Series, pandas.core.frame.DataFrame]#

Calculate local and global feature importance.

The feature matrix, target values, optional predictions, and cluster labels are combined into a clustering table. Local feature importance is computed as normalized cluster-vs-background distances for each feature and cluster. Global feature importance is the mean local importance across clusters.

The returned clustering table is sorted by cluster, target, and, if present, predicted_target. Feature columns are ordered by descending global importance.

Parameters:

X (pd.DataFrame) – Feature matrix with one row per sample.
y (pd.Series) – Target values aligned with X.
y_pred (pd.Series | None) – Optional predicted target values aligned with X.
cluster_labels (np.ndarray) – Cluster labels aligned with X.
verbose (int) – Verbosity level for progress output.

Returns:

Tuple containing local feature importance, global feature importance, and the ranked clustering table.

Return type:

tuple[pd.DataFrame, pd.Series, pd.DataFrame]

fgclustering package

Contents

fgclustering package#

Module contents#