fgclustering package#

Submodules#

fgclustering.forest_guided_clustering module#

class fgclustering.forest_guided_clustering.FgClustering(model, data, target_column, random_state=42)#

Bases: object

Forest-Guided Clustering.

Computes a feature importance based on subgroups of instances that follow similar decision rules within the Random Forest model.

Parameters
  • model (sklearn.ensemble) – Trained Random Forest model.

  • data (pandas.DataFrame) – Input data with feature matrix. If target_column is a string it has to be a column in the data.

  • target_column (str or numpy.ndarray) – Name of target column or target values as numpy array.

  • random_state (int, optional) – seed for random number generator, defaults to 42

Raises

ValueError – error raised if Random Forest model is not a sklearn.ensemble.RandomForestClassifier or sklearn.ensemble.RandomForestRegressor object

run(number_of_clusters=None, max_K=8, method_clustering='pam', init_clustering='random', max_iter_clustering=100, discart_value_JI=0.6, bootstraps_JI=100, bootstraps_p_value=100, n_jobs=1, verbose=1)#

Runs the forest-guided clustering model. The optimal number of clusters for a k-medoids clustering is computed, based on the distance matrix computed from the Random Forest proximity matrix.

Parameters
  • number_of_clusters (int, optional) – Number of clusters for the k-medoids clustering. Leave None if number of clusters should be optimized, defaults to None

  • max_K (int, optional) – Maximum number of clusters for cluster score computation, defaults to 8

  • method_clustering ({'fasterpam', 'fastpam1', 'pam', 'alternate', 'fastermsc', 'fastmsc', 'pamsil', and 'pammedsil'}, optional) – Which algorithm to use. ‘alternate’ is faster while ‘pam’ is more accurate, defaults to ‘pam’. Use ‘fasterpam’ for big datasets. See python kmedoids documentation for other implemented methods.

  • init_clustering ({'random', 'first', 'build'}, optional) – Specify medoid initialization method. See python kmedoids documentation for parameter description, defaults to ‘random’

  • max_iter_clustering (int, optional) – Number of iterations for k-medoids clustering, defaults to 100

  • discart_value_JI (float, optional) – Minimum Jaccard Index for cluster stability, defaults to 0.6

  • bootstraps_JI (int, optional) – Number of bootstraps to compute the Jaccard Index, defaults to 100

  • bootstraps_p_value (int, optional) – Number of bootstraps to compute the p-value of feature importance, defaults to 100

  • n_jobs (int, optional) – maximum number of jobs to run in parallel when creating bootstraps to compute the Jaccard index. n_jobs=1 means no parallel computing is used, defaults to 1

  • verbose ({0,1}, optional) – print the output of fgc cluster optimization process (the Jaccard index and score for each cluster number); defaults to 1 (printing). Set to 0 for no outputs.

calculate_statistics(data, target_column, bootstraps_p_value=100)#

Recalculates p-values for each feature (over all clusters and per cluster) based on the new feature matrix. This impacts all plotting functions. Note: the new feature matrix must have the same number of samples and the same ordering of samples as the original feature matrix.

Parameters
  • X (pandas.DataFrame) – Feature Matrix.

  • bootstraps_p_value (int, optional) – Number of bootstraps to compute the p-value of feature importance, defaults to 100

plot_global_feature_importance(save=None)#

Plot global feature importance based on p-values given as input, the p-values are computed using an Anova (for continuous variable) or a Chi-Square (for categorical variables) test. The features importance is defined by 1-p_value.

Parameters

save (str) – Filename to save plot.

plot_local_feature_importance(thr_pvalue=1, num_cols=4, save=None)#

Plot local feature importance to show the importance of each feature for each cluster, measured by variance and impurity of the feature within the cluster, i.e. the higher the feature importance, the lower the feature variance / impurity within the cluster.

Parameters
  • thr_pvalue (float, optional) – P-value threshold for feature filtering, defaults to 1

  • save (str, optional) – Filename to save plot, if None the figure is not saved, defaults to None

  • num_cols (int, optional) – Number of plots in one row, defaults to 4.

plot_decision_paths(distributions=True, heatmap=True, thr_pvalue=1, num_cols=6, save=None)#

Plot decision paths of the Random Forest model. If distributions = True, feature distributions per cluster are plotted as boxplots (for continuous features) or barplots (for categorical features). If heatmap = True, feature values are plotted in a heatmap sorted by clusters. For both plots, features are filtered and ranked by p-values of a statistical test (ANOVA for continuous features, chi-square for categorical features).

Parameters
  • distributions (boolean, optional) – Plot feature distributions, defaults to True

  • heatmap (boolean, optional) – Plot feature heatmap, defaults to True

  • thr_pvalue (float, optional) – P-value threshold for feature filtering, defaults to 1

  • save (str, optional) – Filename to save plot, if None the figure is not saved, defaults to None

  • num_cols (int, optional) – Number of plots in one row for the distributions plot, defaults to 6.

fgclustering.optimizer module#

fgclustering.optimizer.optimizeK(distance_matrix, y, model_type, max_K, method_clustering, init_clustering, max_iter_clustering, discart_value_JI, bootstraps_JI, random_state, n_jobs, verbose)#

Compute the optimal number of clusters for k-medoids clustering (trade-off between cluster purity and cluster stability).

Parameters
  • distance_matrix (pandas.DataFrame) – Proximity matrix of Random Forest model.

  • y (pandas.Series) – Target column.

  • model_type (str) – Model type of Random Forest model: classifier or regression.

  • max_K (int) – Maximum number of clusters for cluster score computation, defaults to 6

  • method_clustering ({'alternate', 'pam'}, optional) – Which algorithm to use. ‘alternate’ is faster while ‘pam’ is more accurate, defaults to ‘pam’

  • init_clustering ({'random', 'heuristic', 'k-medoids++', 'build'}, optional) – Specify medoid initialization method. To speed up computation for large datasets use ‘random’. See sklearn documentation for parameter description, defaults to ‘k-medoids++’

  • max_iter_clustering (int) – Number of iterations for k-medoids clustering, defaults to 500

  • discart_value (float) – Minimum Jaccard Index for cluster stability, defaults to 0.6

  • bootstraps_JI (int) – Number of bootstraps to compute the Jaccard Index, defaults to 300

  • random_state (int) – Seed number for random state, defaults to 42

  • n_jobs (int, optional) – number of jobs to run in parallel when computing the cluster stability. n_jobs=1 means no parallel computing is used, defaults to 1

  • verbose ({0,1}, optional) – print the output of fgc cluster optimization process (the Jaccard index and score for each cluster number); defaults to 1 (printing). Set to 0 for no outputs.

Returns

Optimal number of clusters.

Return type

int

fgclustering.plotting module#

fgclustering.statistics module#

fgclustering.statistics.compute_balanced_average_impurity(categorical_values, cluster_labels, rescaling_factor=None)#

Compute balanced average impurity as score for categorical values in a clustering. Impurity score is an Gini Coefficient of the classes within each cluster. The class sizes are balanced by rescaling with the inverse size of the class in the overall dataset.

Parameters
  • categorical_values (pandas.Series) – Values of categorical feature / target.

  • cluster_labels (numpy.ndarray) – Cluster labels for each value.

  • rescaling_factor (dict) – Dictionary with rescaling factor for each class / unique feature value. If parameter is set to None, the rescaling factor will be computed from the input data categorical_values, defaults to None

Returns

Impurity score.

Return type

float

fgclustering.statistics.compute_total_within_cluster_variation(continuous_values, cluster_labels)#

Compute total within cluster variation as score for continuous values in a clustering.

Parameters
  • continuous_values (pandas.Series) – Values of continuous feature / target.

  • cluster_labels (numpy.ndarray) – Cluster labels for each value.

Returns

Within cluster variation score.

Return type

float

fgclustering.statistics.calculate_global_feature_importance(X, y, cluster_labels, model_type)#

Calculate global feature importance for each feature. The higher the importance for a feature, the lower the p-value obtained by an ANOVA (continuous feature) or chi-square (categorical feature) test. Returned as p-value, hence importance is 1-p-value.

Parameters
  • X (pandas.DataFrame) – Feature matrix.

  • y (pandas.Series) – Target column.

  • cluster_labels (numpy.ndarray) – Clustering labels.

  • model_type (str) – Model type of Random Forest model: classifier or regression.

Returns

Data Frame incl features, target and cluster numbers ranked by p-value of statistical test and dictionary with computed p-values of all features.

Return type

pandas.DataFrame and dict

fgclustering.statistics.calculate_local_feature_importance(data_clustering_ranked, bootstraps_p_value)#

Calculate local importance of each feature within each cluster. The higher the importance for a feature, the lower the variance (continuous feature) or impurity (categorical feature) of that feature within the cluster. Returned as p-value, hence importance is 1-p-value.

Parameters
  • data_clustering_ranked (pandas.DataFrame) – Filtered and ranked data frame incl features, target and cluster numbers.

  • bootstraps_p_value (int) – Number of bootstraps to be drawn for computation of p-value.

Returns

p-value matrix of all features per cluster.

Return type

pandas.DataFrame

fgclustering.utils module#

fgclustering.utils.scale_standard(X)#

Feature Scaling with StandardScaler.

Parameters

X (pandas.DataFrame) – Feature matrix.

Returns

Standardized feature matrix.

Return type

pandas.DataFrame

fgclustering.utils.scale_minmax(X)#

Feature Scaling with MinMaxScaler.

Parameters

X (pandas.DataFrame) – Feature matrix.

Returns

Standardized feature matrix.

Return type

pandas.DataFrame

fgclustering.utils.proximityMatrix(model, X, normalize=True)#

Calculate proximity matrix of Random Forest model.

Parameters
  • model (sklearn.ensemble) – Trained Random Forest model.

  • X (pandas.DataFrame) – Feature matrix.

  • normalize (bool, optional) – Normalize proximity matrix by number of trees in the Random Forest, defaults to True.

Returns

Proximity matrix of Random Forest model.

Return type

numpy array

Module contents#