fgclustering package
Contents
fgclustering package#
Submodules#
fgclustering.forest_guided_clustering module#
- class fgclustering.forest_guided_clustering.FgClustering(model, data, target_column, random_state=42)#
Bases:
object
Forest-Guided Clustering.
Computes a feature importance based on subgroups of instances that follow similar decision rules within the Random Forest model.
- Parameters
model (sklearn.ensemble) – Trained Random Forest model.
data (pandas.DataFrame) – Input data with feature matrix. If target_column is a string it has to be a column in the data.
target_column (str or numpy.ndarray) – Name of target column or target values as numpy array.
random_state (int, optional) – seed for random number generator, defaults to 42
- Raises
ValueError – error raised if Random Forest model is not a sklearn.ensemble.RandomForestClassifier or sklearn.ensemble.RandomForestRegressor object
- run(number_of_clusters=None, max_K=8, method_clustering='pam', init_clustering='random', max_iter_clustering=100, discart_value_JI=0.6, bootstraps_JI=100, bootstraps_p_value=100, n_jobs=1, verbose=1)#
Runs the forest-guided clustering model. The optimal number of clusters for a k-medoids clustering is computed, based on the distance matrix computed from the Random Forest proximity matrix.
- Parameters
number_of_clusters (int, optional) – Number of clusters for the k-medoids clustering. Leave None if number of clusters should be optimized, defaults to None
max_K (int, optional) – Maximum number of clusters for cluster score computation, defaults to 8
method_clustering ({'fasterpam', 'fastpam1', 'pam', 'alternate', 'fastermsc', 'fastmsc', 'pamsil', and 'pammedsil'}, optional) – Which algorithm to use. ‘alternate’ is faster while ‘pam’ is more accurate, defaults to ‘pam’. Use ‘fasterpam’ for big datasets. See python kmedoids documentation for other implemented methods.
init_clustering ({'random', 'first', 'build'}, optional) – Specify medoid initialization method. See python kmedoids documentation for parameter description, defaults to ‘random’
max_iter_clustering (int, optional) – Number of iterations for k-medoids clustering, defaults to 100
discart_value_JI (float, optional) – Minimum Jaccard Index for cluster stability, defaults to 0.6
bootstraps_JI (int, optional) – Number of bootstraps to compute the Jaccard Index, defaults to 100
bootstraps_p_value (int, optional) – Number of bootstraps to compute the p-value of feature importance, defaults to 100
n_jobs (int, optional) – maximum number of jobs to run in parallel when creating bootstraps to compute the Jaccard index. n_jobs=1 means no parallel computing is used, defaults to 1
verbose ({0,1}, optional) – print the output of fgc cluster optimization process (the Jaccard index and score for each cluster number); defaults to 1 (printing). Set to 0 for no outputs.
- calculate_statistics(data, target_column, bootstraps_p_value=100)#
Recalculates p-values for each feature (over all clusters and per cluster) based on the new feature matrix. This impacts all plotting functions. Note: the new feature matrix must have the same number of samples and the same ordering of samples as the original feature matrix.
- Parameters
X (pandas.DataFrame) – Feature Matrix.
bootstraps_p_value (int, optional) – Number of bootstraps to compute the p-value of feature importance, defaults to 100
- plot_global_feature_importance(save=None)#
Plot global feature importance based on p-values given as input, the p-values are computed using an Anova (for continuous variable) or a Chi-Square (for categorical variables) test. The features importance is defined by 1-p_value.
- Parameters
save (str) – Filename to save plot.
- plot_local_feature_importance(thr_pvalue=1, num_cols=4, save=None)#
Plot local feature importance to show the importance of each feature for each cluster, measured by variance and impurity of the feature within the cluster, i.e. the higher the feature importance, the lower the feature variance / impurity within the cluster.
- Parameters
thr_pvalue (float, optional) – P-value threshold for feature filtering, defaults to 1
save (str, optional) – Filename to save plot, if None the figure is not saved, defaults to None
num_cols (int, optional) – Number of plots in one row, defaults to 4.
- plot_decision_paths(distributions=True, heatmap=True, thr_pvalue=1, num_cols=6, save=None)#
Plot decision paths of the Random Forest model. If distributions = True, feature distributions per cluster are plotted as boxplots (for continuous features) or barplots (for categorical features). If heatmap = True, feature values are plotted in a heatmap sorted by clusters. For both plots, features are filtered and ranked by p-values of a statistical test (ANOVA for continuous features, chi-square for categorical features).
- Parameters
distributions (boolean, optional) – Plot feature distributions, defaults to True
heatmap (boolean, optional) – Plot feature heatmap, defaults to True
thr_pvalue (float, optional) – P-value threshold for feature filtering, defaults to 1
save (str, optional) – Filename to save plot, if None the figure is not saved, defaults to None
num_cols (int, optional) – Number of plots in one row for the distributions plot, defaults to 6.
fgclustering.optimizer module#
- fgclustering.optimizer.optimizeK(distance_matrix, y, model_type, max_K, method_clustering, init_clustering, max_iter_clustering, discart_value_JI, bootstraps_JI, random_state, n_jobs, verbose)#
Compute the optimal number of clusters for k-medoids clustering (trade-off between cluster purity and cluster stability).
- Parameters
distance_matrix (pandas.DataFrame) – Proximity matrix of Random Forest model.
y (pandas.Series) – Target column.
model_type (str) – Model type of Random Forest model: classifier or regression.
max_K (int) – Maximum number of clusters for cluster score computation, defaults to 6
method_clustering ({'alternate', 'pam'}, optional) – Which algorithm to use. ‘alternate’ is faster while ‘pam’ is more accurate, defaults to ‘pam’
init_clustering ({'random', 'heuristic', 'k-medoids++', 'build'}, optional) – Specify medoid initialization method. To speed up computation for large datasets use ‘random’. See sklearn documentation for parameter description, defaults to ‘k-medoids++’
max_iter_clustering (int) – Number of iterations for k-medoids clustering, defaults to 500
discart_value (float) – Minimum Jaccard Index for cluster stability, defaults to 0.6
bootstraps_JI (int) – Number of bootstraps to compute the Jaccard Index, defaults to 300
random_state (int) – Seed number for random state, defaults to 42
n_jobs (int, optional) – number of jobs to run in parallel when computing the cluster stability. n_jobs=1 means no parallel computing is used, defaults to 1
verbose ({0,1}, optional) – print the output of fgc cluster optimization process (the Jaccard index and score for each cluster number); defaults to 1 (printing). Set to 0 for no outputs.
- Returns
Optimal number of clusters.
- Return type
int
fgclustering.plotting module#
fgclustering.statistics module#
- fgclustering.statistics.compute_balanced_average_impurity(categorical_values, cluster_labels, rescaling_factor=None)#
Compute balanced average impurity as score for categorical values in a clustering. Impurity score is an Gini Coefficient of the classes within each cluster. The class sizes are balanced by rescaling with the inverse size of the class in the overall dataset.
- Parameters
categorical_values (pandas.Series) – Values of categorical feature / target.
cluster_labels (numpy.ndarray) – Cluster labels for each value.
rescaling_factor (dict) – Dictionary with rescaling factor for each class / unique feature value. If parameter is set to None, the rescaling factor will be computed from the input data categorical_values, defaults to None
- Returns
Impurity score.
- Return type
float
- fgclustering.statistics.compute_total_within_cluster_variation(continuous_values, cluster_labels)#
Compute total within cluster variation as score for continuous values in a clustering.
- Parameters
continuous_values (pandas.Series) – Values of continuous feature / target.
cluster_labels (numpy.ndarray) – Cluster labels for each value.
- Returns
Within cluster variation score.
- Return type
float
- fgclustering.statistics.calculate_global_feature_importance(X, y, cluster_labels, model_type)#
Calculate global feature importance for each feature. The higher the importance for a feature, the lower the p-value obtained by an ANOVA (continuous feature) or chi-square (categorical feature) test. Returned as p-value, hence importance is 1-p-value.
- Parameters
X (pandas.DataFrame) – Feature matrix.
y (pandas.Series) – Target column.
cluster_labels (numpy.ndarray) – Clustering labels.
model_type (str) – Model type of Random Forest model: classifier or regression.
- Returns
Data Frame incl features, target and cluster numbers ranked by p-value of statistical test and dictionary with computed p-values of all features.
- Return type
pandas.DataFrame and dict
- fgclustering.statistics.calculate_local_feature_importance(data_clustering_ranked, bootstraps_p_value)#
Calculate local importance of each feature within each cluster. The higher the importance for a feature, the lower the variance (continuous feature) or impurity (categorical feature) of that feature within the cluster. Returned as p-value, hence importance is 1-p-value.
- Parameters
data_clustering_ranked (pandas.DataFrame) – Filtered and ranked data frame incl features, target and cluster numbers.
bootstraps_p_value (int) – Number of bootstraps to be drawn for computation of p-value.
- Returns
p-value matrix of all features per cluster.
- Return type
pandas.DataFrame
fgclustering.utils module#
- fgclustering.utils.scale_standard(X)#
Feature Scaling with StandardScaler.
- Parameters
X (pandas.DataFrame) – Feature matrix.
- Returns
Standardized feature matrix.
- Return type
pandas.DataFrame
- fgclustering.utils.scale_minmax(X)#
Feature Scaling with MinMaxScaler.
- Parameters
X (pandas.DataFrame) – Feature matrix.
- Returns
Standardized feature matrix.
- Return type
pandas.DataFrame
- fgclustering.utils.proximityMatrix(model, X, normalize=True)#
Calculate proximity matrix of Random Forest model.
- Parameters
model (sklearn.ensemble) – Trained Random Forest model.
X (pandas.DataFrame) – Feature matrix.
normalize (bool, optional) – Normalize proximity matrix by number of trees in the Random Forest, defaults to True.
- Returns
Proximity matrix of Random Forest model.
- Return type
numpy array