bm.feature package

特征选择

bm.feature.selector_fin module

bm.feature.selector_fin.drop_by_corr(dataframe, target, threshold=0.9, ref='iv')

根据指定的相关性（阈值），删除特征

Parameters

dataframe (pd.dataframe) – 数据表
target (pd.series) – 目标列
threshold (float) – 阈值，相关性
ref – 强相关的特征，参考引用指标，以确定待删除的特征

Returns

依据相关性删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_cv(dataframe, threshold=0.5)

根据指定的变异系数（阈值），删除特征

Parameters

dataframe (pd.dataframe) – 需要操作的数据表
threshold (float) – 阈值，变异系数

Returns

依据变异系数删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_feature_importance(dataframe, target, threshold=0.95)

根据指定的特征重要性排序（阈值），删除特征

Parameters

dataframe (pd.dataframe) – 数据表
target (pd.series) – 目标列
threshold (float) – 阈值，特征重要性排序前xx%

Returns

依据特征重要性排序删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_iv(dataframe, target, threshold=0.02)

根据指定的IV（阈值），删除特征

Parameters

expected_df (pd.dataframe) – 基础数据表
actual_df (pd.dataframe) – 对照数据表
threshold (float) – 阈值，默认0.02

Returns

依据IV删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_missing_rate(dataframe, threshold=0.9)

根据指定的缺失比例（阈值），删除特征

Parameters

dataframe (pd.dataframe) – 需要操作的数据表
threshold (float) – 阈值，缺失比例

Returns

依据缺失比例删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_psi(expected_df, actual_df, method='quantile', threshold=0.1)

根据指定的PSI（阈值），删除特征

Parameters

expected_df (pd.dataframe) – 基础数据表
actual_df (pd.dataframe) – 对照数据表
method (str) – 分箱方法，默认等频
threshold (float) – 阈值，PSI

Returns

依据PSI删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_stepwise(dataframe, target, direction='forward', threshold=0.95)

根据指定的逐步回归置信度（阈值），删除特征

Parameters

dataframe (pd.dataframe) – 数据表
target (pd.series) – 目标列
direction (str) – 方向 – forward : 前向选择 – backward: 后向剔除
threshold (float) – 阈值，置信度

Returns

依据逐步回归置信度删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_by_vif(dataframe, threshold=10)

根据指定的复相关性（阈值），删除特征

Parameters

dataframe (pd.dataframe) – 数据表
threshold (float) – 阈值，复相关性

Returns

依据复相关性删除特征后的数据表

Return type

pd.dataframe

bm.feature.selector_fin.drop_columns(dataframe, drop_cols, indicator, threshold)

删除数据表的指定列

Parameters

dataframe (pd.dataframe) – 需要操作的数据表
drop_cols (pd.series) – 需要删除的数据列
indicator (str) – 删除列所依据的指标
threshold (float) – 删除列所依据的指标阈值

Returns

删除制定列后的数据表

Return type

pd.dataframe

bm.feature.selector_market module

class bm.feature.selector_market.FeatureSelector(data, target, base_features, category_features)

Bases: object

自动化特征筛选

Variables

labels – series: 数据集的标签
data – dataframe: 数据集
na_threshold – float: 空缺率阈值
correlation_threshold – float: 相关性系数阈值
importance_threshold – float: ：特征重要度阈值
categorical_feature – list: 类别型特征
one_hot_features – list: 需要做独热编码处理的特征
useful_feats – list: 入模特征
drop_feats – list: 被过滤的特征
record_na – dataframe: 空缺率计算结果
record_single_value – dataframe: 单一值计算结果
record_collinear – dataframe: 相关性系数计算结果
record_feats_importance – dataframe: 特征重要性计算结果
record_cv_result – dataframe: 模型交叉验证的评估结果
record_increase_feats – dataframe: 模型通过cv自动筛选特征，选择的特征
record_increase_result – dataframe: 模型通过cv自动筛选特征，交叉验证结果的每一步
corr_matrix – dataframe: 使用Pearson方法计算的相关性系数矩阵
ops – dict: 计算结果汇总

filter_results()

返回筛选结果

Returns: None

identify_all(config=None)

筛选特征

Parameters: config (dict) – 从配置文件读取的参数
Returns: None

identify_collinear(correlation_threshold=1)

剔除相关性

Parameters: correlation_threshold (float) – the correlation threshold
Returns: None

identify_cv_importance(params, kfold, groups=None, n_splits=5)

通过cv计算importance

Parameters

params (dict) – the params of estimator
kfold (str) – the cv of split data
groups (ndarray or Series) –
n_splits (int) –

Returns

None

identify_increase_cv_feats(total_iter, step, auc_interval, incre_params=None)

自动筛选特征，方向为验证集auc提升方向

Parameters

total_iter (int) – the max iterator of traing round
step (int) – the numbers of add features
auc_interval (int) – the miniumn interval of auc increase
incre_params (dict) – the params of increase model

Returns

None

identify_na(na_threshold=1)

缺失率剔除

Parameters: na_threshold (float) – the threshold of missing rate
Returns: None

identify_single_value()

剔除单一值

Returns: None

plot_feature_importance(n=50, figsize=(20, 12))

绘制特征重要度

Parameters

n (int) – the top-n features
figsize (tuple) – the figsize of plot

Returns

None

result_save(output=None)

结果文件保存

Parameters: output (string) – the path of save result
Returns: None