bm.feature.metrics package

特征选择——过滤法

bm.feature.metrics.cal_iv_psi module

bm.feature.metrics.cal_iv_psi.cal_total_var_iv(data, numeric_feats, category_feats, target, max_interval=10, numeric_method='chi', category_method='default', BRM=False, special_dict=None, tree_params=None)

全量特征计算IV

Parameters

data (pd.DataFrame) – 数据
numeric_feats (list or ndarray) – 数据的数值型特征
category_feats (list or ndarray) – 数据的类别型特征
target (str) – 数据的目标变量
max_interval (int) – 变量的最大间隔
method (str) – 分箱的方法，默认为DecisionTree
BRM (bool) – 是否按单调合并箱体
special_dict (dict) – 分箱的特殊值
tree_params (dict) – 决策树的树参数，仅在 method=’tree’ 时有用

Returns

全体变量的IV值明细

Return type

pd.DataFrame

bm.feature.metrics.cal_iv_psi.category_var_binning(data, var, target, bins, delta=1e-06)

类别型特征IV详情

Parameters

data (pd.DataFrame) – 数据集
var (str) – dataframe的变量
target (str) – dataframe的目标变量
bins (dict) – 变量的箱体字典
delta (float) – 除数的微小偏移

Returns

pd.DataFrame: 箱体的明细
float: IV值
list: 变量的分箱

bm.feature.metrics.cal_iv_psi.category_var_bins_merge(data, var, target, max_interval=10, method='default', special_attributes=None, tree_params=None)

将类别型特征分箱

Parameters

data (pd.DataFrame) – 数据
var (str) – dataframe的变量
target (str) – dataframe的目标
max_interval (int) – 变量的最大间隔
method (str) – 分箱的类型
special_attributes (list or tuple) – 特殊属性
tree_params (dict) – 决策树的参数，当方法是树时有用

Returns

变量的箱体字典

Return type

dict

bm.feature.metrics.cal_iv_psi.category_var_cal_iv(data, var, target, max_interval=10, method='default', special_attributes=None, tree_params=None, delta=1e-06)

类别型特征计算IV

Parameters

data (pd.DataFrame) – 数据
var (str) – dataframe的变量
target (str) – dataframe的目标变量
max_interval (int) – 变量的最大间隔
method (str) – 分箱的类型
special_attributes (list or tuple) – 特殊属性
tree_params (dict) – 决策树的参数
delta (float) – 除数的微小偏移

Returns

pd.DataFrame: 箱体的明细
float: IV值
dict: 变量的分箱

bm.feature.metrics.cal_iv_psi.category_var_woe_transform(data, var, bins_df, bins)

类别型woe转换

Parameters

data (pd.DataFrame) – 数据
var (str) – dataframe的变量
bins_df (pd.DataFrame) – 箱体的明细
bins (dict) – 变量的箱体字典

Returns

the woe array

Return type

numpy.array

bm.feature.metrics.cal_iv_psi.numeric_var_binning(data, var, target, bins, delta=1e-06)

连续型变量计算iv值

Parameters

data (pd.DataFrame) – 数据
var (str) – dataframe的变量
target (str) – dataframe的目标变量
bins (list) – 变量箱
delta (float) – 除数的微小偏移

Returns

pandas.DataFrame: 箱体的明细
float: IV值
list: 变量的分箱

bm.feature.metrics.cal_iv_psi.numeric_var_cal_iv(data, var, target, max_interval=10, method='tree', BRM=False, special_attributes=None, tree_params=None, delta=1e-06)

数值型分箱并计算IV值

Parameters

data (pd.DataFrame) – 原始数据集
var (str) – dataframe的变量
target (str) – dataframe的变量
max_interval (int) – 最大间距
method (string) – 切割点的类型
BRM (bool) – 是否需要按单调合并
special_attributes (list or tuple) – dataframe的特殊值
tree_params (dict) – 决策树的参数
delta (float) – 除数的微小偏移

Returns

pd.DataFrame: 箱体的明细
float: IV值
list: 变量的分箱

bm.feature.metrics.cal_iv_psi.numeric_var_cal_psi(expected_array, actual_array, bins=10, bucket_type='bins', detail=False, log=True)

数值型变量计算psi

Parameters

expected_array (numpy.array) – 原始值的numpy数组
actual_array (numpy.array) – 新值的numpy数组与预期大小相同
bins (float) – 将值存储到的百分位数范围数
bucket_type (string) – 用于选择的箱或分位数
detail (bool) – 是否获取明细表
log (bool) – 是否显示日志信息

Returns

psi值

Return type

float

bm.feature.metrics.cal_iv_psi.numeric_var_woe_transform(data, var, bins_df, bins)

数值型woe转换

Parameters

data (pd.DataFrame) – 数据
var (str) – dataframe的变量
bins_df (pd.DataFrame) – 箱体的明细
bins (list) – 变量的分箱

Returns

the woe array

Return type

numpy.array

bm.feature.metrics.coefficient_of_variance module

bm.feature.metrics.coefficient_of_variance.calc_cv(dataframe)

计算变异系数 coefficient of variance

Parameters

dataframe (pandas.dataframe) – 需要计算的数据表

Returns

pandas.dataframe: 变异系数值列表，包含两列（特征名，特征变异系数）
str: 变异系数数值的列名

bm.feature.metrics.correlation module

bm.feature.metrics.correlation.calc_corr(dataframe)

计算Pearson相关系数

Parameters

dataframe (pandas.dataframe) – 需要计算相关性的数据表
threshold (float) – 阈值，相关性

Returns

特征对的相关性

Return type

pd.DataFrame

bm.feature.metrics.correlation.filter_corr(dataframe, threshold)

定位Pearson相关系数超过设定阈值的特征对

Parameters

dataframe (pandas.dataframe) – 需要计算相关性的数据表
threshold (float) – 阈值，相关性

Returns

存在相关性的特征对索引

Return type

numpy.array

bm.feature.metrics.iv module

bm.feature.metrics.iv.calc_iv(series, target)

计算IV值

Parameters

series (numpy.array) – 需要计算iv值的数据列
target (numpy.array) – 样本对应的正负目标值

Returns

float: IV值
pandas.Series: 分箱及其对应IV值

bm.feature.metrics.ks module

bm.feature.metrics.ks.calc_ks(train_dict, test_dict)

计算KS散度

参考文献：Hodges, J.L. Jr., “The Significance Probability of the Smirnov Two-Sample Test,” Arkiv fiur Matematik, 3, No. 43 (1958), 469-86.

Parameters

train_dict (dict) – 训练集的 y 值和预测概率值
test_dict (dict) – 测试集的 y 值和预测概率值

Returns

ks_train (KstestResult): 训练集的KS散度计算结果
ks_test (KstestResult): 测试集的KS散度计算结果

Examples

from scipy import stats
rng = np.random.default_rng()
sample1 = stats.uniform.rvs(size=100, random_state=rng)
sample2 = stats.norm.rvs(size=110, random_state=rng)
stats.ks_2samp(sample1, sample2)
KstestResult(statistic=0.5454545454545454, pvalue=7.37417839555191e-15)

bm.feature.metrics.ks.get_dir()

bm.feature.metrics.ks.get_export_path(base_name)

获取导出路径

Parameters: base_name (str) – 构建文件名时, 文件基本类型说明
Returns: 导出文件的全路径名
Return type: str

bm.feature.metrics.psi module

bm.feature.metrics.psi.calc_psi(expected, actual, target=None, n_bins=10, method='quantile')

计算 PSI(Population Stability Index)群体稳定性

Parameters

expected – (基) 特征列
actual – 需要和 expected做分布比较的特征列
target – 决策树分箱方法需要的参数，模型训练目标列
n_bins – 分箱数
method – 分箱的策略 step: 等距 quantile: 等频(默认) dt: 决策树

Returns

psi: PSI值
statistic_df: 数据表，包含每个分箱的信息
    取值范围
    样本量(expected和actual)
    样本占比(expected和actual)
    PSI值

bm.feature.metrics.vif module

bm.feature.metrics.vif.calc_vif(dataframe)

使用最小二乘法计算方差膨胀因子VIF

Parameters: dataframe (pd.DataFrame) – 需要计算VIF的数据表
Returns: 包含特征的VIF值及其容忍度的数据表
Return type: pd.DataFrame

bm.feature.metrics.vif.calc_vif_reg(dataframe)

使用回归的方式计算方差膨胀因子VIF

Parameters: dataframe (pd.DataFrame) – 需要计算VIF的数据表
Returns: 包含特征的VIF值及其容忍度的数据表
Return type: pd.DataFrame

bm.feature.metrics.vif.filter_vif(dataframe, threshold=10)

根据方差膨胀因子VIF（阈值），过滤出高于阈值的特征

Parameters

dataframe (pd.DataFrame) – 需要计算VIF的数据表
threshold (float) – 阈值，方差膨胀因子VIF

Returns

高于阈值的特征列表

Return type

pd.DataFrame

bm.feature.metrics.woe module

bm.feature.metrics.woe.calc_woe(t_prob, f_prob)

计算WOE值

Parameters

t_prob (float) – 正样本的概率
f_prob (float) – 负样本的概率

Returns

WOE值

Return type

float