bm.prep.binning package

分箱

bm.prep.binning.bin_utils module

class bm.prep.binning.bin_utils.CalculationMetrics(data, preds=None, y_labels=None)

Bases: object

计算KS值， IV值，WOE值，PSI值等等方法

Variables

data – DataFrame: 输入数据
preds – series: 预测标签
y_label – series: 真实标签

calculate_best_ks(): 进行best_ks分箱时计算分箱边界

bm.prep.binning.bin_utils.check_single_value(series)

检测唯一值

Parameters: series (pd.Series) –
Returns: bool

bm.prep.binning.bin_utils.check_type(data, special_attributes)

检测数据类型

Parameters

data (pd.DataFrame) –
special_attributes (list or tuple) –

Returns

None

bm.prep.binning.bin_utils.check_unique(x)

检测是否唯一值

Parameters: x (list or array or series) –
Returns: bool

bm.prep.binning.bin_utils.dict_reverse(d)

将字典key与value反转

Parameters: d (dict) –
Returns: dict

bm.prep.binning.bin_utils.generate_counts(arr, breakpoints)

Generates counts for each bucket by using the bucket values

Parameters

arr (ndarray of actual values) –
breakpoints (list of bucket values) –

Returns

counts for elements in each bucket, length of breakpoints array minus one list: bins of cut

Return type

numpy.ndarray

bm.prep.binning.bin_utils.is_monotonic(x)

判断是否单调

Parameters: x (list or array or series) –
Returns: bool

bm.prep.binning.bin_utils.sub_psi(e_perc, a_perc)

cal psi value

Parameters

e_perc (array) – the excepted array
a_perc (array) – the actual array

Returns

the psi value

Return type

float

bm.prep.binning.bins module

class bm.prep.binning.bins.FeaturesBinning(data, column, target=None, target_value=None, num_clusters=5, max_interval=10, special_attributes=None, tree_params=None)

Bases: object

特征分箱功能函数

Variables

data – DataFrame: 输入的数据
column – str: 特征列
target – str: 列名
target_value – any: 目标列的值
num_clusters – int: 聚类簇数
max_interval – int: 分箱最大间隔, 默认值 10
special_attributes – str: 特殊属性, 默认值 None
tree_params – dict: 特征数参数, 默认值 None

Examples

feature_bin = FeaturesBinning(data, column, target, max_interval=10, special_attributes=None, tree_params=None)
results = feature_bin.chi_square_binning()

best_ks_binning(): Best_KS分箱法

chi_square_binning(): 卡方分箱法

decision_tree_binning(): 决策树分箱法

distance_binning(): 等距分箱法

interpolate_binning(): 插值分箱法

kmeans_binning(): K-means聚类分箱法

ks_init(datas)

mixed_binning(): 混合分箱法

quantile_binning(): 等频分箱法

bm.prep.binning.chi_merge module

bm.prep.binning.chi_merge.assign_bin(x, cutOffPoints, special_attributes)

bin分箱映射

Parameters

x (float) –
cutOffPoints (list) –
special_attributes (list) –

Returns

str

bm.prep.binning.chi_merge.assign_group(x, bins)

区间映射

Parameters

x (float or int) –
bins (list or tuple) – the cut bins

Returns

the value of

Return type

float or int

bm.prep.binning.chi_merge.bad_rate_merge(df, col, cutOffPoints, target, special_attributes=None)

将badRate0/1合并分箱节点

Parameters

df (pd.DataFrame) – 包含全部样本总计与坏样本总计的数据框
col (str) – 计算好坏样本占比的列
cutOffPoints (list) – 分箱节点
target (str) – 目标变量
special_attributes (list) – 剔除的特殊值

Returns

分箱节点

Return type

list

bm.prep.binning.chi_merge.bad_rate_monotone(df, sortByVar, target, special_attributes=None)

判断某变量的坏账率是否是单调

Parameters

df (pd.DataFrame) –
sortByVar (str) –
target (str) – the target of dataframe
special_attributes (list or tuple) –

Returns

bool

bm.prep.binning.chi_merge.bin_bad_rate(df, col, target, grantRateIndicator=0)

用于计算col中数值对应的坏样本的占比情况

Parameters

df (dataframe) – 需要计算好坏比率的数据集
col (list) – 需要计算好坏比率的特征
target() – 好坏标签
grantRateIndicator (int) – 1返回总体的坏样本率，0不返回

Returns

每箱的坏样本率,字典形式 dataframe: 每箱的坏样本率，dataframe形式 float: 总体的坏样本率（当grantRateIndicator＝＝1时）

Return type

dict

bm.prep.binning.chi_merge.cal_chi2(df, total_col, bad_col)

计算卡方值

Parameters

df (pd.DataFrame) – 包含全部样本总计与坏样本总计的数据框
total_col (str) – 全部样本的个数
bad_col (str) – 坏样本的个数

Returns

the chi2 value

Return type

float

bm.prep.binning.chi_merge.cal_chi_merge(df, col, target, max_interval, min_bin_pct=0.01, special_attributes=None)

计算卡方分箱节点

Parameters

df (pd.DataFrame) –
col (str) – the variable of dataframe
target (str) – the target of dataframe
max_interval (int) – the max interval of variable
min_bin_pct (float) – the min percent of bin
special_attributes (list or tuple) – remove values of Series

Returns

the cutPoints of variable

Return type

list

bm.prep.binning.chi_merge.cutpoint_brm(data, var, label, cp, special_attributes=None)

检验每一箱中坏样本分布是否单调，不单调需要将分箱进行上下合并

Parameters

data (pd.DataFrame) –
var (str) – the variable of dataframe
label (str) – the label of dataframe
cp (list) – the cut bins of variable
special_attributes (list or tuple) –

Returns

list

bm.prep.binning.chi_merge.feature_monotone(x)

判断不满足单调性的值对应index

Parameters: x (list) – the cut of bins
Returns: dict

bm.prep.binning.chi_merge.monotone_merge(df, target, col)

bm.prep.binning.chi_merge.split_data(df, col, num_split, special_attributes=None)

细粒度转粗粒度

Parameters

df (dataframe) – 按照col排序后的数据集
col (list) – 待分箱的变量
num_split (int) – 切分的组别数
special_attributes (list) – 在切分数据集的时候，某些特殊值需要排除在外

Returns

初始化数据分箱的节点

Return type

list