bm.prep.binning package
分箱
bm.prep.binning.bin_utils module
- class bm.prep.binning.bin_utils.CalculationMetrics(data, preds=None, y_labels=None)
Bases:
object计算KS值, IV值,WOE值,PSI值等等方法
- Variables
data – DataFrame: 输入数据
preds – series: 预测标签
y_label – series: 真实标签
- calculate_best_ks()
进行best_ks分箱时计算分箱边界
- bm.prep.binning.bin_utils.check_single_value(series)
检测唯一值
- Parameters
series (pd.Series) –
- Returns
bool
- bm.prep.binning.bin_utils.check_type(data, special_attributes)
检测数据类型
- Parameters
data (pd.DataFrame) –
special_attributes (list or tuple) –
- Returns
None
- bm.prep.binning.bin_utils.check_unique(x)
检测是否唯一值
- Parameters
x (list or array or series) –
- Returns
bool
- bm.prep.binning.bin_utils.dict_reverse(d)
将字典key与value反转
- Parameters
d (dict) –
- Returns
dict
- bm.prep.binning.bin_utils.generate_counts(arr, breakpoints)
Generates counts for each bucket by using the bucket values
- Parameters
arr (ndarray of actual values) –
breakpoints (list of bucket values) –
- Returns
counts for elements in each bucket, length of breakpoints array minus one list: bins of cut
- Return type
numpy.ndarray
- bm.prep.binning.bin_utils.is_monotonic(x)
判断是否单调
- Parameters
x (list or array or series) –
- Returns
bool
- bm.prep.binning.bin_utils.sub_psi(e_perc, a_perc)
cal psi value
- Parameters
e_perc (array) – the excepted array
a_perc (array) – the actual array
- Returns
the psi value
- Return type
float
bm.prep.binning.bins module
- class bm.prep.binning.bins.FeaturesBinning(data, column, target=None, target_value=None, num_clusters=5, max_interval=10, special_attributes=None, tree_params=None)
Bases:
object特征分箱功能函数
- Variables
data – DataFrame: 输入的数据
column – str: 特征列
target – str: 列名
target_value – any: 目标列的值
num_clusters – int: 聚类簇数
max_interval – int: 分箱最大间隔, 默认值 10
special_attributes – str: 特殊属性, 默认值 None
tree_params – dict: 特征数参数, 默认值 None
Examples
feature_bin = FeaturesBinning(data, column, target, max_interval=10, special_attributes=None, tree_params=None) results = feature_bin.chi_square_binning()
- best_ks_binning()
Best_KS分箱法
- chi_square_binning()
卡方分箱法
- decision_tree_binning()
决策树分箱法
- distance_binning()
等距分箱法
- interpolate_binning()
插值分箱法
- kmeans_binning()
K-means聚类分箱法
- ks_init(datas)
- mixed_binning()
混合分箱法
- quantile_binning()
等频分箱法
bm.prep.binning.chi_merge module
- bm.prep.binning.chi_merge.assign_bin(x, cutOffPoints, special_attributes)
bin分箱映射
- Parameters
x (float) –
cutOffPoints (list) –
special_attributes (list) –
- Returns
str
- bm.prep.binning.chi_merge.assign_group(x, bins)
区间映射
- Parameters
x (float or int) –
bins (list or tuple) – the cut bins
- Returns
the value of
- Return type
float or int
- bm.prep.binning.chi_merge.bad_rate_merge(df, col, cutOffPoints, target, special_attributes=None)
将badRate0/1合并分箱节点
- Parameters
df (pd.DataFrame) – 包含全部样本总计与坏样本总计的数据框
col (str) – 计算好坏样本占比的列
cutOffPoints (list) – 分箱节点
target (str) – 目标变量
special_attributes (list) – 剔除的特殊值
- Returns
分箱节点
- Return type
list
- bm.prep.binning.chi_merge.bad_rate_monotone(df, sortByVar, target, special_attributes=None)
判断某变量的坏账率是否是单调
- Parameters
df (pd.DataFrame) –
sortByVar (str) –
target (str) – the target of dataframe
special_attributes (list or tuple) –
- Returns
bool
- bm.prep.binning.chi_merge.bin_bad_rate(df, col, target, grantRateIndicator=0)
用于计算col中数值对应的坏样本的占比情况
- Parameters
df (dataframe) – 需要计算好坏比率的数据集
col (list) – 需要计算好坏比率的特征
target() – 好坏标签
grantRateIndicator (int) – 1返回总体的坏样本率,0不返回
- Returns
每箱的坏样本率,字典形式 dataframe: 每箱的坏样本率,dataframe形式 float: 总体的坏样本率(当grantRateIndicator==1时)
- Return type
dict
- bm.prep.binning.chi_merge.cal_chi2(df, total_col, bad_col)
计算卡方值
- Parameters
df (pd.DataFrame) – 包含全部样本总计与坏样本总计的数据框
total_col (str) – 全部样本的个数
bad_col (str) – 坏样本的个数
- Returns
the chi2 value
- Return type
float
- bm.prep.binning.chi_merge.cal_chi_merge(df, col, target, max_interval, min_bin_pct=0.01, special_attributes=None)
计算卡方分箱节点
- Parameters
df (pd.DataFrame) –
col (str) – the variable of dataframe
target (str) – the target of dataframe
max_interval (int) – the max interval of variable
min_bin_pct (float) – the min percent of bin
special_attributes (list or tuple) – remove values of Series
- Returns
the cutPoints of variable
- Return type
list
- bm.prep.binning.chi_merge.cutpoint_brm(data, var, label, cp, special_attributes=None)
检验每一箱中坏样本分布是否单调,不单调需要将分箱进行上下合并
- Parameters
data (pd.DataFrame) –
var (str) – the variable of dataframe
label (str) – the label of dataframe
cp (list) – the cut bins of variable
special_attributes (list or tuple) –
- Returns
list
- bm.prep.binning.chi_merge.feature_monotone(x)
判断不满足单调性的值对应index
- Parameters
x (list) – the cut of bins
- Returns
dict
- bm.prep.binning.chi_merge.monotone_merge(df, target, col)
- bm.prep.binning.chi_merge.split_data(df, col, num_split, special_attributes=None)
细粒度转粗粒度
- Parameters
df (dataframe) – 按照col排序后的数据集
col (list) – 待分箱的变量
num_split (int) – 切分的组别数
special_attributes (list) – 在切分数据集的时候,某些特殊值需要排除在外
- Returns
初始化数据分箱的节点
- Return type
list