bm.prep.binning package

分箱

bm.prep.binning.bin_utils module

class bm.prep.binning.bin_utils.CalculationMetrics(data, preds=None, y_labels=None)

Bases: object

计算KS值, IV值,WOE值,PSI值等等方法

Variables
  • data – DataFrame: 输入数据

  • preds – series: 预测标签

  • y_label – series: 真实标签

calculate_best_ks()

进行best_ks分箱时计算分箱边界

bm.prep.binning.bin_utils.check_single_value(series)

检测唯一值

Parameters

series (pd.Series) –

Returns

bool

bm.prep.binning.bin_utils.check_type(data, special_attributes)

检测数据类型

Parameters
  • data (pd.DataFrame) –

  • special_attributes (list or tuple) –

Returns

None

bm.prep.binning.bin_utils.check_unique(x)

检测是否唯一值

Parameters

x (list or array or series) –

Returns

bool

bm.prep.binning.bin_utils.dict_reverse(d)

将字典key与value反转

Parameters

d (dict) –

Returns

dict

bm.prep.binning.bin_utils.generate_counts(arr, breakpoints)

Generates counts for each bucket by using the bucket values

Parameters
  • arr (ndarray of actual values) –

  • breakpoints (list of bucket values) –

Returns

counts for elements in each bucket, length of breakpoints array minus one list: bins of cut

Return type

numpy.ndarray

bm.prep.binning.bin_utils.is_monotonic(x)

判断是否单调

Parameters

x (list or array or series) –

Returns

bool

bm.prep.binning.bin_utils.sub_psi(e_perc, a_perc)

cal psi value

Parameters
  • e_perc (array) – the excepted array

  • a_perc (array) – the actual array

Returns

the psi value

Return type

float

bm.prep.binning.bins module

class bm.prep.binning.bins.FeaturesBinning(data, column, target=None, target_value=None, num_clusters=5, max_interval=10, special_attributes=None, tree_params=None)

Bases: object

特征分箱功能函数

Variables
  • data – DataFrame: 输入的数据

  • column – str: 特征列

  • target – str: 列名

  • target_value – any: 目标列的值

  • num_clusters – int: 聚类簇数

  • max_interval – int: 分箱最大间隔, 默认值 10

  • special_attributes – str: 特殊属性, 默认值 None

  • tree_params – dict: 特征数参数, 默认值 None

Examples

feature_bin = FeaturesBinning(data, column, target, max_interval=10, special_attributes=None, tree_params=None)
results = feature_bin.chi_square_binning()
best_ks_binning()

Best_KS分箱法

chi_square_binning()

卡方分箱法

decision_tree_binning()

决策树分箱法

distance_binning()

等距分箱法

interpolate_binning()

插值分箱法

kmeans_binning()

K-means聚类分箱法

ks_init(datas)
mixed_binning()

混合分箱法

quantile_binning()

等频分箱法

bm.prep.binning.chi_merge module

bm.prep.binning.chi_merge.assign_bin(x, cutOffPoints, special_attributes)

bin分箱映射

Parameters
  • x (float) –

  • cutOffPoints (list) –

  • special_attributes (list) –

Returns

str

bm.prep.binning.chi_merge.assign_group(x, bins)

区间映射

Parameters
  • x (float or int) –

  • bins (list or tuple) – the cut bins

Returns

the value of

Return type

float or int

bm.prep.binning.chi_merge.bad_rate_merge(df, col, cutOffPoints, target, special_attributes=None)

将badRate0/1合并分箱节点

Parameters
  • df (pd.DataFrame) – 包含全部样本总计与坏样本总计的数据框

  • col (str) – 计算好坏样本占比的列

  • cutOffPoints (list) – 分箱节点

  • target (str) – 目标变量

  • special_attributes (list) – 剔除的特殊值

Returns

分箱节点

Return type

list

bm.prep.binning.chi_merge.bad_rate_monotone(df, sortByVar, target, special_attributes=None)

判断某变量的坏账率是否是单调

Parameters
  • df (pd.DataFrame) –

  • sortByVar (str) –

  • target (str) – the target of dataframe

  • special_attributes (list or tuple) –

Returns

bool

bm.prep.binning.chi_merge.bin_bad_rate(df, col, target, grantRateIndicator=0)

用于计算col中数值对应的坏样本的占比情况

Parameters
  • df (dataframe) – 需要计算好坏比率的数据集

  • col (list) – 需要计算好坏比率的特征

  • target() – 好坏标签

  • grantRateIndicator (int) – 1返回总体的坏样本率,0不返回

Returns

每箱的坏样本率,字典形式 dataframe: 每箱的坏样本率,dataframe形式 float: 总体的坏样本率(当grantRateIndicator==1时)

Return type

dict

bm.prep.binning.chi_merge.cal_chi2(df, total_col, bad_col)

计算卡方值

Parameters
  • df (pd.DataFrame) – 包含全部样本总计与坏样本总计的数据框

  • total_col (str) – 全部样本的个数

  • bad_col (str) – 坏样本的个数

Returns

the chi2 value

Return type

float

bm.prep.binning.chi_merge.cal_chi_merge(df, col, target, max_interval, min_bin_pct=0.01, special_attributes=None)

计算卡方分箱节点

Parameters
  • df (pd.DataFrame) –

  • col (str) – the variable of dataframe

  • target (str) – the target of dataframe

  • max_interval (int) – the max interval of variable

  • min_bin_pct (float) – the min percent of bin

  • special_attributes (list or tuple) – remove values of Series

Returns

the cutPoints of variable

Return type

list

bm.prep.binning.chi_merge.cutpoint_brm(data, var, label, cp, special_attributes=None)

检验每一箱中坏样本分布是否单调,不单调需要将分箱进行上下合并

Parameters
  • data (pd.DataFrame) –

  • var (str) – the variable of dataframe

  • label (str) – the label of dataframe

  • cp (list) – the cut bins of variable

  • special_attributes (list or tuple) –

Returns

list

bm.prep.binning.chi_merge.feature_monotone(x)

判断不满足单调性的值对应index

Parameters

x (list) – the cut of bins

Returns

dict

bm.prep.binning.chi_merge.monotone_merge(df, target, col)
bm.prep.binning.chi_merge.split_data(df, col, num_split, special_attributes=None)

细粒度转粗粒度

Parameters
  • df (dataframe) – 按照col排序后的数据集

  • col (list) – 待分箱的变量

  • num_split (int) – 切分的组别数

  • special_attributes (list) – 在切分数据集的时候,某些特殊值需要排除在外

Returns

初始化数据分箱的节点

Return type

list