bm.eda package

数据探索模块

bm.eda.data_detection module

class bm.eda.data_detection.DetectDF(df)

Bases: object

该类实现了数据探查的功能

Variables: df – DataFrame: dataframe格式的原始数据集

Examples

credit_data = pd.read_csv(os.path.join(base_dir, r'data/germancredit.csv'))
dec = data_detection.DetectDF(credit_data)

detect(special_value_dict=None, savefolder=None)

数据表探查

针对类别型特征计算

数值类型
数据量
缺失值占比
唯一值个数
前五个最大值
前五个最小值

针对数值型特征计算

数值类型
数据量
缺失值占比
唯一值个数
平均值
标准差
最小值
1%分位数，10%分位数，50%分位数，75%分位数，90%分位数，99%分位数
最大值

样例数据如下:

	type	size	missing	unique	mean_or_top1	std_or_top2	min_or_top3	1%_or_top4	10%_or_top5	50%_or_bottom5	75%_or_bottom4	90%_or_bottoms3	99%_or_bottom2	max_or_bottom1
purpose	object	1000	0.00%	10	radio/television:28.00%	car (new):23.40%	furniture/equipment:18.10%	car (used):10.30%	business:9.70%	education:5.00%	repairs:2.20%	domestic appliances:1.20%	others:1.20%	retraining:0.90%
duration.in.month	int64	1000	0.00%	33	20.903	12.0588144527563	4	6	9	18	24	36	60	72
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…

Parameters

special_value_dict (dict) – 特殊值字典
savefolder (str) – 结果文件的保存路径

Returns

pd.DataFrame

Examples

base_dir = os.getcwd()
credit_data = pd.read_csv(os.path.join(base_dir, r'data/germancredit.csv'))
dec = data_detection.DetectDF(credit_data)
df_des = dec.detect(special_value_dict={-999:np.nan},
            savefolder=os.path.join(base_dir, "result"))

bm.eda.plotting module

bm.eda.plotting.plot_bin_woe(binx, savefolder, title=None, display_iv=False, figsize=(8, 6))

绘制woe

Parameters

binx (pd.DataFrame) – 分箱的细节
title (str) – 图像的标题
display_iv (bool) – 是否展示iv值
figsize (tuple) – 图纸的尺寸

Returns

None

bm.eda.plotting.plot_category_countplot(df, feats, savefolder, label='flag', sub_col=2, figsize=(12, 8))

绘制类别型特征分布

Parameters

df (DataFrame) – 数据集
feats (list or ndarray) – 特征名称列表
label (string) – df的标签
sub_col (int) – 图纸中子图的行数, 默认图纸中的子图分为五行
figsize (tuple) – 图纸尺寸

Returns

None

bm.eda.plotting.plot_corr(df, feats, savefolder, figsize=(12, 12), mask=False)

绘制相关性图

Parameters

df (DataFrame) – 数据集
feats (list or ndarray) – 特征名称列表
figsize (tuple) – 图纸尺寸
mask (bool) – 是否将另一半重复数据遮上阴影

Returns

None

bm.eda.plotting.plot_feature_boxplot(df, feats, savefolder, sub_col=2, figsize=(40, 12))

绘制箱型图

Parameters

df (DataFrame) – 数据集
feats (list or ndarray) – 特征名称列表
figsize (tuple) – 图纸尺寸
sub_col (int) – 图纸中子图的行数, 默认图纸中的子图分为两行

Returns

None

bm.eda.plotting.plot_feature_distribution(df, feats, savefolder, label='flag', sub_col=5, figsize=(35, 20))

绘制特征分布图

Parameters

df (DataFrame) – 数据集
feats (list or ndarray) – 特征名称列表
label (string) – df的标签
sub_col (int) – 图纸中子图的行数, 默认图纸中的子图分为五行
figsize (tuple) – 图纸尺寸

Returns

None