Principle and implementation of chi-square box dividing (Chi Merge algorithm)
1. Chi-square distribution
Definition of chi-square distribution:
If k independent random variables Z1,Z2,..., Zk satisfies the standard normal distribution N(0, 1), then the sum of the K random variables is:
X=∑i=1kZi2
X = \sum_{i=1}^{k}Z_{i}^2
X=i=1∑kZi2
To obey the chi-square distribution with degrees of freedom k, write down:
X_2(k) or as X_k2
X-\chi^{2}(k) or as X-\chi^2_{k}
X_2(k) or as X_k2
2. Chi-square test
Chi-square test is a hypothesis test based on chi-square distribution and is mainly used to test the independence between categorical variables.
Fundamental idea: infer from the sample data whether there is a significant difference between the overall distribution and the expected distribution, or whether the two categorical variables are related or independent
The general assumption is that there is no difference between the observed frequency and the expected frequency, or that the two variables are independent of each other.
Chi-square: expressed as the degree of deviation between observed and theoretical values
The formula for calculating the chi-square value is:
Xk2=∑(A−E)2E
X^{2}_{k} = \sum\frac{(A-E)^2}{E}
Xk2=∑E(A−E)2
A is the actual frequency and E is the expected frequency.
The chi-square value contains the following two information:
1. Absolute deviation between actual and theoretical values
2. Relative size of difference and theoretical value
Based on the chi-square distribution, chi-square statistics, and degrees of freedom, it is possible to determine the probability p that the current statistic and the more extreme cases are obtained when the original assumption is true. If P is small, the degree to which the observed values deviate from the theoretical values
Chi-square distribution table
p on the horizontal axis and degrees of freedom on the vertical axis
Degree of freedom k:(rows-1)*(columns-1)
[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-2DmfpQnm-1579400182449) (C:Users15743PicturesSaved PicturesChi-square subbox algorithm.jpg)]
Chi-square box dividing algorithm
The algorithm consists of two main phases: Initialization phase and bottom-up merge phase 1. Initialization stage: first sorted by the size of the attribute values (for discontinuous features, a numerical conversion is required, such as bad debt ratio, then sorted), and then each attribute value is separately grouped 2. Consolidation stage: (1) Calculate chi-square values for each pair of adjacent groups (2) Combine the smallest pair of neighbors into a group based on the calculated chi-square values (3) Repeat (1), (2) until the calculated chi-square value is no lower than the pre-set threshold, or the number of groups reaches a certain number
import numpy as np import pandas as pd from scipy.stats import chi2 # Calculate chi-square values def chi_value(data): """ Calculate Chi-square Value """ assert type(data).__name__ == "ndarray" and np.ndim(data) <= 2 # Calculate the sum of rows and columns col_sum = data.sum(axis=0) row_sum = data.sum(axis=1) e_sum = data.sum() # Calculate Expected Frequency # E = np.ones(arr.shape) * C_N / N # E = (E.T * R_N).T # square = (arr - E) ** 2 / E e = np.ones(data.shape) * col_sum / e_sum e = (e.T * row_sum).T square = (data - e) ** 2 / e square[e == 0] = 0 return square.sum() def chi_merge(df, col, target, max_group, threshold): """ ChiMerge Chi-square box df: dataframe col: Variables that need to be boxed target: Class Label max_group: Maximum Grouping threshold: Chi-square threshold returns: cutoffs: Variable Boxed List list """ assert type(df).__name__ == "DataFrame" cross_tab = pd.crosstab(df[col], df[target]) cross_stat = cross_tab.values # Grouping intervals are left closed and right open, such as cutoffs = [1,2,3], indicating intervals [1,2], [2,3], [3,3+). cutoffs = cross_tab.index.values print("sorted cutoffs list:", cutoffs, "len cutoffs", len(cutoffs)) if max_group is None: if threshold is None: cls_num = cross_stat.shape[-1] threshold = chi2.isf(0.05, df=cls_num-1) while True: min_value = None minidx = None # Calculate chi-square for each pair of adjacent groups for i in range(len(cross_stat) - 1): chi_val = chi_value(cross_stat[i: i+2]) print(chi_val) if min_value is None or (min_value > chi_val): min_value = chi_val minidx = i if (max_group is not None and max_group < len(cross_stat)) or (threshold is not None and min_value < threshold): tmp = cross_stat[minidx] + cross_stat[minidx + 1] cross_stat[minidx] = tmp cross_stat = np.delete(cross_stat, minidx + 1, minidx) cutoffs = np.delete(cutoffs, minidx + 1, 0) print("---", cutoffs) else: break return cutoffs if __name__ == "__main__": df = pd.read_csv(r"D:\PycharmProjects\test\toadtest\germancredit.csv") df.columns = [i.lower() for i in df.columns] # target = "creditability" col = "duration of credit (month)" print(chi_merge(df, target="creditability", col='duration of credit (month)', max_group=5, threshold=None))
ion of credit (month)"
print(chi_merge(df, target="creditability", col='duration of credit (month)', max_group=5, threshold=None))