Chi-square box (Chi Merge algorithm)

Keywords: Attribute

Principle and implementation of chi-square box dividing (Chi Merge algorithm)

1. Chi-square distribution

Definition of chi-square distribution:

If k independent random variables Z1,Z2,..., Zk satisfies the standard normal distribution N(0, 1), then the sum of the K random variables is:
X=∑i=1kZi2 X = \sum_{i=1}^{k}Z_{i}^2 X=i=1∑k​Zi2​
To obey the chi-square distribution with degrees of freedom k, write down:
X_2(k) or as X_k2 X-\chi^{2}(k) or as X-\chi^2_{k} X_2(k) or as X_k2

2. Chi-square test

Chi-square test is a hypothesis test based on chi-square distribution and is mainly used to test the independence between categorical variables.

Fundamental idea: infer from the sample data whether there is a significant difference between the overall distribution and the expected distribution, or whether the two categorical variables are related or independent

The general assumption is that there is no difference between the observed frequency and the expected frequency, or that the two variables are independent of each other.

Chi-square: expressed as the degree of deviation between observed and theoretical values

The formula for calculating the chi-square value is:
Xk2=∑(A−E)2E X^{2}_{k} = \sum\frac{(A-E)^2}{E} Xk2​=∑E(A−E)2​
A is the actual frequency and E is the expected frequency.

The chi-square value contains the following two information:

1. Absolute deviation between actual and theoretical values

2. Relative size of difference and theoretical value

Based on the chi-square distribution, chi-square statistics, and degrees of freedom, it is possible to determine the probability p that the current statistic and the more extreme cases are obtained when the original assumption is true. If P is small, the degree to which the observed values deviate from the theoretical values

Chi-square distribution table

p on the horizontal axis and degrees of freedom on the vertical axis

Degree of freedom k:(rows-1)*(columns-1)

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-2DmfpQnm-1579400182449) (C:Users15743PicturesSaved PicturesChi-square subbox algorithm.jpg)]

Chi-square box dividing algorithm

The algorithm consists of two main phases: 
Initialization phase and bottom-up merge phase
 1. Initialization stage: first sorted by the size of the attribute values (for discontinuous features, a numerical conversion is required, such as bad debt ratio, then sorted), and then each attribute value is separately grouped
 2. Consolidation stage:  
	(1) Calculate chi-square values for each pair of adjacent groups  
	(2) Combine the smallest pair of neighbors into a group based on the calculated chi-square values  
	(3) Repeat (1), (2) until the calculated chi-square value is no lower than the pre-set threshold, or the number of groups reaches a certain number
import numpy as np
import pandas as pd

from scipy.stats import chi2
# Calculate chi-square values
def chi_value(data):
    """
    Calculate Chi-square Value
    """
    assert type(data).__name__ == "ndarray" and np.ndim(data) <= 2
    # Calculate the sum of rows and columns
    col_sum = data.sum(axis=0)
    row_sum = data.sum(axis=1)
    e_sum = data.sum()
    # Calculate Expected Frequency
    # E = np.ones(arr.shape) * C_N / N
    # E = (E.T * R_N).T
    # square = (arr - E) ** 2 / E

    e = np.ones(data.shape) * col_sum / e_sum
    e = (e.T * row_sum).T
    square = (data - e) ** 2 / e
    square[e == 0] = 0

    return square.sum()

def chi_merge(df, col, target, max_group, threshold):
    """
    ChiMerge Chi-square box
    df: dataframe
    col: Variables that need to be boxed
    target: Class Label
    max_group: Maximum Grouping
    threshold: Chi-square threshold
    returns:
        cutoffs: Variable Boxed List list

    """
    assert type(df).__name__ == "DataFrame"

    cross_tab = pd.crosstab(df[col], df[target])
    cross_stat = cross_tab.values

    # Grouping intervals are left closed and right open, such as cutoffs = [1,2,3], indicating intervals [1,2], [2,3], [3,3+).
    cutoffs = cross_tab.index.values
    print("sorted cutoffs list:", cutoffs, "len cutoffs", len(cutoffs))
    if max_group is None:
        if threshold is None:
            cls_num = cross_stat.shape[-1]
            threshold = chi2.isf(0.05, df=cls_num-1)

    while True:
        min_value = None
        minidx = None

        # Calculate chi-square for each pair of adjacent groups
        for i in range(len(cross_stat) - 1):
            chi_val = chi_value(cross_stat[i: i+2])
            print(chi_val)
            if min_value is None or (min_value > chi_val):
                min_value = chi_val
                minidx = i

        if (max_group is not None and max_group < len(cross_stat)) or (threshold is not None and min_value < threshold):
            tmp = cross_stat[minidx] + cross_stat[minidx + 1]
            cross_stat[minidx] = tmp
            cross_stat = np.delete(cross_stat, minidx + 1, minidx)
            cutoffs = np.delete(cutoffs, minidx + 1, 0)
            print("---", cutoffs)
        else:
            break
        return cutoffs


if __name__ == "__main__":
    df = pd.read_csv(r"D:\PycharmProjects\test\toadtest\germancredit.csv")
    df.columns = [i.lower() for i in df.columns]
    # target = "creditability" col = "duration of credit (month)"
    print(chi_merge(df, target="creditability", col='duration of credit (month)', max_group=5, threshold=None))

ion of credit (month)"
print(chi_merge(df, target="creditability", col='duration of credit (month)', max_group=5, threshold=None))

52 original articles published. 17% praised. 30,000 visits+
Private letter follow

Posted by b2k on Sat, 18 Jan 2020 20:29:56 -0800