Two factor analysis of variance without repetition in python

Keywords: Lambda

Case study:

Analyze whether brands and regions have a significant impact on sales volume()

Make assumptions:

In order to test the influence of the two factors, we need to put forward the following assumptions for the two factors.

The assumptions for the line factors are as follows:

Since there are four levels of variable brands, namely brand 1, brand 2, brand 3 and brand 4, in order to test whether the mean value of these four levels (each level represents a whole) is equal.

Brand has no significant impact on sales volume

Incomplete equal brand has significant influence on sales volume

The assumptions put forward for the following factors are as follows:

Since there are five levels in the variable region, namely, region 1, region 2, region 3, region 4 and region 5, in order to test whether the mean values of these five levels (each level represents a population) are equal.

Region has no significant impact on sales volume

Different regions have significant influence on sales volume

 

Since variable brands (with 4 levels) and variable regions (with 5 levels), respectively, are retail, tourism, airlines and home appliance manufacturing, in order to test whether the mean values of these four levels (each level represents a whole) are equal, the following assumptions need to be proposed:

# Import related packages
import pandas as pd
import numpy as np
import scipy

# Custom function

def level_avg(data, x_name, y_name):
    df = data.groupby([x_name]).agg(['mean'])
    df = df[y_name]
    dict_ = dict(df["mean"])
    return dict_

def SST(Y):
    sst = sum(np.power(Y - np.mean(Y), 2))
    return sst

def SSA(data, x_name, y_name):
    total_avg = np.mean(data[y_name])
    df = data.groupby([x_name]).agg(['mean', 'count'])
    df = df[y_name]
    ssa = sum(df["count"]*(np.power(df["mean"] - total_avg, 2)))
    return ssa

def SSE(data, y_name):
    
    data_ = data.copy()
    total_avg = np.mean(data[y_name])
    x_var = set(list(data.columns))-set([y_name])
   
    cnt=1
    for i in x_var:
        dict_ = level_avg(data, i, y_name)
        var_name = 'v_avg_{}'.format(cnt)
        data_[var_name] = data_[i].map(lambda x: dict_[x])
        cnt += 1
   
    sse = sum(np.power(data_[y_name] - data_["v_avg_1"] - data_["v_avg_2"] + total_avg, 2))
    return sse

def two_way_anova(data, row_name, col_name, y_name, alpha=0.05):
    """Two factor ANOVA without repetition"""
    
    n = len(data)                       # Total observations
    k = len(data[row_name].unique())    # Number of horizontal row variables
    r = len(data[col_name].unique())    # Number of horizontal column variables
    
    sst = SST(data[y_name])             # Total square sum
    ssr = SSA(data, row_name, y_name)   # Sum of squares of row variables
    ssc = SSA(data, col_name, y_name)   # Sum of squares of column variables
    sse = SSE(data, y_name)             # Sum of squares of errors
    
    msr = ssr / (k-1)
    msc = ssc / (r-1)
    mse = sse / ((k-1)*(r-1))
    
    Fr = msr / mse  # Row variable statistics F
    Fc = msc / mse  # Column variable statistics F
    pfr = scipy.stats.f.sf(Fr, k-1, (k-1)*(r-1))  # P-value of row variable statistic F
    pfc = scipy.stats.f.sf(Fc, r-1, (k-1)*(r-1))  # P value of column variable statistic F
    
    Far = scipy.stats.f.isf(alpha, dfn=k-1, dfd=(k-1)*(r-1))   #Line F threshold
    Fac = scipy.stats.f.isf(alpha, dfn=r-1, dfd=(k-1)*(r-1))   #Critical value of column F
    
    r_square = (ssr+ssc) / sst      # Combined effect / total effect
    
    table = pd.DataFrame({'Difference source':[row_name, col_name, 'error', 'Total'],
                          'Sum of squares SS':[ssr, ssc, sse, sst],
                          'Freedom df':[k-1, r-1, (k-1)*(r-1), k*r-1],
                          'mean square MS':[msr, msc, mse, '_'],
                          'F value':[Fr, Fc, '_', '_'],
                          'P value':[pfr, pfc, '_', '_'],
                          'F critical value':[Far, Fac, '_', '_'],
                          'R^2':[r_square, '_', '_', '_']})
    
    return table
# Import data
df = pd.read_excel("E:\\xx Business data.xlsx", sheet_name='source_03')

# Output ANOVA results
two_way_anova(df, 'brand', 'region', 'Sales volume', alpha=0.05)

According to the above results of ANOVA, it is explained as follows:

(1) Brand: p-value = 9.45615e-05 <(or F value = 18.1078 > F critical value = 3.49029), reject the original hypothesis. It shows that competitive brands have a significant impact on sales volume.

(2) Region: P-value=0.143665 >(or F value = 2.10085 < f critical value = 3.25917), do not reject the original assumption. There is no evidence that regions have a significant impact on sales.

Published 19 original articles, won praise 5, visited 512
Private letter follow

Posted by trevHCS on Sat, 08 Feb 2020 01:35:54 -0800