# Two factor analysis of variance without repetition in python

Keywords: Lambda

Case study:

Analyze whether brands and regions have a significant impact on sales volume( ) Make assumptions:

In order to test the influence of the two factors, we need to put forward the following assumptions for the two factors.

The assumptions for the line factors are as follows:

Since there are four levels of variable brands, namely brand 1, brand 2, brand 3 and brand 4, in order to test whether the mean value of these four levels (each level represents a whole) is equal. Brand has no significant impact on sales volume Incomplete equal brand has significant influence on sales volume

The assumptions put forward for the following factors are as follows:

Since there are five levels in the variable region, namely, region 1, region 2, region 3, region 4 and region 5, in order to test whether the mean values of these five levels (each level represents a population) are equal. Region has no significant impact on sales volume Different regions have significant influence on sales volume

Since variable brands (with 4 levels) and variable regions (with 5 levels), respectively, are retail, tourism, airlines and home appliance manufacturing, in order to test whether the mean values of these four levels (each level represents a whole) are equal, the following assumptions need to be proposed:

```# Import related packages
import pandas as pd
import numpy as np
import scipy

# Custom function

def level_avg(data, x_name, y_name):
df = data.groupby([x_name]).agg(['mean'])
df = df[y_name]
dict_ = dict(df["mean"])
return dict_

def SST(Y):
sst = sum(np.power(Y - np.mean(Y), 2))
return sst

def SSA(data, x_name, y_name):
total_avg = np.mean(data[y_name])
df = data.groupby([x_name]).agg(['mean', 'count'])
df = df[y_name]
ssa = sum(df["count"]*(np.power(df["mean"] - total_avg, 2)))
return ssa

def SSE(data, y_name):

data_ = data.copy()
total_avg = np.mean(data[y_name])
x_var = set(list(data.columns))-set([y_name])

cnt=1
for i in x_var:
dict_ = level_avg(data, i, y_name)
var_name = 'v_avg_{}'.format(cnt)
data_[var_name] = data_[i].map(lambda x: dict_[x])
cnt += 1

sse = sum(np.power(data_[y_name] - data_["v_avg_1"] - data_["v_avg_2"] + total_avg, 2))
return sse

def two_way_anova(data, row_name, col_name, y_name, alpha=0.05):
"""Two factor ANOVA without repetition"""

n = len(data)                       # Total observations
k = len(data[row_name].unique())    # Number of horizontal row variables
r = len(data[col_name].unique())    # Number of horizontal column variables

sst = SST(data[y_name])             # Total square sum
ssr = SSA(data, row_name, y_name)   # Sum of squares of row variables
ssc = SSA(data, col_name, y_name)   # Sum of squares of column variables
sse = SSE(data, y_name)             # Sum of squares of errors

msr = ssr / (k-1)
msc = ssc / (r-1)
mse = sse / ((k-1)*(r-1))

Fr = msr / mse  # Row variable statistics F
Fc = msc / mse  # Column variable statistics F
pfr = scipy.stats.f.sf(Fr, k-1, (k-1)*(r-1))  # P-value of row variable statistic F
pfc = scipy.stats.f.sf(Fc, r-1, (k-1)*(r-1))  # P value of column variable statistic F

Far = scipy.stats.f.isf(alpha, dfn=k-1, dfd=(k-1)*(r-1))   #Line F threshold
Fac = scipy.stats.f.isf(alpha, dfn=r-1, dfd=(k-1)*(r-1))   #Critical value of column F

r_square = (ssr+ssc) / sst      # Combined effect / total effect

table = pd.DataFrame({'Difference source':[row_name, col_name, 'error', 'Total'],
'Sum of squares SS':[ssr, ssc, sse, sst],
'Freedom df':[k-1, r-1, (k-1)*(r-1), k*r-1],
'mean square MS':[msr, msc, mse, '_'],
'F value':[Fr, Fc, '_', '_'],
'P value':[pfr, pfc, '_', '_'],
'F critical value':[Far, Fac, '_', '_'],
'R^2':[r_square, '_', '_', '_']})

return table```
```# Import data

# Output ANOVA results
two_way_anova(df, 'brand', 'region', 'Sales volume', alpha=0.05)``` According to the above results of ANOVA, it is explained as follows:

(1) Brand: p-value = 9.45615e-05 ＜ (or F value = 18.1078 > F critical value = 3.49029), reject the original hypothesis. It shows that competitive brands have a significant impact on sales volume.

(2) Region: P-value=0.143665 > (or F value = 2.10085 < f critical value = 3.25917), do not reject the original assumption. There is no evidence that regions have a significant impact on sales.  Published 19 original articles, won praise 5, visited 512

Posted by trevHCS on Sat, 08 Feb 2020 01:35:54 -0800