# Comparison of G-matrix results of various software / packages

Feige's note: This article was written by my colleague Su Hui. The content is more comprehensive and the code is more complete. My last article Why is the PCA calculated by plink different from that calculated by GCTA? Is an introduction, and when this article gives Plink -- the number of PCA samples,

, you can calculate the percentage of each PCA, just as I thought. In addition, I found that GCTA is really a treasure house,

It can also calculate heritability, genetic correlation, GBLUP and other parameters. The next step is to learn!

The following is the text

The story begins with PCA. To calculate the variance ratio explained by the first three PCs, however, goose Plink does not output the eigen values of all PCs, so GCTA is used to calculate the eigen values of PCA and the first three PCs. GCTA to calculate PCA first needs to construct kinship matrix, that is, G matrix, and then use kinship matrix to calculate PCA. After calculating the PCA, I found that the PCA result calculated by GCTA was actually different from Plink. Then I wanted to know why it was different. Then I began to study the similarities and differences of algorithms and results based on various software / packages to build G matrix.

There are two methods for GCTA to build GRM (Genetic Relationship Matrix, G matrix) - - make GRM alg 0 is based on the method proposed by Yang et al. 2010, - make GRM alg 1 is based on the method proposed by Van Raden 2008, and the default is 0.

# GRM 0 Yang sum{[(xij- 2pi)*(xik- 2pi)] / [2pi(1-pi)]} (Note: This is the formula given on GCTA's official website, but it should be wrong, the formula is not complete, and there is a * 1/N behind it, and the formula in GCTA's paper is correct) # GRM 1 Van Raden sum[(xij- 2pi)(xik- 2pi)] / sum[2pi(1-pi)]

The pca and Plink results obtained by constructing GRM with Van Raden method are inconsistent, and the pca results obtained by Yang method are consistent with Plink. This is very reasonable, because although it is not mentioned in pca, the official document of Plink in make rel says that Yang's method is used. Therefore, although Plink outputs pca results by directly inputting genotype data, it should first construct G-matrix, and the method of constructing G-matrix is to use Yang's method, and then calculate pca based on G-matrix, but Plink directly helps me in this process.

```#  GCTA GRM 0
gcta --bfile plink9996loci --autosome-num 29 --make-grm --make-grm-alg 0 --out kinship0
gcta --grm kinship --pca 3  --out gcta0

0 sample1 0.0650311 0.0628848 -0.0760136
0 sample2 0.0804442 -0.0591764 0.0106865
0 sample3 -0.0243089 -0.00532782 0.0797794
0 sample4 0.0904922 0.0129799 -0.0417386
0 sample5 0.0458302 -0.0731895 0.0243814

#  GCTA GRM 1
gcta --bfile plink9996loci --autosome-num 29 --make-grm --make-grm-alg 1 --out kinship
gcta --grm kinship --pca 3  --out gcta

0 sample1 -0.0349996 0.132768 0.158889
0 sample2 -0.0876233 -0.0556155 0.0798619
0 sample3 0.055747 -0.104418 0.00153495
0 sample4 -0.0998956 0.0625678 0.107681
0 sample5 -0.0652175 -0.0814918 -0.0759696

FID IID PC1 PC2 PC3
0 sample1 -0.0650311 -0.0628848 -0.0760136
0 sample2 -0.0804442 0.0591764 0.0106865
0 sample3 0.0243089 0.00532782 0.0797794
0 sample4 -0.0904922 -0.0129799 -0.0417386```

The foregoing is not rigorous. In fact, using Plink is not an unexplained variance ratio, so it is a little troublesome. As long as you specify to output the original value of all PC s, the result is the same as that of GCTA:

```system("plink --bfile plink9996loci --chr-set 29 --pca 150 header --out plink_pc150")

var1 = round(val[1]/sum(val),4)
var2 = round(val[2]/sum(val),4)
var3 = round(val[3]/sum(val),4)

[1] 14.51180 10.94780  9.39109  8.68224  7.26250  7.01008
> sum(val)
[1] 179.1556
> var1
[1] 0.081
> var2
[1] 0.0611
> var3
[1] 0.0524

val <- scan("gcta0.eigenval")
var1 = round(val[1]/sum(val),4)
var2 = round(val[2]/sum(val),4)
var3 = round(val[3]/sum(val),4)

[1] 14.51180 10.94780  9.39109  8.68224  7.26250  7.01008
> sum(val)
[1] 179.1556
> var1
[1] 0.081
> var2
[1] 0.0611
> var3
[1] 0.0524```

The above is the result of PCA comparison. It's over. Next is the G matrix:

```### Plink structure G Array results ###

sample1   sample2    sample3    sample4    sample5
sample1  0.8703640 0.0272451 -0.1577940  0.1667670 -0.1117380
sample2  0.0272451 0.4955400  0.0446270  0.0691964  0.1130490
sample3 -0.1577940 0.0446270  1.0501800 -0.0994707  0.0471838
sample4  0.1667670 0.0691964 -0.0994707  0.6174680 -0.0740188
sample5 -0.1117380 0.1130490  0.0471838 -0.0740188  1.0315400

### see GCTA structure G Array results ###

# R script to read the GRM binary file | from GCTA official website
sum_i=function(i){
return(sum(1:i))
}
BinFileName=paste(prefix,".grm.bin",sep="")
NFileName=paste(prefix,".grm.N.bin",sep="")
IDFileName=paste(prefix,".grm.id",sep="")
n=dim(id)[1]
BinFile=file(BinFileName, "rb");
NFile=file(NFileName, "rb");
if(AllN==T){
}
i=sapply(1:n, sum_i)
return(list(diag=grm[i], off=grm[-i], id=id, N=N))
}

GCTA GRM 0
[1]  0.02724512 -0.15779422  0.04462700  0.16676699  0.06919638 -0.09947072

GCTA GRM 1
[1]  0.01600686 -0.32093960  0.08792991  0.29738194  0.11218226 -0.22020614
# The result is different from plink

### Based on two G Array construction method: write code and calculate manually ###

# FID, IID and genotype data were retained
g012 = m012[,-c(3:6)]
dim(g012)
fid = g012\$FID
iid = g012\$IID

# Sorting format
setDF(g012)
rownames(g012) = g012\$IID
g012\$IID = NULL
g012\$FID = NULL
g012=as.matrix(g012)

> g012[1:10,1:10]
1_5802_G 1_5823_C 1_5952_T 1_6121_T 1_6357_C 1_6597_T 1_7182_G 1_7934_A 1_8061_A 1_8282_T
sample1         0        0        0        0        0        0        0        0        0        0
sample2         0        0        0        0        0        0        0        0        1        1
sample3         0        0        0        0        0        0        0        0        0        0
sample4         0        0        0        0        0        0        0        0        0        0
sample5         0        2        0        0        0        0        1        0        1        1
sample6         0        2        1        0        0        1        1        0        1        1
sample7         2        2        0        0        0        0        0        0        1        0
sample8         1        1        0        0        0        0        0        0        1        1
sample9         0        2        0        0        0        1        0        0        1        1
sample10        0        0        0        0        1        0        0        0        0        0

# Calculate MAF

# Take the G-matrix values of samples 1 and 2 as an example
# grm 0
> (sum(((g012[1,]-2*maf\$MAF)*(g012[2,]-2*maf\$MAF))/(2*maf\$MAF*(1-maf\$MAF))))/nrow(maf)
[1] 0.02724698 # Consistent with GCTA GRM 0

# grm 1
> sum((g012[1,]-2*maf\$MAF)*(g012[2,]-2*maf\$MAF))/sum(2*maf\$MAF*(1-maf\$MAF))
[1] 0.01600781 # Consistent with GCTA GRM 1

### use DMU of Gmatrix modular ###

# par file without any QC and scale:
\$MINMAF
0
\$FREQMETHOD
1
\$SCALEMETHOD
1
0
0
\$DIAG_ONE
0
\$AGSCALE
0
\$PROP_A_to_G
0
\$CAL_DET
0
\$OUT_GMATRIX
1
\$OUT_IGMATRIX
0
\$OUT_AMATRIX
0

V1 V2          V3
1  1  1  1.23386414
2  1  2  0.01600686
3  1  3 -0.32093966
4  1  4  0.29738199
5  1  5 -0.22977475
6  1  6  0.10775250
# The Van Raden 2008 method used is consistent with the results of GCTA GRM 1

### use sommer package ###

# FID, IID and genotype data were retained
g012 = m012[,-c(3:6)]
dim(g012)
fid = g012\$FID
iid = g012\$IID
library(sommer)

# Sort out the format and calculate the G matrix
setDF(g012)
rownames(g012) = g012\$IID
g012\$IID = NULL
g012\$FID = NULL
g012=as.matrix(g012)
Gmat = A.mat(g012-1)

> Gmat[1:5,1:5]
sample1    sample2     sample3    sample4    sample5
sample1  1.23386414 0.01600686 -0.32093966  0.2973820 -0.2297748
sample2  0.01600686 0.79154839  0.08792993  0.1121823  0.2459466
sample3 -0.32093966 0.08792993  1.14676773 -0.2202062  0.1277878
sample4  0.29738199 0.11218228 -0.22020617  0.9891461 -0.1771543
sample5 -0.22977475 0.24594663  0.12778778 -0.1771543  1.2713619
# The Van Raden method used is consistent with the results of DMU gmatrix and GCTA GRM 1

### use AGHmatrix package ###

library(AGHmatrix)

gmat_yang=Gmatrix(g012,method = "Yang")

> gmat_van[1:5,1:5]
sample1    sample2     sample3    sample4    sample5
sample1  1.23386414 0.01600686 -0.32093966  0.2973820 -0.2297748
sample2  0.01600686 0.79154839  0.08792993  0.1121823  0.2459466
sample3 -0.32093966 0.08792993  1.14676773 -0.2202062  0.1277878
sample4  0.29738199 0.11218228 -0.22020617  0.9891461 -0.1771543
sample5 -0.22977475 0.24594663  0.12778778 -0.1771543  1.2713619
# The Van Raden method used is consistent with the results of DMU gmatrix, GCTA GRM 1 and sommer::A.mat

> gmat_yang[1:5,1:5]
sample1    sample2     sample3     sample4     sample5
sample1  1.19944230 0.02724512 -0.15779421  0.16676698 -0.11173845
sample2  0.02724512 1.04046826  0.04462700  0.06919638  0.11304946
sample3 -0.15779421 0.04462700  1.10626955 -0.09947072  0.04718378
sample4  0.16676698 0.06919638 -0.09947072  1.15327070 -0.07401880
sample5 -0.11173845 0.11304946  0.04718378 -0.07401880  1.26881311
# The Yang method used is consistent with Plink and GCTA GRM 0```

Therefore, the results of the two methods are quite different. Is the correlation coefficient of the two methods high?

```# Entire matrix
> cor(as.vector(gmat_van),as.vector(gmat_yang))
[1] 0.9099529

# Matrix diagonal
> cor(gmat_gcta0\$diag,gmat_gcta1\$diag)
[1] 0.8409531

# Matrix non diagonal
> cor(gmat_gcta0\$off,gmat_gcta1\$off)
[1] 0.8960708```

The correlation coefficient is not very high. So now I'm confused. Which of the two methods is right to build a G-matrix? It seems that most of the GWAS software uses Yang's method, and most of the GS software / packages use Van Raden's method... why？？ Both authors are great gods, and the citations of the two literatures are very high, so which is right???

There is an assumption to verify. Through the PCA results, take a group with clear clustering information as the true value, and use the matrices constructed by the two methods to do PCA respectively to see which method gets the PCA closer to the real clustering situation of the group. However, I don't know if it can work. I guess it is likely that for the group with clear clustering information, which method can get the correct clustering results, For populations with unclear clustering or uneven genetic background, the results obtained by the two methods will be quite different. To be verified...