Feige's note: This article was written by my colleague Su Hui. The content is more comprehensive and the code is more complete. My last article Why is the PCA calculated by plink different from that calculated by GCTA? Is an introduction, and when this article gives Plink -- the number of PCA samples,
, you can calculate the percentage of each PCA, just as I thought. In addition, I found that GCTA is really a treasure house,
It can also calculate heritability, genetic correlation, GBLUP and other parameters. The next step is to learn!
The following is the text
The story begins with PCA. To calculate the variance ratio explained by the first three PCs, however, goose Plink does not output the eigen values of all PCs, so GCTA is used to calculate the eigen values of PCA and the first three PCs. GCTA to calculate PCA first needs to construct kinship matrix, that is, G matrix, and then use kinship matrix to calculate PCA. After calculating the PCA, I found that the PCA result calculated by GCTA was actually different from Plink. Then I wanted to know why it was different. Then I began to study the similarities and differences of algorithms and results based on various software / packages to build G matrix.
There are two methods for GCTA to build GRM (Genetic Relationship Matrix, G matrix) - - make GRM alg 0 is based on the method proposed by Yang et al. 2010, - make GRM alg 1 is based on the method proposed by Van Raden 2008, and the default is 0.
# GRM 0 Yang sum{[(xij- 2pi)*(xik- 2pi)] / [2pi(1-pi)]} (Note: This is the formula given on GCTA's official website, but it should be wrong, the formula is not complete, and there is a * 1/N behind it, and the formula in GCTA's paper is correct) # GRM 1 Van Raden sum[(xij- 2pi)(xik- 2pi)] / sum[2pi(1-pi)]
The pca and Plink results obtained by constructing GRM with Van Raden method are inconsistent, and the pca results obtained by Yang method are consistent with Plink. This is very reasonable, because although it is not mentioned in pca, the official document of Plink in make rel says that Yang's method is used. Therefore, although Plink outputs pca results by directly inputting genotype data, it should first construct G-matrix, and the method of constructing G-matrix is to use Yang's method, and then calculate pca based on G-matrix, but Plink directly helps me in this process.
# GCTA GRM 0 gcta --bfile plink9996loci --autosome-num 29 --make-grm --make-grm-alg 0 --out kinship0 gcta --grm kinship --pca 3 --out gcta0 head -n 5 gcta0.eigenvec 0 sample1 0.0650311 0.0628848 -0.0760136 0 sample2 0.0804442 -0.0591764 0.0106865 0 sample3 -0.0243089 -0.00532782 0.0797794 0 sample4 0.0904922 0.0129799 -0.0417386 0 sample5 0.0458302 -0.0731895 0.0243814 # GCTA GRM 1 gcta --bfile plink9996loci --autosome-num 29 --make-grm --make-grm-alg 1 --out kinship gcta --grm kinship --pca 3 --out gcta head -n 5 gcta.eigenvec 0 sample1 -0.0349996 0.132768 0.158889 0 sample2 -0.0876233 -0.0556155 0.0798619 0 sample3 0.055747 -0.104418 0.00153495 0 sample4 -0.0998956 0.0625678 0.107681 0 sample5 -0.0652175 -0.0814918 -0.0759696 # Plink plink --bfile plink9996loci --chr-set 29 --pca 3 header suhui@suhui-MS-7B23:pca_testGCTA$ head -n 5 plink.eigenvec FID IID PC1 PC2 PC3 0 sample1 -0.0650311 -0.0628848 -0.0760136 0 sample2 -0.0804442 0.0591764 0.0106865 0 sample3 0.0243089 0.00532782 0.0797794 0 sample4 -0.0904922 -0.0129799 -0.0417386
The foregoing is not rigorous. In fact, using Plink is not an unexplained variance ratio, so it is a little troublesome. As long as you specify to output the original value of all PC s, the result is the same as that of GCTA:
system("plink --bfile plink9996loci --chr-set 29 --pca 150 header --out plink_pc150") val <- scan("plink_pc150.eigenval") var1 = round(val[1]/sum(val),4) var2 = round(val[2]/sum(val),4) var3 = round(val[3]/sum(val),4) > head(val) [1] 14.51180 10.94780 9.39109 8.68224 7.26250 7.01008 > sum(val) [1] 179.1556 > var1 [1] 0.081 > var2 [1] 0.0611 > var3 [1] 0.0524 val <- scan("gcta0.eigenval") var1 = round(val[1]/sum(val),4) var2 = round(val[2]/sum(val),4) var3 = round(val[3]/sum(val),4) > head(val) [1] 14.51180 10.94780 9.39109 8.68224 7.26250 7.01008 > sum(val) [1] 179.1556 > var1 [1] 0.081 > var2 [1] 0.0611 > var3 [1] 0.0524
The above is the result of PCA comparison. It's over. Next is the G matrix:
### Plink structure G Array results ### system("plink --bfile plink9996loci --chr-set 29 --make-rel square") gmat_plink=read.table(file="plink.rel") id=read.table(file="plink.rel.id") colnames(gmat_plink)=id$V2 rownames(gmat_plink)=id$V2 > gmat_plink[1:5,1:5] sample1 sample2 sample3 sample4 sample5 sample1 0.8703640 0.0272451 -0.1577940 0.1667670 -0.1117380 sample2 0.0272451 0.4955400 0.0446270 0.0691964 0.1130490 sample3 -0.1577940 0.0446270 1.0501800 -0.0994707 0.0471838 sample4 0.1667670 0.0691964 -0.0994707 0.6174680 -0.0740188 sample5 -0.1117380 0.1130490 0.0471838 -0.0740188 1.0315400 ### see GCTA structure G Array results ### # R script to read the GRM binary file | from GCTA official website ReadGRMBin=function(prefix, AllN=F, size=4){ sum_i=function(i){ return(sum(1:i)) } BinFileName=paste(prefix,".grm.bin",sep="") NFileName=paste(prefix,".grm.N.bin",sep="") IDFileName=paste(prefix,".grm.id",sep="") id = read.table(IDFileName) n=dim(id)[1] BinFile=file(BinFileName, "rb"); grm=readBin(BinFile, n=n*(n+1)/2, what=numeric(0), size=size) NFile=file(NFileName, "rb"); if(AllN==T){ N=readBin(NFile, n=n*(n+1)/2, what=numeric(0), size=size) } else N=readBin(NFile, n=1, what=numeric(0), size=size) i=sapply(1:n, sum_i) return(list(diag=grm[i], off=grm[-i], id=id, N=N)) } GCTA GRM 0 > gmat_gcta0=ReadGRMBin("kinship0") > head(gmat_gcta0$off) [1] 0.02724512 -0.15779422 0.04462700 0.16676699 0.06919638 -0.09947072 # Consistent with Plink's results GCTA GRM 1 > gmat_gcta1=ReadGRMBin("kinship") > head(gmat_gcta1$off) [1] 0.01600686 -0.32093960 0.08792991 0.29738194 0.11218226 -0.22020614 # The result is different from plink ### Based on two G Array construction method: write code and calculate manually ### system("plink --bfile plink9996loci --chr-set 29 --recodeA --out plink9996loci") # Read data m012 = fread("plink9996loci.raw") # FID, IID and genotype data were retained g012 = m012[,-c(3:6)] dim(g012) fid = g012$FID iid = g012$IID # Sorting format setDF(g012) rownames(g012) = g012$IID g012$IID = NULL g012$FID = NULL g012=as.matrix(g012) > g012[1:10,1:10] 1_5802_G 1_5823_C 1_5952_T 1_6121_T 1_6357_C 1_6597_T 1_7182_G 1_7934_A 1_8061_A 1_8282_T sample1 0 0 0 0 0 0 0 0 0 0 sample2 0 0 0 0 0 0 0 0 1 1 sample3 0 0 0 0 0 0 0 0 0 0 sample4 0 0 0 0 0 0 0 0 0 0 sample5 0 2 0 0 0 0 1 0 1 1 sample6 0 2 1 0 0 1 1 0 1 1 sample7 2 2 0 0 0 0 0 0 1 0 sample8 1 1 0 0 0 0 0 0 1 1 sample9 0 2 0 0 0 1 0 0 1 1 sample10 0 0 0 0 1 0 0 0 0 0 # Calculate MAF system("plink --bfile plink9996loci --chr-set 29 --freq --out maf") maf=read.table(file="maf.frq",header = T) # Take the G-matrix values of samples 1 and 2 as an example # grm 0 > (sum(((g012[1,]-2*maf$MAF)*(g012[2,]-2*maf$MAF))/(2*maf$MAF*(1-maf$MAF))))/nrow(maf) [1] 0.02724698 # Consistent with GCTA GRM 0 # grm 1 > sum((g012[1,]-2*maf$MAF)*(g012[2,]-2*maf$MAF))/sum(2*maf$MAF*(1-maf$MAF)) [1] 0.01600781 # Consistent with GCTA GRM 1 ### use DMU of Gmatrix modular ### # par file without any QC and scale: $MINMAF 0 $FREQMETHOD 1 $SCALEMETHOD 1 $DIAG_ADD 0 $G_ADD 0 $DIAG_ONE 0 $AGSCALE 0 $PROP_A_to_G 0 $CAL_DET 0 $OUT_GMATRIX 1 $OUT_IGMATRIX 0 $OUT_AMATRIX 0 gmat_dum=read.table(file="test.gmat") > head(gmat_dum) V1 V2 V3 1 1 1 1.23386414 2 1 2 0.01600686 3 1 3 -0.32093966 4 1 4 0.29738199 5 1 5 -0.22977475 6 1 6 0.10775250 # The Van Raden 2008 method used is consistent with the results of GCTA GRM 1 ### use sommer package ### system("plink --bfile plink9996loci --chr-set 29 --recodeA --out plink9996loci") # Read data m012 = fread("plink9996loci.raw") # FID, IID and genotype data were retained g012 = m012[,-c(3:6)] dim(g012) fid = g012$FID iid = g012$IID library(sommer) # Sort out the format and calculate the G matrix setDF(g012) rownames(g012) = g012$IID g012$IID = NULL g012$FID = NULL g012=as.matrix(g012) Gmat = A.mat(g012-1) > Gmat[1:5,1:5] sample1 sample2 sample3 sample4 sample5 sample1 1.23386414 0.01600686 -0.32093966 0.2973820 -0.2297748 sample2 0.01600686 0.79154839 0.08792993 0.1121823 0.2459466 sample3 -0.32093966 0.08792993 1.14676773 -0.2202062 0.1277878 sample4 0.29738199 0.11218228 -0.22020617 0.9891461 -0.1771543 sample5 -0.22977475 0.24594663 0.12778778 -0.1771543 1.2713619 # The Van Raden method used is consistent with the results of DMU gmatrix and GCTA GRM 1 ### use AGHmatrix package ### library(AGHmatrix) gmat_van=Gmatrix(g012,method = "VanRaden") gmat_yang=Gmatrix(g012,method = "Yang") > gmat_van[1:5,1:5] sample1 sample2 sample3 sample4 sample5 sample1 1.23386414 0.01600686 -0.32093966 0.2973820 -0.2297748 sample2 0.01600686 0.79154839 0.08792993 0.1121823 0.2459466 sample3 -0.32093966 0.08792993 1.14676773 -0.2202062 0.1277878 sample4 0.29738199 0.11218228 -0.22020617 0.9891461 -0.1771543 sample5 -0.22977475 0.24594663 0.12778778 -0.1771543 1.2713619 # The Van Raden method used is consistent with the results of DMU gmatrix, GCTA GRM 1 and sommer::A.mat > gmat_yang[1:5,1:5] sample1 sample2 sample3 sample4 sample5 sample1 1.19944230 0.02724512 -0.15779421 0.16676698 -0.11173845 sample2 0.02724512 1.04046826 0.04462700 0.06919638 0.11304946 sample3 -0.15779421 0.04462700 1.10626955 -0.09947072 0.04718378 sample4 0.16676698 0.06919638 -0.09947072 1.15327070 -0.07401880 sample5 -0.11173845 0.11304946 0.04718378 -0.07401880 1.26881311 # The Yang method used is consistent with Plink and GCTA GRM 0
Therefore, the results of the two methods are quite different. Is the correlation coefficient of the two methods high?
# Entire matrix > cor(as.vector(gmat_van),as.vector(gmat_yang)) [1] 0.9099529 # Matrix diagonal > cor(gmat_gcta0$diag,gmat_gcta1$diag) [1] 0.8409531 # Matrix non diagonal > cor(gmat_gcta0$off,gmat_gcta1$off) [1] 0.8960708
The correlation coefficient is not very high. So now I'm confused. Which of the two methods is right to build a G-matrix? It seems that most of the GWAS software uses Yang's method, and most of the GS software / packages use Van Raden's method... why?? Both authors are great gods, and the citations of the two literatures are very high, so which is right???
There is an assumption to verify. Through the PCA results, take a group with clear clustering information as the true value, and use the matrices constructed by the two methods to do PCA respectively to see which method gets the PCA closer to the real clustering situation of the group. However, I don't know if it can work. I guess it is likely that for the group with clear clustering information, which method can get the correct clustering results, For populations with unclear clustering or uneven genetic background, the results obtained by the two methods will be quite different. To be verified...
Read the original text and check the author's home page.