Details of VCF file format

Keywords: AWS less

VCF file is called Variant Call Format, which represents the variation information of genome, usually obtained by GATK and Samtools software.
The VCF file can be roughly divided into two parts:

1. Header file information starting with ×××

##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.5-0-g36282e4,Date="Tue Apr 03 19:35:05 CST 2018",Epoch=1522755305379,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio//son/son.recal.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=[/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio/ccds.interval.list] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false out=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio/son/son.raw.vcf likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[RMSMappingQuality, BaseCounts] excludeAnnotation=[] group=[Standard, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=true keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.5-0-g36282e4,Date="Wed Jun 06 09:33:03 CST 2018",Epoch=1528248783862,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Hardfilter_optimize/son.raw.vcf) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Hardfilter_optimize/SNP/HG002_SNP.vcf sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[SNP] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
......
##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19>
##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19>
##reference=file:///opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta
##source=SelectVariants
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  son

The header file information mainly includes vcf file version, FORMAT, INFO, reference genome, execution program and other information.
Details of the meaning of each column in the header:

1. Chromosome: chromosome
2. Position of POS mutation in reference genome
3. ID - identifier: the ID of the variant. For example, if there is an ID of the SNP in dbSNP, it will be given in this line; if there is no ID, it will be expressed as a new variant with '.'.
4. REF - reference base(s): the reference base, the base on the chromosome, must be one of ATCGN, N represents the uncertain base
5. ALT - alternate base(s): base mutated compared with reference sequence
6. Qual - quality: the quality value of the phared format (phard ﹣ scanned), indicating the possibility of variant at this site; the higher the value, the
           The greater the possibility of variant; calculation method: the value of Phred = - 10 * log (1-p) p is the probability of the existence of variant; through the calculation formula
           It can be seen that the error probability of 10 is 0.1, and the probability of variant is 90%.
7. Filter - filter status: it is not enough to use the previous QUAL value for filtering. GATK can use other methods to filter. If it passes the filter result, the value is "PASS"; if variant is not reliable, the value is not "PASS" or ".".
8. INFO - additional information: this line is the details of variant, as follows:
  #DP read depth: read coverage of samples at this location. It is the coverage after some reads are filtered out. DP4: high quality sequencing base, before and after REF or ALT
  #QD: evaluate the reliability of a variation by depth. Variant call confidence normalized by depth of sample reads supporting a variant         
  #MQ: RMS Mapping Quality, the mean square value of the covering sequence quality
  #FQ: the possibility that the phred value is similar to all samples
  #AC, AF and AN: AC(Allele Count) indicates the number of Allele; AF(Allele Frequency) indicates the frequency of Allele; AN(Allele Number) indicates the total number of Allele.
      For a diploid sample: the genotype 0 / 1 means that the sample is heterozygous, and the Allele number is 1 (only one Allele of the diploid sample mutated at this site),
       Allele frequency is 0.5 (only 50% alleles of the diploid sample mutated at this site), and the total allele is 2; genotype 1 / 1 means that the sample is homozygous, the number of allele is 2, the frequency of allele is 1, and the total allele is 2.
  #MLEAC: Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed
  #MLEAF: Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed
  #BaseQRankSum compares the quality of the bases supporting the variation with that supporting the reference genome. A negative value indicates that the quality of the bases supporting the variation is lower than that supporting the reference genome,
       On the contrary, the quality value of supporting variation is better than that of reference genome. 0 means there is no significant difference between the two.
  #FS uses F-test to test for strand preference in sequencing. Chain preference may lead to errors in the detection of variant alleles. The output value is phared scaled p-value, and the larger the value is, the more likely the chain preference will appear.
  #Inbreeding coeff uses the likelihood method to test the inbreeding coefficient (also known as the inbreeding relationship) between samples. The higher the value is, the more likely it is inbreeding.
  #MQRankSum compares the quality of the sequences that support variation and the sequences that support reference genome. Negative value means that the quality of the bases that support variation is lower than that of the bases that support reference genome, only for heterozygosity.
       On the contrary, the quality value of supporting variation is better than that of reference genome. 0 means there is no significant difference between the two. In practical application, small negative values are usually filtered out.
  #BaseCounts the number of ATCG in all samples at the mutation site
  #ClippingRankSum is similar to the previous two. A negative value indicates that the read supporting mutation has more hard clip bases, and a positive value indicates that the read supporting reference genome has more hard clip bases. 0 is the best, both positive and negative values indicate that there may be human bias.
  #ReadPosRankSum detects whether the mutation site has position preference (whether it exists at the end of the sequence, which is prone to error). The best value is 0, indicating that the variation is independent of its position in the sequence. A negative value indicates that the mutation site is more likely to occur at the end, and a positive value indicates that the allele in the reference genome is more likely to occur at the end.
  #Excelshet detects the correlation of these samples, similar to InbreedingCoeff, and the larger the value, the more likely it is an error.
  #LikelihoodRankSum evaluated the match between the sequence supporting variation and ref and the best type, with 0 as the best value. A negative value indicates that the read matching degree supporting mutation is less than that supporting ref, while a positive value indicates the opposite. A higher value indicates an error.
  #The higher the HaplotypeScore, the more likely the error will occur. Higher scores are indicative of regions with bad alignments, typically leading to artifactual SNP and indel calls.
  #SOR: it is also a parameter used to evaluate whether there is chain bias, which is equivalent to FS upgrade. The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the  reference or alternate allele. The reported value is ln-scaled.
  #IS: maximum number of reads allowed with missing or partial insertion
  #G3: frequency of genotypes assessed by ML
  #HWE: chi^2 test p value and G3 based on HWE
  #CLR: log value of the possibility of genotyping under or without restriction
  #UGT: three most likely genotypic structures
  #CGT: structure of the three most likely restricted genotypes
  #PV4: four kinds of P value errors, namely (strand, baseQ, mapQ, tail distance bias)
  #INDEL: indicates that the mutation of this position is insertion missing
  #PC2: the phred value of non reference allele is different in two groups
  #P chi 2: Post weighted chi^2 to test the relationship between two groups of samples according to p value
  #QCHI2: Phred scaled PCHI2
  #PR: a smaller PCHI2 produced by displacement
  #QBD: Quality by Depth, the effect of sequencing depth on quality
  #RPB: Read Position Bias
  #MDV: the maximum number of high-quality non reference sequences in the sample
  #VDB: Variant Distance Bias, which filters the variation error range of artificial splicing sequence in RNA sequence
  

9. FORMAT corresponds to the information in the last sample column
  #AD and DP: AD(Allele Depth) is the read coverage of each allele in the sample, and in dipoid is the two values separated by commas,
      The former corresponds to ref genotype, the latter to variant genotype; DP (Depth) is the coverage of this locus in sample.
  #GT: genotype of the sample. The two numbers are separated by '/', which represent the genotype of the diploid sample. 0 is allele with ref in the sample; 
       1 represents allele of the variant in the sample; 2 represents allele with the second variant. Therefore, 0 / 0 indicates that the locus in sample is homozygous, which is consistent with ref; 0 / 1 indicates that the locus in sample is heterozygous, which has two genotypes, ref and variant; 1 / 1 indicates that the locus in sample is homozygous, which is consistent with variant.
  #GQ: that is, the PL value of the second possible genotype. Compared with the PL value of the most likely genotype (PL=0), when it is greater than 99, the amount of information is not large, so all those greater than 99 are assigned 99. When the GQ value is very small, it means that there is little difference between the second possible genotype and the most likely genotype.
  #GL: possibility of three genotypes (rrra AA), R for reference base and A for variant base
  #DV: high quality non reference base
  #SP: p value error line of phred
  #PL: provies the likelihoods of the given genotypes. The three designated genotypes are (0 / 0,0 / 1,1 / 1), and the probability sum of the three genotypes is 1.
       Inconsistent with the previous values, the higher the value, the less likely it is to be the genotype. Phred = - 10 * log (p) p is the probability of genotype.

Posted by kla0005 on Wed, 11 Dec 2019 00:31:01 -0800