Intuitively, data seems easy to understand, but it's a little difficult to really say the word data.
Think about it, what is the data?
The definition of data actually includes two aspects: information symbol and design.
The design of information, that is, the format of data, determines the difficulty of readers to obtain effective information.
A fact that people often ignore - the format of data is as important as the data itself.
Data in Bioinformatics
Traditional biologists may think that bioinformatics is software that converts data into results.
In fact, bioinformatics only converts data in one format into data in another format.
This format conversion often leads to the synthesis and optimization of information.
data format
Several common data formats in bioinformatics:
- GenBank
- Fasta
- FastQ
- BED/GFF/GTF
- SAM/BAM
1.GenBank
The file suffix is. gb/.genbank, which is a data format in line with people's reading habits.
GenBank sample file
Data source: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/
LOCUS NC_045512 29903 bp ss-RNA linear VRL 18-JUL-2020 DEFINITION Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. ACCESSION NC_045512 VERSION NC_045512.2 GI:1798174254 DBLINK BioProject: PRJNA485481 KEYWORDS RefSeq. SOURCE Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ORGANISM Severe acute respiratory syndrome coronavirus 2 Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus. REFERENCE 1 (bases 1 to 29903) AUTHORS Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y., Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H., Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C. and Zhang,Y.Z. TITLE A new coronavirus associated with human respiratory disease in China JOURNAL Nature 579 (7798), 265-269 (2020) PUBMED 32015508 REMARK Erratum:[Nature. 2020 Apr;580(7803):E7. PMID: 32296181] ... gene 21563..25384 /gene="S" /locus_tag="GU280_gp02" /gene_synonym="spike glycoprotein" /db_xref="GeneID:43740568" CDS 21563..25384 /gene="S" /locus_tag="GU280_gp02" /gene_synonym="spike glycoprotein" /note="structural protein; spike protein" /codon_start=1 /product="surface glycoprotein" /protein_id="YP_009724390.1" /db_xref="GI:1796318598" /db_xref="GeneID:43740568" /translation="MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR SSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIR GWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVY SSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQ GFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFL LKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITN LCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCF TNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYN YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPY RVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFG RDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAI HADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPR RARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTM YICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFG GFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFN GLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQN VLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGA ISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMS ECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAH FPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELD SFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELG KYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSE PVLKGVKLHYT" ... #The first line of the file LOCUS includes many data elements, such as: #Name (NC_045512) #Sequence length(29903 bp) #Molecular type (ss)-RNA, single strand RNA) #Molecular shape (linear) #genbank classification abbreviation (VRL, virtual sequences) #Last modified time(18-JUL-2020) LOCUS NC_045512 29903 bp ss-RNA linear VRL 18-JUL-2020
GenBank classification abbreviation
abbreviation | Full name | abbreviation | Full name |
---|---|---|---|
PRI | primate sequences | ROD | rodent sequences |
MAM | other mammalian sequences | VRT | other vertebrate sequences |
INV | invertebrate sequences | PLN | plant, fungal, and algal sequences |
BCT | bacterial sequences | VRL | viral sequences |
PHG | bacteriophage sequences | SYN | synthetic sequences |
UNA | unannotated sequences | EST | EST sequences (expressed sequence tags) |
PAT | patent sequences | STS | STS sequences (sequence tagged sites) |
GSS | GSS sequences (genome survey sequences) | HTG | HTG sequences (high-throughput genomic sequences) |
HTC | unfinished high-throughput cDNA sequencing | ENV | environmental sampling sequences |
The sharp eyed friend found it at a glance. The GenBank sample file shows the genomic information of COVID-19 (SARS-CoV-2), which is wreaking havoc all over the world.
Schematic diagram of COVID-19 structure
Source: Alissa Eckert, MS; Dan Higgins, MAM CDC
If you know the principle of new crown subunit vaccine development, you will probably know that the spike glycoprotein (RBD) fragment shown above contains multiple epitopes of B cells and T cells, which is an ideal target antigen.
However, the immunogenicity of the recombinant target protein is poor, and it often needs to be optimized to stimulate the body to produce enough antibodies.
Through the cooperation of dimerized RBD fragments and immune adjuvants, academician Gao Fu team made up for the short board of poor immunogenicity of recombinant protein, and successfully induced mice to produce a large number of neutralizing antibodies [1].
In addition, the current COVID-19 Delta mutant in the world is the [2] caused by mutation in the amino acid site of S protein.
It can be seen that GenBank is a very complex storage format, which stores rich biological information.
2.Fasta
The file suffix is usually. fa/.fasta/.fna/.seq, which can record sequence information similar to that in GenBank.
Fasta sample file -- gene sequence of COVID-19 M protein
>NC_045512.2:26523-27191 M [organism=Severe acute respiratory syndrome coronavirus 2] [GeneID=43740571] [chromosome=] ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGGAACCTAGTAA TAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCCAACAGGAATAGGTTTTTGTA TATAATTAAGTTAATTTTCCTCTGGCTGTTATGGCCAGTAACTTTAGCTTGTTTTGTGCTTGCTGCTGTT TACAGAATAAATTGGATCACCGGTGGAATTGCTATCGCAATGGCTTGTCTTGTAGGCTTGATGTGGCTCA GCTACTTCATTGCTTCTTTCAGACTGTTTGCGCGTACGCGTTCCATGTGGTCATTCAATCCAGAAACTAA CATTCTTCTCAACGTGCCACTCCATGGCACTATTCTGACCAGACCGCTTCTAGAAAGTGAACTCGTAATC GGAGCTGTGATCCTTCGTGGACATCTTCGTATTGCTGGACACCATCTAGGACGCTGTGACATCAAGGACC TGCCTAAAGAAATCACTGTTGCTACATCACGAACGCTTTCTTATTACAAATTGGGAGCTTCGCAGCGTGT AGCAGGTGACTCAGGTTTTGCTGCATACAGTCGCTACAGGATTGGCAACTATAAATTAAACACAGACCAT TCCAGTAGCAGTGACAATATTGCTTTGCTTGTACAGTAA
The Fasta file contains the comment information line and base sequence line of the sequence
# The comment information line of the sequence, with a greater than sign (>)start >NC_045512.2:26523-27191 M [organism=Severe acute respiratory syndrome coronavirus 2] [GeneID=43740571] [chromosome=] # Base sequence ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGGAACCTAGTAA TAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCCAACAGGAATAGGTTTTTGTA TATAATTAAGTTAATTTTCCTCTGGCTGTTATGGCCAGTAACTTTAGCTTGTTTTGTGCTTGCTGCTGTT TACAGAATAAATTGGATCACCGGTGGAATTGCTATCGCAATGGCTTGTCTTGTAGGCTTGATGTGGCTCA GCTACTTCATTGCTTCTTTCAGACTGTTTGCGCGTACGCGTTCCATGTGGTCATTCAATCCAGAAACTAA CATTCTTCTCAACGTGCCACTCCATGGCACTATTCTGACCAGACCGCTTCTAGAAAGTGAACTCGTAATC GGAGCTGTGATCCTTCGTGGACATCTTCGTATTGCTGGACACCATCTAGGACGCTGTGACATCAAGGACC TGCCTAAAGAAATCACTGTTGCTACATCACGAACGCTTTCTTATTACAAATTGGGAGCTTCGCAGCGTGT AGCAGGTGACTCAGGTTTTGCTGCATACAGTCGCTACAGGATTGGCAACTATAAATTAAACACAGACCAT TCCAGTAGCAGTGACAATATTGCTTTGCTTGTACAGTAA
3.FastQ
The file suffix is. fq/.fastq, which is used to store the base information read by the sequencer through sequencing experiments, and can be regarded as a Fasta file with base quality score.
The Fastq file contains a read sequencing record every 4 lines
#First line: @ read id #Line 2: base sequence line #Line 3: plus sign(+)Can follow read id information #Line 4: Phred score corresponding to base
FastQ sample file
@SRR16911464.1 1 length=35 GGCTGCTTATGTAGACAATTTTAGTCTTACTATTA +SRR16911464.1 1 length=35 BBBBBBFFFFFFGGGGGGGGGGHHHHGHGHHHHHH @SRR16911464.2 2 length=36 GACAATGCTCAGGTGTTACTTTCCAAAGTGCAGTGA +SRR16911464.2 2 length=36 AAABBFFFFFBBFGCGEGGGGGFFFFGFHHFHH5FG @SRR16911464.3 3 length=37 CTATGTAATCATCAGATTCAACTTGCATGGCATTGTT +SRR16911464.3 3 length=37 CCDEDFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHH
4.BED/GFF/GTF
This kind of data is mainly used to record the coordinate information of specific intervals in the genome, and the columns are separated by tabs, such as gene, coding region sequence (CDS) and non coding region (UTR).
BED
The BED file in column 3 includes chromosomes, initiation and termination
chr7 127471196 127472363 chr7 127472363 127473530 chr7 127473530 127474697
The BED file in column 6 adds name, value and chain direction
chr7 127471196 127472363 Pos1 0 + chr7 127472363 127473530 Pos2 0 + chr7 127473530 127474697 Pos3 0 +
GFF/GTF
Files typically contain 9 columns separated by tabs.
P.S. The valid starting coordinates of BED file are 0, and the valid starting coordinates of GFF/GTF are 1
GFF sample file
chr1 . mRNA 1300 9000 . + . ID=mrna0001;Name=sonichedgehog chr1 . exon 1300 1500 . + . ID=exon00001;Parent=mrna0001 chr1 . exon 1050 1500 . + . ID=exon00002;Parent=mrna0001
The difference between GTF and GFF files is in column 9. Column 9 of GTF must include gene_id and transcript_id to be a valid format.
5.SAM/BAM
BAM file is the binary format of SAM file. Both files contain the information from reads (FastQ) to reference genome (Fasta).
Generally, it includes the following 11 columns of information, as shown in: https://samtools.github.io/hts-specs/SAMv1.pdf
Col | Field | Brief description |
---|---|---|
1 | QNAME | Query template NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost mapping POSition |
5 | MAPQ | MAPping Quality |
6 | CIGAR | CIGAR string |
7 | RNEXT | Reference name of the mate/next read |
8 | PNEXT | Position of the mate/next read |
9 | TLEN | observed Template LENgth |
10 | SEQ | segment SEQuence |
11 | QUAL | ASCII of Phred-scaled base QUALity+33(or 64) |
Install under Linux and use samtools to view BAM files
samtools view -h demo.bam | less -S @HD VN:1.5 SO:coordinate @SQ SN:Chromosome1 LN:3942983 V300035025L4C001R0081179505 99 Chromosome1 1 30 150M = 101 250 ATGGAGAATATATTGGATCTTTGGAATCAAGCCTTAGCTCAAATTGAGAAAAAGCTAAGCAAACCGAGCTTCGAAACTTGGATGAAGTCGACGAAAGCCCATTCGCTGCAAGGAGATACCTTAACCATCACCGCTCCCAATGAATTTGCC eeedaZeeefeeeeecdeeeeeecZcefeZ_eefefedecdeefeededeee_eeeUceeeeeeeeeedeabYedeeeeeedfeeeedfde^e_e`de_eeeefeedeeeeeeeeee`fefceecee]eeeffeceefebeeY]fcfaedNM:i:0 MD:Z:150 V300035025L4C006R0370133480 99 Chromosome1 1 30 150M = 238 387 ATGGAGAATATATTGGATCTTTGGAATCAAGCCTTAGCTCAAATTGAGAAAAAGCTAAGCAAACCGAGCTTCGAAACTTGGATGAAGTCGACGAAAGCCCATTCGCTGCAAGGAGATACCTTAACCATCACCGCTCCCAATGAATTTGCC eeXdeedeeeeeeeebceeeee_Ucee_deZeeeeeeddeeeXedabedeecdaeeeaWeeeedeeaeeeeeecbeee]eed^YeeeceQae`ae]eebeeeeeY`e^edbeNcceeeeeceeee]e\eedebaWc_fe_dbeeeed]beNM:i:0 MD:Z:150
If it is a SAM file, you can use samtools view to convert it first
$samtools view Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...] Options: -b output BAM -o FILE output file name [stdout]
Sometimes BAM index files need to be used. You need to sort with samtools sort command first, and then call samtools index to index BAM files.
reference
[1] Dai L, Zheng T, Xu K, et al. A Universal Design of Betacoronavirus Vaccines against COVID-19, MERS, and SARS. Cell. 2020;182(3):722-733.e11. doi:10.1016/j.cell.2020.06.035
[2] Korber B, Fischer WM, Gnanakaran S, et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell. 2020;182(4):812-827.e19. doi:10.1016/j.cell.2020.06.043