BGI college student letter introduction 2 - what is data?

Intuitively, data seems easy to understand, but it's a little difficult to really say the word data.

Think about it, what is the data?

The definition of data actually includes two aspects: information symbol and design.

The design of information, that is, the format of data, determines the difficulty of readers to obtain effective information.

A fact that people often ignore - the format of data is as important as the data itself.

Data in Bioinformatics

Traditional biologists may think that bioinformatics is software that converts data into results.

In fact, bioinformatics only converts data in one format into data in another format.

This format conversion often leads to the synthesis and optimization of information.

data format

Several common data formats in bioinformatics:

  1. GenBank
  2. Fasta
  3. FastQ
  4. BED/GFF/GTF
  5. SAM/BAM

1.GenBank

The file suffix is. gb/.genbank, which is a data format in line with people's reading habits.

GenBank sample file

Data source: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/

LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,
            complete genome.
ACCESSION   NC_045512
VERSION     NC_045512.2  GI:1798174254
DBLINK      BioProject: PRJNA485481
KEYWORDS    RefSeq.
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
            Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae;
            Betacoronavirus; Sarbecovirus.
REFERENCE   1  (bases 1 to 29903)
  AUTHORS   Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y.,
            Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H.,
            Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C. and Zhang,Y.Z.
  TITLE     A new coronavirus associated with human respiratory disease in
            China
  JOURNAL   Nature 579 (7798), 265-269 (2020)
   PUBMED   32015508
  REMARK    Erratum:[Nature. 2020 Apr;580(7803):E7. PMID: 32296181]
...
     gene            21563..25384
                     /gene="S"
                     /locus_tag="GU280_gp02"
                     /gene_synonym="spike glycoprotein"
                     /db_xref="GeneID:43740568"
     CDS             21563..25384
                     /gene="S"
                     /locus_tag="GU280_gp02"
                     /gene_synonym="spike glycoprotein"
                     /note="structural protein; spike protein"
                     /codon_start=1
                     /product="surface glycoprotein"
                     /protein_id="YP_009724390.1"
                     /db_xref="GI:1796318598"
                     /db_xref="GeneID:43740568"
                     /translation="MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR
                     SSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIR
                     GWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVY
                     SSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQ
                     GFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFL
                     LKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITN
                     LCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCF
                     TNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYN
                     YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPY
                     RVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFG
                     RDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAI
                     HADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPR
                     RARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTM
                     YICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFG
                     GFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFN
                     GLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQN
                     VLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGA
                     ISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMS
                     ECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAH
                     FPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELD
                     SFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELG
                     KYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSE
                     PVLKGVKLHYT"
...
#The first line of the file LOCUS includes many data elements, such as:
#Name (NC_045512)
#Sequence length(29903 bp)
#Molecular type (ss)-RNA, single strand RNA)
#Molecular shape (linear)
#genbank classification abbreviation (VRL, virtual sequences)
#Last modified time(18-JUL-2020)
LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020

 

GenBank classification abbreviation

abbreviationFull nameabbreviationFull name
PRI primate sequences ROD rodent sequences
MAM other mammalian sequences VRT other vertebrate sequences
INV invertebrate sequences PLN plant, fungal, and algal sequences
BCT bacterial sequences VRL viral sequences
PHG bacteriophage sequences SYN synthetic sequences
UNA unannotated sequences EST EST sequences (expressed sequence tags)
PAT patent sequences STS STS sequences (sequence tagged sites)
GSS GSS sequences (genome survey sequences) HTG HTG sequences (high-throughput genomic sequences)
HTC unfinished high-throughput cDNA sequencing ENV environmental sampling sequences

The sharp eyed friend found it at a glance. The GenBank sample file shows the genomic information of COVID-19 (SARS-CoV-2), which is wreaking havoc all over the world.

 

Schematic diagram of COVID-19 structure

Source: Alissa Eckert, MS; Dan Higgins, MAM CDC

If you know the principle of new crown subunit vaccine development, you will probably know that the spike glycoprotein (RBD) fragment shown above contains multiple epitopes of B cells and T cells, which is an ideal target antigen.

However, the immunogenicity of the recombinant target protein is poor, and it often needs to be optimized to stimulate the body to produce enough antibodies.

Through the cooperation of dimerized RBD fragments and immune adjuvants, academician Gao Fu team made up for the short board of poor immunogenicity of recombinant protein, and successfully induced mice to produce a large number of neutralizing antibodies [1].

In addition, the current COVID-19 Delta mutant in the world is the [2] caused by mutation in the amino acid site of S protein.

It can be seen that GenBank is a very complex storage format, which stores rich biological information.

2.Fasta

The file suffix is usually. fa/.fasta/.fna/.seq, which can record sequence information similar to that in GenBank.

Fasta sample file -- gene sequence of COVID-19 M protein

>NC_045512.2:26523-27191 M [organism=Severe acute respiratory syndrome coronavirus 2] [GeneID=43740571] [chromosome=]
ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGGAACCTAGTAA
TAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCCAACAGGAATAGGTTTTTGTA
TATAATTAAGTTAATTTTCCTCTGGCTGTTATGGCCAGTAACTTTAGCTTGTTTTGTGCTTGCTGCTGTT
TACAGAATAAATTGGATCACCGGTGGAATTGCTATCGCAATGGCTTGTCTTGTAGGCTTGATGTGGCTCA
GCTACTTCATTGCTTCTTTCAGACTGTTTGCGCGTACGCGTTCCATGTGGTCATTCAATCCAGAAACTAA
CATTCTTCTCAACGTGCCACTCCATGGCACTATTCTGACCAGACCGCTTCTAGAAAGTGAACTCGTAATC
GGAGCTGTGATCCTTCGTGGACATCTTCGTATTGCTGGACACCATCTAGGACGCTGTGACATCAAGGACC
TGCCTAAAGAAATCACTGTTGCTACATCACGAACGCTTTCTTATTACAAATTGGGAGCTTCGCAGCGTGT
AGCAGGTGACTCAGGTTTTGCTGCATACAGTCGCTACAGGATTGGCAACTATAAATTAAACACAGACCAT
TCCAGTAGCAGTGACAATATTGCTTTGCTTGTACAGTAA
 

The Fasta file contains the comment information line and base sequence line of the sequence

# The comment information line of the sequence, with a greater than sign (>)start
>NC_045512.2:26523-27191 M [organism=Severe acute respiratory syndrome coronavirus 2] [GeneID=43740571] [chromosome=]
# Base sequence
ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGGAACCTAGTAA
TAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCCAACAGGAATAGGTTTTTGTA
TATAATTAAGTTAATTTTCCTCTGGCTGTTATGGCCAGTAACTTTAGCTTGTTTTGTGCTTGCTGCTGTT
TACAGAATAAATTGGATCACCGGTGGAATTGCTATCGCAATGGCTTGTCTTGTAGGCTTGATGTGGCTCA
GCTACTTCATTGCTTCTTTCAGACTGTTTGCGCGTACGCGTTCCATGTGGTCATTCAATCCAGAAACTAA
CATTCTTCTCAACGTGCCACTCCATGGCACTATTCTGACCAGACCGCTTCTAGAAAGTGAACTCGTAATC
GGAGCTGTGATCCTTCGTGGACATCTTCGTATTGCTGGACACCATCTAGGACGCTGTGACATCAAGGACC
TGCCTAAAGAAATCACTGTTGCTACATCACGAACGCTTTCTTATTACAAATTGGGAGCTTCGCAGCGTGT
AGCAGGTGACTCAGGTTTTGCTGCATACAGTCGCTACAGGATTGGCAACTATAAATTAAACACAGACCAT
TCCAGTAGCAGTGACAATATTGCTTTGCTTGTACAGTAA

3.FastQ

The file suffix is. fq/.fastq, which is used to store the base information read by the sequencer through sequencing experiments, and can be regarded as a Fasta file with base quality score.

The Fastq file contains a read sequencing record every 4 lines

#First line: @ read id
#Line 2: base sequence line
#Line 3: plus sign(+)Can follow read id information
#Line 4: Phred score corresponding to base
 

FastQ sample file

@SRR16911464.1 1 length=35
GGCTGCTTATGTAGACAATTTTAGTCTTACTATTA
+SRR16911464.1 1 length=35
BBBBBBFFFFFFGGGGGGGGGGHHHHGHGHHHHHH
@SRR16911464.2 2 length=36
GACAATGCTCAGGTGTTACTTTCCAAAGTGCAGTGA
+SRR16911464.2 2 length=36
AAABBFFFFFBBFGCGEGGGGGFFFFGFHHFHH5FG
@SRR16911464.3 3 length=37
CTATGTAATCATCAGATTCAACTTGCATGGCATTGTT
+SRR16911464.3 3 length=37
CCDEDFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHH
 

4.BED/GFF/GTF

This kind of data is mainly used to record the coordinate information of specific intervals in the genome, and the columns are separated by tabs, such as gene, coding region sequence (CDS) and non coding region (UTR).

BED

The BED file in column 3 includes chromosomes, initiation and termination

chr7  127471196  127472363
chr7  127472363  127473530
chr7  127473530  127474697
 

The BED file in column 6 adds name, value and chain direction

chr7  127471196  127472363  Pos1  0  +
chr7  127472363  127473530  Pos2  0  +
chr7  127473530  127474697  Pos3  0  +
 

GFF/GTF

Files typically contain 9 columns separated by tabs.

P.S.   The valid starting coordinates of BED file are 0, and the valid starting coordinates of GFF/GTF are 1

GFF sample file

chr1  .  mRNA  1300  9000  .  +  .  ID=mrna0001;Name=sonichedgehog
chr1  .  exon  1300  1500  .  +  .  ID=exon00001;Parent=mrna0001
chr1  .  exon  1050  1500  .  +  .  ID=exon00002;Parent=mrna0001
 

The difference between GTF and GFF files is in column 9. Column 9 of GTF must include gene_id and transcript_id to be a valid format.

5.SAM/BAM

BAM file is the binary format of SAM file. Both files contain the information from reads (FastQ) to reference genome (Fasta).

Generally, it includes the following 11 columns of information, as shown in: https://samtools.github.io/hts-specs/SAMv1.pdf

ColFieldBrief description
1 QNAME Query template NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost mapping POSition
5 MAPQ MAPping Quality
6 CIGAR CIGAR string
7 RNEXT Reference name of the mate/next read
8 PNEXT Position of the mate/next read
9 TLEN observed Template LENgth
10 SEQ segment SEQuence
11 QUAL ASCII of Phred-scaled base QUALity+33(or 64)

Install under Linux and use samtools to view BAM files

samtools view -h demo.bam | less -S
@HD     VN:1.5  SO:coordinate
@SQ     SN:Chromosome1  LN:3942983
V300035025L4C001R0081179505     99      Chromosome1     1       30      150M    =       101     250     ATGGAGAATATATTGGATCTTTGGAATCAAGCCTTAGCTCAAATTGAGAAAAAGCTAAGCAAACCGAGCTTCGAAACTTGGATGAAGTCGACGAAAGCCCATTCGCTGCAAGGAGATACCTTAACCATCACCGCTCCCAATGAATTTGCC        eeedaZeeefeeeeecdeeeeeecZcefeZ_eefefedecdeefeededeee_eeeUceeeeeeeeeedeabYedeeeeeedfeeeedfde^e_e`de_eeeefeedeeeeeeeeee`fefceecee]eeeffeceefebeeY]fcfaedNM:i:0  MD:Z:150
V300035025L4C006R0370133480     99      Chromosome1     1       30      150M    =       238     387     ATGGAGAATATATTGGATCTTTGGAATCAAGCCTTAGCTCAAATTGAGAAAAAGCTAAGCAAACCGAGCTTCGAAACTTGGATGAAGTCGACGAAAGCCCATTCGCTGCAAGGAGATACCTTAACCATCACCGCTCCCAATGAATTTGCC        eeXdeedeeeeeeeebceeeee_Ucee_deZeeeeeeddeeeXedabedeecdaeeeaWeeeedeeaeeeeeecbeee]eed^YeeeceQae`ae]eebeeeeeY`e^edbeNcceeeeeceeee]e\eedebaWc_fe_dbeeeed]beNM:i:0  MD:Z:150
 

If it is a SAM file, you can use samtools view to convert it first

$samtools view

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -o FILE  output file name [stdout]
 

Sometimes BAM index files need to be used. You need to sort with samtools sort command first, and then call samtools index to index BAM files.

reference
[1] Dai L, Zheng T, Xu K, et al. A Universal Design of Betacoronavirus Vaccines against COVID-19, MERS, and SARS. Cell. 2020;182(3):722-733.e11. doi:10.1016/j.cell.2020.06.035
[2] Korber B, Fischer WM, Gnanakaran S, et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell. 2020;182(4):812-827.e19. doi:10.1016/j.cell.2020.06.043

Posted by hank__22 on Sun, 28 Nov 2021 20:03:15 -0800