Introduction to Transcription Group (4): Understanding Reference Genomes and Gene Annotations

Keywords: Linux ftp Database github

task list
  • 1. Download hg19 reference genome in UCSC;
  • 2. Download gene annotation files from gencode database, and use IGV to view the structure of interested genes, such as TP53, KRAS, EGFR and so on.
  • 3. Visual structure of IGV of several genes in screenshots
  • 4. Download ENSEMBL, NCBI gtf, also import IGV to see, screenshot gene structure
  • 5. Understanding IGV Common Sense
Download hg19 reference genome at UCSC
Hg19, GRCH37, Ensembl 75, these three genomic versions should be seen more by everyone. In fact, they store the same fasta sequence, which corresponds to the genomic information published by NCBI, UCSC and ENSEMBL respectively. Some reference genomes are relatively small and store different sequences, such as the Yanhuang genome made by BGI, the genome of Watson, the proponent of DNA double helix structure, and the genome made by the most perfect Korean published on nature in 2016. Previously, we did not consider these niche genomes, mainly downloading hg19 and hg38, which are provided by UCSC. Although hg38 has made many improvements compared with hg19, it has many advantages, but so far, a lot of annotation information is directed at the coordinate system of hg19, so we downloaded them, just to explore for ourselves. By the way, download the latest version of the reference genome of a mouse. Anyway, the comparison means sleeping. By the way, analyze the results to see if the comparison rate is not very low.
mkdir rna_seq/data/reference && cd rna_seq/data/reference
mkdir -p genome/hg19 && cd genome/hg19
# nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz &
# nohup is permanently executed, & refers to running in the background. nohup COMMAND & so that the mission order can be permanently executed in the background
nohup axel http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz &
tar zvfx chromFa.tar.gz
cat *.fa > hg19.fa
rm chr*.fa
Download gene annotation files from gencode database and use IGV to view the structure of genes of interest
Download Gene Annotation Files
Official website: http://www.gencodegenes.org/releases/26lift37.html
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh37_mapping/gencode.v26lift37.annotation.gtf.gz
gzip -d gencode.v26lift37.annotation.gtf.gz
 
Download and install IGV, BEDtool
Official website: http://software.broadinstitute.org/software/igv/download (download Binary Distribution version)
wget https://github.com/arq5x/bedtools2/releases/download/v2.26.0/bedtools-2.26.0.tar.gz
tar -zxvf bedtools-2.26.0.tar.gz
cd bedtools2
make
 
Visual structure of IGV for screenshots of several genes
Batch screenshots: TP53,KRAS,EGFR
grep -w 'gene' gencode.v26lift37.annotation.gtf | grep -w 'TP53' | cut -f 1,4,5 >> gene.bed
grep -w 'gene' gencode.v26lift37.annotation.gtf | grep -w 'KRAS' | cut -f 1,4,5 >> gene.bed
grep -w 'gene' gencode.v26lift37.annotation.gtf | grep -w 'EGFR' | cut -f 1,4,5 >> gene.bed
~/biosoft/bedtools2/bin/bedtools igv -i gene.bed > Bach_sanpshot.txt
grep is a multi-purpose text search tool, which is frequently used in linux and is flexible to use. It can be a variable or a string. There are two basic uses:
  • 1. There is no space in the search content, so you can execute grep command directly, such as grep pass a.txt, which means searching the line where pass is located in a.txt file.
  • 2. If there is a space in the search content, you need to use single or double quotation marks to cause the search content, such as grep "hello all" a.txt or grep'Hello all'a.txt. If you do not add single and double quotation marks, you will be prompted with errors and unable to recognize them, because without quotation marks, you can directly search hello in all and a.txt, which is definitely wrong.
Grep-w option file: Accurate search, can be said to be accurate search, such as: grep-w b* a.txt: when this command is executed, * will not default to any character, only literal meaning, is a * character
Pipeline command operator: "|", which can only process the correct output information from the previous instruction, that is, standard output information, and has no direct processing capability for stdandard error information. Then, pass it to the next command as standard input for the standard input
The cut command cuts bytes, characters, and fields from each line of the file and writes them to standard output. If the File parameter is not specified, the cut command reads the standard input. One of the - b, - c or - f flags must be specified. Use the - f option to extract the specified fields
Download ENSEMBL, NCBI gtf
axel ftp://ftp.ensembl.org/pub/grch37/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz
axel ftp://ftp.ensembl.org/pub/grch37/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz
axel  ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ref_GRCh37.p13_top_level.gff3.gz
axel ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ref_GRCh37.p13_scaffolds.gff3.gz

Posted by ident on Wed, 05 Jun 2019 11:44:22 -0700