Exploration of processing similar data dada2 in ubiome 6

Keywords: Database Python REST

Next, we will process the database to see if the processing of sequencing data from the reference database can improve the accuracy of species annotation. It is predicted that the classification results of species can not be improved obviously. Maybe it is because of the defect of sequence length. Even if we try to improve our skills, we can't solve the fundamental problem. The length of 250bp, compared with the total length of 1500bp, is obviously too short to achieve accurate classification. Therefore, to be more accurate, only the total length of 16S can be expected from pacbi o. Oxford Nanopore, and 10x linked reads or similar technologies, such as Huada's sLtFR, have improved the length of reading. Be more radical, wait for the cost of sequencing to be low enough, go up the macro genome, the macro transcriptome.

Anyway, record the process of my exploration.

1. Use qiime2 to intercept V4 area

In fact, writing your own script can be done, but, after all, you know your own level. If you can use God, you can use God. He has rich experience and can solve the problem of primer matching, such as fuzzy matching of two bases to simulate the mismatch in the real PCR process and so on. If I want to do it, I can only use regular expressions to solve the problem of perfect matching. It is doomed to miss some sequences, which will have some impact on the accuracy, especially if it happens that these sequences are what you need. Here are my steps:

#Activate conda and enter qiime environment. If you need conda to recommend this [1]
source ~/Miniconda3/bin/activate
conda activate qiime2-2019.10
#Import the original sequence in Silva 132 database (99% similarity clustering)
#Download from here https://www.arb-silva.de/fileadmin/silva'databases/qiime/silva'132'release.zip
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path SILVA_132_QIIME_release/rep_set/rep_set_16S_only/silva_132_99_16S.fna \
  --output-path s_otus.qza
 ##Cut the V4 region, use 515f-806r, the latest primer of V4, emp 16s protocol[2]
time qiime feature-classifier extract-reads \
  --i-sequences s99_otus.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer GGACTACNVGGGTWTCTAAT \
  --o-reads sref-seqs.qza

2. Sequence exploration and processing

First, extract the sequence. The sequence qza file of qiime2 can be directly renamed as a zip file. After extracting sref-seqs.qza, find the sequence file from the data subfolder of the folder, dna-sequences.fasta, 368240 sequences.

The distribution of sequence length was counted a little, and the primers were not found. It is found that only a few sequences can find primers, generally only one end can be found. It should be said that software can directly cut off primers, leaving only the real sequence of v4 region. And the few sequences that can find primers should be mismatched to primer sequences at other locations. There's a little episode in the middle, and the last sequence will always be lost. Because we haven't used python for several months, the algorithm is a little rusty, so we can only add a random sequence name to the obtained sequence for a while. A script is used to solve this problem:

def deal_with_seq(seq,seq_name):
  '''
  //It is simple to find out whether the primer exists or not and its location in the sequence by using several bases that do not exist in the primer
  //These sequences are then processed to 120 + 10 + 120
  '''
	primer_F = 'GCCGCGGTAA'
	primer_R = 'ATTAGA'
	seq_F = ''
	seq_R = ''
	if primer_F in seq and primer_R not in seq: #Only positive primers found
		seq_F = seq.split(primer_F)[1][1:120] #120 BP after the forward primer and 120 BP corresponding to the previous sequenced reads
		seq_R = seq.split(primer_F)[1][-120:] #Ditto
		print(len(seq.split(primer_F)[1])) #Look at the length of the sequence except for the primer
		seq = seq_F + "NNNNNNNNNN" + seq_R #Connect with 10 N, corresponding to the previous data processing
	elif primer_R in seq and primer_F not in seq: #Only reverse primers found 
		seq_R = seq.split(primer_R)[0][-120:]
		seq_F = seq.split(primer_R)[0][:120]
		#print(seq_F, seq_R)
		print(len(seq.split(primer_R)[0]))
		seq = seq_F + "NNNNNNNNNN" + seq_R
	elif  primer_F in seq and primer_R  in seq: #Two way primers found
    print(len(seq.strip()))
		seq_F = seq.split(primer_F)[1][:120]
		seq_R = seq.split(primer_R)[0][-120:]
		seq = seq_F + "NNNNNNNNNN" + seq_R
	else: #In other cases, the default is only v4 region sequence
		print(len(seq.strip()))
		seq = seq[:120] + "NNNNNNNNNN" + seq[-120:]
	return seq


def get_250_otus():
  '''
  //Read and store these sequences
  '''
	fout = open('250_otus.fasta', 'w')
	f_length = open('250_otus_length.txt', 'w')
	f_undealed = open('undealed.fasta', 'w')
	dic_length = {}
	with open('dna-sequences.fasta') as f:
		seq_name = ''
		seq = ''
		for line in f:
			if line.startswith('>'):
				if seq_name != '':
					dic_length[seq_name] = 0
					dic_length[seq_name] = len(seq)
					seq = deal_with_seq(seq, seq_name)
					f_undealed.write(seq_name + seq + '\n')
					fout.write(seq_name + seq + '\n')
					f_length.write(seq_name + str(dic_length[seq_name]) + '\n')
					seq_name = line
					seq = ''
				else:
					seq_name = line
			elif line.strip() != '':
				seq += line.strip()
	fout.close()
	f_length.close()
	f_undealed.close()


if __name__ == '__main__':
	get_250_otus()

The following sequence length distribution is calculated:

It can be seen from the histogram that the vast majority of them are 251-260bp long, and the rest of the sequences can be ignored, so this result is relatively reliable.

[failed to transfer and save the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-ap0loksu-1580622344861) (https://jian.zd200572.com/wp-content/uploads/2020/02/250otu. PNG))

3. Training model, testing the effect of species annotation

Bid farewell to the small wheels made by ourselves and return to the qiime2 process.

#Re import sequence
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path /Users/zd200572/Biodata/Refrence/8c74b73b-c49b-44f1-bfba-54df23b1c8d3/data/250_otus.fasta \
  --output-path s250_otus.qza
 #The training model is time-consuming, especially the performance of my notebook is average, about the same as that of the i5 generation, with 16G of memory
 time qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads s250_otus.qza \
  --i-reference-taxonomy sref-taxonomy.qza \
  --o-classifier classifier_silva_V4.qza
  #More than two hours, 8140.54s user 1138.65s system 93% cpu 2:45:32.42 total
  #Species annotation
  time qiime feature-classifier classify-sklearn \
  --i-classifier classifier_silva_V4.qza \
  --i-reads rep-seqs.qza \
  --o-classification taxonomy250.qza
  #754.47s user 57.61s system 80% cpu 16:47.93 total

4. Comparison with previous models

There is no obvious difference in other classification levels, and there is a significant increase in the genus from 194 to > 283. After a look at the genera with clear classification, it is worth doing all this. But also to determine whether there is a false-positive situation, the classification of species can not bear to look directly, basically is not classified species.

[failed to transfer and save the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-r5izgdqu-1580622344869) (https://jian.zd200572.com/wp-content/uploads/2020/02/youhuahou.jpg))

Let's look at the histogram of the optimized genus. The classification is indeed better:

[failed to transfer and store the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-o0klyrwb-1580622344871) (https://jian.zd200572.com/wp-content/uploads/2020/02/bartlots.jpg))
Previous exploration
1.Learning notes on the process of data analysis 1

2.Exploration of processing similar data dada2 in ubiome 2

3.The exploration of processing similar data dada2 in ubiome 3

4.The exploration of processing similar data dada2 in ubiome 5

Reference resources:

1.One article of conda management information software is enough

2.16S Illumina Amplicon Protocol : Earth Microbiome Project