Why is copykat not intuitive to inforcnv

In fact, copykat is only unsatisfactory when judging by the algorithm, but the visualization is still that the naked eye can clearly distinguish diploid normal cells and aneuploid cancer cells, so we want to see what specific improvements can be made to bypass this bug. The first choice is that we split all epithelial cells according to patients, Get the separate folder for each patient as shown below and the expFile.txt under each folder!

   81M 12  2 10:09 S11/expFile.txt
  123M 12  2 10:22 S21/expFile.txt
   84M 12  2 10:41 S47/expFile.txt
   75M 12  2 10:54 S50/expFile.txt
   78M 12  2 11:05 S53/expFile.txt
   95M 12  2 11:20 S56/expFile.txt
   95M 12  2 11:34 S57/expFile.txt
   78M 12  2 11:49 S58/expFile.txt
  107M 12  2 12:01 S63/expFile.txt
   77M 12  2 12:17 S65/expFile.txt
   87M 12  2 12:29 S66/expFile.txt
   87M 12  2 12:42 S69/expFile.txt
   79M 12  2 12:56 S71/expFile.txt
   75M 12  2 13:08 S72/expFile.txt
   76M 12  2 13:19 S74/expFile.txt
   78M 12  2 13:34 S78/expFile.txt
   77M 12  2 13:45 S79/expFile.txt
   77M 12  2 13:57 S81/expFile.txt
   84M 12  2 14:09 S82/expFile.txt

In this way, I only need to write a simple script Step4 run copykat. R, which can save the country by parallel curves in the way of all samples, which can also be regarded as parallel computing.

This simple script Step4 run copykat. R is the following shell script:

ls -d S*|while read id;do
echo $id;
cd $id;
nohup Rscript.exe ../scripts/step4-run-copykat.R &
cd ../
done

It is easy to see the single cell level cell expression matrix of each patient. Running this copykat process takes time:

S11/nohup.out:Time difference of 54.33337 mins
S21/nohup.out:Time difference of 1.20906 hours
S47/nohup.out:Time difference of 54.88378 mins
S50/nohup.out:Time difference of 43.44914 mins
S53/nohup.out:Time difference of 51.72247 mins
S56/nohup.out:Time difference of 1.020941 hours
S57/nohup.out:Time difference of 1.001784 hours
S58/nohup.out:Time difference of 52.24233 mins
S63/nohup.out:Time difference of 1.096883 hours
S65/nohup.out:Time difference of 45.49075 mins
S66/nohup.out:Time difference of 56.40917 mins
S69/nohup.out:Time difference of 56.40654 mins
S71/nohup.out:Time difference of 52.58897 mins
S72/nohup.out:Time difference of 42.06179 mins
S74/nohup.out:Time difference of 47.09917 mins
S78/nohup.out:Time difference of 49.3231 mins
S79/nohup.out:Time difference of 50.86545 mins
S81/nohup.out:Time difference of 44.29561 mins
S82/nohup.out:Time difference of 45.59241 mins

Moreover, I briefly looked at the discrimination difference between diploid normal cells and aneuploid cells in these patients:

tail -n  6  S*/nohup.out
==> S11/nohup.out <==
               aneuploid diploid
  epi                177       0
  ref-Bcells           7     389
  ref-Tcells           0     478
  spike-Bcells         6     235
  spike-Tcells         0     289

==> S21/nohup.out <==
               aneuploid diploid
  epi                935      23
  ref-Bcells           9     393
  ref-Tcells           0     480
  spike-Bcells         4     241
  spike-Tcells         0     289

==> S47/nohup.out <==
               aneuploid diploid
  epi                195      13
  ref-Bcells           9     385
  ref-Tcells           0     477
  spike-Bcells         6     232
  spike-Tcells         0     289

==> S50/nohup.out <==
               aneuploid diploid
  epi                 27       8
  ref-Bcells         316      77
  ref-Tcells          20     457
  spike-Bcells       184      55
  spike-Tcells        24     265

==> S53/nohup.out <==
               aneuploid diploid
  epi                 74      27
  ref-Bcells           0     393
  ref-Tcells           0     477
  spike-Bcells         1     238
  spike-Tcells         0     289

==> S56/nohup.out <==
               aneuploid diploid
  epi                348     114
  ref-Bcells           0     399
  ref-Tcells           0     477
  spike-Bcells         0     241
  spike-Tcells         0     289

==> S57/nohup.out <==
               aneuploid diploid
  epi                104     356
  ref-Bcells           0     395
  ref-Tcells           0     477
  spike-Bcells         0     239
  spike-Tcells         0     289

==> S58/nohup.out <==
               aneuploid diploid
  epi                 96       2
  ref-Bcells         311      82
  ref-Tcells          16     461
  spike-Bcells       180      60
  spike-Tcells        18     271

==> S63/nohup.out <==
               aneuploid diploid
  epi                605      80
  ref-Bcells           0     400
  ref-Tcells           1     477
  spike-Bcells         0     243
  spike-Tcells         0     289

==> S65/nohup.out <==
               aneuploid diploid
  epi                 56      28
  ref-Bcells         307      84
  ref-Tcells          32     445
  spike-Bcells       190      45
  spike-Tcells        29     257

==> S66/nohup.out <==
               aneuploid diploid
  epi                266      19
  ref-Bcells          16     379
  ref-Tcells           0     478
  spike-Bcells         9     234
  spike-Tcells         2     287

==> S69/nohup.out <==
               aneuploid diploid
  epi                290       3
  ref-Bcells           7     391
  ref-Tcells           0     479
  spike-Bcells         5     238
  spike-Tcells         0     289

==> S71/nohup.out <==
               aneuploid diploid
  epi                 71      52
  ref-Bcells           0     393
  ref-Tcells           0     477
  spike-Bcells         0     240
  spike-Tcells         0     289

==> S72/nohup.out <==
               aneuploid diploid
  epi                 28       2
  ref-Bcells         317      75
  ref-Tcells          29     448
  spike-Bcells       188      47
  spike-Tcells        32     254

==> S74/nohup.out <==
               aneuploid diploid
  epi                 48      25
  ref-Bcells           0     393
  ref-Tcells           0     477
  spike-Bcells         1     238
  spike-Tcells         0     289

==> S78/nohup.out <==
               aneuploid diploid
  epi                 94       8
  ref-Bcells         317      75
  ref-Tcells          35     442
  spike-Bcells       192      44
  spike-Tcells        31     255

==> S79/nohup.out <==
               aneuploid diploid
  epi                 46      53
  ref-Bcells         313      80
  ref-Tcells          28     449
  spike-Bcells       183      57
  spike-Tcells        31     258

==> S81/nohup.out <==
               aneuploid diploid
  epi                 75       3
  ref-Bcells         309      83
  ref-Tcells          21     456
  spike-Bcells       187      49
  spike-Tcells        20     266

==> S82/nohup.out <==
               aneuploid diploid
  epi                 70      15
  ref-Bcells          83     308
  ref-Tcells         455      22
  spike-Bcells        49     186
  spike-Tcells       259      27

This is very interesting. It can be seen that there are very strange discrimination of single cells in half of the patients, such as S82/nohup.out, in which a large number of ref and spike are judged as malignant cells. Let's look at its chart:

test_copykat_heatmap

It's really a little strange. At this time, the naked eye is actually the same as the software! And I went to see its inforcnv results, as shown below:

infercnv

It can be seen that copykat is just not intuitive, but in such patient data with a low proportion of malignant cells, there is no much difference in the effect, which is consistent with the conclusion of judging whether the cells are malignant or not by the naked eye! Although copykat mistakenly identified a large number of ref and spike as malignant cells, it is obvious that we will turn to the previous malignant epithelial cells as normal cells. At this time, there is some suspicion of idealism.

Although copykat is just not intuitive as inforcnv, copykat at least gives the judgment of aneuploid and diploid. The result file given by inforcnv still needs to be read by yourself, calculate the cnv score by yourself, and give a threshold to distinguish whether the cell is malignant or not. This step is also a test of your programming skills. You need to familiarize yourself with the following 5 tutorials:

If you also have your own copy number variation analysis needs of tumor related single-cell data, and you always report errors in many attempts, please leave a message and entrust us. The copy number variation single analysis of tumor related single-cell data is 2400. If the previous Dimension Reduction Clustering clustering is required, an additional 800 is required to provide an independent copykat and inforcnv process for distinguishing samples, as well as all the results and code folders of the combined copykat and inforcnv processes.

It should be noted that we will not keep the txt and other files in the middle, because it consumes too much disk space. For example, the project we demonstrated has less than 7000 epithelial cells from more than 20 patients. By following the copykat and inforcnv processes, we obtained nearly 100G files:

 11G ./4-3-only-Epi
478M ./Rdata-celltype
 81G ./epi-one-by-one

Posted by steve@MRS on Fri, 03 Dec 2021 21:21:02 -0800