Viral Assembly Pipeline
VrAP is based on the genome assembler SPAdes combined with a additional read correction and several filter steps. Our pipeline classifies the contigs to distinguish host from viral sequences by annotation and ORF density scores. With our new ORF density method we can identify viruses without any sequence homology to known references. We tested VrAP on real datasets generated with different sequencing technologies. We identified new viruses representing new species and even genera and families.
Please contact me if you have any problems with the tool.
I would be happy for every feedback.
Dr. Markus Fricke
RNA Bioinformatics and High Throughput Analysis Jena
Friedrich Schiller University Jena
Faculty of Mathematics and Computer Science
VrAP has a small dependency list:
All other tools are already included. The pipeline expects a linux 64bit version. Otherwise the included tools need to be compiled again.
It is recommended to set a global PATH variable:
For help page:
The simplest run is:
vrap.py -o output/dir -1 pair_1.fq
or if you have paired end reads:
vrap.py -o output/dir -1 pair_1.fq -2 pair_2.fq
VrAP only accepts fastq files or packed fastq files. I also recommend to use the -o option to set a output directory.
If you have a genome of your host you can add the fasta file for host read depletion with -r.
vrap.py -o output/dir -1 pair_1.fq -2 pair_1.fq -r reference.fasta
If you want to extend the viral search database you can add a fasta file with -v. However, the pipeline already includes a comprehensive viral database which will be downloaded at the first run.
vrap.py -o output/dir -1 pair_1.fq -2 pair_1.fq -v viral_db.fasta
If you want to update the included viral database you can use the -u option.
You can also add other search databases as blast database with -n (nucleotide db e.g. nt) or -a (protein db, e.g. nr).
vrap.py -o output/dir -1 pair_1.fq -2 pair_1.fq -n nt_blast_db -a aa_blast_db
VrAP will output a vrap_contig.fasta with the assembled contigs and a vrap_summary.csv with possible annotations for all contigs.
qleng = contig length
orf_d = orf density
ident = sequence identity to target/reference
qcov = contig coverage by blast target/reference
tcov = target/reference coverage by contig
tlength = target/reference length
tid = target/reference id
tname = target/reference name
mean_eval = mean Blast e-value of all hits to the corresponding target/reference