Home     |     Tutorial (Standalone version)     |     Tutorial (Web server)     |     Method Description     |     Contact us

Preparing BAM file for webserver

The input data should be a BAM file,which is the output of BWA alignment against a customized genome including both Human reference genome (hg19) and virus genomes of interest. Specific processing steps are needed under certain conditions.

Step 1 (optional): Reporduce BAM file if virus genome is not aligned
If only the Human reference genome was aligned for your BAM file, the following procession is needed:
a) Use samtools to extract unmapped reads:
    samtools fastq -f 4 [unmapped.fastq.gz] [human.bam]
b) Align unmapped reads to the virus genome of interest:
    bwa mem [virus.fasta] unmapped.fastq.gz|samtools view -b -f 4 -S -t [virus.fasta.fai] - > [virus.bam]
c) Merge two BAM files
    samtools merge [merged.bam] [virus.bam] [human.bam]

Step 2 (optional): Reduce the size of the input BAM file for web server uploading
If your BAM file is over 50 MBs, please download the VirusFaster_preprocess.jar, a preprocessing program developed by us, to filter the BAM file to a much smaller size for uploading to the server. Usage:
     java -jar /program/VirusFaster_preprocess.jar \
     -b [merged.bam] \
     -g [human.fasta] \
     -v [virus.fasta] \
     -o [out_dir] \
     -s [prefix]
human.fasta:the human reference genome(hg19) file in fasta format
virus.fasta:the virus genome sequence in fasta format
out_dir:the path of output directory
prefix:the prefix of the output file

Input Parameters

Minimum soft-clip count:
The number of soft-clipped sequencing reads as a sensible threshold for preliminary filtering of viral integration events. We recommend 3 soft-clipped reads as default.
Similarity threshold:
VirusFaster uses the Smith-Waterman algorithm to align soft-clip sequences to the reference. Only reads with alignment similarity score above the threshold will be remained. It is highly recommended to keep this parameter as default (0.95).
Insert Size:
Insert size can be either specified by users or estimated automatically by VirusFaster based on Gaussian distribution of the paired-end (PE) reads of the input BAM.
Read Length:
Read length can be either specified by users or calculated automatically by VirusFaster.
Options for running mode:
Strict mode: Under the strict mode, VirusFaster will apply steps 1-5 (Figure 1) to detect virus integrated breakpoints, which will generate results with higher confidence. Both breakpoints in the human genome and virus genome will be confirmed by soft-clip reads at single base resolution.
Loose mode: For low-depth NGS sequencing data without breakpoints detected, it is recommended to try the loose mode, which will only take steps 1-3 (Figure 1) to detect virus integrated sites. This mode will detect more breakpoints than the strict mode with possible lower specificity.

Results (Circos plot)

  • The red block represents the virus sequence (magnified by 100000 times).
  • The links connecting the HBV sequence and the human chromosomes represent virus integration events.
  • Nearest human genes of integration sites are labeled outside the outer ring.

Results (Breakpoint alignment)

  • The first two rows show the virus integration event informations.
  • The + or - sign at the left indicates the sequence alignment orientation for each soft clip reads.
  • Mismatches to the reference genome are indicated in brown.
  • Soft-clipping reads are displayed in two parts:
    1. The part that matches the reference genome is displayed in black and bold letters;
    2. The soft-clipping part is displayed in blue and itaic letters.

Results (Published studies of detected virus events)

  • VirusFaster use Dr.Vis v2.0 database to annotate the VirusFaster output to help users get more related studies.

Copyright© 2016-2017, All Rights Reserved.
Center for Cancer Bioinformatics, Peking Cancer Hospital Feedback