Data inputs

This document describes the input format required by SVA version 1.1 and onwards. For older versions of SVA, where the pileup format was supported, see here.

SVA users need to prepare three (3) types of input files for an SVA project.

 

These 3 types of files are:

  1. A list of identified variants including single nucleotide variants (SNVs) insertion/deletion s(INDELs) and in a specific vcf format (the detailed description of the vcf format can be found here) - text file with file name extension .vcf;
  2. (Optional) A list of structural variations (SVs) in HMMCNV output format - text file with file name extension .events;
  3. A chromosome-wise coverage and quality control data file, generated from SAMtools mpileup output.- binary file with file name extension .bco

 

In addition, there is an optional pedinf file for an SVA project. This file lists the subjects in a linkage format. This file is not necessary for SVA annotation tasks, but is necessary for some SVA analysis and exporting functions.

Optional pedinf file :

pedinf file: listing the subjects in a linkage format, consisting of six columns, seperated by space or tab:

Family ID, Individual ID, Father ID, Mother ID, Gender (1=male, 2=female), Affected status (1=control, 2=case, -9=unknown)

Here is an example for this file.

I will assume that the SVA users are already familiar with next-generation sequencing data pipelines, particularly using BWA/SAMtools. The file name extensions in the above box is only for SVA to conveniently recognize the relative format. Although we do ourselves use BWA/SAMtools, the file extensions do not indicate that SVA only takes outputs from SAMtools. SVA does not distinguish which software generates the alignment results, as long as the format is in accordance with the description below.

There is another important note:

Important note:

The build of the human reference genome that you use for alignment must be the same with the build that you use for SVA annotation. For instructions on how to update SVA annotation databases, see here.

The basic data generation flow described below is based on our experience for your reference. You may choose to use different parameter settings.

Step 1. Generating mpileup file

We used SAMtools to generate the mpileup file:

[YOUR SAMTOOLS DIR]/samtools mpileup -d 500 -uf [YOUR RERERENCE FASTA FILE] [YOUR ALIGNMENT .bam file] > [YOUR mpileup file]

There is an important note regarding the chromosome designatations, which will affect the following data generation.

Step 2. Generating variant file in vcf format

We used SAMtools/bcftools to generate the variant file (Please note this is a basic example. Your actual parameters may vary.):

[YOUR SAMTOOLS DIR]/bcftool/bcftools view -bvcg [YOUR mpileup file] > [YOUR bcf file]

[YOUR SAMTOOLS DIR]/bcftool/bcftools view [YOUR bcf file] > [YOUR raw vcf file]

[YOUR SAMTOOLS DIR]/bcftool/vcfutils.pl varFilter -D 500 [YOUR raw vcf file] > [YOUR filtered variant vcf file]

(Optional) Step 3. Generate SV file .events

We used a separate program (ERDS) to generate the SV file. Please refer to its webpages for user guide.

Here is an example of the generated .events file:

X 2130001 2206000 76000 2 213.2
X 2206001 2208000 2000 0 0.7

The columns are: chromosome name, start coordinate, end coordinate, SV status (diploid=2), LOD score.

Step 4. Generate coverage and quality score file .bco

We used a simple JAVA program vcf2bco.jar (download it here) to generate the chromosome-wise .bco file from base-wise vcf file generated using SAMtools/bcftools.

[YOUR SAMTOOLS DIR]/bcftool/bcftools view -gc [YOUR mpileup file] > [YOUR base-wise vcf file]

java -jar [YOUR vcf2bco.jar DIR]/vcf2bco.jar [YOUR base-wise vcf file] [YOUR_BCO_OUTPUTSTEM]

Note: This small JAVA program (vcf2bco.jar) accepts pileup file with chromosome designations (column 1) as an integer from 1-22, and X, Y, M. For example, vcf2bco accepts "16" but not "chr16".

The .bco is in binary format, using 4 bytes for each base with one byte for each score: consensus quality, SNP quality, RMS mapping quality, read depth. Please note in this process the upper limit for each score is 255. Any score greater than 255 will be trimmed.

After you generate these four types of files (with step 3 as optional), you may proceed to create your project.

 

| Visits: Locations of visitors to this page   |
© 2011 Dongliang Ge, PhD.