vcftools - Online in the Cloud

Run vcftools in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command vcftools that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

vcftools - analyse VCF files

SYNOPSIS

vcftools [OPTIONS]

DESCRIPTION

The vcftools program is run from the command line. The interface is inspired by PLINK, and
so should be largely familiar to users of that package. Commands take the following form:

vcftools --vcf file1.vcf --chr 20 --freq

The above command tells vcftools to read in the file file1.vcf, extract sites on
chromosome 20, and calculate the allele frequency at each site. The resulting allele
frequency estimates are stored in the output file, out.freq. As in the above example,
output from vcftools is mainly sent to output files, as opposed to being shown on the
screen.

Note that some commands may only be available in the latest version of vcftools. To obtain
the latest version, you should use SVN to checkout the latest code, as described on the
home page.

Also note that polyploid genotypes are not currently supported.

Basic Options
--vcf <filename>
This option defines the VCF file to be processed. The files need to be decompressed
prior to use with vcftools. vcftools expects files in VCF format v4.0, a
specification of which can be found here.

--gzvcf <filename>
This option can be used in place of the --vcf option to read compressed (gzipped)
VCF files directly. Note that this option can be quite slow when used with large
files.

--out <prefix>
This option defines the output filename prefix for all files generated by vcftools.
For example, if <prefix> is set to output_filename, then all output files will be
of the form output_filename.*** . If this option is omitted, all output files will
have the prefix 'out.'.

Site Filter Options
--chr <chromosom>
Only process sites with a chromosome identifier matching <chromosome>

--from-bp <integer>

--to-bp <integer>
These options define the physical range of sites will be processed. Sites outside
of this range will be excluded. These options can only be used in conjunction with
--chr.

--snp <string>
Include SNP(s) with matching ID. This command can be used multiple times in order
to include more than one SNP.

--snps <filename>
Include a list of SNPs given in a file. The file should contain a list of SNP IDs,
with one ID per line.

--exclude <filename>
Exclude a list of SNPs given in a file. The file should contain a list of SNP IDs,
with one ID per line.

--positions <filename>
Include a set of sites on the basis of a list of positions. Each line of the input
file should contain a (tab-separated) chromosome and position. The file should
have a header line. Sites not included in the list are excluded.

--bed <filename>

--exclude-bed <filename>
Include or exclude a set of sites on the basis of a BED file. Only the first three
columns (chrom, chromStart and chromEnd) are required. The BED file should have a
header line.

--remove-filtered-all

--remove-filtered <sting>

--keep-filtered <sting>
These options are used to filter sites on the basis of their FILTER flag. The
first option removes all sites with a FILTER flag. The second option can be used to
exclude sites with a specific filter flag. The third option can be used to select
sites on the basis of specific filter flags. The second and third options can be
used multiple times to specify multiple FILTERs. The --keep-filtered option is
applied before the --remove-filtered option.

--minQ <float>
Include only sites with Quality above this threshold.

--min-meanDP <float>

--max-meanDP <float>
Include sites with mean Depth within the thresholds defined by these options.

--maf <float>

--max-maf <float>
Include only sites with Minor Allele Frequency within the specified range.

--non-ref-af <float>

--max-non-ref-af <float>
Include only sites with Non-Reference Allele Frequency within the specified range.

--hue <float>
Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by
Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold
defined by this option are taken to be out of HWE, and therefore excluded.

--geno <float>
Exclude sites on the basis of the proportion of missing data (defined to be between
0 and 1).

--min-alleles <int>

--max-alleles <int>
Include only sites with a number of alleles within the specified range. For
example, to include only bi-allelic sites, one could use:

vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

--mask <filename>

--invert-mask <filename>

--mask-min <filename>
Include sites on the basis of a FASTA-like file. The provided file contains a
sequence of integer digits (between 0 and 9) for each position on a chromosome that
specify if a site at that position should be filtered or not. An example mask file
would look like:

>1
0000011111222...

In this example, sites in the VCF file located within the first 5 bases of the
start of chromosome 1 would be kept, whereas sites at position 6 onwards would be
filtered out. The threshold integer that determines if sites are filtered or not is
set using the --mask-min option, which defaults to 0. The chromosomes contained in
the mask file must be sorted in the same order as the VCF file. The --mask option
is used to specify the mask file to be used, whereas the --invert-mask option can
be used to specify a mask file that will be inverted before being applied.

Individual Filters
--indv <string>
Specify an individual to be kept in the analysis. This option can be used multiple
times to specify multiple individuals.

--keep <filename>
Provide a file containing a list of individuals to include in subsequent a nalysis.
Each individual ID (as defined in the VCF headerline) should be included on a
separate line.

--remove-indv <string>
Specify an individual to be removed from the analysis. This option can be used
multiple times to specify multiple individuals. If the --indv option is also
specified, then the --indv option is executed before the --remove-indv option.

--remove <filename>
Provide a file containing a list of individuals to exclude in subsequent analysis.
Each individual ID (as defined in the VCF headerline) should be included on a
separate line. If both the --keep and the --remove options are used, then the
--keep option is execute before the --remove option.

--mon-indv-meanDP <float>

--max-indv-meanDP <float>
Calculate the mean coverage on a per-individual basis. Only individuals with
coverage within the range specified by these options are included in subsequent
analyses.

--mind <float>
Specify the minimum call rate threshold for each individual.

--phased
First excludes all individuals having all genotypes unphased, and subsequently
excludes all sites with unphased genotypes. The remaining data therefore consists
of phased data only.

Genotype Filters
--remove-filtered-geno-all

--remove-filtered-geno <string>
The first option removes all genotypes with a FILTER flag. The second option can be
used to exclude genotypes with a specific filter flag.

--minGQ <float>
Exclude all genotypes with a quality below the threshold specified by this option
(GQ).

--minDP <float>
Exclude all genotypes with a sequencing depth below that specified by this option
(DP)

Output Statistics
--freq

--counts

--freq2

--counts2
Output per-site frequency information. The --freq outputs the allele frequency in a
file with the suffix '.frq'. The --counts option outputs a similar file with the
suffix '.frq.count', that contains the raw allele counts at each site. The --freq2
and --count2 options are used to suppress allele information in the output file. In
this case, the order of the freqs/counts depends on the numbering in the VCF file.

--depth
Generates a file containing the mean depth per individual. This file has the suffix
'.idepth'.

--site-depth

--site-mean-depth
Generates a file containing the depth per site. The --site-depth option outputs the
depth for each site summed across individuals. This file has the suffix '.ldepth'.
Likewise, the --site-mean-depth outputs the mean depth for each site, and the
output file has the suffix '.ldepth.mean'.

--geno-depth
Generates a (possibly very large) file containing the depth for each genotype in
the VCF file. Missing entries are given the value -1. The file has the suffix
'.gdepth'.

--site-quality
Generates a file containing the per-site SNP quality, as found in the QUAL column
of the VCF file. This file has the suffix '.lqual'.

--het Calculates a measure of heterozygosity on a per-individual basis. Specfically, the
inbreeding coefficient, F, is estimated for each individual using a method of
moments. The resulting file has the suffix '.het'.

--hardy
Reports a p-value for each site from a Hardy-Weinberg Equilibrium test (as defined
by Wigginton, Cutler and Abecasis (2005)). The resulting file (with suffix '.hwe')
also contains the Observed numbers of Homozygotes and Heterozygotes and the
corresponding Expected numbers under HWE.

--missing
Generates two files reporting the missingness on a per-individual and per-site
basis. The two files have suffixes '.imiss' and '.lmiss' respectively.

--hap-r2

--geno-r2

--ld-window <int>

--ld-window-bp <int>

--min-r2 <float>
These options are used to report Linkage Disequilibrium (LD) statistics as
summarised by the r2 statistic. The --hap-r2 option informs vcftools to output a
file reporting the r2 statistic using phased haplotypes. This is the traditional
measure of LD often reported in the population genetics literature. If phased
haplotypes are unavailable then the --geno-r2 option may be used, which calculates
the squared correlation coefficient between genotypes encoded as 0, 1 and 2 to
represent the number of non-reference alleles in each individual. This is the same
as the LD measure reported by PLINK. The haplotype version outputs a file with the
suffix '.hap.ld', whereas the genotype version outputs a file with the suffix
'.geno.ld'. The haplotype version implies the option --phased.

The --ld-window option defines the maximum SNP separation for the calculation of
LD. Likewise, the --ld-window-bp option can be used to define the maximum physical
separation of SNPs included in the LD calculation. Finally, the --min-r2 sets a
minimum value for r2 below which the LD statistic is not reported.

--SNPdnsity <int>
Calculates the number and density of SNPs in bins of size defined by this option.
The resulting output file has the suffix '.snpden'.

--TsTv <int>
Calculates the Transition / Transversion ratio in bins of size defined by this
option. The resulting output file has the suffix '.TsTv'. A summary is also
supplied in a file with the suffix '.TsTv.summary'.

--FILTER-summary
Generates a summary of the number of SNPs and Ts/Tv ratio for each FILTER category.
The output file has the suffix '.FILTER.summary.

--filtered-sites
Creates two files listing sites that have been kept or removed after filtering. The
first file, with suffix '.kept.sites', lists sites kept by vcftools after filters
have been applied. The second file, with the suffix '.removed.sites', list sites
removed by the applied filters.

--singletons
This option will generate a file detailing the location of singletons, and the
individual they occur in. The file reports both true singletons, and private
doubletons (i.e. SNPs where the minor allele only occurs in a single individual and
that individual is homozygotic for that allele). The output file has the suffix
'.singletons'.

--site-pi

--window-pi <int>
These options are used to estimate levels of nucleotide diversity. The first option
does this on a per-site basis, and the output file has the suffix '.sites.pi'. The
second option calculates the nucleotide diversity in windows, with the window size
defined in the option argument. Output for this option has the suffix
'.windowed.pi'. The windowed version requires phased data, and hence use of this
option implies the --phased option.

Output in Other Formats
--O12 This option outputs the genotypes as a large matrix. Three files are produced. The
first, with suffix '.012', contains the genotypes of each individual on a separate
line. Genotypes are represented as 0, 1 and 2, where the number represent that
number of non-reference alleles. Missing genotypes are represented by -1. The
second file, with suffix '.012.indv' details the individuals included in the main
file. The third file, with suffix '.012.pos' details the site locations included in
the main file.

--IMPUTE
This option outputs phased haplotypes in IMPUTE reference-panel format. As IMPUTE
requires phased data, using this option also implies --phased. Unphased
individuals and genotypes are therefore excluded. Only bi-allelic sites are
included in the output. Using this option generates three files. The IMPUTE
haplotype file has the suffix '.impute.hap', and the IMPUTE legend file has the
suffix '.impute.hap.legend'. The third file, with suffix '.impute.hap.indv',
details the individuals included in the haplotype file, although this file is not
needed by IMPUTE.

--ldhat

--ldhat-geno
These options output data in LDhat format. Use of these options also require the
--chr option to by used. The --ldhat option outputs phased data only, and therefore
also implies --phased, leading to unphased individuals and genotypes being
excluded. Alternatively, the --ldhat-geno option treats all of the data as
unphased, and therefore outputs LDhat files in genotype/unphased format. In either
case, two files are generated with the suffixes '.ldhat.sites' and '.ldhat.locs',
which correspond to the LDhat 'sites' and 'locs' input files respectively.

--BEAGLE-GL
This option outputs genotype likelihood information for input into the BEAGLE
program. This option requires the VCF file to contain the FORMAT GL tag, which can
generally be output by SNP callers such as the GATK. Use of this option requires a
chromosome to be specified via the --chr option. The resulting output file (with
the suffix '.BEAGLE.GL') contains genotype likelihoods for biallelic sites, and is
suitable for input into BEAGLE via the 'like=' argument.

--plink
This option outputs the genotype data in PLINK PED format. Two files are generated,
with suffixes '.ped' and '.map'. Note that only bi-allelic loci will be output.
Further details of these files can be found in the PLINK documentation.

Note: This option can be very slow on large datasets. Using the --chr option to
divide up the dataset is advised.

--plink-tped
The --plink option above can be extremely slow on large datasets. An alternative
that might be considerably quicker is to output in the PLINK transposed format.
This can be achieved using the --plink-tped option, which produces two files with
suffixes '.tped' and '.tfam'.

--recode
The --recode option is used to generate a VCF file from the input VCF file having
applied the options specified by the user. The output file has the suffix
'.recode.vcf'.

By default, the INFO fields are removed from the output file, as the INFO values
may be invalidated by the recoding (e.g. the total depth may need to be
recalculated if individuals are removed). This default functionality can be
overridden by using the --keep-INFO <string> option, where <string> defines the
INFO key to keep in the output file. The --keep-INFO flag can be used multiple
times. Alternatively, the option --keep-INFO-all can be used to retain all INFO
fields.

Miscellaneous
--extract-FORMAT-info <string>
Extract information from the genotype fields in the VCF file relating to a specfied
FORMAT identifier. For example, using the option '--extract-FORMAT-info GT' would
extract the all of the GT (i.e. Genotype) entries. The resulting output file has
the suffix '.<FORMAT_ID>.FORMAT'.

--get-INFO <string>
This option is used to extract information from the INFO field in the VCF file. The
<string> argument specifies the INFO tag to be extracted, and the option can be
used multiple times in order to extract multiple INFO entries. The resulting file,
with suffix '.INFO', contains the required INFO information in a tab-separated
table. For example, to extract the NS and DB flags, one would use the command:

vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

VCF File Comparison Options
The file comparison options are currently in a state of flux and likely buggy. If you
find a bug, please report it. Note that genotype-level filters are not supported in these
options.

--diff <filename>

--gzdiff <filename>
Select a VCF file for comparison with the file specified by the --vcf option.
Outputs two files describing the sites and individuals common / unique to each
file. These files have the suffixes '.diff.sites_in_files' and
'.diff.indv_in_files' respectively. The --gzdiff version can be used to read
compressed VCF files.

--diff-site-discordance
Used in conjunction with the --diff option to calculate discordance on a site by
site basis. The resulting output file has the suffix '.diff.sites'.

--diff-indv-discordance
Used in conjunction with the --diff option to calculate discordance on a per-
individual basis. The resulting output file has the suffix '.diff.indv'.

--diff-discordance-matrix
Used in conjunction with the --diff option to calculate a discordance matrix. This
option only works with bi-allelic loci with matching alleles that are present in
both files. The resulting output file has the suffix '.diff.discordance.matrix'.

--diff-switch-error
Used in conjunction with the --diff option to calculate phasing errors
(specifically 'switch errors'). This option generates two output files describing
switch errors found between sites, and the average switch error per individual.
These two files have the suffixes '.diff.switch' and '.diff.indv.switch'
respectively.

Options still in development
The following options are yet to be finalised, are likely to contain bugs, and are likely
to change in the future.

--fst <filename>

--gzfst <filename>
Calculate FST for a pair of VCF files, with the second file being specified by this
option. FST is currently calculated using the formula described in the
supplementary material of the Phase I HapMap paper. Currently, only pairwise FST
calculations are supported, although this will likely change in the future. The
--gzfst option can be used to read compressed VCF files.

--LROH Identify Long Runs of Homozygosity.

--relatedness
Output Individual Relatedness Statistics.

Use vcftools online using onworks.net services