Manual

Installation

Installation of SpeeDB is rather straightforwrad. You may download the latest version from the home page and extract it into a new directory. Change to that directory. Compile SpeeDB by running

make clean; make

The speedb tped2dbs command

The SpeeDB package uses an internal data structure to organize the genomic data of a database of individuals. To facilitate its usage, command speedb tped2dbs converts the input PLINK TPED file into a binary DBS file. The genotype of an individual is created by combining two adjacent haplotypes. If any haplotype of an individual has a missing value in a certain marker, then the genotype of this individual in the corresponding marker is 1 (namely, having one copy of minor allele). The output of this command is a file with extension .dbs. This output file is required by all the other commands in the SpeeDB package.

Command line is

speedb tped2dbs <input.tped> <output.dbs>

The speedb dbs2query command

The SpeeDB package provides a command to convert each database indivdual to a query individual. This enables pairwise IBD inference from the input database.

Command line is

speedb dbs2query <input.dbs> <output.query>

The speedb filter command

Command speedb filter takes a batch of query individuals and a large collection of individuals as inputs. In a single query, SpeeDB compares a query individual against the collection and outputs candidate IBD segments for downstream accurate IBD inference.

Command line is

speedb filter [options] <input.dbs> <input.query> <output.candidate>

Options

-b <float> sets the target IBD block size. The default value is 4 cM.

-s <float> sets the step size, i.e. par_step in the article. The default value is 0.5 cM. The window size par_window is calculated from this value.

-e <float> sets the paramter η_J in the optimization. The default value is 0.9.

-1 <int> sets the Major Filter's threshold p_th as 1e-<int>. The current version supports <int> = {0, 1, 2}. The default is 1.

-2 <int> sets the Minor Filter's threshold p_th* as 1e-<int>. The current version supports <int> = {0, 1, 2}. The default is 1. Users can use different thresholds on these two filters.

-p <string> allows users to specify Major Filter's pruning critieria by pointing SpeeDB to a file with extension .1. Note that, the input should not have this extension. For example, if the filename is criteria.1, then the option should be -p criteria. This file includes 100 numbers; each two are separated by a new-line character. The ith number indicates the number of tolerable false opposite homozygous markers when the Major Filter selects i markers in a window. If both -p and -1 are specified, SpeeDB uses the former one.

-q <string> works the same way as -p. The only differences are that -q is used to specify Minor Filter's pruning criteria, and that the input file has extension .2.

-w runs pairwise mode. Note that, use this mode only if the query file is generated from the dbs file by using the speedb dbs2query command.

The speedb genQ command

Users can use speedb genQ command to create simulated query individuals. For each query, this command creates two composite haplotypes using the original haplotypes in the input file as follows. Specifically, it partitions the genome into a series of mosaics. Each segment of a composite haplotype is made by copying alleles from a randomly selected original haplotype. Next, it simulates an IBD segment between the query individual and a random database individual at a random location by copying one of the haplotype of the database individuals over one of the haplotypes of the query individual. Finally, errors are introduced in the query individual's haplotype with a user-defined error rate.

The outputs of this command are two binary files: output.query and output.query.meta. The former one stores the genotype of query individuals. The later one stores the IBD information, which is used for evaluating the performance of SpeeDB.

Command line is

speedb genQ [options] <input.dbs> <output.query>

Options

-b <float> sets the artificial IBD segment length. The default value is 4 cM.

-g <float> sets the size of mosaics. The default value is 0.2 cM.

-t <float> sets the injected genotyping error rate. The default value is 0.005.

-c <int> sets the number of query individuals. The default is 1,000.

-N disables the injection of any artificial IBD.

Example

Suppose you have a data set in PLINK TPED format sample.tped. After you installed SpeeDB, run the following commands to nominate candidate IBD segments shared by a pair of individuals in the database.

speedb tped2dbs sample.tped sample.dbs

speedb dbs2query sample.dbs sample.query

speedb filter -w -1 0 -2 0 sample.dbs sample.query sample.candidate

You can find the candidates in file sample.candidate. Note that, in this example, because the thresholds of both filters are set to 1e-0, no false opposite homozygous markers are tolerated.

More examples

Users can create a large set of query individuals with small genotyping error rate by using

speedb genQ -c 10000 -t 0.0025 sample.dbs sample_large_set_light_error.query

and then compare this query set against the database with default filter thresholds by using

speedb filter -c 10000 sample.dbs sample_large_set_light_error.query sample_large_set_light_error.candidate

Also, one can create a set of query individuals which share 6 cM IBD with database individuals

speedb genQ -b 6 sample.dbs sample_long_IBD.query

and then import customized pruning criteria into the filtering process. The criteria are stored in files customized_criteria.1 and customized_criteria.2 for the Major Filter and the Minor Filter respectively.

speedb filter -b 6 -p customized_criteria -q customized_criteria sample.dbs sample_long_IBD.query sample_long_IBD.candidate