Alignment of Optical Maps Journal of Computational Biology (2006)
By Anton Valouev, Lei Li, Yu-Chi Liu, David C. Schwartz, Yi Yang,
Yu Zhang, Michael S. Waterman
Click here for Abstract
We introduce a new scoring method for calculation of alignments of optical maps. Missing cuts, false cuts and
sizing errors present in optical maps are addressed by our alignment score through calculation of corresponding
likelihoods. The size error model is derived through the application of Central Limit Theorem and validated
by residual plots collected from real data. Missing cuts and false cuts are modeled as Bernoulli and Poisson
events respectively as suggested by previous studies. Likelihoods are used to derive an alignment score through
calculation of likelihood ratios for a certain hypothesis test. This allows to achieve maximal descriminative
power for the alignment score. Our scoring method is naturally embedded within a well known DP framework
for ¯nding optimal alignments.
Fine mapping - 19th Century style BMC Genetics: 2005, in press.
By John Molitor, Keyan Zhao and Paul Marjoram
Click here for Abstract
Background: There is great interest in the use of computationally intensive methods for fine mapping of
marker data. In this paper we develop methods based upon ideas originally proposed 100 years ago in the
context of spatial clustering.
Methods: We use spatial clustering of haplotypes as a low-dimensional surrogate for the unobserved
genealogy underlying a set of genotype data. In doing so we hope to avoid the computational complexity
inherent in explicitly modelling details of the ancestry of the sample, while at the same time capturing the
key correlations induced by that ancestry at a much lower computational cost.
1
Results: We benchmark our methods using the simulated GAW 14 data, using 100 replicates of 4 phenotypes
to indicate the power of our method. When a functional mutation relating to a trait is actually
present, we find evidence for that mutation in 97 out of 100 replicates, on average.
Conclusions: Results show that our method has the ability to accurately infer the location of functional
mutations from unphased genotype data.
An approximation algorithm for haplotype inference by
maximum parsimony Journal of Computational Biology (2005) 12(10):1261-74
By Yao-Ting Huang , Kun-Mao Chao and Ting Chen
Click here for Abstract
This paper studies haplotype inference by maximum parsimony using population data. We de¯ne
the optimal haplotype inference (OHI) problem as given a set of genotypes and a set of related
haplotypes, ¯nd a minimum subset of haplotypes that can resolve all the genotypes. We prove that
OHI is NP-hard and can be formulated as an integer quadratic programming (IQP) problem. To
solve the IQP problem, we propose an iterative semi-de¯nite programming based approximation
algorithm, (called SDPHapInfer). We show that this algorithm ¯nds a solution within a factor of
O(log n) of the optimal solution, where n is the number of genotypes. This algorithm has been
implemented and tested on a variety of simulated and biological data. In comparison with three
other methods: (1) HAPAR, which was implemented based on the branching and bound algorithm,
(2) HAPLOTYPER, which was implemented based on the Expectation-Maximization algorithm,
and (3) PHASE, which combined the Gibbs sampling algorithm with an approximate coalescent
prior, the experimental results indicate that SDPHapInfer and HAPLOTYPER have similar error
rates. In addition, the results generated by PHASE have lower error rates on some data but higher
error rates on others. The error rates of HAPAR are higher than the others on biological data. In
terms of e±ciency, SDPHapInfer, HAPLOTYPER, and PHASE output a solution in a stable and
consistent way, and they run much faster than HAPAR when the number of genotypes becomes large.
HAPLORE: a program for haplotype
reconstruction in general pedigrees without
recombination Bioinfomatics 21: No. 1, 90-103, 2005
By Kui Zhang , Fengzhu Sun and Hongyu Zhao
Click here for Abstract
Motivation: Haplotype reconstruction is an essential step
in genetic linkage and association studies. Although many
methods have been developed to estimate haplotype frequencies
and reconstruct haplotypes for a sample of unrelated
individuals, haplotype reconstruction in large pedigrees with
a large number of genetic markers remains a challenging
problem.
Methods: We have developed an efficient computer program,
HAPLORE (HAPLOtype REconstruction), to identify all haplotype
sets that are compatible with the observed genotypes
in a pedigree for tightly linked genetic markers. HAPLORE
consists of three steps that can serve different needs in applications.
In the first step, a set of logic rules is used to reduce
the number of compatible haplotypes of each individual in
the pedigree as much as possible. After this step, the haplotypes
of all individuals in the pedigree can be completely
or partially determined. These logic rules are applicable to
completely linked markers and they can be used to impute
missing data and check genotyping errors. In the second step,
a haplotype-elimination algorithm similar to the genotypeelimination
algorithms used in linkage analysis is applied to
delete incompatible haplotypes derived from the first step.
All superfluous haplotypes of the pedigree members will be
excluded after this step. In the third step, the expectationmaximization
(EM) algorithm combined with the partition and
ligation technique is used to estimate haplotype frequencies
based on the inferred haplotype configurations through the
first two steps. Only compatible haplotype configurations with
haplotypes having frequencies greater than a threshold are
retained.
Results: We test the effectiveness and the efficiency of
HAPLORE using both simulated and real datasets. Our
results show that, the rule-based algorithm is very efficient
for completely genotyped pedigree. In this case, almost all
of the families have one unique haplotype configuration. In
the presence of missing data, the number of compatible haplotypes
can be substantially reduced by HAPLORE, and the
program will provide all possible haplotype configurations of
a pedigree under different circumstances, if such multiple
configurations exist. These inferred haplotype configurations,
as well as the haplotype frequencies estimated by the EM
algorithm, can be used in genetic linkage and association
studies.
Comparisons of substitution, insertion and deletion
probes for resequencing and mutational analysis
using oligonucleotide microarrays Nucleic Acids Research, 2005, Vol. 33, No. 3 e33
By Mazen W. Karaman, Susan Groshen, Chi-Chiang Lee and Brian L. Pike
Click here for Abstract
Although oligonucleotide probes complementary to
single nucleotide substitutions arecommonly used in
microarray-based screens for genetic variation, little
is known about the hybridization properties of probes
complementary to small insertions and deletions. It is
necessary to define the hybridization properties of
these latter probes in order to improve the specificity
and sensitivity of oligonucleotide microarray-based
mutational analysis of disease-related genes. Here,
we compare and contrast the hybridization properties
of oligonucleotide microarrays consisting of 25mer
probes complementary to all possible single nucleotide
substitutions and insertions, and one and two
base deletions in the 9168 bp coding region of the
ATM (ataxia telangiectasia mutated) gene. Over 68
different dye-labeled single-stranded nucleic acid
targets representing all ATM coding exons were
applied to these microarrays. We assess hybridization
specificity by comparing the relative hybridization
signals from probes perfectly matched to ATM
sequences to those containing mismatches. Probes
complementary to two base substitutions displayed
the highest average specificity followed by those
complementary to single base substitutions, single
base deletions and single base insertions. In all the
cases, hybridization specificity was strongly influenced
by sequence context and possible intra- and
intermolecular probe and/or target structure.
Furthermore, single nucleotide substitution probes
displayed the most consistent hybridization specificity
data followed by single base deletions, two base
deletions and single nucleotide insertions. Overall,
these studies provide valuable empirical data that
can be used to more accurately model the hybridization
properties of insertion and deletion probes
and improve the design and interpretation of oligonucleotide
microarray-based resequencing and
mutational analysis.
Identifying Susceptibility Genes by Using Joint Tests of Association and Linkage and Accounting For Epistasis BMC Genetics 2005:under review
By Joshua Millstein , Kimberly D. Siegmund , David V. Conti and W. James Gauderman
Click here for Abstract
Simulated GAW 14 data were analyzed by jointly testing linkage and association and by accounting for epistasis using a candidate gene approach. Our group was unblinded to the “answers.” The 48 SNPs within the six disease loci were analyzed in addition to five SNPs from each of two non-disease-related loci. Affected sib-parent data was extracted from the first 10 replicates for populations Aipotu, Kaarangar, and Danacaa, and analyzed separately for each replicate. We developed a likelihood for testing association and/or linkage using data from affected sib pairs and their parents. Identical by descent (ibd) allele sharing between sibs was explicitly modeled using a conditional logistic regression approach and incorporating a covariate that represents expected ibd allele sharing given the genotypes of the sibs and their parents. Interactions were accounted for by performing likelihood ratio tests in stages determined by the highest order interaction term in the model. In the first stage, main effects were tested independently, and in subsequent stages, multilocus effects were tested conditional on significant marginal effects. A reduction in the number of tests performed was achieved by prescreening gene combinations with a goodness-of-fit chi square statistic that depended on mating-type frequencies. SNP-specific joint effects of linkage and association were identified for loci D1, D2, D3, and D4 in multiple replicates. The strongest effect was for SNP B03T3056, which had a median p-value of 1.98E-34. No two or three-locus effects were found in more than one replicate.
Inference of Missing SNPs and Information
Quantity Measurements for Haplotype Blocks Bioinformatics, 2005 May 1;21(9):2001-7.
By Shih-Chieh Su , C.-C. Jay Kuo and Ting Chen
Click here for Abstract
Motivation: Missing data in genotyping single nucleotide
polymorphism (SNP) spots are common. High-throughput
genotyping methods usually have a high rate of missing data.
For example, the published human chromosome 21 data by
Patil et al. (2001) contains about 20% missing SNPs. Inferring
missing SNPs using the haplotype block structure is promising
but difficult because the haplotype block boundaries
are not well-defined. Here we propose a global algorithm to
overcome this difficulty.
Results: First, we propose to use entropy as a measure of
haplotype diversity. We show that the entropy measure combined
with a dynamic programming algorithm produces better
haplotype block partitions than other measures. Second,
based on the entropy measure, we propose a two-step iterative
partition-inference (IPI) algorithm for the inference of
missing SNPs. At the first step, we apply the dynamic programming
algorithm to partition haplotypes into blocks. At
the second step, we use an iterative process similar to
the expectation-maximization (EM) algorithm to infer missing
SNPs in each haplotype block so as to minimize the block
entropy. The algorithm iterates these two steps until the total
block entropy is minimized. We test our algorithm in several
experimental data sets. The results show that the global
approach significantly improves the accuracy of the inference.
The Pattern of Polymorphism
in Arabidopsis thaliana PLoS Biology 3: 7, e196, 2005.
By Magnus Nordborg , Tina T. Hu , Yoko Ishino , Jinal Jhaveri , Christopher Toomajian , Honggang Zheng, Erica Bakker , Peter Calabrese , Jean Gladstone , Rana Goyal , Mattias Jakobsson , Sung Kim ,Yuri Morozov , Badri Padhukasahasram1, Vincent Plagnol , Noah A. Rosenberg , Chitiksha Shah , Jeffrey D. Wall , Jue Wang , Keyan Zhao , Theodore Kalbfleisch , Vincent Schulz , Martin Kreitman and Joy Bergelson
Click here for Abstract
We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that included stock center
accessions as well as a hierarchical sample from natural populations. Although A. thaliana is a selfing weed, the pattern
of polymorphism in general agrees with what is expected for a widely distributed, sexually reproducing species.
Linkage disequilibrium decays rapidly, within 50 kb. Variation is shared worldwide, although population structure and
isolation by distance are evident. The data fail to fit standard neutral models in several ways. There is a genome-wide
excess of rare alleles, at least partially due to selection. There is too much variation between genomic regions in the
level of polymorphism. The local level of polymorphism is negatively correlated with gene density and positively
correlated with segmental duplications. Because the data do not fit theoretical null distributions, attempts to infer
natural selection from polymorphism data will require genome-wide surveys of polymorphism in order to identify
anomalous regions. Despite this, our data support the utility of A. thaliana as a model for evolutionary functional
genomics.
Software for tag single nucleotide
polymorphism selection Human Genetics 145671-17:8, 13/4, 2005.
By Daniel O. Stram
Click here for Abstract
This paper reviews the theoretical basis for single nucleotide polymorphism (SNP) tagging and considers the use of current software made
freely available for this task. A distinction between haplotype-block based block-based and non-block-based approaches yields two classes
of procedures. Analysis of two different sets of SNP genotype data from the HapMap is used to judge the practical aspects of using each of
the programs considered, as well as to make some general observations about the performance of the programs in finding optimal sets of
tagging SNPs. Pairwise R2 methods, while the simplest of those considered, do tend to pick more tagging SNPs than are strictly needed to
predict unmeasured (non-tagging) SNPs, since a combination of two or more tagging SNPs can form a other prediction of SNPs that have
no direct (pairwise) surrogate. Block-based methods that exploit the linkage disequilibrium structure within haplotype blocks exploit this
sort of redundancy, but run a risk of over-fitting if used without some care. A compromise approach which eliminates the need first to
analyse block structure, but which still exploits simple relationships between SNPs, appears promising.