Ohana is a suite of software for analyzing population structure and admixture history using unsupervised learning methods. We construct statistical models to infer individual clustering from which we identify outliers for selection analyses.
This project was directed by Dr. Rasmus Nielsen at University of California Berkeley. Jade Cheng was funded by the Bioinformatics, Computer Science Department at Aarhus University and the Natural History Museum of Denmark at Copenhagen University.
About me
I immigrated from China to Hawaii, where I met my husband. We lived in Honolulu for eight years. During that time, I earned two degrees from the University of Hawaiʻi at Mānoa. My MS in CS paved the way to my PhD program in Denmark and later the birth of this project.
Ohana means extended family in Hawaiian, and the word seems fitting to represent this project since it couldn’t exist without the love and support of the friends, family, and colleagues who joined me on this journey many moons ago.
Installation
The Ohana source code is available on github. Building Ohana from source requires a UNIX development environment with Make, a C++ compiler, and BLAS/LAPACK libraries.
On macOS, the BLAS/LAPACK libraries come preinstalled with the operating system through the Accelerate Framework, so no prerequisite steps are required beyond installing Xcode, which provides the necessary tools to build software from the terminal. See Apple’s documentation for more information.
On Linux, the development tools and libraries may need to be installed explicitly. For example, on Debian-based systems, these packages should be installed:
Once the development environment is prepared, Ohana can be downloaded and built by following these steps:
$ git clone https://github.com/jade-cheng/ohana
$ cd ./ohana
$ make
This creates several programs in the ./bin/* directory, which are described in later sections of this documentation.
$ ls ./bin
convert
cpax
filter
nemeco
neoscan
qpas
selscan
Description
convert
To facilitate different stages of the analysis, we provide several conversion subroutines. ped2dgm converts genotype observations from the plink format to feed into qpas. bgl2lgm converts genotype likelihoods from the beagle format to feed into qpas. cov2nwk first converts a covariance matrix to a distance matrix, then it implements the Neighbor Joining algorithm to approximate the distance matrix into a Newick tree. nwk2svg produces a scalar vector graphics representation of the Newick tree. The output can be viewed with web browsers and modified with graphics editors like Inkscape. Finally, if a tree-compatible covariance matrix is desired for selscan, we have nwk2cov to converts a Newick tree to a covariance matrix.
qpas and cpax
Under the assumption of Hardy Weinberg Equilibrium, the likelihood of assigning an observed genotype g in individual i at locus j to ancestral component k is a function of the allelic frequency f_kj of the locus at k and the fraction of the genome of the individual q_ik that comes from that component. We thus consider the likelihood of the ancestral component proportions vector Q and their vector of allele frequencies F. In particular, if we denote K as the number of ancestry components, I as the number of individuals, and J as the number of polymorphic sites among the I individuals, then the log likelihood of observing the genotype is:
To estimate Q and F, we apply a Newton-style optimization method using quadratic programming through the active set algorithm. qpas operates by solving equality-constraint quadratic problems using the Karush-Kuhn-Tucker algorithm, which is a nonlinear programming generalization of the Lagrange multiplier method. cpax operates through complementarity pivoting using Lemke’s algorithm.
Leveraging the block structure of the hessian matrices for Q and F, we decompose the problem into a sequence of small-matrix manipulations rather than managing one large linear system. This allows us to update Q, row after row, and F, column after column. The optimization task, therefore, becomes feasible.
nemeco
We model the joint distribution of allele frequencies across all ancestral components as a multivariate Gaussian distribution. The likelihood denotes the probability of observing the given allele frequencies. The variances and covariances of the multivariate Gaussian are determined by a product term of two factors. The first factor is site-specific, and the second factor records the ancestral components’ variances and covariances relation.
P(f_j | C, u_j) ~ N(u_j * (1 - u_j) * C)
We denote C as the covariance matrix to be inferred. We denote the allele frequency at site j as f_j and average major allele frequency at site j as u_j. We root the covariance matrix to avoid multiple likelihood optima and obtain C'. We formulate the log likelihood analytically as the following:
nemetree is a web service dedicated to delivering clean, accurate, and presentation-ready visualizations of phylogenetic trees. To find an appealing arrangement of a tree, nemetree takes inspiration from an electrostatic field and models tree components as like-signed charged particles with nodes constrained by the branches that connect them. nemetree then utilizes the Nelder-Mead algorithm to minimize the total potential energy of this system and achieve an optimal tree layout. nemetree allows users to specify tree structures in Newick format and adjust the model and rendering configuration through a JSON editor. nemetree animates the progression of the optimization and provides a method to pause and resume the process. All rendered trees may be downloaded as SVG. What you see is what you get at http://www.jade-cheng.com/trees/
selscan
We scan for covariance outliers by applying a likelihood model to each locus, similar to the one used genome-wide but with certain scalar factor variations. This creates a nested likelihood model. Through a likelihood ratio test, it identifies loci in which the variance among components is larger than expected from the genome-wide estimated covariance matrix.
This program takes two input matrices, the c matrix and c-scale matrix. These two matrices provide the minimum and maximum values of the optimization and the interpolation is used to define how to optimize multiple values at the same time using a single parameter. In this way the same framework can be used for both optimization of additive and multiplicative models. If -cs option is not supplied, by default, the programs uses a c-scale matrix that is 10 times of the c matrix.
Workflow
A typical workflow of genetic data analysis using Ohana starts with structure inference with qpas using either genotype observations or genotype likelihoods. For genotype observations, we first prepare the data with Plink including the --recode12 --geno 0.0 --tab flags. We then convert the .ped file to a .dgm file, which is used by qpas.
The standard output from qpas contains four columns: iteration number, time elapsed, log likelihood value, and the difference in log likelihood values between iterations. The -qo output file records the admixture inference, which can be used for interpretation directly.
We approximate the estimated covariance matrix into a component tree using cov2nwk, and produce a visualization of the tree using nwk2svg. For a better visualization of the inferred Newick tree, copy the nwk file contents and visit nemetree at http://www.jade-cheng.com/trees/
With the inferred component covariances and allele frequencies, we scan for covariance outliers. The standard output from selscan contains four columns, the scalar value when local best likelihood is reached, the local log likelihood obeying global covariances, the local optimal log likelihood, and the likelihood ratio of this locus. The total number and order of loci match with the input genotype data.
The workflow described above is for demonstration purposes only. For real data analysis, the structure inference using qpas would include many fewer loci than a selection scan using selscan, which would require multiple millions of loci. What’s shown above, a scan of allele frequencies produced directly from the structure analysis, would be inadequate.
Selection study
If selection study is the goal, the first step should be to obtain a full genome dataset of high quality. Then a subset of this data, ~100 Kbp unlinked markers, should be used for the structure analysis with qpas. After that, the -fq and -qi options should be used for qpas to produce admixture-corrected allele frequencies for the full genome dataset. Finally, it should be possible to proceed to the selection study with full genome allele frequencies.
For faster analysis and smaller memory footprints, we recommend performing the last two steps in parallel. For example, suppose we have the full genome dataset HGDP.ped, we have split it into three pieces, each containing 1/3 of the markers, and the files are named HGDP1.ped, HGDP2.ped, and HGDP3.ped.
We first down-sample the full dataset to perform structure analysis and then infer the component covariances.
The file scan_all.txt contains selection results for the full dataset with markers following the order defined in HGDP.ped.
License
Copyright (c) 2015-2020 Jade Cheng
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Jade Cheng nor the names of her contributors may be used to endorse or promote products derived from this software without specific prior written permission.
This software is provided by the copyright holders and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall Jade Cheng be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
Ohana
Introduction
Ohanais a suite of software for analyzing population structure and admixture history using unsupervised learning methods. We construct statistical models to infer individual clustering from which we identify outliers for selection analyses.This project was directed by Dr. Rasmus Nielsen at University of California Berkeley. Jade Cheng was funded by the Bioinformatics, Computer Science Department at Aarhus University and the Natural History Museum of Denmark at Copenhagen University.
About me
I immigrated from China to Hawaii, where I met my husband. We lived in Honolulu for eight years. During that time, I earned two degrees from the University of Hawaiʻi at Mānoa. My MS in CS paved the way to my PhD program in Denmark and later the birth of this project.
Ohana means extended family in Hawaiian, and the word seems fitting to represent this project since it couldn’t exist without the love and support of the friends, family, and colleagues who joined me on this journey many moons ago.
Installation
The
Ohanasource code is available on github. BuildingOhanafrom source requires a UNIX development environment with Make, a C++ compiler, and BLAS/LAPACK libraries.On macOS, the BLAS/LAPACK libraries come preinstalled with the operating system through the Accelerate Framework, so no prerequisite steps are required beyond installing Xcode, which provides the necessary tools to build software from the terminal. See Apple’s documentation for more information.
On Linux, the development tools and libraries may need to be installed explicitly. For example, on Debian-based systems, these packages should be installed:
Once the development environment is prepared,
Ohanacan be downloaded and built by following these steps:This creates several programs in the
./bin/*directory, which are described in later sections of this documentation.Description
convert
To facilitate different stages of the analysis, we provide several conversion subroutines.
ped2dgmconverts genotype observations from the plink format to feed intoqpas.bgl2lgmconverts genotype likelihoods from the beagle format to feed intoqpas.cov2nwkfirst converts a covariance matrix to a distance matrix, then it implements the Neighbor Joining algorithm to approximate the distance matrix into a Newick tree.nwk2svgproduces a scalar vector graphics representation of the Newick tree. The output can be viewed with web browsers and modified with graphics editors like Inkscape. Finally, if a tree-compatible covariance matrix is desired forselscan, we havenwk2covto converts a Newick tree to a covariance matrix.qpas and cpax
Under the assumption of Hardy Weinberg Equilibrium, the likelihood of assigning an observed genotype
gin individualiat locusjto ancestral componentkis a function of the allelic frequencyf_kjof the locus atkand the fraction of the genome of the individualq_ikthat comes from that component. We thus consider the likelihood of the ancestral component proportions vectorQand their vector of allele frequenciesF. In particular, if we denoteKas the number of ancestry components,Ias the number of individuals, andJas the number of polymorphic sites among theIindividuals, then the log likelihood of observing the genotype is:To estimate
QandF, we apply a Newton-style optimization method using quadratic programming through the active set algorithm.qpasoperates by solving equality-constraint quadratic problems using the Karush-Kuhn-Tucker algorithm, which is a nonlinear programming generalization of the Lagrange multiplier method.cpaxoperates through complementarity pivoting using Lemke’s algorithm.Leveraging the block structure of the hessian matrices for
QandF, we decompose the problem into a sequence of small-matrix manipulations rather than managing one large linear system. This allows us to updateQ, row after row, andF, column after column. The optimization task, therefore, becomes feasible.nemeco
We model the joint distribution of allele frequencies across all ancestral components as a multivariate Gaussian distribution. The likelihood denotes the probability of observing the given allele frequencies. The variances and covariances of the multivariate Gaussian are determined by a product term of two factors. The first factor is site-specific, and the second factor records the ancestral components’ variances and covariances relation.
We denote
Cas the covariance matrix to be inferred. We denote the allele frequency at sitejasf_jand average major allele frequency at sitejasu_j. We root the covariance matrix to avoid multiple likelihood optima and obtainC'. We formulate the log likelihood analytically as the following:nemetree
nemetreeis a web service dedicated to delivering clean, accurate, and presentation-ready visualizations of phylogenetic trees. To find an appealing arrangement of a tree,nemetreetakes inspiration from an electrostatic field and models tree components as like-signed charged particles with nodes constrained by the branches that connect them.nemetreethen utilizes the Nelder-Mead algorithm to minimize the total potential energy of this system and achieve an optimal tree layout.nemetreeallows users to specify tree structures in Newick format and adjust the model and rendering configuration through a JSON editor.nemetreeanimates the progression of the optimization and provides a method to pause and resume the process. All rendered trees may be downloaded as SVG. What you see is what you get at http://www.jade-cheng.com/trees/selscan
We scan for covariance outliers by applying a likelihood model to each locus, similar to the one used genome-wide but with certain scalar factor variations. This creates a nested likelihood model. Through a likelihood ratio test, it identifies loci in which the variance among components is larger than expected from the genome-wide estimated covariance matrix.
This program takes two input matrices, the
c matrixandc-scale matrix. These two matrices provide the minimum and maximum values of the optimization and the interpolation is used to define how to optimize multiple values at the same time using a single parameter. In this way the same framework can be used for both optimization of additive and multiplicative models. If-csoption is not supplied, by default, the programs uses ac-scale matrixthat is 10 times of thec matrix.Workflow
A typical workflow of genetic data analysis using
Ohanastarts with structure inference withqpasusing either genotype observations or genotype likelihoods. For genotype observations, we first prepare the data with Plink including the--recode12 --geno 0.0 --tabflags. We then convert the .ped file to a .dgm file, which is used byqpas.For genotype likelihoods, first prepare the data in beagle format; then convert the it to an .lgm file.
The standard output from
qpascontains four columns: iteration number, time elapsed, log likelihood value, and the difference in log likelihood values between iterations. The-qooutput file records the admixture inference, which can be used for interpretation directly.The
-fooutput file records the allele frequency inference, which can be used to estimate component covariances withnemeco.We approximate the estimated covariance matrix into a component tree using
cov2nwk, and produce a visualization of the tree usingnwk2svg. For a better visualization of the inferred Newick tree, copy the nwk file contents and visitnemetreeat http://www.jade-cheng.com/trees/If the intermediate Newick file is unneeded, it is possible to pipe output through the conversion.
With the inferred component covariances and allele frequencies, we scan for covariance outliers. The standard output from
selscancontains four columns, the scalar value when local best likelihood is reached, the local log likelihood obeying global covariances, the local optimal log likelihood, and the likelihood ratio of this locus. The total number and order of loci match with the input genotype data.The workflow described above is for demonstration purposes only. For real data analysis, the structure inference using
qpaswould include many fewer loci than a selection scan usingselscan, which would require multiple millions of loci. What’s shown above, a scan of allele frequencies produced directly from the structure analysis, would be inadequate.Selection study
If selection study is the goal, the first step should be to obtain a full genome dataset of high quality. Then a subset of this data, ~100 Kbp unlinked markers, should be used for the structure analysis with
qpas. After that, the-fqand-qioptions should be used forqpasto produce admixture-corrected allele frequencies for the full genome dataset. Finally, it should be possible to proceed to the selection study with full genome allele frequencies.For faster analysis and smaller memory footprints, we recommend performing the last two steps in parallel. For example, suppose we have the full genome dataset
HGDP.ped, we have split it into three pieces, each containing 1/3 of the markers, and the files are namedHGDP1.ped,HGDP2.ped, andHGDP3.ped.We first down-sample the full dataset to perform structure analysis and then infer the component covariances.
We then produce admixture-corrected allele frequencies and perform the selection scan separately, possibly in parallel.
The file
scan_all.txtcontains selection results for the full dataset with markers following the order defined in HGDP.ped.License
Copyright (c) 2015-2020 Jade Cheng
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Jade Cheng nor the names of her contributors may be used to endorse or promote products derived from this software without specific prior written permission.
This software is provided by the copyright holders and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall Jade Cheng be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.