Yleaf-pipelines: Pipeline-optimized Y-chromosomal haplogroup inference from NGS data

Note: This is a fork of the original Yleaf software with customizations to facilitate integration with Nextflow and other bioinformatics pipelines.

Original authors: Arwin Ralf, Diego Montiel Gonzalez, Kaiyin Zhong and Manfred Kayser

Pipeline adaptation: Alaina Hardie (@trianglegrrl, Toronto, ON, Canada)

Department of Genetic Identification

Erasmus MC University Medical Centre Rotterdam, The Netherlands

Pipeline adaptations

This fork of Yleaf includes customizations that make it more easily integrated into Nextflow and other bioinformatics pipelines, including:

Command-line parameters to specify custom reference genomes (-fg, -yr)
Other pipeline-friendly modifications while maintaining the original functionality

These adaptations aim to streamline the integration of Yleaf into automated workflows while preserving the core functionality and accuracy of the original software.

Future plans

In the future, we will attempt to maintain all of the core functionality of the original Yleaf, with optimizations for use in Nextflow pipelines.

We endeavour to stay in release sync with the official Yleaf repo. As such, our first release is 3.3.0, which includes the additional command-line parameters described above that are required for basic smooth operation within pipelines.

Requirements

Operating system: Linux only.
Internet connection: when running for the first time for downloading the reference genome. Alternatively you
                     can configure your own references.
Data storage: For installation we recommend a storage capacity of > 8 GB.

Installation

The easiest way to get Yleaf up and running is by using a conda environment.

# first clone this repository to get the environment_yleaf.yaml
git clone https://github.com/genid/Yleaf.git
cd Yleaf
# create the conda environment from the .yaml the environment will be called yleaf
conda env create --file environment_yleaf.yaml
# activate the environment
conda activate yleaf
# pip install the cloned yleaf into your environment. Using the -e flag allows you to modify the config file in your cloned folder
pip install -e .

# verify that Yleaf is installed correctly. You can call this command from any directory on your system
Yleaf -h

or manually install everything

# install python and libraries
apt-get install python3.6
pip3 install pandas
pip3 install numpy
# install Burrows-Wheeler Aligner for FASTQ files
sudo apt-get install minimap2
# install SAMtools
wget https://github.com/samtools/samtools/releases/download/1.4.1/
samtools-1.4.1.tar.bz2 -O samtools.tar.bz2
tar -xjvf samtools.tar.bz2 3.
cd samtools-1.4.1/
./configure 5. make
make install
# clone the yleaf repository
git clone https://github.com/genid/Yleaf.git
# pip install the yleaf repository
cd Yleaf
pip install -e .

# verify that Yleaf is installed correctly. You can call this command from any directory on your system
Yleaf -h

After installation you can navigate to the yleaf/config.txt folder and add custom paths for the files listed there. This will make sure that Yleaf does not download the files on the first go or downloads the files in the provided location. This allows you to use a custom reference if you want. Please keep in mind that custom reference files might cause other issues or give problems in combination with already existing data files. Positions are based on either hg38 or hg19.

Usage and examples

Here follow some minimal working examples of how to use Yleaf with different input files. There are additional options that can be used to tune how strict Yleaf is as well as options to get private mutations as well as a graph showing the positioning of predicted haplogroups of all your samples in the Haplogroup tree.

Note: In version 3.0 we switched to using YFull (v10.01) for the underlying tree structure of the haplogroups. This also means that predictions are a bit different compared to earlier versions.

Yleaf: FASTQ (raw reads)

Yleaf -fastq raw_reads.fastq -o fastq_output --reference_genome hg38

Yleaf: BAM or CRAM format

Yleaf -bam file.bam -o bam_output --reference_genome hg19
Yleaf -cram file.bam -o cram_output --reference_genome hg38

With drawing predicted haplogroups in a tree and showing all private mutations

Yleaf -bam file.bam -o bam_output --reference_genome hg19 -dh -p

Using custom reference genomes

You can specify custom reference genomes instead of using the default downloaded ones:

Yleaf -bam file.bam -o bam_output -rg hg19 -fg /path/to/full_genome.fa -yr /path/to/chrY.fa

Where:

-fg or --full_genome_reference specifies the path to a custom full genome reference file
-yr or --y_chromosome_reference specifies the path to a custom Y chromosome reference file

Both references must be in FASTA format (.fa, .fasta, or .fna).

Extracting Y chromosome from a reference genome

If you have a full genome reference but need to extract just the Y chromosome, use the included extraction tool:

python -m yleaf.extract_y_chromosome -i /path/to/full_genome.fa -o /path/to/output_chrY.fa

Additional information

For a more comprehensive manual please have a look at the yleaf_manual.

If you have a bug to report or a question about installation consider sending an email to a.ralf at erasmusmc.nl or create an issue on GitHub.

References and Supporting Information

A. Ralf, et al., Yleaf: software for human Y-chromosomal haplogroup inference from next generation sequencing data (2018).

https://academic.oup.com/mbe/article/35/5/1291/4922696

Acknowledgments

All credit for the original Yleaf software and methodology goes to Arwin Ralf, Diego Montiel Gonzalez, Kaiyin Zhong, Manfred Kayser and the Department of Genetic Identification at Erasmus MC University Medical Centre Rotterdam. This fork builds upon their excellent work to enhance pipeline compatibility.