But it may be very slow and memory-consuming, for large case data
without control data. It’s OK to use more than one “case” file:
dnarrange will only output groups that include reads from all case
files.
dnarrange tries to flip the reads’ strands so all the reads in a
group are on the same strand. A - at the end of a read name
indicates that it’s flipped, + unflipped.
Each group is given a name, such as group5-28. The first number
(5) is a serial number for each group, and the second number (28)
is the number of reads in the group.
Low coverage
By default, dnarrange finds rearrangements supported by at least 2
reads. To find also rearrangements with just 1 read, use -s1:
dnarrange -s1 case.maf > groups0.maf
You can get an intermediate level of leniency by using -c0 instead
of -s1: this finds groups with at least 2 reads, but does not
require each consecutive pair of rearranged fragments to be supported
by 2 reads. If you use either -s1 or -c0, it may be useful to
apply “strict control filtering” with -f0:
-f0 makes it discard any case read with any two rearranged fragments
in common with any control read. The default is to discard case reads
whose “strongest” rearrangement type is shared with a control read,
where “strength” is defined by: inter-chromosome > inter-strand >
non-colinear > big gap.
Insertions of foreign DNA
You can find insertions of (e.g.) a virus genome into a host genome.
First, align the DNA reads to a combined “genome” with host and virus
chromosomes. Then do:
This will get DNA reads that align partly to the sequence called
VirusSeqName and partly to other sequences.
Step 3: Draw pictures of the groups
Draw a picture of each group, showing the read-to-genome alignments,
in a new directory group-pics:
last-multiplot groups.maf group-pics
last-multiplot has the same
options
as last-dotplot. In fact, last-multiplot uses last-dotplot, so
it requires the latter to be
installed (i.e. in your PATH).
This in turn requires the Python Imaging
Library to be installed.
A useful option is -a, to display files of genes, repeats, etc.
It shows at most 10 reads per group: you can change that with option -m.
Tips for viewing the pictures on a Mac: open the folder in Finder, and
View the pictures with “Cover Flow” (Command+4), or:
Select all the pictures (Command+A), and view them in slideshow
(Option+Spacebar) or “Quick Look” (Spacebar). Use the left and
right arrows to move between pictures. In slideshow, Spacebar
toggles “pause”/“play”.
Try to check for bad groups, e.g. rearrangements that look like
sequencing artifacts. It may be useful to require at least 3 (instead
of 2) reads per group:
dnarrange -s3 groups.maf > strict.maf
If the results are clear enough, you can stop here!
Step 4: Merge each group into a consensus sequence
This uses 3 input files: the read sequences (in fastq or fasta
format), myseq.par from the alignment step, and the groups. It
requires lamassemble to be installed (which in turn requires LAST
and MAFFT to be installed).
Then
re-align
the merged reads to the genome (maybe using e.g. -m20 or -m50 to
make it more sensitive-but-slow):
dnarrange may omit some consensus sequences, if it doesn’t consider
their alignments to be rearranged. If this is a problem, try reducing
dnarrange‘s minimum thresholds for rearrangement, e.g. with option
-r1.
Merging doesn’t always work well, especially if the reads have large
tandem duplications so it’s easy to merge the wrong parts of the
reads. Try to check this by comparing the merged and unmerged
pictures.
Step 5: Find the order and orientation of groups
A large rearrangement might include several groups of rearranged
reads. To understand it, we need to know the order and orientation of
the groups in the rearranged sequence:
dnarrange-link -g3,7,8,12 final.maf > linked.txt
This tells it to use groups 3, 7, 8, and 12. (If you don’t specify
any groups, it will use them all: ideally that would work, but the
groups may not all be perfectly accurate, complete, and
uniquely-linkable.)
You can give it groups before or after merging. It uses only the
topmost read in each group, so we need to ensure that the topmost read
represents the whole group, and covers all the group’s rearrangement
breakpoints. We can check this by looking at the pictures.
If necessary, we can hand-edit the input file. It begins with a
summary of each group, like this:
This shows how each read aligns to the genome. dnarrange-link uses
only the group names and topmost reads from the summary.
dnarrange-link may show a warning message like this:
WARNING: 4 equally-good ways of linking ends in chr8
In this case, it arbitrarily picks one of these ways to link the ends
in chr8 (or you can use -a to get them all).
The output has reconstructed chromosomes with names like der1,
der2. If a reconstructed chromosome has widely-separated
rearrangements, it’s broken into parts like der2a, der2b. You can
control this breaking with option -m (see dnarrange-link --help).
Step 6: Draw pictures of the rearranged chromosomes
last-multiplot linked.txt linked-pics
To make the pictures clearer, you may wish to:
Change the dnarrange-link-m parameter. Large values best show
the “big picture”, and small values magnify small-scale
rearrangements.
Draw a subset of derived sequences with option
-2:
Hand-edit linked.txt, to lengthen the topmost and bottommost
segments of a derived chromosome.
Public control files
You can use these control files
to discard common rearrangements. They were made with human reference
genome version hg38, so you must use the same reference! They were
“shrunk” with commands like:
You can give it genes in these formats: refGene.txt, refFlat.txt,
BED. You can make it find genes up to, say, 5000 base-pairs away
from any breakpoint like this:
-h, --help: show a help message, with default option values, and
exit.
-s N, --min-seqs=N: minimum query sequences per group
(default=2). A value of 0 tells it to not bother grouping: it
will simply find rearranged query sequences.
-c N, --min-cov=N: discard any query sequence that has a pair of
consecutive rearranged fragments shared by < N other queries. The
default depends on the -s option: when s>1, the default is 1, else
the default is 0.
-g BP, --min-gap=BP: minimum forward jump in the reference
sequence counted as a “big gap” (default=10000). The purpose of
this is to exclude small deletions.
-r BP, --min-rev=BP: minimum reverse jump in the reference
sequence counted as “non-colinear” (default=1000). The purpose of
this is to exclude small tandem duplications, which can be
overwhelmingly numerous. To include everything, use -r1.
-f N, --filter=N: discard case reads sharing any (0) or
“strongest” (1) rearrangements with control reads (default=1).
-d BP, --max-diff=BP: maximum query-length difference for shared
rearrangement (default=500).
-m PROB, --max-mismap=PROB: discard any alignment with mismap
probability > PROB (default=1).
--insert=NAME: find insertions of the sequence with this name.
--shrink: write the output in a compact format. This format can
be read by dnarrange.
-v, --verbose: show progress messages.
-w W, --width=W: line-wrap width of group summary lines
(default=79). 0 means no wrapping.
dnarrange-merge options
You can get the rearranged reads, without merging them, like this:
This may be useful if you wish to re-align the rearranged reads to the
genome more slowly-and-carefully.
dnarrange-merge also has options that it passes to lamassemble:
you can see them with dnarrange-merge --help, and they’re described
at the lamassemble site.
dnarrange
This is a method to find rearrangements in “long” DNA reads relative to a genome sequence. It can characterize changes such as chromosome shattering, gene conversion, virus insertion, and processed pseudogene insertion. For more details, please see: A pipeline for complete characterization of complex germline rearrangements from long DNA reads.
You can install dnarrange (together with all other software that it uses) from Bioconda or Debian Med.
Step 1: Align the reads to the genome
You can use this recipe.
Step 2: Find rearrangements
Like this:
The input files may be gzipped (
.gz).It’s OK to not use “control” files, or use them in a separate step:
But it may be very slow and memory-consuming, for large case data without control data. It’s OK to use more than one “case” file:
dnarrangewill only output groups that include reads from all case files.dnarrangetries to flip the reads’ strands so all the reads in a group are on the same strand. A-at the end of a read name indicates that it’s flipped,+unflipped.Each group is given a name, such as
group5-28. The first number (5) is a serial number for each group, and the second number (28) is the number of reads in the group.Low coverage
By default,
dnarrangefinds rearrangements supported by at least 2 reads. To find also rearrangements with just 1 read, use-s1:You can get an intermediate level of leniency by using
-c0instead of-s1: this finds groups with at least 2 reads, but does not require each consecutive pair of rearranged fragments to be supported by 2 reads. If you use either-s1or-c0, it may be useful to apply “strict control filtering” with-f0:-f0makes it discard any case read with any two rearranged fragments in common with any control read. The default is to discard case reads whose “strongest” rearrangement type is shared with a control read, where “strength” is defined by: inter-chromosome > inter-strand > non-colinear > big gap.Insertions of foreign DNA
You can find insertions of (e.g.) a virus genome into a host genome. First, align the DNA reads to a combined “genome” with host and virus chromosomes. Then do:
This will get DNA reads that align partly to the sequence called
VirusSeqNameand partly to other sequences.Step 3: Draw pictures of the groups
Draw a picture of each group, showing the read-to-genome alignments, in a new directory
group-pics:last-multiplothas the same options aslast-dotplot. In fact,last-multiplotuseslast-dotplot, so it requires the latter to be installed (i.e. in your PATH). This in turn requires the Python Imaging Library to be installed.-a, to display files of genes, repeats, etc.-m.Tips for viewing the pictures on a Mac: open the folder in Finder, and
Try to check for bad groups, e.g. rearrangements that look like sequencing artifacts. It may be useful to require at least 3 (instead of 2) reads per group:
If the results are clear enough, you can stop here!
Step 4: Merge each group into a consensus sequence
This uses 3 input files: the read sequences (in fastq or fasta format),
myseq.parfrom the alignment step, and the groups. It requires lamassemble to be installed (which in turn requires LAST and MAFFT to be installed).Then re-align the merged reads to the genome (maybe using e.g.
-m20or-m50to make it more sensitive-but-slow):And draw pictures:
dnarrangemay omit some consensus sequences, if it doesn’t consider their alignments to be rearranged. If this is a problem, try reducingdnarrange‘s minimum thresholds for rearrangement, e.g. with option-r1.Merging doesn’t always work well, especially if the reads have large tandem duplications so it’s easy to merge the wrong parts of the reads. Try to check this by comparing the merged and unmerged pictures.
Step 5: Find the order and orientation of groups
A large rearrangement might include several groups of rearranged reads. To understand it, we need to know the order and orientation of the groups in the rearranged sequence:
This tells it to use groups 3, 7, 8, and 12. (If you don’t specify any groups, it will use them all: ideally that would work, but the groups may not all be perfectly accurate, complete, and uniquely-linkable.)
You can give it groups before or after merging. It uses only the topmost read in each group, so we need to ensure that the topmost read represents the whole group, and covers all the group’s rearrangement breakpoints. We can check this by looking at the pictures.
If necessary, we can hand-edit the input file. It begins with a summary of each group, like this:
This shows how each read aligns to the genome.
dnarrange-linkuses only the group names and topmost reads from the summary.dnarrange-linkmay show a warning message like this:In this case, it arbitrarily picks one of these ways to link the ends in chr8 (or you can use
-ato get them all).The output has reconstructed chromosomes with names like
der1,der2. If a reconstructed chromosome has widely-separated rearrangements, it’s broken into parts likeder2a,der2b. You can control this breaking with option-m(seednarrange-link --help).Step 6: Draw pictures of the rearranged chromosomes
To make the pictures clearer, you may wish to:
Change the
dnarrange-link-mparameter. Large values best show the “big picture”, and small values magnify small-scale rearrangements.Draw a subset of derived sequences with option -2:
Hand-edit
linked.txt, to lengthen the topmost and bottommost segments of a derived chromosome.Public control files
You can use these control files to discard common rearrangements. They were made with human reference genome version
hg38, so you must use the same reference! They were “shrunk” with commands like:They won’t be fully effective if you use them with option
-g< 100 (or non-default-m).dnarrange-genesThis is a simple script to find genes at or near rearrangement breakpoints:
You can give it genes in these formats: refGene.txt, refFlat.txt, BED. You can make it find genes up to, say, 5000 base-pairs away from any breakpoint like this:
You can also get groups near genes, with option
-o1:dnarrangeoptions-h,--help: show a help message, with default option values, and exit.-s N,--min-seqs=N: minimum query sequences per group (default=2). A value of0tells it to not bother grouping: it will simply find rearranged query sequences.-c N,--min-cov=N: discard any query sequence that has a pair of consecutive rearranged fragments shared by < N other queries. The default depends on the-soption: when s>1, the default is 1, else the default is 0.-t LETTERS,--types=LETTERS: rearrangement types: C=inter-chromosome, S=inter-strand, N=non-colinear, G=big gap (default=CSNG).-g BP,--min-gap=BP: minimum forward jump in the reference sequence counted as a “big gap” (default=10000). The purpose of this is to exclude small deletions.-r BP,--min-rev=BP: minimum reverse jump in the reference sequence counted as “non-colinear” (default=1000). The purpose of this is to exclude small tandem duplications, which can be overwhelmingly numerous. To include everything, use-r1.-f N,--filter=N: discard case reads sharing any (0) or “strongest” (1) rearrangements with control reads (default=1).-d BP,--max-diff=BP: maximum query-length difference for shared rearrangement (default=500).-m PROB,--max-mismap=PROB: discard any alignment with mismap probability > PROB (default=1).--insert=NAME: find insertions of the sequence with this name.--shrink: write the output in a compact format. This format can be read bydnarrange.-v,--verbose: show progress messages.-w W,--width=W: line-wrap width of group summary lines (default=79). 0 means no wrapping.dnarrange-mergeoptionsYou can get the rearranged reads, without merging them, like this:
This may be useful if you wish to re-align the rearranged reads to the genome more slowly-and-carefully.
dnarrange-mergealso has options that it passes tolamassemble: you can see them withdnarrange-merge --help, and they’re described at the lamassemble site.