pkgsrc-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[pkgsrc/trunk]: pkgsrc/biology/miniasm biology/miniasm: add miniasm 0.3



details:   https://anonhg.NetBSD.org/pkgsrc/rev/f78a4c30663c
branches:  trunk
changeset: 453267:f78a4c30663c
user:      brook <brook%pkgsrc.org@localhost>
date:      Wed May 26 18:44:44 2021 +0000

description:
biology/miniasm: add miniasm 0.3

Miniasm is a very fast OLC-based *de novo* assembler for noisy long
reads. It takes all-vs-all read self-mappings (typically by minimap)
as input and outputs an assembly graph in the GFA format. Different
from mainstream assemblers, miniasm does not have a consensus step. It
simply concatenates pieces of read sequences to generate the final
unitig sequences. Thus the per-base error rate is similar to the raw
input reads.

So far miniasm is in early development stage. It has only been tested
on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data
sets. Including the mapping step, it takes about 3 minutes to assemble
a bacterial genome. Under the default setting, miniasm assembles 9 out
of 12 PacBio datasets and 3 out of 4 ONT datasets into a single
contig. The 12 PacBio data sets are [PacBio E.  coli
sample][PB-151103], [ERS473430][ERS473430], [ERS544009][ERS544009],
[ERS554120][ERS554120], [ERS605484][ERS605484],
[ERS617393][ERS617393], [ERS646601][ERS646601],
[ERS659581][ERS659581], [ERS670327][ERS670327],
[ERS685285][ERS685285], [ERS743109][ERS743109] and a deprecated PacBio
E.  coli data set. ONT data are acquired from the Loman Lab.

For a *C. elegans* PacBio data set (only 40X are used, not the whole
dataset), miniasm finishes the assembly, including reads overlapping,
in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50
is 1.94Mb. In comparison, the HGAP3 produces a 104Mb assembly with N50
1.61Mb. This dotter plot gives a global view of the miniasm assembly
(on the X axis) and the HGAP3 assembly (on Y). They are broadly
comparable. Of course, the HGAP3 consensus sequences are much more
accurate. In addition, on the whole data set (assembled in ~30 min),
the miniasm N50 is reduced to 1.79Mb. Miniasm still needs
improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it
is possible to generate long contigs from raw PacBio or ONT reads
without error correction. It also shows that minimap can be used as a
read overlapper, even though it is probably not as sensitive as the
more sophisticated overlapers such as MHAP and DALIGNER.  Coupled with
long-read error correctors and consensus tools, miniasm may also be
useful to produce high-quality assemblies.

## Algorithm Overview

1. Crude read selection. For each read, find the longest contiguous region
   covered by three good mappings. Get an approximate estimate of read
   coverage.

2. Fine read selection. Use the coverage information to find the good regions
   again but with more stringent thresholds. Discard contained reads.

3. Generate a string graph. Prune tips, drop weak overlaps and
   collapse short bubbles. These procedures are similar to those
   implemented in short-read assemblers.

4. Merge unambiguous overlaps to produce unitig sequences.

## Limitations

1. Consensus base quality is similar to input reads (may be fixed with a
   consensus tool).

2. Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing
   needed).

3. Prone to collapse repeats or segmental duplications longer than input reads
   (hard to fix without error correction).

diffstat:

 biology/miniasm/DESCR    |  22 ++++++++++++++++++++++
 biology/miniasm/Makefile |  30 ++++++++++++++++++++++++++++++
 biology/miniasm/PLIST    |   6 ++++++
 biology/miniasm/distinfo |   6 ++++++
 4 files changed, 64 insertions(+), 0 deletions(-)

diffs (80 lines):

diff -r 2e29702512a2 -r f78a4c30663c biology/miniasm/DESCR
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/biology/miniasm/DESCR     Wed May 26 18:44:44 2021 +0000
@@ -0,0 +1,22 @@
+Miniasm is a very fast OLC-based *de novo* assembler for noisy long
+reads. It takes all-vs-all read self-mappings (typically by minimap)
+as input and outputs an assembly graph in the GFA format. Different
+from mainstream assemblers, miniasm does not have a consensus step. It
+simply concatenates pieces of read sequences to generate the final
+unitig sequences. Thus the per-base error rate is similar to the raw
+input reads.
+
+So far miniasm is in early development stage. It has only been tested
+on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data
+sets. Including the mapping step, it takes about 3 minutes to assemble
+a bacterial genome. Under the default setting, miniasm assembles 9 out
+of 12 PacBio datasets and 3 out of 4 ONT datasets into a single
+contig.
+
+Miniasm confirms that at least for high-coverage bacterial genomes, it
+is possible to generate long contigs from raw PacBio or ONT reads
+without error correction. It also shows that minimap can be used as a
+read overlapper, even though it is probably not as sensitive as the
+more sophisticated overlapers such as MHAP and DALIGNER.  Coupled with
+long-read error correctors and consensus tools, miniasm may also be
+useful to produce high-quality assemblies.
diff -r 2e29702512a2 -r f78a4c30663c biology/miniasm/Makefile
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/biology/miniasm/Makefile  Wed May 26 18:44:44 2021 +0000
@@ -0,0 +1,30 @@
+# $NetBSD: Makefile,v 1.1 2021/05/26 18:44:44 brook Exp $
+
+GITHUB_PROJECT=        miniasm
+GITHUB_TAG=    refs/tags/v0.3
+DISTNAME=      v0.3
+PKGNAME=       ${GITHUB_PROJECT}-${DISTNAME:S,^v,,}
+CATEGORIES=    biology
+MASTER_SITES=  ${MASTER_SITE_GITHUB:=lh3/}
+DIST_SUBDIR=   ${GITHUB_PROJECT}
+
+MAINTAINER=    pkgsrc-users%NetBSD.org@localhost
+HOMEPAGE=      https://github.com/lh3/miniasm/
+COMMENT=       OLC-based de novo assembler for long reads
+LICENSE=       mit
+
+WRKSRC=                ${WRKDIR}/miniasm-0.3
+USE_TOOLS+=    gmake
+USE_LANGUAGES+=        c
+
+INSTALLATION_DIRS+=    bin ${PKGMANDIR}/man1 share/doc/miniasm
+
+do-install:
+       ${INSTALL} ${WRKSRC}/miniasm ${DESTDIR}${PREFIX}/bin
+       ${INSTALL} ${WRKSRC}/minidot ${DESTDIR}${PREFIX}/bin
+       ${INSTALL_DATA} ${WRKSRC}/miniasm.1 ${DESTDIR}${PREFIX}/${PKGMANDIR}/man1
+       ${INSTALL_DATA} ${WRKSRC}/PAF.md ${DESTDIR}${PREFIX}/share/doc/miniasm
+       ${INSTALL_DATA} ${WRKSRC}/README.md ${DESTDIR}${PREFIX}/share/doc/miniasm
+
+.include "../../devel/zlib/buildlink3.mk"
+.include "../../mk/bsd.pkg.mk"
diff -r 2e29702512a2 -r f78a4c30663c biology/miniasm/PLIST
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/biology/miniasm/PLIST     Wed May 26 18:44:44 2021 +0000
@@ -0,0 +1,6 @@
+@comment $NetBSD: PLIST,v 1.1 2021/05/26 18:44:44 brook Exp $
+bin/miniasm
+bin/minidot
+man/man1/miniasm.1
+share/doc/miniasm/PAF.md
+share/doc/miniasm/README.md
diff -r 2e29702512a2 -r f78a4c30663c biology/miniasm/distinfo
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/biology/miniasm/distinfo  Wed May 26 18:44:44 2021 +0000
@@ -0,0 +1,6 @@
+$NetBSD: distinfo,v 1.1 2021/05/26 18:44:44 brook Exp $
+
+SHA1 (miniasm/v0.3.tar.gz) = 11aa9dcfdf3d304fbc31d4be0246bb8d2108a70d
+RMD160 (miniasm/v0.3.tar.gz) = f0352ea9952704a1b6df6eda1be5d9631c822a07
+SHA512 (miniasm/v0.3.tar.gz) = e5f622e079283d69bb878cbb9768f7522d279f89fc11f86a2b1fdade9e09e681e742d9e4be46bed4c864e4fef3d0d6760348c85cf1d5e029a36d96c45f885160
+Size (miniasm/v0.3.tar.gz) = 204805 bytes



Home | Main Index | Thread Index | Old Index