Abstract
Conducting genome assembly at scale remains a challenge owing to the intense computational and memory requirements of the problem, coupled with inherent complexities in existing parallel tools associated with data movement, use of complex data structures, unstructured memory accesses and repeated I/O operations. To this end we have developed a tool, PaKman which presents a fully distributed method that tackles assembly of large genomes through the combination of a novel data-structure (PaK-Graph) and algorithmic strategies to simplify communication and I/O footprint during the assembly process. The algorithm deviates from the state-of-the-art de Bruijn graph-based methods and presents a novel perspective to addressing the assembly problem by incorporating: i) a novel distributed-memory data structure that enables contig enumeration with minimal coordination; ii) a novel contig generation algorithm with simplified I/O and communication patterns. Our results demonstrate the ability to achieve near-linear speedups on up to 16K cores (tested) on the NERSC Cori supercomputer; perform better than or comparable to other state-of-the-art distributed memory and shared memory tools in terms of performance while delivering comparable (if not better) quality; and reduce time to solution significantly.
Exploratory License
Not eligible for exploratory license
Market Sector
Data Sciences