When fighting the global spread of a pandemic or working to fight cancer, faster time-to-discovery saves lives. For this critical bioscience work, scientists have employed single-cell RNA sequencing to assemble entire genomes and reveal genetic variants. Modern gene sequencing technology has grown exponentially in the last decade from studying hundreds of cells to millions of cells. Yong Tian at MemVerge explains how Big Memory technology helps to accelerate single-cell genomic sequencing.
Extract:
‘Accelerating Single-Cell Genomic Sequencing with Big Memory’
The Extra-Exponential Growth of Cell Data
When fighting the global spread of a pandemic or working to fight cancer, faster time-to-discovery saves lives. For this critical bioscience work, scientist have employed single-cell RNA sequencing to assemble entire genomes and reveal genetic variants. Modern gene sequencing technology has grown exponentially in the last decade from studying hundreds of cells to millions of cells. At the same time, the modality of data has increased exponentially to improve the profiling of different aspects of a cell, including its genome, transciptome and epigenome, and the spatial organisation of the above-nomes. The emergence of multi-modal studies of millions of cells has resulted in the extra-exponential growth of cell data resulting in DGM (data is greater than memory).
In figure 1 (found within the downloadable PDF), the charts on the top and bottom left show that in 2010, studies were based on one hundred cells, and by 2020 has increased four orders of magnitude to one million cells per study. On the top right are examples of how the modalities of data increased.
A Problem Emerges: Storage Becomes a Bottleneck When Data is Greater than Memory
For the last 50 years, computing has been dominated by a model which uses storage as ‘virtual’ memory for data that cannot fit into DRAM. The R error message in the lover right-hand corner if figure 1 is an example of how the extra-exponential growth of data is crushing IT infrastructure such as the current computing model and R.
Sing-cell sequencing jobs are multi-stage analytic pipelines using very large matrices that need to be loaded from storage into memory for each stage. With terabyte data sets, loading data from storage into memory takes a long time. And when all cell data doesn’t fit in memory, then code execution becomes IO-intensive as data is swapped for storage.
Even with high-performance solid-state disk, repetitively loading data from storage and executing application code with IO to storage is 1,000x slower than memory. The traditional model and storage IO has become a bottleneck in many types of multi-stage analytic jobs with massive data sets.
Click the download button below to read the complete version of ‘Accelerating Single-Cell Genomic Sequencing with Big Memory’ by Yong Tian at MemVerge