[color-box color=”white”] Confluence of Data, Computing, and Storage
Coding for Machine Learning. [/color-box]
In traditional information processing systems, different system components are often developed agnostic of each other. While this viewpoint offers useful abstractions, the incurred inefficiencies are increasingly more costly.
In this research we explore a fundamentally new viewpoint: we design system components (codes and algorithms for storage) in data-aware way. We are actively exploring several applications of this philosophy including fault tolerant memories and context-aware coding for machine learning.
Here is one recent success story.
Overview. Error-correcting codes and system-level fault-tolerance techniques have historically been developed as separate abstractions in the hardware/software stack. Conventional codes are designed to be agnostic to the content of their data payloads, while fault tolerance techniques do not leverage knowledge of the underlying code construction to recover from uncorrectable errors. We have created software-defined error-correcting codes (SWD-ECC), a new error-correction technique that co-designs the ECC scheme alongside system-level fault-tolerance mechanisms to enable heuristic recovery for previously uncorrectable errors.
The key idea in SWD-ECC is that side information can be used to heuristically recover from detected-but-uncorrectable errors (DUEs) by trying to correctly estimate the original uncorrupted message. Our techniques allow us to push past the traditional boundaries of ECCs; when an error is detected, but uncorrectable, SWD-ECC allows us to probabilistically decode using side-information. Although our studies are tailored to memory, the ideas can be applied to storage, communications, and information theory as well. The approach will benefit computing from embedded and mobile to the cloud and supercomputing domains.
Recent Results: Recently, we evaluated our SWD-ECC techniques by heuristically recovering from 2-bit DUEs applied to the MIPS instruction set. We performed the offline analysis on SPEC CPU2006 benchmarks using a single-error-correcting double-error-detecting (SEC-DED) underlying code. We were able to recover from 34% of errors that would have previously been unrecoverable, often resulting in catastrophic crashes! The only side-information used was the legality and frequency of the instruction bits. Using other side-information, such as data correlation, will yield even better results.
Recent Publications and Presentations:
- M. Gottscho, C. Schoeny, L. Dolecek, P. Gupta, “Software-Defined Error-Correcting Codes,” in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, Jun. 28-Jul. 1, 2016, (to appear).
- M. Gottscho, C. Schoeny, L. Dolecek, P. Gupta, “Software-Defined Error-Correcting Codes,” in IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), Austin, TX, Mar. 29-30, 2016. Best Paper Award (top 3 selected).
- C. Schoeny & M. Gottscho, “Software-Defined Error-Correcting Codes,” Qualcomm Innovation Fellowship, San Diego, CA, Mar. 22-23, 2016. Winner of fellowship (top 8 selected).