Article Index

A continuing installment of our (Close to the) Edge Computing series.

The main memory in computing devices is an ephemeral repository of information. Processing units (CPUs) may analyze and change the information, but ultimately the results of everything eventually end up in the main memory store. Of course, the information may move to nonvolatile memory or disk storage, which provides a more permanent resting place.

Main memory integrity is important. If errors occur in main memory, anything from nothing to a full crash of the entire computer is possible. In order to prevent and possibly correct memory errors, Error-Correcting Code memory (ECC) memory has been developed and deployed in systems where data errors may result in harmful results, like in real-time financial systems, for instance. The goal is that the data written to memory should be the same when it is read back in the future.

Data in memory are stored as binary bits--1s and 0s. It is possible to have one or several of these "bits" flip due to several reasons (a 1 turns into a 0 or vice versa). Memory errors are usually placed into two categories--hard and soft errors. Hard errors are things like a permanently flipped bit. These errors can result from manufacturing or design flaws. In essence, the memory module has broken and is no longer usable. Soft errors are temporary and often result from random bit flips due to cosmic rays. The flipped bit can be reset to the correct value using ECC memory (and computer operation can continue). It should also be noted that cosmic ray issues are related to elevation. Close to sea level, the numbers of cosmic rays are less than at higher elevations (or in an airplane) due to the atmosphere.

Systems that run 24/7--financial databases, HPC clusters, web servers, and mainframes--all use ECC memory. Consumer-facing hardware, on the other hand, such as desktop or laptops, often does not use ECC. From a cost perspective, ECC is more expensive and because consumers have non-continuous usage pattern, memory errors may not matter all that much.

For those building Edge-based systems, the use of ECC memory can restrict hardware options. It is, therefore, worthwhile to discuss the non-ECC option a bit further. In addition, and to be clear, ECC is always preferred; however, there may be design cases where ECC memory is not available.

Purist or Pragmatic

When it comes to ECC memory, there seem to be purists and pragmatists. The purists insist on ECC memory for everything, particularly if they are running any kind of server. This requirement means that they will only use and purchase "enterprise grade" equipment. The "enterprise grade" designation usually means a more robust/powerful hardware specification (better Mean Time Between Failures, MTBF) with better warranties than consumer level hardware. Taking a conventional purist approach is a wise choice.

The pragmatist will often look at the situation and decide if the use of ECC memory limits possible solutions. And, in the course of that evaluation, consider the risk of not using ECC memory. The question, do I need ECC? is important and yet, it implies that using non-ECC memory means you are living "on the edge" in terms of memory errors. From a practical standpoint, the absence of ECC may not have a huge impact on your applications.

A good example is a Tesla GPU from NVidia. This accelerator card comes with ECC memory, which is often a requirement in a modern data center. However, often times the ECC capability is turned off to improve performance. Tests indicate that turning off ECC memory improves performance by 10% for the AMBER molecular dynamics application. The AMBER GPU web page states:

"Extensive testing of AMBER on a wide range of hardware has established that ECC has little to no benefit on the reliability of AMBER simulations."

AMBER performance seems to trump the slim possibility of a memory error (e.g. bit flip) causing an actual program or systems fault. Indeed, the ability of a random bit flip to actually cause a system of program fault is not always certain. Consider the following: In many instances, a small percentage of memory is actually "in use" at any given time. It may have a future dependency, but the amount of memory "in play" and susceptible to real-time bit errors is usually small compared to the overall amount of memory.

Thus, to cause a fault, a flipped bit must have a future dependency such that the memory value is read in a way that it changes program flow or changes data in a significant way. In other words, random bit flips do not necessarily mean a devastating system or program fault. The flip may occur in inconsequential places like unused memory, in a least significant digit of a floating-point number, or in a portion of a program (including the OS) that never gets used. A random bit flip may cause a "quiet error" where some data get changed, but it does not cause noticeable or catastrophic changes in results. This type of error is a valid concern for non-ECC computations. A good practice for important applications is to rerun codes and compare results. The applies to both ECC and non-ECC memory systems.

Anecdotally, I have built and deployed many small four and eight node HPC desk-side clusters. When designing and testing these systems, I use the NAS parallel benchmark (MPI version). The benchmark is self-checking and ensures the results are correct. In over a decade of running the NAS tests, the only time the results were wrong was due to a faulty network cable. I have never had a failed NAS test run due to a system hardware fault.

Another data point comes from running the Hadoop TeraSort benchmark on a small four node desk-side cluster with non-ECC memory. The Terasort benchmark has three separate steps. The first step creates a variable length table with random data. The next step sorts the data using Hadoop MapReduce. The final step validates the sort. Similar to the NAS example, this benchmark, using 50GByte table sizes, has been run countless times and has never failed to validate correctly. Nor have any of the many parallel mappers and reducers failed and restarted.

A final data point is the operating system itself. All the benchmarks were run under Linux on systems that use non-ECC memory. These and many other small clusters I have built (starting back in 2005) have never displayed a systems fault or panic over their lifetime (due to hardware faults). Portions of these systems are powered on and run 24x7, and they have uptimes on the order of months (they are often rebooted for kernel updates).

Of course, these three examples do not imply that memory errors don't cause system faults or problems. And, it is certainly possible that in the near future, a flipped bit that ECC memory would have caught and corrected may crash one of the many Linux instances have running in my office. I'm not too concerned, however, because the historic evidence seems to indicate the likelihood is low. Again, to emphasize the point, if available, ECC is the best choice. If it is not available on a particular platform, non-ECC memory may be a valid solution for your particular problem. Indeed, for some applications like Deep Learning or Genetic Algorithms, a random flipped bit in the data may be totally inconsequential and may even improve results.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.