Big Data In Little Spaces: Hadoop And Spark At The Edge
- Details
- Written by Administrator
- Hits: 2930

Ever wonder what Edge computing is all about? Data happens and information takes work. Estimates are that by 2020, 1.7 megabytes of new data will be created every second for every person in the world. That is a lot of raw data.
Two questions come to mind. What are we going to do with it and where we going to keep it. Big Data is often described by the three Vs – Volume, Velocity, and Variability – and note not all three need apply. What is missing is the letter “U” which stands for Usability. A Data Scientist will first ask, how much of my data is usable? Data usability can take several forms and include things like quality (is it noisy, incomplete, accurate) and pertinence (is there any extraneous information that will not make a difference to my analysis). There is also the issue of timeliness. Is there a “use by” date for the analysis or might the data be needed in the future for some as of yet unknown reason. The usability component is hugely important and often determines the size of any scalable analytics solution. Usable data is not the same as raw data.
Get the full article at The Next Platform. You may recognize the author.
Sledgehammer HPC
- Details
- Written by Douglas Eadline
- Hits: 9147
HPC without coding in MPI is possible, but only if your problem fits into one of several high level frameworks.
[Note: The following updated article was originally published in Linux Magazine in June 2009. The background presented in this article has recently become relevant due to the resurgence of things like genetic algorithms and the rapid growth of MapReduce (Hadoop) . It does not cover deep learning.]
Not all HPC applications are created in the same way. There are applications like Gromacs, Amber, OpenFoam, etc. that allow domain specialist to input their problem into an HPC framework. Although there is some work required to "get the problem into the application", these are really application specific solutions that do not require the end user to write a program. At the other end of the spectrum are the user written applications. The starting points for these problems include a compiler (C/C++ or Fortran), an MPI library, and other programming tools. The work involved can range form small to large as the user must concern themselves with the "parallel aspects of the problem". Note: all application software started out at this point some time in the past.
Return of the Free Lunch (sort of)
- Details
- Written by Douglas Eadline
- Hits: 5931
From the do-you-want-fries-with-that department
As processor frequencies continue to level off and mainstream processors keep sprouting cores, the end of the "frequency free lunch" has finally arrived. That is, in the good old days each new generation of processor would bring with it a faster system clock that would result in a "free" performance bump for many software applications -- no reprogramming needed. Can we ever get back to the good old days?
Answering The Nagging Apache Hadoop/Spark Question
- Details
- Written by Douglas Eadline
- Hits: 9603
(or How to Avoid the Trough of Disillusionment)
A recent blog post, Why not so Hadoop?, is worth reading if you are interested in big data analytics, Hadoop, Spark, and all that. The article contains the 2015 Gartner Hype Cycle. The 2016 version is worth examining as well. Some points similar to the blog can be made here:
- Big data was at the "Trough of Disillusionment" stage in 2014, but is not seen in the 2015/16 Hype cycle.
- The "Internet of Things" (a technology that is expected to fill the big data pipeline) was on the peak for two years and now has been given "platform status."
The Ignorance is Bliss Approach To Parallel Computing
- Details
- Written by Douglas Eadline
- Hits: 4991
from the random thoughts department
[Note: The following updated article was originally published in Linux Magazine in June 2006 and offers some further thoughts on the concept of dynamic execution.]
In a previous article, I talked about programming large numbers of cluster nodes. By large, I mean somewhere around 10,000. To recap quickly, I pointed out that dependence on large numbers of things increase the chance that one of them will fail. I then proposed that it would be cheaper to develop software that can live with failure than try to engineer hardware redundancy. Finally, I concluded that adapting to failure requires dynamic software. As opposed to statically scheduled programs, dynamic software adapts at run-time. The ultimate goal is to make cluster programming easier: focus more on the problem and less on the minutiae of message passing. (Not that there is anything wrong with message passing or MPI. At some level messages (memory) needs to be transferred between cluster nodes.)
Search
Login And Newsletter
Feedburner
Who's Online
We have 225 guests and no members online
Latest Stories/News
Popular
InsideHPC
-
DDN Introduces AI Data Architecture, Addresses NAND Shortages
Dec 5, 2025 | 20:45 pm
-
Report: AI Back-End Networks Continue Shift to Ethernet
Dec 5, 2025 | 20:17 pm
-
NVIDIA Introduces CUDA 13.1 with CUDA Tile
Dec 5, 2025 | 18:48 pm