Answering The Nagging Apache Hadoop/Spark Question | Apache Hadoop, Spark, and Big Data

(or How to Avoid the Trough of Disillusionment)

A recent blog post, Why not so Hadoop?, is worth reading if you are interested in big data analytics, Hadoop, Spark, and all that. The article contains the 2015 Gartner Hype Cycle. The 2016 version is worth examining as well. Some points similar to the blog can be made here:

Big data was at the "Trough of Disillusionment" stage in 2014, but is not seen in the 2015/16 Hype cycle.
The "Internet of Things" (a technology that is expected to fill the big data pipeline) was on the peak for two years and now has been given "platform status."

As with any new tech there are often high expectations coupled with overselling that leads to the eventual "Trough of Disillusionment." The big data/Hadoop cycle is no exception. And to be clear, the Apache Hadoop umbrella includes things like Apache Spark and other such tools that are not derived from the traditional MapReduce algorithm. Hadoop version 2.0 has taken on the role of a big data management platform (supporting an ecosystem) on which new and existing tools can be created and used.

The big discussions about big data and Apache Hadoop are ongoing. When at the top of the hype curve Hadoop was destined to replace all modern RDMS and create massive profits by mining the gold hidden in the mountains of unused company data. The reality is, of course, much different. Like any new technology, when it works, it works well. When it is pushed into the wrong hole, it fails. Finding those spots where Hadoop/Spark and friends fit is an important question that all companies and organizations need to to answer.

From my experience, one of the biggest impediments to answering the Apache Hadoop question is the belief that big data needs a big cluster. Unless your organization is creating large volumes on a daily basis, a large cluster is not needed to use Hadoop. Regardless of size, however, standing up and managing a Hadoop cluster is:

Expensive
Hard
Risky

Hadoop is hard to get right out of the box. The required functionality is not impossible, but it usually represents a significant investment in resources (either local or cloud) and people (both administrators and data scientists). Given this situation, it is easy to see how a promising new technology can slip from the top of the hype cycle and land deep in the "Trough of Disillusionment." If done correctly, however, Hadoop projects can be very successful and skirt the dreaded trough.

Hadoop on a Shoe String

Before a large investment is made in Hadoop, organization's need to ask "How can we cost effectively determine Hadoop/Spark feasibility?" A first step can be as simple as Hadoop on a laptop (See the Hortonworks sandbox). Tools like Spark and Hive will work quite well in this environment. As one can imagine, Hadoop on a laptop is limited to the amount of data that can reasonably fit on a laptop (both in terms of size and processing speed). After the sandbox, the next step is usually a cloud cluster or a local cluster resource. While the cloud seems like a natural solution (and it sometimes is), two issues often impede a true cloud solution.

Data Movement and Privacy - moving and managing data to/from the cloud can become a project in and of itself. In addition, anytime data leaves the premises, it become more vulnerable.
Unbounded Learning Costs - while cloud proponents point to a lower on-going operating expense vs. a one-time capital expense, the reality is the Hadoop learning curve is somewhat unpredictable -- it is feasibility work after all. The cloud meter starts to run at the lowest point in the learning curve and continues to accrue as the project evolves.

Provisioning and configuring a local resource is also an option, but unlike cloud a large infrastructure investment is often needed. Typically a Hadoop feasibly cluster needs about 4-8 server-nodes with at least 4-16 TBytes of combined storage (If you were not aware, the Hadoop file system, HDFS, replicates data and the raw storage is often reduced by a factor of three.) A Hadoop feasibly cluster, whether it be in the cloud on in the data center, represents a big step when leaving the laptop sandbox. Recently, a third cost-effective intermediate solution has become available.

Introducing the Hadoop Appliance

One way to quickly move up the learning curve is to use an Hadoop/Spark appliance such as those developed by Basement Supercomputing These systems are Linux based turn-key ready to run Apache Hadoop/Spark clusters. They are designed for low-cost immediate installation with all Apache Hadoop/Spark software pre-installed and ready to run. Base models range from 4-7 server-nodes (16-56 Intel cores) and storage options from 2-18 TBytes of fast SSD storage. Pricing starts at less than $8000. Workable Hadoop applications can be developed and tested with minimal overhead and cost. Each system provides the following benefits:

Apache Hadoop ready to run on first power-up
Cool, quiet, low power, desk-side operation - no need for data center or IT overhead and cost
Open Linux based design, full hardware and software support available
Access and operate on local data, eliminates cloud transfer and/or security issues
A fixed low-cost solution for Hadoop/Spark development and feasibility efforts

An appliance can quickly answer the big questions and avoid broken budgets and a trip to the Trough of Disillusionment. To take a closer look at the Apache Hadoop/Spark appliances visit Basement Supercomputing.