I was able to attend the fall 2014 Strata/Hadoop World this year. Held in New York City, the show was active and confirms the data science sector is alive and growing. There is a lot to mention, however, I found one aspect quite interesting. While walking the exhibit floor I passed a small 10'x10' booth that had nothing but a table. There was no fancy back drop or exciting demonstrations, just a lot of people gathering and a small conference sign that said Databricks.

I knew who it was right away – the Apache Spark crew now launching a company. I also knew one other thing – the amount of people around this bare booth meant Spark was hot and getting hotter. Often labeled as the Hadoop killer, Spark is an in-memory data analytics tool.

Spark was initially developed for applications where keeping data in memory helps performance, such as iterative algorithms, which are common in machine learning, and interactive data mining. Spark differs from classic MapReduce in two important ways. First, Spark holds intermediate results in memory, rather than writing them to disk. And second, Spark supports more than just MapReduce functions, greatly expanding the set of possible analyses that can be executed over HDFS data stores. It also provides APIs in Scala, Java, and Python.

Since 2013, Spark has been running on production YARN clusters at Yahoo!. The advantage of porting and running Spark on top of YARN is the common resource management and a single underlying file system. For more information, see https://spark.apache.org.

One of the most interesting features is Python API. Unlike the limited Hadoop streaming interface which will work with Python, the Spark API direct uses the Spark engine. As an example consider the Spark version of the classic Hadoop wordcount program. (The Hadoop Java version is 62 lines.)

text_file = spark.textFile("hdfs://...")

text_file.flatMap(lambda line: line.split())
   .map(lambda word: (word, 1))
   .reduceByKey(lambda a, b: a+b) 

The Hadoop Killer monicker is a bit over the top because most spark installation run as part of a Hadoop cluster using data from the vast Hadoop data lake.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.