No GPU, No Problem: Five Open Source Machine Learning Tools | AI and Data Science

From the this article was written by a puny human department

There is a notion floating about that suggests machine learning with deep learning is a GPU focused application. While GPUs excel at deep learning, they are not exclusively required to teach your servers some smarts. There are many free and open tools for machine learning that use good old fashion CPUs (some can also use GPUs). Before we point out some of the tools, however, a few background comments about machine learning are in order.

Machine learning can take many forms and cover many methods one of which is deep learning. Back in the 1980's machine learning was called Artificial Intelligence (AI). As with many technologies, AI was oversold and lost credibility because promised breakthroughs never materialized. Beyond the hype, there has been steady progress with machine learning. In commercial contexts, machine learning methods may be referred to as data science (statistics), predictive analytics, or predictive modeling.

In those early days, there were three major areas of AI research: Expert Systems (a rule based approach), Neural Networks (modeling the brain), and Genetic Algorithms (mimic genetic selection). These each had successes and limitations. Expert Systems were successful but often limited in scope and failed when pushed to the edge of their knowledge domain. They could, however provide a very clear reasoning path when asked "why" after producing answers. Neural networks showed promise, but took a long time to train (often measured in weeks), and researchers could not ask "why" when questioning a solution. Genetic Algorithms were good at solving many problems that were not tractable by other means and could find local maximum (or minimums), but often could not guarantee the best solution.

The cognitive computing field has changed since the first major go-around with AI. Today, we call it machine learning and have an arsenal of new hardware and software tools that can be applied to many challenging problems. A good example is IBM Watson that was able to beat the best human Jeopardy! players. Currently, "deep learning" is used to describe neural networks and-- thanks to GPUs and some performance optimizations-- have become a very popular method of machine learning.

The following tools use several methods for machine learning. They are by no means the only tools available. The first three were selected because they represent substantial efforts, offer high performance, and are openly available. The final project is a smaller JavaScript application that allows machine learning to run from a web browser.

Microsoft Distributed Machine Learning Toolkit

Developed by Microsoft, well Microsoft Research, the Distributed Machine Learning Toolkit (DMTK) is now available to the machine learning community. The full source is available on GitHub and there are both Windows and Linux binaries available from the DMTK web site. Training can be done on a single machine or a cluster. Both MPI (MPICH) and ZMQ can be used to perform distributed learning.

Implemented as a standard C++ library, DMTK provides a server-based framework for training machine learning models on big data using multiple machines. The API eliminates the need for system management issues such as distributed model storage and operation, inter-process and inter-thread communication, and multi-threading management. Instead, users are able to focus on the core machine learning logics: data, model, and training. According to the DMTK webpage, the current version includes the following components:

DMTK Framework: a flexible framework that supports a unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.
LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation.
Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.

It is also possible to build distributed machine learning algorithms on top of the DMTK framework with small modifications to their existing single-machine algorithms. According to the project site, the developers believe that in order to push the frontier of distributed machine learning, a collective effort from the entire community helps foster both machine learning and system innovations. This belief strongly motivated the DMTK team to develop the project in an open fashion.

BLVC: Caffe

As stated on the project website, Caffe is a deep learning framework made with expression, speed, and modularity in mind. It was developed by the Berkeley Vision and Learning Center BVLC and by community contributors. Caffe is released under the BSD 2-Clause license.

Caffe is an expressive architecture that encourages application development and innovation. Models and optimization are defined through configuration and do not require hard-coding. Users may switch between CPUs and GPUs by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices. Caffe has shown great success and speed with computer vision. A recent blog post, Myth Busted: General Purpose CPUs Can't Tackle Deep Neural Network Training, by Pradeep Dubey at Intel demonstrates how Caffe can be optimized for Xeon CPUs. The speed-up is shown in Figure One.

Interestingly, the performance was achieved through the use of standard optimization tools including Intel's Math Kernel Library (Intel MKL), parallelization using OpenMP, aligned data allocation, and the use of new performance primitives such as direct batched convolution, pooling, and normalization. These new primitives are coming to future versions of Intel MKL and Intel Data Analytics Acceleration Library (Intel DAAL). A technical preview package demonstrating achievable performance with Caffe on the AlexNet topology is already available for download.

Cafe optimization
Figure One: Optimized Performance of Caffe on Xeon processors.

Apache SystemML

Originally developed by IBM and now an Apache Project, SytemML is a declarative large-scale machine learning (ML) project that provides a flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on Hadoop MapReduce or Apache Spark. ML algorithms are expressed in an R-like syntax that includes linear algebra primitives, statistical functions, and ML-specific constructs. This high-level language significantly increases the productivity of data scientists as it provides both full flexibility in expressing custom analytics and data independence from the underlying input formats and physical data representations.

All SystemML computations can be executed in a variety of different modes. Users can begin with a standalone mode on a single machine allowing algorithms to be developed locally without need of a distributed cluster. Once ready for scalable operation, algorithms can be automatically distributed across Hadoop MapReduce or Apache Spark. In addition, SystemML can be operated via Java and Scala and features an embedded API for scoring models.

Google TensorFlow

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

There is both a local and distributed implementation of the TensorFlow interface. The local implementation is used when the client, the master, and the worker all run on a single machine in the context of a single operating system process and can include multiple GPU cards or CPU cores. The distributed implementation shares most of the code with the local implementation and includes support for the client, the master, and the workers running as different processes on different machines. In the distributed environment, these different tasks are containers in jobs managed by the Google Borg cluster scheduling system.

Currently Tensor Flow supports a Python and C++ API. The Python API is the most complete and the easiest to use, but the C++ API may offer some performance advantages in graph execution, and supports deployment to small devices such as Android. There are plans to develop front ends for languages like Go, Java, JavaScript, Lua R, and possibly others.

Installation and download instructions are available for the local (single machine) version. The distributed version is expected once it can be cleanly detached from the Google infrastructure (i.e. rescued from the Borg).

ConvNetJS

A final machine learning tool requires no special hardware or software. The ConvNetJS project is a JavaScript library for training Deep Learning models (mainly Neural Networks) entirely in your browser. Simply open a tab and you're training. There is no need for extra software, compilers, or GPUs. The speed at which your browser learns and classifies will vary and if you find ML useful, you may want to consider one of the grown-up packages mentioned above.

ML and HPC

One issue that often comes up is the intersection of machine learning (ML) and high performance computing (HPC). Indeed, the question, "Is data science (DS) and/or machine learning a type high performance computing?" is often asked in HPC circles. Since machine learning, data science, high performance computing are all rather nebulous and vast topics, the answer, if you care, is yes. At one point HPC (called supercomputing) was described as any computing systems that had at least a six figure price tag. Cluster computing has certainly changed this situation. A good working definition of HPC can be thought of as continuous computing that must produce timely solutions with potentially larger data sets than can be delivered by widely available desktop computing resources. In that case, ML, DS and all they entail seem to fit the HPC definition provided here.