Article Index

  • If a space is populated (on state):
    • Any cell with one or zero neighbors dies, (death from loneliness).
    • Any cell with four or more neighbors dies, (death from overpopulation).
    • Any cell with two or three neighbors survives.
  • For a space that is unpopulated (off state)
    • Any empty cell with three neighbors becomes populated. (birth)

If you want to play with Life on-line you can go to sites like this one and enter patterns to see what happens. One thing to note is that small variations in patterns can cause huge changes in outcomes.

A popular and controversial book by Stephen Wolfram called A New Kind of Science was mostly about the properties and possibility of CAs. In general, CA works well when you want to model complex things with lots of little parts. Of course there are other ways to model complex things, but CA has one advantage that it starts out with "simple rules" and generates complex results. There are CA's for fluid flow, ecosystems, populations, environmental systems, urban growth, and even traffic. There are other areas as well. CA are often used to study "emergent systems" -- those systems that start out simple but then self organize.

As you can imagine a CA can be executed in parallel. There are really no restrictions on how far a neighbors interaction can reach, but in general, most interactions are very local, i.e. interactions take place between close neighbors. This type of interaction keeps the dataflow local and thus makes parallel execution possible. If the grid is broken in to parts, only the boundaries need to communicate. Like, Genetic Algorithms, CAs are naturally parallel and can take advantage of the cheap and plentiful parallel computing cycles that clusters offer. There are even parallel CA programming languages like CAOS (which stands for Cells, Agents and Observers for Simulation).

There is a certain appeal to using CA as a tool to simulate large systems. Indeed, CA's can even produce pseudo randomness as part of their behavior. Cellular automata provides an alternative method to study the behavior of dynamical systems. Traditionally such systems are studied with differential equations. While differential equations are the work horse of such systems, difficulties arise from non-linear behavior, round-off errors, and variations in initial conditions. Because CAs are "simple rules" they are easier to analyze than differential equations and can tolerate modifications without affecting the stability of the analysis.

Keep the CA approach in mind when you have some extra cycles and want to play with an new idea. For instance, what if you try to simulate cluster queuing algorithm. You never know what may happen. The often noted problem with CA's, which is similar to Genetic Algorithms, is you can find something that works, but you are not sure why. If you want find more information about CA's then have a look at a 2003 Survey of Cellular Automata or take a gander the old but still useful Cellular Automata FAQ

MapReduce

Our last sledgehammer approach to HPC is currently quite popular in big data analytics. If you have searched for anything on the web, you have probable used (in the background) a MapReduce algorithm. The idea is basically the same thing *nix users have been doing for years, but MapReduce in applications like Apache Hadoop can do it on much larger scale. For those interested in a good starting point consider Hadoop2 Quick Start Guide . Hadopp version 1 is a free implementation of MapReduce at scale although it can be implemented in any language. Hadoop version 2 support MapReduce and many other scalable methods. In particular, Apache Spark is particularly popular language for MapReduce and many other analytics operations. In terms of MapReduce, consider the following "piped" set of commands:

cat * | grep | sort | unique -c | cat > file

Nothing fancy there. Now try this on 20 TBytes of data. MapReduce works by distributing the work across multiple servers. Think of it as a partitioned file system where files are broken in to large blocks. Each block is often replicated on several data nodes (for reliability). With this arrangement, searching can be done in parallel. Figure One shows the general MapReduce algorithm.

MapReduce is Single Instruction Multiple Data (SIMD) approach to parallel computing.

In terms of our command string above, the MapReduce algorithm can be thought of as the following

cat *    |  grep |    sort    |   unique -c   | cat > file
input   ->  Map  ->  shuffle ->    Reduce   ->  output

A MapReduce operation works with key/value pairs. It is primarily used for log processing, web search indexing and ranking, and Ad-hoc data queries. Because the search is distributed and independent, MapReduce algorithms can have near linear scalability. As mentioned the idea is simple, by breaking the data up across servers (hard drives) you can spin many disks at that same time. Another MapReduce trick is to move the processing to the data and not the other way around. This drastically improves performance. Another key aspect of MapReduce is the assumption that the data is static, i.e. it is not going to change while the search is in process. Because the data are often redundantly stored, there is also a level of fault tolerance in a MapReduce clusters.

The most popular MapReduce package is the the Apache Foundation Hadoop project. (Named for project creator Doug Cutting's child's stuffed elephant.) Hadoop has three major components, The Hadoop Distributed File System (HDFS), A resource scheduler (YARN), and MapReduce engine framework.

The HDFS is built from a cluster of DataNodes and is designed for streaming file access (i.e. not for user home directories). DataNodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high. There is also a central NameNode serves as both directory namespace manager and "inode table" for the HDFS. HDFS offers high availability, federation, snapshots, NFSv3 mounts and more. .

The MapReduce Engine sits above YARN and the HDFS. Clients submit MapReduce jobs to YARN which attempts to keep the work as close to the data as possible. HDFS is rather unique in that it is "rack aware." MapReduce gets its performance from starting multple "maps" on the DatNodes. If an individual amp or reduce fails or times out, that part of the job is rescheduled. Hadoop is an active project and represents a powerful way to search through reams of data quickly.

If I Had A Hammer

While the three methods I mention above are not typical uses for HPC clusters, they are none the less capable of throwing large amounts of compute power at an appropriate problem. In each case, the user is also removed from much of the parallel computing minutia all to familiar to many HPC users. Such an advantage only really works if you swing your hammer and hitting a round peg in a round hole. If you have a square peg and round hole, a sledgehammer may make it fit, but it may not give the best result. Swing carefully.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.