Everyone is talking about big data and tools like Hadoop. With the number of discussions floating around it can be tough to make basic sense of terms like MapReduce, Hadoop, Distributed Computing, HDFS and other terms people like to use to describe their work with big data. Here at Think Big Analytics, we specialize in making sense of big data with many different technologies including Hadoop. After reading this blog post you will have a better understanding of some of these terms.
Hadoop is a distributed platform designed to help manage and analyze massive datasets that are too big or costly to put in relational databases (commonly referred to as SQL databases). Rather than storing data in expensive and monolithic highend computers as is very common with relational databases, Hadoop at its core is designed to distribute data across many commodity server class computers in a cluster. By spreading data across many smaller nodes, the storage as well as the computation can be scaled horizontally. In other words, instead of spending lots of money on the most powerful computers on the market to increase your horsepower, all you need to do is add more commodity servers to the cluster.
What makes Hadoop so powerful? Well, let’s take those distributed principles that Hadoop is based upon and use a simple example to think about them. Say you have a one Terabyte file, and assume that if you put that file on a single machine that it can process the file in one hour. If you break that same Terabyte file into even chunks across 10 machines, that means that each machine only has to process 100 Gigabytes of that file. If you do the math, each computer will take approximately 6 minutes to process their portion of the file. Since each computer can do its portion of the work in parallel, you will now have have reduced your processing time from one hour to six minutes. Of course this assumes that you can process your data in parallel, in other words you do not have to wait for the first part of your data to be processed in order to process the rest of the data. Now, you could have just purchased a single computer that has the processing power to process the file in a comparable time. But what if you wanted to double your processing power? Well, with a Hadoop cluster you can just double the amount of nodes you have. Double your nodes, double your power. You already know how much it will cost you for the nodes since you most likely have similar nodes in your cluster, so it makes it easy to estimate scaling costs for the future. On the other hand, super computers might not even have the technology to process files as fast as you want to. And if they do, they are absolutely going to be much more expensive than the commodity servers you will be putting in your cluster. It should be pretty clear at this point that a distributed framework like Hadoop is a very powerful and economic tool.
What about HDFS and MapReduce? How do they fit in with Hadoop? As mentioned earlier, Hadoop is a term used to describe a distributed computing platform. The Hadoop platform contains many different components like HDFS and MapReduce (along with Hive, HBase, and many others) which each serve a special purpose. HDFS stands for the Hadoop Distributed File System, which is exactly what it sounds like, the file system the Hadoop platform uses to store data across many servers in a cluster. In HDFS, files are broken into blocks and spread across nodes in the cluster. In addition, these blocks are replicated on different nodes to maintain data durability. This way, if one of your nodes fails another one of the live nodes still has a copy of the data.
When running Hadoop there are three main processes that manage HDFS, the NameNode, the Secondary NameNode, and the DataNode. The NameNode is the master process that maintains all of the references on how exactly a file is split up in blocks across nodes in the cluster. When reading files in HDFS, a client process will contact the NameNode for this metadata and ask the corresponding DataNodes for those blocks for reading. In the current stable Hadoop release, the NameNode is a single point of failure (SPOF). This can be very dangerous because if the NameNode crashes or fails your cluster will become essentially unusable until your administrator manually recovers the NameNode. However, as Hadoop distributions from MapR, Cloudera, and HortonWorks come out with new features like federation, hot-failover, or completely replacing the NameNode all together, it is becoming less of an issue. Regardless, the Secondary NameNode daemon is an assistant process designed to take snapshots of the NameNode’s metadata that can be used to recover the filesystem if you do lose the NameNode. Keep in mind this kind of recovery does require some manual human intervention.
MapReduce is the data processing framework that allows developers to create applications that can take advantage of files stored in a distributed environment like HDFS. A typical MapReduce application has two functions, a Mapper and a Reducer. Mappers and Reducers will be run as tasks on nodes in the cluster. The Mapper functions organize blocks of data in a way that allows the data to be aggregated and sent to Reducer functions for any kind of aggregate logic. For example, in the WordCount application the Mapper functions read blocks of text and output each word they find. If several Mappers output the same word, each of those outputs are aggregated to a single Reducer which can count the number of times it has received the same word, producing a WordCount.
Similar to the processes that manage HDFS, there are two processes that manage the MapReduce framework, the JobTracker and TaskTracker. The JobTracker is the master process that coordinates the Map and Reduce tasks sent across the cluster. It manages task and node failure to make sure the cluster is being used at its full potential. The TaskTrackers are slave processes that create individual Map and Reduce tasks on a node and keep the JobTracker up to date with its current status.
HDFS and MapReduce make up the very core of the Hadoop platform. After learning about the NameNode, Secondary NameNode, DataNode, JobTracker, and TaskTracker you should be able to make a little more sense out of some of the Hadoop buzz. Please look forward to additional blog posts where we can dive deeper into Hadoop or other interesting technologies like Storm, Hive, and HBase. Also, check out the companion webinar I did on this very same topic.