Machine Learning in Hadoop

Mahout

Mahout’s goal is to build scalable machine learning libraries. Mahout’s core algorithms for clustering, classification, and batch-based collaborative filtering are implemented on top of Apache Hadoop using the MapReduce paradigm. Currently, Mahout supports mainly three common machine-learning use cases: (1) user-based recommendations, where data is mined using known user preferences and behaviors and used to predict new preferences for the user (there is also limited support for the related approach, item-based recommendations), (2) clustering looks for similarities between data points, using a user-specified metric, to identify clusters in the data, that is groups of points that appear more similar to each other than to members of other groups, and (3) classification applies discrete labels to data or predicts a continuous value (e.g., a price) based on previous examples of similar data.

RHadoop

RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop: rmr to allow writing MapReduce distributed programs in R, rhdfs to allow access to files stored in Hadoop with R, and rhbase to allow R programs to interact with HBase.