With an increase in the penetration ofinternet and the usage of the internet, the data captured by Google increasedexponentially year on year. Just to give us an estimate of this number, in 2007Google collected on an average 270 PB of data every month. The same numberincreased to 20000 PB everyday in 2009.
Obviously, Google needed a betterplatform to process such an enormous data. Google implemented a programmingmodel called MapReduce, which could process this 20000 PB per day. Google ran theseMapReduce operations on a special file system called Google File System (GFS).
Unfortunately,GFS is not an open source.Doug cutting andYahoo! reverse engineered the model GFS and built a parallel Hadoop DistributedFile System (HDFS). Thus came Hadoop, a framework- an open-source Apacheproject- that can be used for performing operations on data in a distributedenvironment(using HDFS) using a simple programing model called MapReduce .In other words, Hadoopcan be thought of as a set of open source programs and procedures which anyonecan use as the “backbone” of their big data operations.
Hadooplets you store files bigger than what can be stored on one particular node orserver. So one can store very, very large files. It also lets you store many, manyfiles.It is also a scalable and fault tolerant system. Inthe realm of Big Data, Hadoop falls primarily into the distributed processingcategory but also has a powerful storage capability.The core components of Hadoop are:1. Hadoop YARN – A manager and schedulingsystem that schedules resources on a cluster of machines.
It manages resources of the systems storingthe data and running the analysis.2. HadoopMapReduce – MapReduce is named after the two basic operations this modulecarries out – reading data from the database, putting it into a format suitablefor analysis (map), and performing mathematical operations (reduce).MapReduceprovides a programming model that makes combining the data from various harddrives a much easier task. There are two parts to the programming model – themap phase and the reduce phase—and it’s the interface between the two where the”combining” of data occurs. Hadoop distributes the data across multipleservers. Each and every server offers the ability to analyze and store the datalocally.
When you run a query on a large dataset, every server in this networkwill execute the query on its local server on the local dataset. Finally, theresults from all the local servers are consolidated. The consolidation part ishandled effectively by MapReduce.3. HadoopDistributed File System (HDFS)This is a self-healing, highbandwidth clustered file storage, which is optimized for high throughput accessto data. It can store any type of data, structured or complex from any numberof sources in their original format.
It is a file systemdesigned for storing very large files with streaming data access patterns,running on clusters of commodity hardware. Hadoop by defaultstores 3 copies of each data block in the cluster on different nodes of thecluster. Any time a node or machine fails containing a certain block of data,another copy is created on another node in the cluster thus making the systemfail proof. In simpler terms Hadoop distributes and replicates the datasetacross the multiple nodes efficiently. So that if any of the nodes fail in theHadoop ecosystem, it will still return the dataset appropriately.
KEY CHARACTERISTICS – High AvailabilityMapReduce,a YARN based system has efficient load balancing. It ensures that jobs run andfail independently. It also restartsjobs automatically on failure.– Scalability of Storage/Compute Using the MapReduce model, applications canscale from a single node to hundreds of nodes without having to re-architectthe system.
Scalability is built into the model because data is chunked anddistributed as independent compute quantities.– Controlling Cost Addingor retiring nodes based on the evolution of storage and analytic requirementsis easy with Hadoop. You don’t have to commit to more storage or processingpower ahead of time and can scale only when required, thus controlling yourcosts.– Agility and InnovationSincedata is stored in its original format and there is no predefined schema, it iseasy to apply new and evolving analytic techniques to this data usingMapReduce.