Abstract – A huge amountof data (Data in the unit of Exabyte or Zettabyte) is called Big Data.Toquantify such a large amount of data and store electronically is not easy. Toprocess these large datasets, Hadoop system is used. To gather these big dataaccording to the request Map Reduce program is used.
For achieving greaterperformance, big data requires proper scheduling. To minimize starvation andmaximize the utilization of resource, scheduling technique are used to assignthe jobs to available resources. The Performance can be increasedbyimplementing deadline constraints on jobs. The goal of the research is to studyand analyze various scheduling algorithm for better performance.Index Terms – Big Data, MapReduce, Hadoop, Job Scheduling Algorithms.
I. INTRODUCTIONCurrently,the term big data 1 has become very trendy in Information Technology segment.Big data refers to broad range of datasets which are hard to be managed byprevious conventional applications. Big data can be applied in finance andbusiness, banking, online and onsite purchasing, healthcare, astronomy,oceanography, engineering, and many other fields. These datasets are verydifficult and are rising exponentially day by day in very large amount.
As data is increasing in volume, in variety andwith high velocity, it leads to complexities in processing it. To correlate,link, match and transform such big data is a complex process. Big data being adeveloping field has a lot of research problems and challenges to address. Themajor research problems in big data are following: 1) Handling data volume, 2)Analysis of big data, 3) Privacy of data, 4) Storage of huge amount of data, 5)Data visualization, 6) Job scheduling in big data, 7) Fault tolerance. 1) Handling data volume 12: The large amount of data coming from different fields of science such asbiology, astronomy, meteorology, etc makes its processing very difficult to the scientists. 2) Analysis of big data:it is difficult to analyze big data due to heterogeneity and incompleteness ofdata.
Collecteddata can be in different formats, variety and structure 3. 3) Privacy of datain the context of big data 3: There is public fear regarding theinappropriate use of personal data, particularly through linking of data frommultiple sources. Managing privacy is both a technical and a Sociologicalproblem. 4) Storage of huge amount of data 1 3: it represents the problemof how to recognize and store important information, extracted fromunstructured data, efficiently. 5) Data visualization 1: Data processingtechniques should be efficient enough to enable real time visualization. 6) Jobscheduling in big data 4: This problem focuses on efficient scheduling ofjobs in a distributed environment. 7) Fault tolerance 5: is another issue inHadoop framework in big data. In Hadoop, NameNode is a single point of failure.
Replication of block is one of the fault tolerance technique used by Hadoop.Fault tolerance techniques must be efficient enough to handle failure indistributed environment. MapReduce 6 provides an ideal framework forprocessing of such large datasets by using parallel and distributed programmingapproaches. II. MAPREDUCEMapReducingoperations depend on two function such as Map and Reduce function. Boththe functionsare written for the user need. The Map functiontakes an input pair andgeneratesa set of intermediate or middle key or the value pairs.
The MapReduce librarythat collects all the middle values that are associated with the same middlekey andtransfer them into the Reduce function for further operations. TheReduce function obtains an intermediate or middle key with integrated set ofvalues. And it associates thosevalues to make it as a smaller set of values.
The Figure 1 shows all process of MapReduce. Fig.1TheOverall MapReduce Word Count Process.
III. HADOOP ARCHITECTUREScheduling decisions which aretaken by the master node are called as Job Tracker and by the worker nodes arecalled as Task Tracker which executes the tasks.Fig.2HadoopArchitecture 11A Hadoopcluster includes a single master node and multiple slave nodes. Figure 2 showsHadoop Architecture. The single master node consists of a Job tracker, Tasktracker, Name node and Data node.A. Job trackerThe primaryfunction of the job tracker is managing the task trackers and tracking resourceavailability.
The Job tracker is a node which controls the job executionprocess. Job tracker performs mapreduce tasks to a specific node in thecluster. Client submits jobs to the Job tracker. When the work is completed,the Job tracker updates its status.
Client applications can ask the Job trackerfor information. B. Task trackerIt follows theorders of the job tracker and updating the job tracker with its statusperiodically. Task tracker run tasks and send the reports to Job tracker, whichkeeps a complete record of each job. Every Task tracker is configured with aset of slots; it indicates the number of tasks that it can accept. C. Name nodeThe name node mapsto block locations and which blocks are stored on which data node.
Whenever adata node undergoes a disk corruption of a particular block, the first tablegets updated and whenever a data node is detected to be dead due to networkfailure or a node, both the tables get updated. The updating of the tables isbased on only failure of the nodes. It does not depend on any neighbor blocksor any block locations to identify its destination. Each block is separatedwith its job nodes and respective allocated process. D.
Data nodeThe node whichstores the data in hadoop system is known to be as data node. All data nodessend a heartbeat message to the name node for every three seconds to say thatthey are alive. If the name node does not receive a heartbeat from a particulardata node for ten minutes, then it considers that data node to be dead or outof service. It initiates some other data node for the process.
The data nodesupdate the name node with the block information periodically. IV. JOB SCHEDULING IN BIGDATAThedefault Scheduling algorithm is supported on FIFO where jobs were executed inthe magnitude of their humility. Later on the cognition to set the priority ofa Job was added. Facebook and Character contributed meaningful apply inprocessing schedulers i.
e. Legible Scheduler 8 and Capacity Scheduler 9respectively which after free to Hadoop Dominion. This section describesvarious Job Scheduling algorithms in big data.
A. Default FIFO SchedulingThedefault Hadoop scheduler operates using a FIFO queue. After a job is dividedinto independent tasks, they are ended into the queue and allotted to freeslots as they get acquirable on Task Tracker nodes. Although there is keep fordecision of priorities to jobs, this is not revolved on by default. Typicallyapiece job would use the complete assemble, so jobs had to inactivity for theirrelease. Regularize though a distributed constellate offers zealous latent foroffering larger resources to numerous users, the job of intercourse resourcesevenhandedly between users requires a turn scheduler. Production jobs bet in arational indication. B.
Fair SchedulingTheFair Scheduler 8 was developed at Facebook to manage access to their Hadoopcluster and subsequently released to the Hadoop community. The Fair Schedulerplans to provide each user a fair share of the cluster capacity in excess oftime. Users may allocate jobs to pools, with every pool owed a guaranteedsmallest number of Map and Reduce slots. Free slots in unsuccessful pools maybe owed to new pools; piece immoderateness ability within a pool is joint amongjobs. The Fair Scheduler maintains preemption, so if a pool has not receivedits fair contract for a destined period of measure, then the scheduler moduleveto tasks in pools flowing over capacity in dictate to afford the slots to thepool functional under capacity. In addition, administrators may enforcepriority settings on doomed pools. Tasks are therefore scheduled in aninterleaved fashion, supported on their priority within their pool, and theconstellate capacity and activity of their pool.
As jobs contain their tasksassigned to Task Tracker slots for calculation, the scheduler follows theshortfall between the become of calculate really old and the saint fairpercentage for that job. Eventually, this has the result of ensuring that jobsobtain roughly equal amounts of resources. Shorter jobs are assigned enoughresources to terminate fast. Simultaneously, longer jobs are assured to not beravenous of resources.
C. Capacity SchedulingCapacityScheduler 10 initially developed at Yahoo addresses a usage circumstanceswhere the number of users is huge, and there is a require to make sure a fairassign of calculation resources between users. The Capacity Schedulerallocates jobs supported on the submitting user to queues with configurabledrawing of Map and Minify slots. Queues that hold jobs are bestowed theirorganized capacity; patch a trip capacity in a queue is shared among oppositequeues. Within a queue, planning operates on a modified priority queuegroundwork with specialized person limits, with priorities orientated supportedon the quantify a job was submitted, and the priority scene allocated to thathuman and accumulation of job. When a Task Tracker receptacle becomes unfixed,the queue with the lowest laden is elite, from which the oldest remaining jobis chosen. A task is then scheduled from that job.
This has the validity ofenforcing meet capacity distribution among users, rather than among jobs, aswas the case in the Fair Scheduler. D. Dynamic Proportional SchedulingAsclaimed by Sandholm and Lai 12, Dynamic Proportional scheduling gives a lotof job sharing and prioritization that end in increasing share of clusterresources and a lot of differentiation in service levels of various jobs. Thisalgorithm improves response time for multi-user Hadoop environments.E. Resource-AwareAdaptive Scheduling (RAS)Toincreaseutilization of resource among machines even as monitoring the completiontime of process, RAS proposed by Polo et al. 13 for the Map Reduce withmulti-job workloads.Zhao et al.
14 providestask scheduling algorithm based on the resource attribute selection (RAS) towork out its resource assigned by sending a group of test tasks to an executionnode before a task is scheduled and so choose optimal node to execute a taskconsistent with resource needs and appropriateness between the resource nodeand therefore the task, which uses history task information if prevail.F. MapReduce task scheduling with deadlineconstraints (MTSD) algorithmAccordingto Tang et al. 15, scheduling algorithmic rule sets two deadlines:map-deadline and reduce-deadline. Reduce-deadline is simply the users’ jobdeadline. Pop et al. 16 presents a classical approach for a periodic taskscheduling by considering a scheduling system with totally different queues forperiodic and aperiodic function and deadline, because the main constraintdevelops a method to guess the quantity of resources required to schedule agroup of an interrupted tasks or function, by considering along implementationand data transfers costs.
Based on a numerical model, and by using dissimilarsimulation situations, MTSD proved thefollowing statements: (1) varied sources of independent an episodic tasks willbe measured approximating to a single one; (2) when the quantity of evaluatedresources transcend a data center capability, the tasks migration betweentotally different regional centers is that the appropriate resolution withrelevance the global deadline; and (3) during a heterogeneous data center, wewant higher variety of resources for an equivalent request with relevance thedeadline constraints. In MapReduce, Wang and Li 17 detailed the taskscheduling, for disseminated data centers on heterogeneous networks throughadaptative heartbeats, job deadlines and data locality. Job deadlines aredividing alongside the foremost data quantity of tasks. With the thought oflimitation, the task scheduling is twisted as an assignment downside in eachheartbeat, during which adaptive heartbeats are supposed by the process timesof tasks and jobs are sequencing in terms of the separated deadlines and tasksare planned by the Hungarian algorithmic program. On the idea of data transferand process times, the most appropriate data center for all mapped jobs aredetermined within the reduce part.G.
Delay SchedulingTheobjective is to deal with the dispute between locality and fairness. once anode requests for a task or function, if the head-of-line job cannot project alocal task, scheduler omit that task and appears at later jobs. If a job hasbeen omited for long, we tend to permit it to project non-native tasks, toavoid starvation. Delayscheduling provisionally relaxes fairness to induce higher locality throughallowing jobs to attend for scheduling on a node among native data. Song et al.
18 offer a game assumption based technique to solve scheduling problems byseparating a Hadoop scheduling issue into 2 levels—job level and task level. Forthe job level scheduling, use a bid model to produce guarantee to the fairnessand reduce the common waiting time. For tasks level, change scheduling drawbackinto assignment problem and use Hungarian methodology to optimize the problem.Wan et al. 19 provides multi-job scheduling algorithm in MapReduce supportedgame assumption that deals with the competition for resources between manyjobs.H. Multi Objective SchedulingNitaet al.
20 explain about scheduling algorithm named MOMTH by consideringobjective functions associated to resources and users within the similar timewith constraints similar to deadline and budget. Theenact model takes into account as all MapReduce jobs are independent. Asthere’s no nodes failure before/during scheduling computation, schedulingdecision is taken solely based on the present data. Bian et al. presentsscheduling strategy supported fault tolerance.
Consistent with this schedulingstrategy, the cluster finds the speed of the present nodes and creates somebackups of the intermediate MapReduce data which results to a high performance cache server. The datacreated by that node could get it wrong shortly. Hence the cluster could resumethe execution to the previous level rapidly if there are many nodes goingwrong, the cut back nodes scan the Map output from the cache server or fromboth the cache and also the node, and keeps its high performance. I. HybridMultistage Heuristic Scheduling (HMHS)Chenet al.
21 elaborates heuristic scheduling algorithm named HMHS that makes anattempt to clarify the scheduling trouble by rending it into 2 sub problems:sequencing and dispatching. For sequencing, they use heuristic supported Pri(the modified Johnson’s algorithm). For dispatching, they recommend twoheuristics Min-Min and Dynamic Min-Min. V. TABLE I: COMPARISON OFVARIOUS JOB SCHEDULING ALGORITHMS IN BIGDATA Scheduling Algorithm Technology Advantages Disadvantages Default FIFO Scheduling 22 Schedule jobs based on their priorities in first-in first-out 1. Cost of entire cluster scheduling process is less.
2. Simple to implement and efficient. 1.
Designed only for single type of job. 2. Low performance when run multiple types of jobs. 3.
Poor response times for short jobs compared to large jobs. Fair Scheduling 8 Do an equal distribution of compute resources among the users/jobs in the system. 1. Less complex 2. Works well when both small and large clusters. 3.
It can provide fast response times for small jobs mixed with larger jobs. 1. Does not consider the job weight of each node.
Capacity Scheduling10 Maximization the resource utilization and throughput in multi-tenant cluster environment. 1. Ensure guaranteed access with the potential to reuse unused capacity and prioritize jobs within queues over large cluster. 1. The most complex among three schedulers.
Dynamic Proportional Scheduling12 Planned for data intensive workloads and tries to maintain data locality during job execution 1. It is a fast and flexible scheduler. 2. It improves response time for multi-user Hadoop environments.
If the system eventually crashes then all unfinished low priority processes gets lost. Resource-Aware Adaptive Scheduling (RAS) 13 Dynamic Free Slot Advertisement. Free Slot Priorities/Filtering It improves the Job performance. Only takes action on appropriate slow tasks. MapReduce task scheduling with deadline constraints (MTSD)15 Achieve nearly full overlap via the novel idea of including reduce in the overlap. 1. It Reduce computation time. 2.
Improve performance for the important class of shuffle-heavy Map Reductions. Better work with small clusters only. Delay Scheduling18 To address the conflict between locality and fairness. 1. Simplicity of scheduling No particular Multi Objective Scheduling20 The executiontype consider as allthe MapReduce jobs are independent, there is no nodes failure before or during the scheduling computation and the scheduling decision is taken only based on present knowledge. It keeps performance is high. Execution Time is too large. Hybrid Multistage Heuristic Scheduling (HMHS)21 Johnson’s algorithm & Min-Min and Dynamic-MinMin algorithm used Achieves not only high data locality rate but also high cluster utilization.
It does not ensure reliability. VI. DISCUSSIONSThispaper provides the classification of Hadoop schedulers based on differentparameters such as time, priority, resources etc.
It discuss about how varioustask scheduling algorithms helps in achieving better result in Hadoop cluster. Furthermorethis paper also discusses about advantages and disadvantages of various taskscheduling algorithms. This comparison results shows, each scheduling algorithmhas some advantages and disadvantages. So, all algorithms are important in jobscheduling. VII. CONCLUSIONSThis paper gives anoverall idea about different job scheduling algorithm in the big data.
And itcompares most of the properties of various task scheduling algorithms. Individualscheduling techniques which areused to upgrade the data locality, efficiency,makespan,fairness and performance are elaborated and discussed. However, the schedulingtechnique is an open area for researchers to research