p { margin-bottom: 0.1in; direction: ltr; color: rgb(0, 0, 10); line-height: 120%; text-align: left; }p.western { font-family: “Liberation Serif”, serif; font-size: 12pt; }p.cjk { font-family: “Droid Sans Fallback”; font-size: 12pt; }p.

ctl { font-family: “FreeSans”; font-size: 12pt; }a:link { }code.cjk { font-family: “Droid Sans Fallback”, monospace; }Thefirst tool that can be used in a given scenario is Apache Sqoop.It isused to import and export data to and from between HDFS and RDBMS. Sqoopworks by running a query on the relational database and exporting theresulting rows into files in one of these formats: text, binary,Avro, or Sequence Files. Thefiles can then be saved on Hadoop’s HDFS. Sqoop also works in theopposite direction, letting you import formatted files back into arelational database.ASpark for this use case.

You can create DataFramefrom RDBMS table and then write this to HDFS.Sparkdid not work well for complex data types.Iprefer Sqoop for this use case. As one can change data types, numberof mappers, etc. in sqoop.Nowthe second tool for the scenario is Apache Hive.Hive is a datawarehouse infrastructure tool to process structured data in Hadoop.

It resides on top of Hadoop and makes querying and analyzing easy.Hiveprovides a mechanism to project structure onto this data and querythe data using a SQL-like language called HQL. At the same time thislanguage also allows traditional map/reduce programmers to plug intheir custom mappers and reducers when it is inconvenient orinefficient to express this logic in HiveQL.ClouderaImpala can be the alternative for apache hive but it uses its ownexecution daemons which we need to install every datanodes in Hadoopcluster.Followingare the the steps that you would expect to take to ingest the tablesinto Hadoop :-create a dedicated Hadoop user account before you set up the Hadooparchitecture. This account separates the Hadoop installation fromother services that are running on the same machine.

Then setup thehadoop cluster and make sure all components of hadoop areworking.Components include name node,data node,secondary namenodeetc.In case of distributed setup there should be multiple data nodes.-Installsqoop and make sure its working correctly.

-sqoopimport command is used to create connection with database and importall data from a table-Theimported data is saved in a directory on HDFS based on the tablebeing imported. As is the case with most aspects of Sqoop operation,the user can specify any alternative directory where the files shouldbe populated. -Bydefault these files contain comma delimited fields, with new linesseparating different records. -Inmost cases, importing data into Hive is the same as running theimport task and then using Hive to create and load a certain table orpartition. -Whenyou run a Hive import, Sqoop converts the data from the nativedatatypes within the external datastore into the corresponding typeswithin Hive. Sqoop automatically chooses the native delimiter setused by Hive. Comparisonof Hortonworks and Cloudera:Hortonworkswas founded in the year 2011 and has then quickly emerged on theleading vendors to provide Hadoop distributions. The Hadoopdistribution made available by Hortonworks is also an open sourceplatform based on Apache Hadoop for analyzing, storageand management of Big Data.

Hortonworks is the only vendor to providea 100% open source distribution of Apache Hadoop with no proprietarysoftwares tagged with it. Hortonworks distribution, HDP 2.0 can beaccessed and downloaded from their organization website for free andits installation process is also very easy.

BothCloudera and Hortonworks are both built upon the same core of ApacheHadoop, thereby both of these share more similarities thandifferences between each other.Both Cloudera and Hortonworks areenterprise ready Hadoop distributions to answer customer requirementsin regards to Big Data. Each ofthese have passed the tests of consumers in the areas of security,stability and scalability. They provide paid trainings and servicesto make ourselves familiarized.Clouderaannounced its long term achievement to be an enterprise data hub thuseliminating the need of a Data Warehouse.Hortonworks looks forward tofirmly provide Hadoop distribution partnering with data warehousingcompany Teradata, just for this purposeIprefer hortonworks for this scenario because its 100% free and havefree open source license and include both tools that I suggested forthis task whereas CDH is not free completely.


I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out