p { margin-bottom: 0.1in; direction: ltr; color: rgb(0, 0, 10); line-height: 120%; text-align: left; }p.western { font-family: “Liberation Serif”, serif; font-size: 12pt; }p.cjk { font-family: “Droid Sans Fallback”; font-size: 12pt; }p.ctl { font-family: “FreeSans”; font-size: 12pt; }a:link { }code.cjk { font-family: “Droid Sans Fallback”, monospace; }

The
first tool that can be used in a given scenario is Apache Sqoop.It is
used to import and export data to and from between HDFS and RDBMS.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Sqoop
works by running a query on the relational database and exporting the
resulting rows into files in one of these formats: text, binary,
Avro, or Sequence Files.

The
files can then be saved on Hadoop’s HDFS. Sqoop also works in the
opposite direction, letting you import formatted files back into a
relational database.
A
Spark for this use case.You can create DataFrame
from RDBMS table and then write this to HDFS.Spark
did not work well for complex data types.
I
prefer Sqoop for this use case. As one can change data types, number
of mappers, etc. in sqoop.

Now
the second tool for the scenario is Apache Hive.Hive is a data
warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop and makes querying and analyzing easy.
Hive
provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HQL. At the same time this
language also allows traditional map/reduce programmers to plug in
their custom mappers and reducers when it is inconvenient or
inefficient to express this logic in HiveQL.

Cloudera
Impala can be the alternative for apache hive but it uses its own
execution daemons which we need to install every datanodes in Hadoop
cluster.

Following
are the the steps that you would expect to take to ingest the tables
into Hadoop :

create a dedicated Hadoop user account before you set up the Hadoop
architecture. This account separates the Hadoop installation from
other services that are running on the same machine. Then setup the
hadoop cluster and make sure all components of hadoop are
working.Components include name node,data node,secondary namenode
etc.In case of distributed setup there should be multiple data nodes.
-Install
sqoop and make sure its working correctly.
-sqoop
import command is used to create connection with database and import
all data from a table

-The
imported data is saved in a directory on HDFS based on the table
being imported. As is the case with most aspects of Sqoop operation,
the user can specify any alternative directory where the files should
be populated. -By
default these files contain comma delimited fields, with new lines
separating different records.

-In
most cases, importing data into Hive is the same as running the
import task and then using Hive to create and load a certain table or
partition.

-When
you run a Hive import, Sqoop converts the data from the native
datatypes within the external datastore into the corresponding types
within Hive. Sqoop automatically chooses the native delimiter set
used by Hive.

Comparison
of Hortonworks and Cloudera:
Hortonworks
was founded in the year 2011 and has then quickly emerged on the
leading vendors to provide Hadoop distributions. The Hadoop
distribution made available by Hortonworks is also an open source
platform based on Apache Hadoop for analyzing,

storage
and management of Big Data. Hortonworks is the only vendor to provide
a 100% open source distribution of Apache Hadoop with no proprietary
softwares tagged with it. Hortonworks distribution, HDP 2.0 can be
accessed and downloaded from their organization website for free and
its installation process is also very easy.

Both
Cloudera and Hortonworks are both built upon the same core of Apache
Hadoop, thereby both of these share more similarities than
differences between each other.Both Cloudera and Hortonworks are
enterprise ready Hadoop distributions to answer customer requirements
in regards to Big Data.

Each of
these have passed the tests of consumers in the areas of security,
stability and scalability. They provide paid trainings and services
to make ourselves familiarized.
Cloudera
announced its long term achievement to be an enterprise data hub thus
eliminating the need of a Data Warehouse.Hortonworks looks forward to
firmly provide Hadoop distribution partnering with data warehousing
company Teradata, just for this purpose
I
prefer hortonworks for this scenario because its 100% free and have
free open source license and include both tools that I suggested for
this task whereas CDH is not free completely.

x

Hi!
I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out