. Big data is ahuge amount of data which can’t be processed using traditional systems of approach(computer system) in a given time frame. Now how big does this data need to be? There’s acommon misconception while referring the word big data. There’s not athreshold of data above which data will be considered as big data. Itis referred to data that is either in gigabytes, terabytes, petabytes, exabytesor size even larger than this. This definition is wrong. Big data dependspurely on the context it is being used in.
Even a small amount of data can bereferred to as big data. For example, you can’t attach a file to an email witha size of 100 MB. Therefore for the email, this 100 MB is referred to as bigdata.
Some more examples are listed below:· There is 100 TB of videos to beresized and edited within a given time frame. Using a single storage system, wewon’t be able to accomplish this time within the given time frame. · Popular social networking siteshave a lot of data coming in every minute like facebook receives 100 TB dataper day, around 400 millions tweets are tweeted on twitter everyday and almost48 videos are uploaded every minute onYouTube. This data is very important and it needs to be processed in a giventime. ClassificationData is classified into three main categories:1. Structured data- databases, XLS files etc.
2. Semi structured data– email, log files, doc files etc.3. Unstructured data– images, videos, music files etc.Weneed new techniques, new tools, new architecture for the management that is forstorage, processing within a time frame.
V’s of big dataThere are 3 V’s of big data, 4 have been recentlyadded making them 7 in total.· Volume:It refers to the huge amount of data that is created in places ranging fromdata created by social networking sites, banks (accounts, credit and debitcards).· Variety:It is referred to different types of data being used for asdiscussed above (structured, semi structured and unstructured).· Velocity:While processing, more and more data keeps on coming and it has to be processedefficiently and within the time frame.
For example, every minute new videos arebeing uploaded on YouTube.· Veracity:This is referred to the authenticity of the data. For exampletwitter uses hash tags, abbreviations in user’s tweets. The accuracy of allthis content is checked by twitter.
· Visibility:The type of data that is visible.· Validity:Referred to the validity of data. For example, in 1998 different kinds of fileswere than that are being used now.· VariabilityEnablersfor big data· Increased storage· Increased processing speed· Increased data· Increased network speed· Increased capitalTools forbig data a) Hosting– Distributed servers/ cloud. Eg-Amazon EC2b) Filesystem – Sealable, distributed.
Eg-HDFSc) Programmingmode/ Paradigm- Hadoopd) Database-NoSQL. Eg- HBase,MongoDB, Cassandrae) Operating-Querying,indexing, analytics. Eg- Data mining, info retrievalHadoop (briefintroduction)An algorithm was developed that allowed for large datacalculation to be chopped up into smaller chunks or pieces and then mapped tomany computers, and after calculation the large data was brought back togetherto produce the resulting data set. This algorithm was called MapReduce. Withthis algorithm, Hadoop was created.Data in this way is being processed in parallel rather than serial.