The popularityof the term “Data Science” has bombarded in technical, businessenvironments and academia, as indicated by a jump in job openings. However,many critical academics and journalists see no difference between data scienceand statistics implementation.
Dealing with unstructured and structured data, DataScience is a field that encompasses anything related to data cleansing,preparation, and analysis. Data is everywhere and increasing at infiniterate. In fact, the amount of digital data that exists is thriving at a rapidrate—in fact, more than 2.7zettabytes of data exist in today’s digital universe, and that isprojected to flourish to 180zettabytes in 2025. That’s why more organizations of new world areseeking professionals’ worker who can make sense of all the data.
It’s thefuture of development and present for sustainable development. For the future of data science, Donohoprojects an ever-growing environment for open science where data sets used for academicpublications are accessible toall researchers. USNational Institute of Health hasalready announced plans to enhance reproducibility and transparency of researchdata.
Data science is a discipline thatincorporates varying degrees of Data Engineering, Scientific-Method, Statistics, Advanced Computing, Visualization, Hackermindset, and Domain Expertise. A professionalpractitioner of Data Science is called a Data Scientist. Data Scientists solvecomplex data analysis problems. The job title has similarlybecome very noted. On one heavily used employment site, the number of jobpostings for “data scientist” inclined more than 10,000 percentbetween January 2010 and July 2012.
Data science extant makes the companies tomake stronger and smarter business decision.· Netflix data minesmovie viewing patterns to understand what drives user is interested in, anduses that to make predictive decisions on which Netflix original series togenerate.· Target features whatare major customer segments within it’s base and the unique shopping demeanorswithin those segments, which helps to guide messaging to distinct marketaudiences.· Proctor & Gambleutilizes time series models to more lucid and intelligible understand future need,which help plan for production levels more optimally.· Amazon’srecommendation engines suggest items for the user to buy, determined by theiralgorithms.
Netflix recommends movies to the user. Spotify recommends music tothe user.· Gmail’s spam filter isdata product – an algorithm behind the scenes processes incoming mail anddetermines if a message is junk or not and process accordingly.· Computer vision usedfor self-driving cars is also data product – machine learning algorithms areable to alert itself by recognizing traffic lights, other cars on the road,pedestrians, etc.
The requisites for the professional industrial data scientists -A. Mathematics ExpertiseAt the heart of mining data insight and building data product isthe ability to view the data through a quantitative and logical oculus. Thereare delicacy, dimensions, and correlations in data that can be expressedmathematically. Finding solutions utilizing data becomes a brain perplexing jobof heuristics and quantitative technique. Solutions to many business problemsinvolve building analytic models grounded in the hard math, where being able tounderstand the underlying mechanics of those models is key to success inbuilding them. B. Strong BusinessAcumen It is important for a datascientist to be a shrewd, tactical and stalwart business consultant.
Working so closely with data, datascientists are positioned to learn from data in ways no one else can. That createsthe responsibility to translate observations to shared knowledge, andcontribute to strategy on how to solve crux business problems. This means acore competency of data science is using data to intelligibly tell a story. Nodata-puking – rather, present a cohesive narrative of problem and solution,using data insights as supporting pillars, that lead to guidance.C.
Technology andHacking First, let’s clarify on thatwe are not talking about hacking as in breaking the information bygetting into computers. We’re referring to the technical coder subculturemeaning of hacking – i.e.,creativity and ingenuity in using technical skills to build things and findtactical solutions to problems as expressed in Fig.
1. Fig.1 I. Pandas Pandas is a BSD-licensed,open source library providing effecient-performance, easy-to-use datastructures, algorithms and data analysis tools for the Python programming language. Pandas is a NumFOCUS sponsoredproject. This will help ensure the success of development of pandas library as aworld-class open-source project, and makes it possible to donate tothe project. Python has long been great for large data manipulation andpreparation, but less so for data analysis and modeling.
Pandas helps fillthis vaccum, enabling you to carry out your entire data analysis workflow inPython without having to switch to a more domain specific language like R.Aggregatedtogether with the marvelous IPython toolkitand other libraries, the environment for working in data analysis in Pythonexcels in performance, productivity, and the ability to collaborate. Pandas does notimplement significant modeling functionality outside of linear and panelregression; for this, look to stats models and scikit-learn. More work is still needed to make Pythona outstandingly brilliant class statistical modeling environment, but atpresent it is well on its way toward the goal.
a. Installation The optimum solution forinstalling the pandas on system.conda install pandas Also can be installed from the PyPI where it has beenuploaded.pip install pandas b. Specifications and library highlights Tools for readingand writing data between in-memory data structures anddifferent formats: CSV and text files, Microsoft Excel, SQL databases, and thefast HDF5 format, Intelligent data alignment and integratedhandling of missing data:gain automated label-based alignment in computational techs and easilymanipulate messy data into an orderly form, Intelligent label-based slicing, fancy indexing, and subsetting of cosmic data sets, High performance merging and joining of data sets,Python with pandas isin usage in a wide variety of academicand commercial domains, including Finance, Neuroscience, Statistics,Economics, Advertising, Web Analytics, and more. II. seaborn Seaborn is a Python interactivevisualization library based on matplotlib.
It provides a h