Data Science has become a hub of opportunities now a days. It is taking care of each domain of the industry whether it is IT, Electronics, Mechanical, Medical or research. Anyone from any background can go for data science today. Data Science is a combination of programming and mathematics. You don’t need to be an expert in mathematics, but you should know basics of it. These are the prerequisite that you must see before going into the detail of data science:
Statistics and Probability
Calculus (Partial Derivatives)
Programming (Any One)
Mathematics is one of the most beautiful subjects in this world. If you are eager to learn these topics of mathematics and a little bit of any of the programming, then only you can go for Data Science. You might be learning mathematics from your school time, but did you ever thought about the real time use case of the topics that you learn in math. In data science you will understand the real meaning of math that how it is helping to make your software and applications more and more smart.
In this blog we are going to see the life cycle of a data scientist. Data Scientist job is considered as one of the most highly paid job in industry today. But there are lot of domains inside a data science life cycle. You could become any one of them or you could be a full fledge data scientist as well. So let’s see the life cycle of data science first :
Here, I am using very basic terms to make you understand the life cycle of data science. And you might find few different life cycles on internet as well. But the meaning is almost same everywhere. I have divided data science into 5 major parts. Let’s talk about each and every part of life cycle in detail :
This is the first phase of data science life cycle. Before doing anything the first thing you need is data. So how and from where you will get the dataset ?
Data Collection means to collect the data and being a data science engineer, it becomes your responsibility to gather data from different resources. And you should be aware of the different techniques to gather data.
Data could be available on any website or in database or a file or it could be an API.
Techniques of data collection :
So these are the few techniques which are used to gather data and you cannot rely on a single technique because data could be available on any website or it might be stored in database or some web services are providing you the dataset.
To learn more about web crawling using python refer this blog.
I am also mentioning here few websites here which you can use to search datasets :
This is the second phase of data science life cycle. Now this is the most important part of the life cycle because this part covers 70% of the data science job. 30% is the rest.
What is data analysis ?
Data Analysis is a process of :
Cleaning and Modeling data
Why we need data analysis ?
- The data you have collected might contain a lot of unwanted information that you don’t want.
- Might be at few places there are null values or missing values.
- You might want to perform statistical analysis on your data.
- You want to visualize the data and plot graphs to get insight of data
How to perform data analysis ?
There are different tools available to perform data analysis. Here is a list of most popular tools and programming languages that are used to perform data analysis :
- Power BI
- XPlenty and lot more…
Data analysis helps a business to grow more by telling growth rate of company, by analyzing there daily business reports. Data Analyst is an expert of statistics as well because it requires a good knowledge of stats to perform data analysis and to get the more insight of data.
First, we need to clean the data then statistical analysis will be performed. Data cleaning is the process where we remove unwanted values and handle the missing values. To learn more about cleaning of data and what exactly it is and how it is done then refer this blog. After cleaning we can perform statistical analysis.
We can find out the mean, median, mode and variance and standard deviation using stats. Statistics is a broad field of study. It is further divided into two categories :
Go through this blog to learn more about statistics.
Another part of data analysis is to visualize the data using graphs like bar plot, pie chart, box plot, line plot, scatter plot and few more different graphs. Graphs are the best way to show the outcomes to end user. You could have seen graphs on TV as well when you watch sports or news. During elections the results are shown through graphs. So data visualization is very important part of data analysis. To learn more about data visualization using python you can refer this video.
Now you have performed data analysis and visualization on the dataset you have collected. After that you need to perform data preprocessing. In this step we prepare the dataset in such a form that we can apply machine learning on it. Data preprocessing is required to perform few transformations on your dataset before you implement machine learning. Each algorithm of machine learning has a math behind it. So data must be converted into proper numerical format first.
Data Preprocessing includes :
- Label Encoding and OneHotEncoding
- Feature Scaling – Standardization and Normalization
- Train Test Split
To learn more about data preprocessing refer this blog.
Predictive Modeling (Machine Learning)
Finally we are ready to implement machine learning on our dataset. It is known as predictive modeling because here we train our dataset using machine learning and do the predictions. So first data was divided into training and testing part and then we apply machine learning on training data and test the model on testing dataset.
Machine Learning is a subset of AI where we train the machine by using human experience. All the steps that we performed above are going to help us now in machine learning.
Machine learning is divided into 4 categories :
- Supervised Learning
- Unsupervised Learning
- Semi-Supervised Learning
- Reinforcement Learning
Now this is the final part of our data science life cycle where we perform optimization to train the model with more accuracy and less error. The machine learning model that we trained in previous step might not give you proper accuracy at first. So here we need to optimize our model by using techniques like gradient descent. Gradient Descent is a technique which uses partial derivatives to find the minimum error for our model.
- Train the model
- Find out the error
- Apply gradient descent to minimize the error
So after applying the machine learning on our dataset first we find out the error or loss that how far our predicted values are from the actual values of data. So there is a loss function or cost function we say for each machine learning model and we differentiate the particular loss function to minimize the error.
Now we are ready to deploy the model that we have trained. Model deployment means to store and load the model on either cloud or integrate with your application. Suppose you wants to build a movie recommendation system like Netflix. It is an app which shows and recommends you the movies. The machine learning is helping Netflix to show better recommendations to users. Here we integrate the trained machine learning model with our app. Simply applying machine learning and showing accuracy on a dataset don’t mean anything. You need to deploy your trained models with applications to show the actual working of machine learning.
Now finally let’s conclude the life cycle of data science that we have learned in this blog. So if you want to become a data scientist then this is the process or life cycle that you have to go through.
Collect the data, Clean it, Perform Data Analysis, Visualize it, Apply machine learning, Optimize it.