Statistics is one of the major branches of mathematics and stats is one of the biggest reasons behind the success of Data Science, because the methodologies and techniques statistics provides are very helpful for Data Scientist in daily life. Statistical Analysis plays a major role in life cycle of data science.
Definition: Statistics deals with the methods which helps us to gather, analyze, review, and make conclusions from the data. Statistics is used when the data set depends on a sample of a larger population, then the analyst can develop interpretations about the population primarily based on the statistical outcomes from the sample. Like mean, median, mode, range etc.
It comes into the role when a user wants to see the insight of data or wants to find out hidden patterns and it is used in almost every field and department like:
- Weather reports
- In Sports to show players and teams performances
- In TV Channels to perform analysis on TRP
- Stock Market
- Products Based Companies
- Disease and their impact
From very small to very large, each company need statistics to evaluate their growth and how their products perform in market. Let’s see few examples of statistics:
So, these was the few examples of statistics that how everything is being shown to us with the help of graphs and graphs are the best way to show data to users. Before we talk about statistics, first we need to understand data and its different types.
Data and Its types
Categorical Or Qualitative :
Categorical data is a type of data which represents categories. Data is divided into two or more categories like gender, languages, cast or religion etc. Data can also be numerical (Example : 1 for male and 0 for female). Here numbers do not represent any mathematical meaning.
Type of data that has two or more categories without any specific order. Nominal values represents discrete units and used to label variables that do not have any quantitative value.
- Gender – Male and Female
- Languages – Hindi, English, Chinese
- Exams – Pass, Fail
- Grades – A, B, C, D
Type of data that has two or more categories but they have a specific order. Ordinal values represents discrete and ordered units. So it is almost similar to nominal data but they have some specific order.
- Movie Ratings : Flop, Average, Hit, Superhit
- Scale : Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree
Numerical Or Quantitative :
Numerical data represents continuous type of data which has a mathematical meaning and measured in a quantity.
Interval type of data represents data in equal intervals. The values of interval variable are equally spaced. So they are almost similar to ordinal type of data but here data could of any continuous range.
- Temperature – Generally temperature is divided into equal intervals like 10 – 20, 20 – 30, 30 – 40
- Distance and speed could also be given in equal intervals
It is interval data with a natural zero point. When a value of any variable is 0.0 then it means there is none of that value. Suppose you are given temperature and it is 0 degree, so it is valid because temperature could be 0 degree. But if I say that your height is 0 ft then it doesn’t mean anything.
Let’s Dig into Statistics
Statistics is also divided into 2 major categories :
- Descriptive Statistics – Presenting, organizing and summarizing data
- Inferential Statistics – Drawing conclusions about a population based on data observed in a sample
Descriptive Statistics helps to find out the summary of data and tells us the value that best describes the data set. It also tells how much your data is spread and scattered around from its average value or mean value. You can also find out minimum and maximum range of your data.
Descriptive Statistics is broken down into :
- Measure of Central Tendency (Mean, Median, Mode)
- Measure of Variability / Spread (Standard Deviation, Variance, Range, Kurtosis, Skewness)
Measure of Central Tendency
Here we can describe whole dataset with a single value that represents the center of its distribution. There are 3 main measures of central tendency : Mean, Median and Mode.
I know most of you are already aware of simple arithmetic mean but there are few more types of mean that you should learn about. Different types of mean :
- Arithmetic Mean
- Weighted Mean
- Geometric Mean
- Harmonic Mean
Relationship b/w AM, GM and HM
Measure of Variability
Measure of variability describes how spread out a set of data is. We can observe how widely data is scattered when we have large values in the dataset or how data is tightly clustered when we have smaller values in the dataset. It tells the variation of the data from one another and gives the clear idea about the distribution.
The spread of a data is described by a range of descriptive statistics which includes variance, standard deviation, range and interquartile range. Here the spread of data can be shown in graphs like : boxplot, dot plots, stem and leaf plots. The measure of variability tells how much your data is deviated from its standard or in simple terms we can say that how much data is far away from center point or from average value.
Note : We are not going in depth of measure of central tendency or measure of spread. Soon there will be a separate blog for these topics. Here in this blog we are just having introduction to statistics
You might have heard the term probability a lot of times earlier and might have studied in schools or colleges as well. There were few common examples when we used to learn probability like probability of head or tail when we coin the toss or probability of getting a 6 if roll the dice.
Definition : Probability Distributions are the mathematical functions from which we get to know about the probabilities of the occurrence of various possible outcomes in an experiment. There are different types of probability distributions like :
- Bernoulli Distribution
- Uniform Distribution
- Binomial Distribution
- Normal Distribution
- Poisson Distribution
- Exponential Distribution
- Chi-Squared Distribution
I have just written few of the most popular ones. There are few more types of distributions.
Note : We are not going into details of these distributions right now, because each distribution needs a separate blog. So in upcoming blogs we will see these distributions one by one. Here in this blog we are just having introduction to statistics.
Inferential Statistics is used to make conclusions from the data. Generally here we take a random sample from the population to describe and make inferences about the population.
Inferential statistics is used a lot in data analysis field. We conduct different types of test on random samples from a given set of data and get to know about the effect of the product. Inferential Statistics use statistical models to help you compare your sample data to other samples or to previous research. Most research uses statistical models called the Generalized Linear model and include :
- Student’s t-test
- ANOVA (Analysis of variance)
- Regression Analysis
Inferential Statistics includes :
- Hypothesis Testing
- Binomial Theorem
- Normal Distributions
- Central Limit Theorem
- Confidence Intervals
- Regression Analysis / Linear Regression
- Comparison of Mean
So this was a introduction to statistics and its different types.
Note : In upcoming blogs we will be going in depth of these topics as well