Probability and Statistics, for Data Enthusiasts, Part 1

7 min readNov 9, 2022

This series is an attempt to explain how probability, statistics, and computers work together to create the vast field of Data Science. I will be starting with the very basics and working my way up. Hopefully, this would help students and professionals hoping to switch to a career in Data Science, without a prior background in Mathematics, to get some insight into Probability and how it is used to analyze data. Let’s get started.

What is Statistics and what are these statistical methods?

We have statistics for every area of our life. It is, in the simplest of terms just numbers. Just numbers arranged in a fashion, that gives us a birds-eye view of the subject the numbers are about. The reason it is so important in data science is that Statistics is the most certain way(regardless of the uncertainty) to get insights into our data. As a Data Scientist in training or as a working Data Scientist/Data analyst you will be working with tons of data(literally) and will use different methods to extract information and gain insights from it.

Example 1:
For example purpose, we will assume that we have a University of Data Science as the world’s topmost university in existence. You go to this university and conduct a student survey to ask them how they think they will perform at university, and do they believe that the university lives up to its prestigious reputation and provides real-life career opportunities. You ask 100 students this question and ask them to answer in Yes or No. You get 60 students saying no, that UDS sucks, and 40 saying no the university doesn't suck and in fact, provides you with tons of opportunities. Now you used your numbers and determined that UDS is in fact a scam. But before that, did you actually use statistics to make sure you were looking for the right thing in the right places?
For example, If you were to employ a Data Scientist for this job, let us see how he/she may approach the same problem.
Bonus: You are that Data Scientist!
You will first look at the Dataset. You will realize for the 60 students who said no, 35 of them have GPA lower than 2.5 on a scale of 4. While out of the students who said yes, 35 of them have GPA higher than 3.5. This means that for the majority of students who said UDS sucks, the real problem was that they are not able. This is one of the basic examples of how data science works. It is not just about telling a story from the data, it is about listening to the story the data is telling you. You just have to know what to listen for. That is where statistics comes into picture.

Population and Sampling

Using the references from the above example, the population is the total data from which you will choose samples to analyze. In other words, if we are for example counting how many green leaves are there in Arlington city, then the population will be all the leaves of all the trees in the city of Arlington(which is a lot!). However, it is almost impossible for me to actually count all of the leaves in Arlington and provide an analysis of them. So I will choose a sample. Meaning I will only study the number of leaves of one tree and then multiply it by the number of trees to obtain the count.

Why Samples must be unbiased?

So I count the leaves of one tree, analyse them and come up with the analysis that over a period of one day the tree shed over 2,500 leaves. Will this analysis be applicable to all the trees in Arlington? Think.

No! Of course not. Because there are multiple factors to take into consideration. For example, the type of the tree, the life of the tree, the condition of the tree, the condition of the soil, and so on. If I am choosing only one type of tree then that analysis will have super limited capability to predict anything. So how do I fix this and calculate an unbiased analysis?

I take samples from every type that I know. For example, if I am aware of 10 different types of trees and 2 different types of soil in Arlington, I will take 5 samples from 20 different types of combinations each. This will help me to keep my data balanced and provide some insight that will be suitable for all the trees in Arlington city.

This is why having unbiased data is extremely important to make your analysis actually worth applying.

Mean or Average

The mean or average of a sample is determined by adding the numerical values of that sample and dividing them by the total number of values. The mean, therefore, is a tool through which we can look at the variable data through a kind of like more uniform lens. It is represented mathematically as follows:

Mean of a sample = χ̅ = Sum of the values in the sample÷ Total number of values =
Σχi / n
Where,
Xi is the ith numerical value
n is the total number of values.

The mean involves the sum of all the numerical values. This means that if we have outliers in our data, i.e., the values that are extremely small or extremely large as compared to the rest of the data set, it is likely to have an effect on our mean. Thus, the Mean is sensitive to outliers.

Trimmed Mean

To reduce the effect of outliers on our data, so that we can get a better understanding of the central tendency of our data, we use the method of the trimmed mean. A trimmed mean refers to the process of trimming the ends of the data, by a specified percentage. If in our data we want to use trimmed mean, let's say it is 20% trimmed mean, we will take only 80% of the largest and smallest value. This makes our mean somewhat insensitive to outliers.

Weighted Mean

Sometimes when we have data in which a few entries are repetitive, then we can use weighted mean. For example, suppose you have 3 subjects. English, Physics, and Biology. In your final grade, as you are doing a degree of literature in bioinformatics ( I don't think such a degree exists, but let us assume UDS invented it!?) then you will have 60% weightage for English, 30% for Biology, and 10% to Physics in your grades. You scored, 65/100 in English, 78/100 in Biology, and 100/100 in Physics. and your final grade is the mean of your grades in all three subjects. Let us analyze how mean and weighted mean gives us different information.

If we were to calculate a simple mean, we will get

(65+78+100)/3= 81

Which seems like a good score! But wait, since we have a degree in Literature Bioinformatics, we have to consider the importance of each subject. Otherwise, there is no point in doing this specific degree. So let us calculate the weighted mean. Mathematically, the weighted mean is defined as:

Weighted Mean of a sample = χ̅ = Sum of the values in the sample times its weight ÷ Addition of weights = Σωiχi / Σωi
Where,
Xi is the ith numerical value
ωi is the weight of the ith numerical value.

In our example, it will be,

(60*65+78*30+10*100)/(60+30+10)= 72.4

That means you did not score in your specific degree as much as you wanted to! This is why the weighted mean is important because in most of the cases of our data science operations we will have different factors with different importance.

Median

The Median gives you the middle point of your data however it is. Meaning, the median is just where the center would last if the data were divided into two parts. It doesn't care if one part was bigger in value than the other. It just divides. Mathematically,

Median of a sample with n values

if n is odd
ẍ = x(n+1)/2
if n is even then
ẍ = (xn/2 + xn/2 +1)/2
This means, if n is odd then it is the middle term of the sequence as it is easy to determine the middle term if n is odd.
if n is even we will have two middle terms to split the data into two, it is the average of these two terms.

As one might expect, the median is unaffected by the outlier values because of the fact that it cares only about the position of terms and the middlemost terms rather than the mean which is dependent on the value of the term.

Out of all the central tendencies we have encountered so far, the order of sensitivity goes like this:

Mean=Weighted Mean>Trimmed Mean> Median.

Meaning, Mean is the most sensitive and the same to the weighted mean, to outliers while the Median is the least sensitive.

That is all for part 1! As we go further, we will encounter more complex subjects and hopefully simplify them.

Hope you understood everything. if you did not, please let me know in the comments.

Thank you!