Probability and statistics for data enthusiasts, Part -2: Data and Measures of Variability.

Yash Tobre
6 min readNov 20, 2022

--

Broadly Speaking, there are two types of data. One is something that is countable and the other is something that is sort of descriptive over a range. Let’s look at some examples.

Photo by Pritesh Sudra on Unsplash

Example 1:

Suppose in our UDS ( University of Data Science), We have a team of data scientists to improve the performance of our students. There are two tasks, Task A and Task B. You are the data scientist in charge and were provided with the following Problem Statements for each of the tasks.

We want to know how students are performing in the course that was offered to them in 2022, to determine whether we should keep the course going or discard it. The course is Data Science And Agriculture(DSA). To determine the validity of the course we are using two performance metrics. The time students spent in the lectures and the total number of students that completed the assignments given.

Task A: Determine whether the total time students were sitting in the lecture was as good as the other top-rated courses at UDS.

Task B: Determine whether students were found doing the assignments given to them and enjoying them.

The team of Data Scientists thus creates a survey with a few questions. and gets some data. You take a glance at the data.

You realize, that for Task A, the total time all the students spent in the course was about 22 hours,on an average, from 26 hours from the total coursework. You calculated this by the timestamps of students entering and leaving the classroom.

Now, this type of data is continuous data. Because while figuring out how much time did all students attend the lectures, the team didn't count each and every second seperately. You counted the time from the beginning of the lecture to the end of it.

For Task B, from the survey the team has collected responses from all the students giving feedback on the assignments on 3 scales, 0: Hated them, 1:Did them but did not enjoy, 2: Enjoyed them. You found out that 22/80 students hated the assignments, 24/80 Did them on time but did not enjoy doing them, and 34/80 Enjoyed doing them. For figuring out the numbers the team counted the number of students for each category.

This type of data is discrete data. Because while gathering the student feedback there was more of a counting done rather than observation over a period of time.

Thus now we can define both of these data.

Discrete Data: Discrete Data is a collection of data points that are countable and finite. It is any value that is counted and one count is typically not connected to the other.
Continuous Data: Continuous Data is data that is over a range and is not countable. It is always in terms of ranges and every data point is a succession of the previous one.

Measure of Variability

In basic terms, there are three measures of variability that can help us describe most of the data that we will have.

Photo by David Pupaza on Unsplash
  1. Range: Range is the indicator of how your data is spread out over a specific interval. It is simply calculated by taking the difference between the maximum value from the Data Points and the minimum value of the data points.

For example, there are two artists, artist A, and artist B. Artist A has 2 albums one being folk-pop and the other being synth-soft rock. Artist B has 2 albums one being a pop one and the other being a rap one. Out of the two artists, for artist A you can say his/her work can be classified as a maximum of 4 different genres. While for artist B you can say that he/she has worked in 2 different genres. You can say that artist A has a wider range than artist B just because his/her work is more spread out across different genres.

2. Variance: Variance is a factor that helps us see how far is the spread within the group. Mathematically it is the average squared deviation. The unit of Variance is the square of the unit of the dataset we have.

Following up with our previous example, you give a listen to both albums A and B of the corresponding artists respectively. You analyze the albums and realize that album A has a couple of songs that performed commercially and critically well. Then there was a lesser group of songs that performed okayish critically but were loved by fans. Then finally there were a couple of songs that fans did not repeatedly listen to but were loved by critics. Now for album B, you have the majority of songs that were loved by critics and fans and were a major success. At the same time, there were a couple of songs that flopped and rarely got any streams after the album release week. You can clearly see that album A has a lot of variance in terms of performance both critically and commercially while album B was either this or that; meaning there were clear and distinct boundaries between the data and it did not vary that much. This degree of varying can be dubbed as the variance for this example.

3. Standard Deviation: Standard deviation is the deviation of the data points from the mean value of a particular dataset. It measures the spread of the data considering the mean as the center. Mathematically it is the square root of reference and thus has the same unit as the data points that we have.

From the previous example, the standard deviation would mean the deviation of performance of each song would have from the song that was dubbed average by both critics and fans.

In both standard deviation and variance, we have a denominator term n-1 this term is the degrees of freedom and it represents how many independent pieces of information we have for our data set. For calculating the sample standard deviation or sample variance we use n-1 degrees of freedom in case the numerator sums up to zero. In such a case the real information is determined by all the data points before the last data point. While for population variance or population standard deviation it does not make a significant difference whether we choose n or n-1 since the number of data points for a population sample is much larger.

Example 2:

For the following data set let us look at the range, variance, and standard deviation.

2, 5, 7, 8, 21,3,1,11

For the range,

we look at the Maximum Value: 21 and the minimum value: 1

thus range will be

R= X(max)-X(min)=21–1=20

Looking at the range we understand that our data talks about values with a range of 20 which helps us to create a mental bandwidth for the values to work with.

Now, let us find out the standard deviation and variance. For that, we need to find out the mean first.

Mean of Data = Sum of all the data points / the total number of data points.

Mean = 58/8 = 7.25

Now we can use the Variance equation and find out the variance as well as the standard deviation. For demonstration purposes, I am using a standard deviation calculator online (https://www.calculator.net/standard-deviation-calculator.html, Cheating but time-saving :P)

Observing the standard deviation I can see that on average the values vary a lot from mean, almost as much as equal to min. Which is also evident from the data. Moreover the higher the variance, the higher the distribution of data.

I wonder what will happen if I replace the 21 with 7. Check it out!

Well, that is all for this part. Let me know if you have any questions in the comments. Thank you for taking the time to read this. I hope this article cleared up one or two concepts of yours. Have a great time!

--

--

No responses yet