<< Hide Menu
8 min read•june 18, 2024
Jed Quiaoit
Lusine Ghazaryan
Jed Quiaoit
Lusine Ghazaryan
Statistics is a measure taken from the sample to help us analyze the data. Meanwhile, a parameter is the measure taken from the population. In inferential statistics, we will use statistics to make inferences about the parameters. For now, we'll focus on summary statistics. Mean, median, standard deviation, IQR, range, all are summary statistics for a quantitative variable.
Mean, or average, as you learned before, is easy to calculate, we add up all the values of the variable and divide the sum by number. The formula follows: x̄ = ∑x / n x is read as an x bar; it’s the mean value of the x values of data. By the way, it doesn't need to be x; it can be y as well. Means are the best summary measures for a symmetric distribution because, as mentioned before, they are the balancing point of the distributions. However, the mean has few drawbacks.It does not tell about all individuals (that is why we also need summary measures of spread), and it is not resistant to outliers.
The mean number can easily be affected by one high value in our data set and affect our study results, leading us to make wrong decisions if we wrongly choose to report the mean instead of the median.
Median is the middle number of data. When data are even we calculate the median by finding the average of the middle two numbers. Medians are good alternatives of summarizing the center of for skewed distributions or distribution with an outlier. The median is resistant to outliers. However, it is not easy to find the median from the histogram, but you don’t need to do it.
We will need only to find its position by dividing the total number of our data by 2. If the total amount is odd, we add one (n/2 for even cases and (n + 1)/2 for odd ones).
In the following section, when we compare two histograms, you will see how to find the median from the histogram.
Rule of thumb time! 👍
If the distribution is symmetric and unimodal, the mean is often the best measure of central tendency because it takes into account all of the values in the dataset and reflects the overall trend in the data.
On the other hand, if the distribution is skewed or has outliers, the median is often a better measure of central tendency because it is resistant to the influence of these values. In right-skewed distributions, the mean is generally higher than the median, while in left-skewed distributions, the mean is generally lower than the median.
It's always a good idea to report both the mean and median when describing the statistical properties of a dataset, and to explain why they are different if they are not close to each other. This can help to provide a more complete picture of the distribution of the data and how it is dispersed around the center.
Likewise, remember to always report the units when describing summary measures of the center, as you would in any math class. This helps to provide context and allows others to interpret the results accurately.
The standard deviation is like lungs in statistics. You cannot breathe without it. You cannot analyze data without it. It shows how far or close the values are dispersed, deviated, or vary from the mean. The process of calculating standard deviation is lengthy and time-consuming, and definitely, you already know by now. You will mostly rely on your calculator to do it for you, but in case here is the formula:
s = √[∑(x-x̄)^2/n-1]
You may wonder, if not already before, why subtract one from n? When calculating the standard deviation for a sample, it is necessary to subtract 1 from the number of values (n) in the sample to account for the fact that the sample is a subset of the population and therefore has some level of sampling error. This is known as the "degrees of freedom" and it is used to adjust the variance estimate so that it is more accurate and more representative of the population.
As you read more units, you will revisit the concept of standard deviation and will understand it more.
Recall that the interquartile range (IQR) is based on the difference between the upper and lower quartiles. It is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a data set.
IQR = upper quartile (Q3) - lower quartile (Q1).
The first quartile, Q1, is the median of the half of the ordered data set from the minimum to the position of the median. The third quartile, Q3, is the median of the half of the ordered data set from the position of the median to the maximum. Q1 and Q3 form the boundaries for the middle 50% of values in an ordered data set.
However, the IQR does not capture the entire distribution of values in the data set and therefore may not fully reflect the variability of the data. Other measures such as the range, standard deviation, and variance can provide a more comprehensive view of the dispersion and variability in a data set. These measures are often used in conjunction with the IQR to provide a more complete understanding of the characteristics of a data set.
It's generally true that the IQR is larger than the standard deviation for symmetric distributions without outliers, although the specific relationship between these measures will depend on the characteristics of the data set.
Previously, we've talked about what outliers are, but how do we know a data point is an outlier or not? There are many methods for determining outliers. Two methods frequently used in this course are:
We can use the IQR to identify outliers involves calculating the IQR for the data set and then using this value to determine which values are outside the normal range of the data.
Specifically, values that are more than 1.5 × IQR above the third quartile (Q3) or more than 1.5 × IQR below the first quartile (Q1) are considered outliers. This method is based on the assumption that most of the values in the data set should fall within the range defined by the IQR, with only a small number of values falling outside this range.
To determine whether a value is an outlier using the 1.5 × IQR method, you will need to calculate the interquartile range (IQR) for the data set and then compare the value to the upper and lower bounds of the data set. Here is an example of how this might be done:
Suppose you have the following data set: 10, 15, 20, 25, 30, 35, 40, 45, 50
To determine whether any of these values are outliers using the 1.5 × IQR method, you would first need to calculate the IQR. To do this, you would need to find the first quartile (Q1), the median (Q2), and the third quartile (Q3).
For this data set, the first quartile (Q1) is 20, the median (Q2) is 30, and the third quartile (Q3) is 40. The IQR is then calculated as the difference between Q3 and Q1, or 40 - 20 = 20.
Next, you would need to determine the upper and lower bounds of the data set using the IQR. The upper bound is calculated as Q3 + 1.5 × IQR, or 40 + (1.5 × 20) = 70. The lower bound is calculated as Q1 - 1.5 × IQR, or 20 - (1.5 × 20) = -10.
Finally, you would need to compare the value you are interested in to these bounds. If the value is less than the lower bound or greater than the upper bound, it is considered an outlier. For example, if the value you are interested in is 100, it is an outlier because it is greater than the upper bound of 70. If the value you are interested in is 5, it is not an outlier because it is within the bounds of the data set (-10 to 70).
We can also use standard deviations to identify outliers involves calculating the mean and standard deviation for the data set and then using these values to determine which values are outside the normal range of the data. Specifically, values that are more than 2 standard deviations above or below the mean are considered outliers. This method is based on the assumption that most of the values in the data set should fall within two standard deviations of the mean, with only a small number of values falling outside this range.
Both of these methods can be useful for identifying unusual or unexpected values in a data set, but they may not be suitable for all types of data or in all situations. It is important to consider the characteristics of the data set and the goals of the analysis when deciding which method to use to identify outliers.
The mean, standard deviation, and range are considered nonresistant (or non-robust) because they are influenced by outliers. The median and IQR are considered resistant (or robust), because outliers do not greatly (if at all) affect their value.
For these reasons, the median and IQR are often preferred to the mean, standard deviation, and range when working with data sets that may contain outliers. They are more robust and provide a more accurate representation of the center and spread of the data, even in the presence of extreme values.
© 2024 Fiveable Inc. All rights reserved.