Outline:
Introduction
Various statistical approaches for data analysis, such as correlation, regression, t-tests, and analysis of variance, make assumptions regarding normality. The central limit theorem implies that when a sample size of 100 or more observations is used, normality violations are not a big concern. Although, regardless of sample size, the assumption of normalcy should be followed for useful findings. We report continuous data in mean value if it follows a normal distribution. This mean value is also used to compare between/among groups in order to determine the significance level (P value). The resultant mean is not a representative value of our data if our data are not normally distributed.
A faulty selection of a data set's representative value and subsequent calculation of the significance level using this representative value may result in incorrect interpretation. As a result, we must first determine whether the data is normal, and then determine whether the mean is a representative value of the data. If applicable, means are compared using parametric tests; otherwise, nonparametric approaches are employed to compare the groups using medians.
Because normal data is an underlying assumption in parametric testing, determining the normality of data is a need for many statistical tests. Normality can be assessed in two ways: graphically and numerically. Statistical tests have the benefit of providing an objective assessment of normality, but they also have the drawback of being insensitive at small sample sizes or unduly sensitive at large sample sizes. In instances where numerical tests may be overly or underly sensitive, graphical interpretation has the advantage of allowing excellent judgment to assess normality.
Graphical Technique
Graphical interpretation has the advantage of allowing competent judgment to assess normality in situations where numerical tests may be either sensitive or too sensitive, but it lacks impartiality. If you don't have much expertise interpreting normalcy graphically, you should usually stick to numerical methods.
Here are some of the various methods available to test the normality using graphical methods.
>Histogram - A histogram is a data visualization that depicts a variable's distribution. It tells us the frequency of occurrence for each value in the dataset, which is what distributions are all about.
A histogram is a representation of a continuous variable's probability distribution. We can assume normally distributed data if the graph is roughly bell-shaped and symmetric about the mean.
Illustration:
Two histograms are displayed below, one for normal data and the other for non-normal data.
Watch this Video: «How to Create a Histogram Using R»
>Box Plot - The boxplot is a standardized method of visualizing data distributions using a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can give you information about your outliers and their values. It can also determine whether your data is symmetrical, how tightly your data is packed, and whether or not your data is skewed. A boxplot is a graph that shows how the values in the data are distributed. Although boxplots appear unsophisticated when compared to a histogram or density plot, they have the advantage of taking up less space, which is beneficial for comparing distributions across multiple groups or datasets.
Illustration of a Box Plot is shown in the image below:
Watch this Video: «Box Plot Using R»
Box plots are valuable because they provide a visual summary of the data that allows researchers to easily discover mean values, data set dispersion, and skewness. If a statistical data set is regularly distributed or skewed, the box plot shape will show it.
The Box plot below displays the data set test of normality.
Looking closely in the Box plot, the distribution is symmetric when the median is in the centre of the box and the whiskers on both sides are roughly the same. The distribution is positively skewed when the median is closer to the bottom of the box and the whisker on the lower end of the box is shorter (skewed right). The distribution is negatively skewed when the median is closer to the top of the box and the whisker is shorter on the upper end of the box (skewed left).
>Q-Q Plot - The Q-Q plot, also known as a quantile-quantile plot, is a graphical tool that can be used to determine if a collection of data is likely to have come from a theoretical distribution such as a Normal or exponential distribution. For instance, if we do a statistical study and assume that our dependent variable is normally distributed, we can validate that assumption with a Normal Q-Q plot. It's only a visual inspection, not a foolproof evidence, therefore it's a little subjective. However, it enables us to quickly determine whether our assumption is credible, and if not, how the assumption is broken and which data points contribute to the violation.
On the x-axis, we plot the theoretical quantiles, also known as the standard normal variate (a normal distribution with mean=0 and standard deviation=1), and on the y-axis, we plot the ordered values for the random variable we're trying to determine whether it's Gaussian or not. From each point plotted on the graph, a highly beautiful and smooth straight line-like structure emerges.
Now we must concentrate on the straight line's ends. We cannot conclude a relationship between the x and y axes if the points at the ends of the curve produced from the points do not lie on a straight line but are substantially scattered from the positions, indicating that our ordered values that we sought to calculate are not normally distributed, as illustrated in the figure below:
We can plainly claim that this distribution is Normally distributed if all of the points shown on the graph properly lie on a straight line since it is evenly aligned with the standard normal variate, which is the simple notion of Q-Q plot. If the lower end of the Q-Q plot deviates from the straight line but the upper end does not, then the distribution has a longer tail to the left or is left-skewed (or negatively skewed), but if the upper end deviates from the straight line but the lower end does not, then the curve has a longer tail to the right and is right-skewed (or positively skewed) (or positively skewed).
>P-P Plot - The Normality P-P Plot test the cumulative probability plots of residuals (P-P plot) are used to determine whether a variable's distribution is compatible with a given distribution. The scatters should fall on or very close to the normal distribution line if the Standar-dized residuals are normally distributed. demonstrates that the residuals' scatters largely fall straight on the normal distribution line, indicating that the residuals have a normal distribution.
Numerical Technique
When alternative visualization techniques are inconclusive, statistical inference (Hypothesis Testing) might provide a more objective conclusion as to whether our variable deviates significantly from a normal distribution. The Kolmogorov–Smirnov test and the Shapiro–Wilk test, two well-known tests of normality, are the most generally numerical methods to test the data's normality.
>Shapiro–Wilk test - The Shapiro–Wilk test is better suited to small sample sizes (less than 50), but it can also be used with larger samples. When looking for a normal distribution, this is the most powerful test. It was created with the normal distribution in mind and cannot be used to test against other distributions. We assume a normal distribution if the P-Value of the test is more than 0.05; otherwise, the distribution is not normal.
>Kolmogorov-Smirnov test - The Kolmogorov-Smirnov test calculates the distances between the empirical and theoretical distributions, and the test statistic is defined as the sum of those distances. The Kolmogorov-Smirnov Statistic is the Test Statistic of the KS Test, which follows a Kolmogorov distribution if the null hypothesis is true.
The KS test is well-known, although it isn't very effective. This indicates that rejecting the null hypothesis requires a significant number of observations. It is also affected by outliers. We assume a normal distribution if the P-Value of the KS Test is greater than 0.05; otherwise, the distribution is not normal.
In Summary
If you only want to look at one variable, use a QQ plot, and if you want to look at a lot, use a Box Plot. If you need to show your findings to a non-statistical audience, use a histogram.|
Use the Shapiro-Wilk test as a statistical test to confirm your theory. It is the most persuasive test, and it should be the deciding factor. When testing against different distributions, you should use the Anderson-Darling test or the KS test instead of Shapiro Wilk.
Watch this Video: «Test of Normality Using R»
References:
Thank you for Reading!
I'd love to hear your thoughts about the Testing Normality of a data set. Feel free to leave your comment section below.
Is this article useful to you?
1 Comments
Thank you for reading! Any thoughts about Normality Test. I'd love to hear your comment.
ReplyDelete