PSM 201 Notes -

PSM 201

Advanced Statistics

Semester – II

Introduction

Inferential Statistics - Parametric and Nonparametr

Inferential statistics is a branch of statistics that involves using data from a sample to make inferences about a larger population. It is concerned with making predictions, generalizations, and conclusions about a population based on the analysis of a sample of data.

So, statistical inference is the branch of statistics concerned with drawing conclusions and or making decisions concerning a population based only on sample data.

Apart from inferential statistics, descriptive statistics forms another branch of statistics. Inferential statistics help to draw conclusions about the population while descriptive statistics summarizes the features of the data set.

Parametric and Nonparametric Tests

If you are studying statistics, you will frequently come across two terms – parametric and non-parametric test. These terms are essential for anyone looking to pursue Statistics and Data Science. However, seldom one understands the gravity of these terms, especially when dealing with a holistic understanding of statistics and its implementation in data science.

Parametric and non-parametric tests are the two main ways of classifying statistical tests. The exciting and complicated aspect of this classification, particularly regarding non-parametric tests in statistics, is that there is no definitive definition of what defines a non-parametric test.

This makes understanding the differences between these two terms more complicated, and you require a more nuanced approach.

One common way is to take examples of parametric tests and then discuss their non-parametric counterparts. This is one of the best methods to understand the differences. In this article, we will take this approach to understand the topic at hand.

What are Parametric Tests?

Parametric tests are the backbone of statistics and are an inseparable aspect of data science. This is simply because to interpret many models, especially the predictive models that employ statistical algorithms such as linear regression and logistic regression, you must know about specific parametric tests.

However, to fully grasp the idea of what a parametric test is, there are several aspects of Statistics that need to be on the tip of your fingers. Before proceeding, let’s brush up on these concepts.

1) Population

Population refers to all individuals or subjects of interest that you want to study. Typically, in statistics, you can never fully collect information on population because-

Either – the population is too large, causing accessibility issues . For example, suppose you want to know the income of all working Indians. In that case, asking about the income of millions of individuals in the organized and disorganized sector is almost impossible.
Or – the volume and velocity of the population data are too high, which causes hardware issues (limited memory), making it difficult to process such data . For example, if you want to understand the spending pattern of the major bank’s customers, the sheer number of transactions happening at any given moment can be in millions. Analyzing a month’s data can be computationally so expensive that it’s impossible to use the whole data.

2) Parameter

To answer any question, you will need arithmetic to quantify the population. Such critical quantification methods can be – mean, standard deviation, median, minimum, maximum, inter-quartile range, etc. These significant values that describe the population are known as ‘parameters’.

3) Sample

As mentioned earlier, it becomes difficult to have complete data of the population in question due to various issues. However, to answer many questions, you need to understand the population. This is where the use of samples comes in handy.

Samples are nothing but the subset of a population that represents the population due to a concept known as the central limit theorem.

4) Central Limit Theorem

To put it roughly, the Central Limit Theorem (CLT) states :

If you have a large enough number of samples, i.e., the sample size (large theoretically means more than 30), then the mean of all these samples will be the same as the mean of the population.

Another aspect is that the distribution of the sample (also known as the sampling distribution) will be normal (gaussian) even if the population’s distribution is not normal.

5) Distribution

Distribution (commonly called data distribution) is a function that states all the possible values of a dataset along the frequency (count) of all values (or intervals as the values can be binned in groups).

The distribution is often represented using graphs like a histogram and a line chart. Different distributions have peculiar shapes and specific properties that help calculate probabilities.

These probabilities are typically regarding the likelihood of a value occurring in the data that can then be extrapolated to form a larger opinion regarding the sample space and the population from where it has been drawn.

6) Types of Distribution

Distribution can be symmetric and asymmetric.

Symmetrical distributional are those where the area under the curve to the left of the central point is the same as to the right.
asymmetric distributions are skewed that can be positive or negative. Common examples include Log-normal.

Another way of understanding symmetrical distribution in terms of shape is that there is no skewness as the right side of the distribution mirrors the left side. Common examples include Gaussian, Cauchy, Logistic, Uniform, etc.

7) Gaussian Distribution and the 3-Sigma Rule

CLT causes a large sample to have a normal, also known as Gaussian distribution. This refers to symmetric distribution that has a bell-shaped curve where the mean, median, and mode coincide.

Specific distributions have specific properties. One property of normal distribution is the three-sigma rule regarding the area under the curve (AUC) states that-

The AUC between -1 and 1 standard deviation is 68.27%
The AUC between -2 and 2 standard deviations is 95.54%
The AUC between -3 and 3 standard deviations is 99.73%

This concept is then expanded to calculate the probability of a value occurring in this distribution, which leads to hypothesis tests like the z-test.

8) Hypothesis Testing

Hypothesis Testing is an essential aspect of inferential statistics. As the name suggests, it is used to check if the hypothesis being made regarding the population is true or not.

This is often done by calculating the probability of a value occurring in a population’s sample given the standard deviation in the data. Such tests help validate whether the statistics found through the sample can be extrapolated to form a particular opinion about the population.

9) Statistic

Certain arithmetic values that help define the population are known as parameters. However, as you often use samples, these values are known as statistics when calculated using a sample.

For example, if you know the income of all the Indians and you calculate the mean income from this population data, then this value will be a parameter.

However, when calculated using a sample of this population, the mean is known as a statistic.

To make sure the sample’s mean is truly indicative of the population mean and is not due to random chance, you use the concept of hypothesis testing.

With the crucial concepts laid down, you can now finally answer the question: what is parametric test? Let’s get started.

Parametric Test: Definition

Parametric test in statistics refers to a sub-type of the hypothesis test . Parametric hypothesis testing is the most common type of testing done to understand the characteristics of the population from a sample.

While there are many parametric test types, and they have certain differences, few properties are shared across all the tests that make them a part of ‘parametric tests’. These properties include-

When using such tests, there needs to be a deep or proper understanding of the population.
An extension of the above point is that to use such tests, several assumptions regarding the population must be fulfilled (hence a proper understanding of the population is required). A common assumption is that the population should be normally distributed (at least approximately).
The outputs from such tests cannot be relied upon if the assumptions regarding the population deviate significantly.
A large sample size is required to run such tests. Theoretically, the sample size should be more than 30 so that the central limit theorem can come into effect, making the sample normally distributed.
Such tests are more powerful, especially compared to their non-parametric counterparts for the same sample size.
These tests are only helpful with continuous/quantitative variables.
Measurement of the central tendency (i.e., the central value of data) is typically done using the mean.
The output from such tests is easy to interpret; however, it can be challenging to understand their workings.

Now, with an understanding of the properties of parametric tests, let’s now understand what non-parametric tests are all about.

What are Non-Parametric Tests?

Let’s consider a situation.

A problem can be solved by using a parametric hypothesis test. However, you cannot fulfill the necessary assumption required to use the test. This assumption can be, for example, regarding the sample size, and there is nothing much you can do about it now.

So, would that mean you can’t do any inferential analysis using the data? The answer is NO.

In hypothesis testing, the other type apart from parametric is non-parametric. Typically, for every parametric test, its non-parametric cousin can be used when the assumptions cannot be fulfilled for the parametric test .

Non-parametric tests do not need a lot of assumptions regarding the population and are less stringent when it comes to the sample requirements.

However, they are less powerful than their parametric counterparts.

It means that the chances of a non-parametric test concluding that two attributes have an association with each other are less even when they, infact, are associated. To compensate for this ‘less power,’ you need to increase the sample size to gain the result that the parametric counterpart would have provided.

Another peculiar aspect of the non-parametric test is that it can also be used with discreet variables (i.e., categorical variables). It’s because non-parametric tests have the provision of a ranking of values instead of using the original data.

While it’s helpful in solving certain kinds of problems, it is difficult to interpret the results in many cases.

To put this in context, a parametric test can tell that the blood sugar of patients using the new variant of a drug (to control diabetes) is 40 mg/dL lower than that of those patients who used the previous version.

This interpretation is useful and can be used to form an intuitive understanding of what is happening in the population.

On the other hand, its non-parametric counterpart, as they use rankings, will provide output in terms of 40 being the difference in the mean ranks of the two groups of patients. This is less intuitive and helpful in forming a definite opinion regarding the population.

To conclude:

While nonparametric tests have the advantage of providing an alternative when you cannot fulfill the assumptions required to run a parametric test or solve an unconventional problem, they have limitations in terms of capability and interpretability.

Now, to gain a practical understanding, let’s explore different types of parametric and non-parametric tests.

Parametric Tests for Hypothesis Testing

To understand the role of parametric tests in statistics, let’s explore various parametric tests types. The parametric tests examples discussed ahead all solve one of the following problems-

Using standard deviation, find the confidence interval regarding the population
Compare the mean of the sample with a hypothesized value (that refers to the population mean in some cases)
Compare two quantitative measurement values typically mean from a common subject
Compare two quantitative measurement values typically mean from two or more two distinct subjects
Understand the association level between two numerical attributes, i.e., quantitative attributes.

Parametric Hypothesis Testing: Types

Z-Test	When you need to compare the sample’s mean with a hypothesized value (which often refers to the population mean), then one sample z-test is used. The test has major requirements, such as the sample size should be more than 30, and the population’s standard deviation should be known
One Sample t-Test	If either of the requirements mentioned above cannot be met, then you can use another type of parametric test known as the one-sample t-test. Here if the sample size is at least more than 15 and the standard deviation of the sample is known, then you can use this test. Here the sample distribution should be approximately normal
Paired (dependent) t-Test	Paired t-test is used when from the same subject data is collected; typically before and after an event—for example, the weight of a group of 10 sportsmen before and after a diet program. Here to compare the mean of the before and after group, you can use the paired t-test. The assumptions here include groups being independent, the values of before and after belonging to the same subjects, and the differences between the groups should be normally distributed
Two Sampled (Independent) t-Test	In situations where there are two separate samples, for example, the house prices in Mumbai v/s house prices in Delhi, and you have to check if the mean of both these samples is statistically significantly different not, then a two-sampled t-test can be used. It assumes that each sample’s data distribution should be roughly normal, values should be continuous, the variance should be equal in both the samples, and they should be independent of each other
One-way Analysis of Variance	An extension of two sampled t-tests is one-way ANOVA, where we compare more than two groups. Suppose someone asks you if that is ANOVA a parametric test, the answer to that is a definitive yes. ANOVA analyses the variance of the groups and requires the population distribution to be normal, variance to be homogeneous, and groups to be independent
Pearson’s Coefficient of Correlation	To understand the association between two continuous numeric variables, you can use a person’s coefficient of correlation. It produces an ‘r’ value where a value closer to -1 and 1 indicates a strong negative and positive correlation respectively. A value close to 0 indicates no major correlation between the variables. A part of its assumption is that both the variables in question should be continuous.

Non-Parametric Tests for Hypothesis Testing

In the above section, we talked about several parametric tests that can solve different types of statistical inferential problems. All those tests, however, are of the parametric types and have stringent assumptions to be taken care of, which you may or may not be able to fulfill. This is where non-parametric tests are helpful. Common types of non-parametric tests include-

Wilcoxon signed-rank test	It is used as an alternative to the one-sample t-test
Mann-Whitney U-test / Wilcoxon rank-sum test	They can be used as an alternative to the two-sample t-test
Kruskal-Wallis test	It is an alternative to the parametric test – one-way ANOVA
Spearman’s rank correlation	You can use this test as an alternative to pearson’s correlation coefficient. It’s important when the data is not continuous but in the form of ranks (ordinal data)
Signed-rank test	It is an alternative to the parametric test – paired t-test

There are alternatives to all the parametric tests. So, if you cannot fulfill any assumptions, you can use their respective non-parametric tests.

Parametric vs. Non-Parametric Test

With the exploration of parametric and non parametric tests, it’s time to summarize their differences. The following table can help you understand when and where you should use the parametric tests or their non-parametric counterparts and their advantages and disadvantages.

Now you have a better understanding of the differences between parametric and non-parametric tests and can use the type of test that suits your needs and can provide you with the best results.

Criterion	Parametric	Non Parametric
Population	A proper understanding of the population is available	Not much information about the population is available
Assumptions	Several assumptions are regarding the population. Incorrect results are provided if assumptions are not fulfilled	No assumptions are made regarding the population
Distribution	The distribution of the population is often required to be normal	Do not require the population to be normal; it can be arbitrary
Sample Size	Require sample size to be over 30	Can work with small samples
Interpretability	Are easy to interpret	Are difficult to interpret
Implementation	Are difficult to implement	Are easy to implement
Reliability	The output is more powerful/reliable	Are less powerful/reliable
Type of variable	Works with continuous/quantitative variables	Works with continuous/quantitative as well as categorical/discrete variables
Central Tendency	Measurement of the central tendency is typically done using mean	Measurement of the central tendency is generally done using median
Outliers	Affected by outliers	Less affected by outliers
Null Hypothesis	More accurate	Incorrectly rejects null hypothesis; Less accurate
Examples	z-test, t-test, ANOVA, f-test, Pearson coefficient of correlation	One sample KS test, Wilcoxon signed rank test, Mann-Whitney U-test, Wilcoxon rank-sum test, Wilcoxon signed-rank test, Kruskal-Wallis test, Spearman’s rank correlation, Kuiper’s test, Hosmer-Lemeshow test, Chi-Square test for independence

Univariate Statistics

Data analytics deploy many techniques with their use and advantages. These techniques include methods to describe data, understand relationships between variables, and assess how a feature set can have a combined effect on a variable. In this article, we will go over these techniques, especially, the univariate analysis.

What is Univariate Analysis?

Before we get into what is univariate analysis, let’s first understand the univariate meaning.

While ‘uni’ means one, variate indicates a variable. Therefore, univariate analysis is a form of analysis that only involves a single variable. In a practical setting, a univariate analysis means the analysis of a single variable (or column) in a dataset (data table).

Among all the forms of analytical methods that data analysts practice, univariate analysis is considered one of the basic forms of analysis. It is typically the first step to understanding a dataset. The idea of univariate analysis is to first understand the variables individually. Then, you move into analyzing two or more variables simultaneously. There are specific steps to do this, which are discussed next.

Steps to conduct univariate analysis

There are 4 steps to conducting univariate analysis, as follows:

Accessing the dataset of interest
Identifying the variable that needs to be analyzed
Identifying the questions that need to be answered through the analysis
Determining the appropriate type of univariate analysis techniques to answer the above-identified questions

Statistical languages like SPSS, SAS, and R are typically used to deploy the various types of univariate techniques. It is also done through other languages common in data analytics and data science like Python. Spreadsheets like MS Excel are commonly used for fundamental univariate analysis involving limited data.

However, before you start using any of these tools, you must understand the basic concepts of what is a dataset, specifically the types of columns that form a dataset. This is essential to know because different univariate analysis techniques are used for different types of variables.

What is Univariate data?

A dataset (also known as a feature set or simply a table) is a multidimensional heterogeneous data structure. It is formed by combining multiple one-dimensional data structures that are homogeneous.

For example, a dataset can have multiple columns such as ‘Employee Number’, ‘Name’, ‘Income’, ‘No. of Family Members’, ‘Date of Birth’, ‘Location’, ‘Designation’. These variables can have different data types such as text, number, logical, date, etc. However, univariate data works differently.

The univariate data is not categorized by data type but rather by the purpose they serve or their nature.

In this sense, univariate data (i.e., a single column) can be divided into ID, Numerical, and Categorical. This classification is essential because different types of univariate analysis are required for each type.

To understand univariate data and its classifications, first, follow the dataset below.

Employee No.	Name	Income	No. of Family Members	DOB	Location	Designation

1	Alex	$21,060	3	12/01/1984	NY	Sr. Manager
2	Mate	$59,879.95	2	31/08/1990	LA	CEO
3	Philip	$30,126.30	1	01/07/1985	LA	Co-founder
4	Lucy	$19,898	3	09/11/1986	NY	Accountant
5	Rez	$47,876	3	10/10/1990	NY	Sr. Analyst

Univariate data classifications are as follows:

ID: This data has no statistical or aggregative properties, and they are used to identify a subject uniquely. For example, the column ‘Employee Number’.
Numerical (Quantitative): This data has statistical properties. They can be of two types- Discrete and Continuous.
- Discrete: This dataset has discrete values (i.e., cannot have decimals). For example- ‘No of Family Members.
- Continuous: This dataset can have numbers with decimals. For example- ‘Income’.
Categorical (Qualitative): Categorical data deals with descriptions or categories. They have aggregative properties and are of two types- Ordinal and Nominal. (Note- categorical univariate data can have numeric datatype)
- Ordinal: These categories have an order. For example- ‘Designation’ where the order can be Manager, Sr Manager, CEO and cannot be any other.
- Nominal: These categories do not have any order. For example- ‘Location’ has mutually exclusive categories.

Typically, a univariate is data that belongs to any of the types mentioned above. Now, to analyze such a dataset, different types of univariate analysis techniques are used depending on the type of variable in question.

Types of Univariate Analysis

The primary purpose of univariate analysis is to describe data. Using different techniques, these descriptions are found. These techniques can be categorized into the following groups:

Graphical
Tables
Descriptive statistics
Inferential statistics (i.e., use of frequency distributions)

Each of these techniques provides information about the data in a unique way. Typically, a data analyst uses more than one technique to form their opinion about the data they are dealing with, as this helps them make important decisions related to data preparation. Let’s understand each technique.

Graphical analysis

Various types of graphs can be used to understand data. The standard type of graphs include-

Histograms: A histogram displays the frequency of each value or group of values (bins) in numerical data. This helps in understanding how the values are distributed.
Boxplot: A boxplot provides several important information such as minimum, maximum, median, 1st, and 3rd quartiles. It is beneficial in identifying outliers in the data.
Density Curve: The density curve helps in understanding the shape of the data’s distribution. It helps answer questions such as if the data is bimodal, normally distributed, skewed, etc.
Bar Chart: Bar Charts, mainly frequency bar charts, is a univariate chart used to find the frequency of the different categories of categorical data.
Pie Chart: Frequency Pie charts convey similar information to bar charts. The difference is that they have a circular formation with each slice indicating the share of each category in the data.

Univariate tables

Tables help in univariate analysis and are typically used with categorical data or numerical data with limited cardinality. Different types of tables include:

Frequency Tables: Each unique value and its respective frequency in the data is shown through a table. Thus, it summarizes the frequency the way a histogram, frequency bar, or pie chart does but in a tabular manner.
Grouped Tables: Rather than finding the count of each unique value, the values are binned or grouped, and the frequency of each group is reflected in the table. It is typically used for numerical data with high cardinality.
Percentage (Proportion) Tables: Rather than showing the frequency of the unique values (or groups), such a table shows their proportion in the data (in percentage).
Cumulative Proportion Tables: It is similar to the proportion table, with the difference being that the proportion is shown cumulatively. It is typically used with binned data having a distinct order (or with categorical ordinal data).

In some instances, all such univariate tables can be used as an alternative to a more graph-based way of describing the analysis.

Univariate Statistics

Univariate analysis can be performed in a statistical setting. Two types of statistics can be used here- Descriptive and Inferential.

Descriptive Statistics

As the name suggests, descriptive statistics are used to describe data. The statistics used here are commonly referred to as summary statistics.

For instance, if you have to describe a cube, you have to ‘measure’ it. By measuring its length, breadth, and height, you can describe it. Similarly, these descriptive or univariate statistics have specific measures that help us in describing the data. These measures are-

Measure of Central Tendency: Statistics such as mean, median, and mode are considered here. They help in summarizing all the data through a single central value.
Measure of Variability: Analysts also need to understand how the data varies from the central point. To understand this, specific univariate statistics can be calculated, such as range, interquartile range, variance, standard deviation, etc.
Measure of Shape: The shape of the data distribution can explain a great deal about the data as the shape can help in identifying the type of distribution followed by the data. Each of these distributions has specific properties that can be used to your advantage. By analyzing the shapes, you will know if the data is symmetrical, non-symmetrical, left or right-skewed, is suffering from positive or negative kurtosis, among other things.

These descriptive statistics can be used for calculating things like missing value proportions, upper and lower limits for outliers, level of variance through the coefficient of variance, etc.

Inferential Statistics

Often, the data you are dealing with is a subset (sample) of the complete data (population). Thus, the common question here is –

Can the findings of the sample be extrapolated to the population? i.e., Is the sample representative of the population, or has the population changed? Such questions are answered using specific hypothesis tests designed to deal with such univariate data-based problems.

Hypothesis tests help us answer crucial questions about the data and their relation with the population from where they are drawn. Several hypotheses or univariate testing mechanisms come in handy here, such as-

Z Test: Used for numerical (quantitative) data where the sample size is greater than 30 and the population’s standard deviation is known.
One-Sample t-Test: Used for numerical (quantitative) data where the sample size is less than 30 or the population’s standard deviation is unknown.
Chi-Square Test: Used with ordinal categorical data
Kolmogorov-Smirnov Test: Used with nominal categorical data

All such univariate testing methods generate p-values that can be used to accept or reject different types of hypotheses.

**Univariate Testing Methods Generate P-Values**

In real-time, all these techniques are used depending upon the situation, type of data, and problem statement.

Univariate Analysis Examples

While there can be hundreds of univariate analysis examples where univariate data analysis is used, some of them are-

Finding the average height of a country’s men from a sample.
Calculate how reliable a batsman is by calculating the variance of their runs.
Finding which country is the most frequent in winning Olympic Gold Medal by creating a frequency bar chart or frequency table.
Understanding the income distribution of a county by analyzing the distribution’s shape. A right-skewed distribution can indicate an unequal society.
Checking if the price of sugar has statistically significantly risen from the generally accepted price by using sample survey data. Hypothesis tests such as the Z or t-test solve such questions.
Assessing the predictive capability of a variable by calculating the coefficient of variance.

Bi-variate and Multi-variate Analysis

Bivariate Analysis: Bivariate analysis is performed when two variables are involved. Here, you typically try to understand how two variables affect each other, how they are related, or how they compare to each other. Like univariate data analysis that is performed through graphs, tables, and statistics, bivariate analysis can also be performed somewhat similarly. For example- scatterplots, bar charts, pie charts, multi-line charts, cross-frequency tables, and tests such as dependent t-test, independent t-test, and one-way ANOVA are used for bivariate analysis.
Multivariate Analysis: So far we have discussed the analysis of single variables. However, when more than two variables are to be analyzed, such an analysis is called multivariate analysis. Typically, predictive problems are solved under such an analysis where many variables are used to predict another variable. However, non-predictive analysis is also performed by creating a correlation matrix, cross-frequency tables, dodged or stacked bar charts, etc.

Bivariate Statistics

The bivariate analysis allows you to investigate the relationship between two variables. It is useful to determine whether there is a correlation between the variables and, if so, how strong the connection is. For researchers conducting a study, this is incredibly helpful.

Bivariate analysis is one of the statistical analysis where two variables are observed. One variable here is dependent while the other is independent. These variables are usually denoted by X and Y. So, here we analyse the changes occured between the two variables and to what extent. Apart from bivariate, there are other two statistical analyses, which are Univariate (for one variable) and Multivariate (for multiple variables).

In statistics, we usually interpret the given set of data and make statements and predictions about it. During the research, an analysis attempts to determine the impact and cause in order to conclude the given variables.

What is bivariate analysis?

Bivariate analysis is a statistical method examining how two different things are related. The bivariate analysis aims to determine if there is a statistical link between the two variables and, if so, how strong and in which direction that link is.

It is a helpful technique for determining how two variables are connected and finding trends and patterns in the data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities.

Recognizing bivariate data is a prerequisite for analysis. Data analytics and data analysis are closely related processes that involve extracting insights from data to make informed decisions. Typically, X and Y are two of the measures included. The bivariate data can be understood as a pair (X, Y ).

Definition of Bivariate Analysis

Bivariate analysis is stated to be an analysis of any concurrent relation between two variables or attributes. This study explores the relationship of two variables as well as the depth of this relationship to figure out if there are any discrepancies between two variables and any causes of this difference. Some of the examples are percentage table, scatter plot, etc.

Importance of bivariate analysis

Bivariate analysis is an important statistical method because it lets researchers look at the relationship between two variables and determine their relationship. This can be helpful in many different kinds of research, such as social science, medicine, marketing, and more.

Here are some reasons why bivariate analysis is important:

Bivariate analysis helps identify trends and patterns: It can reveal hidden data trends and patterns by evaluating the relationship between two variables.
Bivariate analysis helps identify cause and effect relationships: It can assess if two variables are statistically associated, assisting researchers in establishing which variable causes the other.
It helps researchers make predictions: It allows researchers to predict future results by modeling the link between two variables.
It helps inform decision-making: Business, public policy, and healthcare decision-making can benefit from bivariate analysis.

The ability to analyze the correlation between two variables is crucial for making sound judgments, and this analysis serves this purpose admirably.

Types of bivariate analysis

Many kinds of bivariate analysis can be used to determine how two variables are related. Here are some of the most common types.

1. Scatterplots

A scatterplot is a graph that shows how two variables are related to each other. It shows the values of one variable on the x-axis and the values of the other variable on the y-axis.

The pattern shows what kind of relationship there is between the two variables and how strong it is.

2. Correlation

Correlation is a statistical measure that shows how strong and in what direction two variables are linked.

A positive correlation means that when one variable goes up, so does the other. A negative correlation shows that when one variable goes up, the other one goes down.

3. Regression

This kind of analysis gives you access to all terms for various instruments that can be used to identify potential relationships between your data points.

The equation for that curve or line can also be provided to you using regression analysis. Additionally, it may show you the correlation coefficient.

4. Chi-square test

The chi-square test is a statistical method for identifying disparities in one or more categories between what was expected and what was observed. The test’s primary premise is to assess the actual data values to see what would be expected if the null hypothesis was valid.

Researchers use this statistical test to compare categorical variables within the same sample group. It also helps to validate or offer context for frequency counts.

5. T-test

A t-test is a statistical test that compares the means of two groups to see if they have a big difference. This analysis is appropriate when comparing the averages of two categories of a categorical variable.

6. ANOVA (Analysis of Variance)

The ANOVA test determines whether the averages of more than two groups differ from one another statistically. This comparison of averages of a numerical variable for more than two categories of a categorical variable is appropriate.

Example of bivariate analysis

Some examples of bivariate analysis are listed below:

Investigating the connection between education and income

In this case, one of the variables could be the level of education (e.g., high school, college, graduate school), and the other could be income.

A bivariate analysis could be used to determine if there is a significant relationship between these two variables and, if so, how strong and in what direction that relationship is.

Investigating the connection between aging and blood pressure

Here, age is one variable and blood pressure is another (systolic and diastolic).

It is possible to conduct an analysis of bivariate analysis to determine if and how strongly these two factors are related by testing for statistical significance.

These are just a few ways this analysis can be used to determine how two variables are related. The type of data and the research question will determine which techniques and statistical tests are used in the analysis.

Conclusion

The primary topic addressed by bivariate analysis is whether or not the two variables are correlated, and if so, whether or not the relationship is negative and to what degree. Typical research used in inferential statistics and calculations analyzes two variables. Numerous scientific and commercial projects focus on understanding the link between two continuous variables.

Multivariate Statistics

Introduction

Multivariate analysis is an essential methodological approach in psychology that allows researchers to examine complex relationships between multiple variables simultaneously. Given the intricate nature of human behavior, cognition, and emotions, psychological phenomena rarely operate in isolation. Instead, they involve interdependent variables, requiring statistical techniques that can handle multiple predictors, outcomes, and mediating relationships. Multivariate analysis encompasses a variety of techniques such as factor analysis, multiple regression, multivariate analysis of variance (MANOVA), structural equation modeling (SEM), and cluster analysis, each serving distinct research purposes. These methods enable psychologists to develop theoretically grounded models, assess latent constructs, and derive meaningful insights that would be impossible with univariate or bivariate approaches.

The importance of multivariate analysis in psychology extends beyond statistical sophistication; it fundamentally reshapes how psychological constructs are measured, analyzed, and interpreted. In clinical psychology, it facilitates the identification of complex symptom patterns and diagnostic classification. In cognitive psychology, it allows for modeling mental processes such as perception, memory, and decision-making. In personality psychology, it helps uncover latent traits and the structure of individual differences. In social psychology, it is instrumental in understanding group dynamics, attitudes, and social behavior. Furthermore, advances in computing power and statistical software have made multivariate methods more accessible, enabling psychologists to conduct large-scale, high-dimensional studies with greater precision.

Theoretical Foundations of Multivariate Analysis in Psychology

The development of multivariate analysis in psychology is rooted in both mathematical statistics and theoretical psychology. One of the earliest motivations for multivariate analysis was the recognition that psychological constructs—such as intelligence, personality, and motivation—are inherently multidimensional. Unlike physical sciences, where variables can often be measured directly, psychology deals with latent constructs, which require multiple indicators for valid assessment.

The application of multivariate methods in psychology can be traced back to the work of Charles Spearman, who introduced factor analysis in the early 20th century to study intelligence. His two-factor theory of intelligence, which distinguished between general intelligence (g-factor) and specific abilities (s-factors), laid the groundwork for multivariate thinking in psychology. Subsequently, Raymond Cattell expanded factor analysis to develop his 16 Personality Factor (16PF) Model, demonstrating how personality traits could be statistically derived from large datasets.

The rise of structural equation modeling (SEM) in the mid-20th century further solidified multivariate approaches in psychology, allowing researchers to test theoretical models with latent variables. The integration of psychometric theory, experimental psychology, and computational statistics led to a growing recognition that psychological phenomena must be studied within an interconnected, multivariate framework.

Multivariate analysis is distinguished by its ability to simultaneously analyze multiple dependent and independent variables, making it particularly valuable for testing complex psychological theories. Unlike univariate methods, which assess relationships between a single predictor and outcome, multivariate techniques allow for the modeling of interdependencies, mediation, moderation, and higher-order interactions.

Key Multivariate Techniques in Psychology

Among the most widely used multivariate techniques in psychology are factor analysis, multiple regression, MANOVA, SEM, cluster analysis, and canonical correlation analysis. Each method serves a specific research purpose and is used in different branches of psychology.

Factor analysis is a cornerstone of psychological research, particularly in personality psychology, cognitive psychology, and psychometrics. It is used to identify latent constructs that underlie observed variables, reducing complex datasets into meaningful dimensions. Exploratory Factor Analysis (EFA) helps discover underlying structures without predefined hypotheses, whereas Confirmatory Factor Analysis (CFA) is used to test theoretically specified models. Factor analysis has been instrumental in the development of intelligence tests, personality inventories, and clinical assessments, such as the Minnesota Multiphasic Personality Inventory (MMPI) and the Big Five Personality Model.

Multiple regression analysis is extensively used in clinical, social, and cognitive psychology to predict psychological outcomes based on multiple predictor variables. It allows researchers to determine the relative contribution of different independent variables while controlling for confounding factors. For example, in clinical psychology, multiple regression is used to predict depression severity based on cognitive distortions, social support, and biological markers.

Multivariate Analysis of Variance (MANOVA) extends traditional ANOVA by allowing for multiple dependent variables, making it particularly useful in experimental psychology and neuroscience. It is applied in cognitive research, where multiple cognitive outcomes (e.g., reaction time, accuracy, and neural activation) need to be analyzed simultaneously.

Structural Equation Modeling (SEM) represents an advanced multivariate approach that combines factor analysis and path analysis, allowing for complex hypothesis testing involving latent variables. SEM is widely used in psychology to test causal models, mediation effects, and longitudinal relationships. One of its greatest advantages is its ability to handle measurement error, making it a preferred method in psychometric validation studies.

Cluster analysis is frequently employed in clinical and health psychology to classify individuals into homogeneous subgroups based on shared psychological characteristics. It has been used to identify subtypes of depression, personality disorders, and cognitive impairment patterns.

Canonical correlation analysis is used to examine relationships between two sets of multiple variables, making it particularly useful in neuropsychology and developmental psychology. It allows researchers to study how cognitive functions (e.g., memory, attention, and executive function) are related to different brain regions or genetic factors.

Applications of Multivariate Analysis in Psychological Research

Multivariate analysis has been widely applied in various branches of psychology, significantly enhancing theoretical understanding and practical assessment.

In clinical psychology, multivariate methods help identify risk factors for mental disorders, evaluate treatment effectiveness, and develop diagnostic models. For example, SEM has been used to test cognitive models of anxiety and depression, examining how negative thinking patterns mediate the relationship between childhood trauma and psychopathology.

In cognitive psychology, multivariate analysis is critical in understanding how different cognitive processes interact. Neuroscientific studies frequently use multivariate pattern analysis (MVPA) to analyze fMRI data, revealing how distributed neural networks encode information.

In social psychology, multivariate approaches allow for the examination of complex social behaviors, attitudes, and group dynamics. Researchers use SEM to study how personality traits, situational factors, and social norms jointly influence decision-making and prejudice formation.

In developmental psychology, multivariate methods facilitate longitudinal studies that track psychological changes over time. Latent growth modeling (LGM), a form of SEM, is used to study how intelligence, language development, and emotional regulation evolve across childhood and adolescence.

In personality psychology, factor analysis has been instrumental in refining trait-based theories, leading to the establishment of the Big Five Personality Model.

Challenges and Future Directions

Despite its advantages, multivariate analysis in psychology presents several challenges. Statistical assumptions such as normality, multicollinearity, and sample size requirements must be carefully addressed. Model complexity in SEM can lead to overfitting and misinterpretation, and interpreting factor loadings and latent constructs remains a subjective process. Furthermore, the rise of machine learning and artificial intelligence introduces both opportunities and challenges, requiring psychologists to integrate advanced computational techniques with traditional statistical methods.

Future directions include the integration of multivariate techniques with neuroimaging, genetic research, and large-scale behavioral datasets, enhancing our ability to understand the biopsychosocial determinants of human behavior.

Conclusion

Multivariate analysis is indispensable in psychology, enabling researchers to analyze complex relationships, develop robust theoretical models, and improve psychological assessment. Its applications span across clinical, cognitive, social, developmental, and personality psychology, providing deeper insights into the intricate mechanisms of human thought and behavior. As computational power increases, multivariate methods will continue to evolve, offering new possibilities for advanced psychological research and real-world applications.

Data Screening for Statistical Analysis

Introduction

Data screening is a crucial preliminary step in statistical analysis, ensuring that datasets meet the necessary conditions for valid and reliable inferences. The integrity of any statistical investigation, whether in psychology, social sciences, medical research, or business analytics, depends on the quality of data used. Even the most sophisticated statistical models are rendered meaningless if they are based on incomplete, inconsistent, or erroneous data. Data screening involves a systematic process of examining datasets for accuracy, completeness, normality, outliers, multicollinearity, and missing values, among other potential issues.

The significance of data screening is often underestimated, yet failing to address data anomalies can lead to biased estimates, invalid conclusions, and misleading research findings. In psychological research, for example, failure to check for outliers or assumption violations in multivariate analyses can distort effect sizes and result in Type I or Type II errors. In medical and health sciences, improperly screened data can lead to incorrect clinical recommendations, jeopardizing patient safety. In machine learning and artificial intelligence, datasets with systematic biases or outliers can negatively impact predictive accuracy, leading to erroneous decision-making.

Conceptual Foundations of Data Screening

Data screening is rooted in classical statistical theory and modern data science methodologies, providing a foundation for rigorous hypothesis testing and model development. The fundamental goal is to ensure that data conform to the assumptions underlying statistical techniques. Many statistical models, including linear regression, ANOVA, and structural equation modeling (SEM), rely on specific assumptions such as normality, homoscedasticity, and independence of observations. Violations of these assumptions can lead to skewed results, incorrect parameter estimates, and inflated standard errors.

The necessity of data screening becomes even more evident in the era of big data and machine learning, where automated algorithms rely on high-dimensional datasets. While traditional statistical models assume well-structured data, modern analytical techniques must contend with heterogeneous, incomplete, and noisy datasets. Consequently, rigorous data screening is essential not only for parametric tests but also for predictive modeling, deep learning, and network analysis.

Psychological research, for instance, often relies on self-reported survey data, which are prone to issues such as social desirability bias, response inconsistency, and missing values. Similarly, biomedical datasets contain sensor-based data, genetic markers, and patient records, all of which require extensive preprocessing to ensure validity. These realities highlight the critical role of data screening as a prerequisite for high-quality research and decision-making.

Key Steps in Data Screening

The data screening process involves a systematic evaluation of various aspects of a dataset, including accuracy, completeness, assumption adherence, and the identification of anomalies. Each step is crucial for ensuring that statistical analyses yield valid and interpretable results.

One of the first steps in data screening is checking for data entry errors. In manually collected datasets, such as surveys and experiments, errors in data entry can introduce significant biases. Common data entry errors include typographical mistakes, duplicate entries, missing labels, and incorrectly coded variables. Modern statistical software such as SPSS, R, SAS, and Python’s pandas library provide tools to detect and correct such anomalies, ensuring that datasets reflect the true underlying values.

Handling missing data is another crucial aspect of data screening. Missing data can occur due to participant non-responsiveness, data corruption, or equipment failure. The presence of missing data can bias statistical estimates and reduce statistical power. Researchers commonly use three primary approaches to deal with missing data: deletion methods, imputation techniques, and model-based corrections.

Listwise deletion, where cases with missing values are removed entirely, is often the simplest approach but can lead to a loss of valuable data and reduced generalizability. Pairwise deletion, which retains partial information, is sometimes preferable but may introduce inconsistencies in sample sizes across analyses. Imputation methods, such as mean imputation, multiple imputation (MI), and expectation-maximization (EM) algorithms, allow for more sophisticated handling of missing data by estimating missing values based on observed data patterns. Modern techniques, such as Bayesian imputation and machine learning-based imputations, offer superior accuracy by leveraging probabilistic models.

Outlier detection is another vital step in data screening. Outliers are data points that deviate significantly from the overall distribution, potentially skewing statistical results. Outliers can arise due to measurement errors, data entry mistakes, or genuine variability in the dataset. Detecting outliers requires both visual and statistical approaches. Graphical methods, such as boxplots, histograms, and scatterplots, provide intuitive ways to identify outliers. Statistical methods, including z-scores, Mahalanobis distance, and Cook’s distance, quantify the influence of extreme values on the dataset.

Once detected, outliers must be carefully examined to determine whether they should be retained, transformed, or removed. Simply removing outliers without justification can introduce selection bias, while retaining extreme values without adjustment can distort parametric statistical assumptions. Techniques such as log transformations, winsorization, and robust statistical methods can mitigate the effects of extreme values while preserving data integrity.

Assessing normality is essential for many statistical procedures that assume a Gaussian distribution of data. Normality checks are particularly relevant for tests such as t-tests, ANOVA, and multiple regression, which assume that residuals follow a normal distribution. Normality can be assessed using histograms, Q-Q plots, skewness and kurtosis statistics, and formal tests such as the Shapiro-Wilk test and Kolmogorov-Smirnov test.

Violations of normality can be addressed using data transformations, such as logarithmic, square root, or Box-Cox transformations. However, in large datasets, normality assumptions become less critical due to the central limit theorem, which states that the sampling distribution of the mean tends to be normal regardless of the shape of the population distribution.

Checking for multicollinearity is crucial in multivariate analyses such as multiple regression, factor analysis, and SEM, where predictor variables should be independent of each other. Multicollinearity occurs when predictors are highly correlated, leading to unstable regression coefficients and inflated standard errors. It can be detected using variance inflation factor (VIF) scores, tolerance statistics, and correlation matrices. If multicollinearity is detected, solutions include removing redundant variables, combining correlated predictors, or using ridge regression techniques.

Testing for linearity and homoscedasticity ensures that the relationships between variables are consistent across different values. Linearity assumes that the relationship between predictors and outcomes follows a straight-line pattern, which can be assessed using scatterplots and residual plots. Homoscedasticity, the assumption that residual variance remains constant across values of an independent variable, can be checked using Levene’s test and Breusch-Pagan test. Violations can be corrected using weighted least squares regression or transforming dependent variables.

Challenges and Future Directions in Data Screening

Despite advancements in statistical methodologies, data screening remains a challenging and resource-intensive process. Large-scale datasets, such as those used in neuroscience, genomics, and artificial intelligence, present unique challenges in terms of missing data, high-dimensionality, and heterogeneity. Automating data screening using machine learning techniques, anomaly detection algorithms, and Bayesian modeling offers promising solutions for handling complex datasets with minimal manual intervention.

The integration of real-time data monitoring and preprocessing pipelines in experimental psychology and clinical trials also enhances data screening efficiency, ensuring that data integrity is maintained throughout the research process.

Conclusion

Data screening is an indispensable process that underpins the validity and reliability of statistical analyses across disciplines. By systematically addressing data entry errors, missing data, outliers, normality violations, multicollinearity, and assumption adherence, researchers can ensure that their findings are robust and generalizable. As datasets grow in complexity, future advancements in automated data preprocessing, AI-driven anomaly detection, and real-time validation techniques will further enhance the efficiency and accuracy of data screening. Rigorous data screening remains the foundation of credible, high-quality scientific research and evidence-based decision-making.

Perpetration for Statistical Analysis

Statistical analysis serves as a fundamental pillar in quantitative research, ensuring that empirical data are systematically examined to derive meaningful conclusions. The process of preparing for statistical analysis is an essential phase in research methodology, requiring meticulous planning, data integrity assessment, and appropriate selection of statistical techniques. The importance of this preparatory phase cannot be overstated, as errors at this stage can lead to biased results, incorrect inferences, and ultimately flawed conclusions.

Research Design and Methodological Frameworks

Before any statistical analysis can be conducted, the research design must be rigorously established. The choice of research design—whether experimental, quasi-experimental, longitudinal, cross-sectional, or case-control—directly influences the type of statistical tests applicable. A well-structured research design ensures that the data collected are suitable for inferential or descriptive statistics and minimizes confounding variables, which can otherwise introduce bias.

One of the fundamental aspects of research design is defining the research question and hypothesis. The formulation of hypotheses follows the null hypothesis (H₀) and the alternative hypothesis (H₁) framework, where the null hypothesis represents the absence of an effect or relationship, while the alternative hypothesis suggests the presence of one. The research question determines whether the study will employ parametric or non-parametric tests, whether it will focus on causality or correlation, and the appropriate level of measurement (nominal, ordinal, interval, or ratio).

Moreover, the selection of sampling techniques is pivotal in ensuring representativeness and external validity. Probability sampling techniques such as simple random sampling, stratified sampling, and cluster sampling are preferred in hypothesis-driven research, whereas non-probability sampling methods, such as convenience or purposive sampling, may be used in exploratory research where generalizability is not the primary focus. Power analysis should be conducted to determine the minimum sample size required to achieve statistical significance, thereby mitigating the risk of Type I and Type II errors.

Data Collection and Instrumentation Validity

The integrity of statistical analysis is contingent on the quality of data collection. Several critical considerations must be accounted for, including instrument reliability, validity, and precision. The instruments used for data collection—such as surveys, psychological scales, physiological measurements, or experimental apparatus—must be validated to ensure they measure what they intend to measure. Construct validity, content validity, and criterion validity are essential measures to establish before data collection begins.

Equally crucial is ensuring instrument reliability, which is commonly assessed through Cronbach’s alpha (for internal consistency), test-retest reliability (for temporal stability), and inter-rater reliability (for observer agreement). Measurement errors, whether systematic or random, must be identified and mitigated before statistical analysis can proceed.

Additionally, missing data is an inevitable challenge in empirical research. The handling of missing data requires strategic decisions—whether through listwise deletion, pairwise deletion, mean imputation, regression imputation, or multiple imputation methods—depending on the missing data mechanism (MCAR – Missing Completely at Random, MAR – Missing at Random, MNAR – Missing Not at Random). Failing to account for missing data appropriately can introduce significant bias, affecting both the validity and reliability of statistical outcomes.

Data Cleaning and Preparation for Analysis

Once data collection is completed, data cleaning becomes an indispensable process before statistical tests are conducted. Raw data often contain errors, inconsistencies, and potential outliers that can distort analytical results.

Data cleaning involves several key steps:

Checking for duplicates: Ensuring that no duplicate entries exist, particularly in large datasets where human error may introduce redundancy.
Identifying and handling outliers: Using statistical techniques such as Z-scores, Mahalanobis distance, and Cook’s distance, researchers can detect outliers that may disproportionately influence statistical models.
Ensuring consistency of variable coding: Recoding categorical variables to ensure consistency in classification is crucial for correct interpretation in software packages such as SPSS, R, Stata, or Python.
Checking for data normality: Statistical techniques such as the Kolmogorov-Smirnov test, Shapiro-Wilk test, and visual inspection of Q-Q plots and histograms determine whether parametric tests can be applied.
Addressing multicollinearity: In regression analysis, the presence of high variance inflation factors (VIFs) can indicate multicollinearity, requiring remedial measures such as variable elimination or principal component analysis.

Assessing Statistical Assumptions

Every statistical test relies on assumptions that must be validated before applying the test. Common assumptions include:

Normality: Many parametric tests assume that the data follows a normal distribution. If violated, transformations such as log, square root, or Box-Cox transformations can be applied.
Homoscedasticity: Tests such as Levene’s test ensure that variances are equal across groups. If violated, alternative tests such as Welch’s t-test or non-parametric tests should be considered.
Independence: Ensuring that observations are independent from one another is essential in experimental designs to prevent pseudoreplication.
Linearity: In regression models, scatterplots and correlation matrices help identify whether relationships between variables are linear. If not, non-linear transformations or polynomial regression models may be required.

Failure to meet these assumptions can result in misleading p-values and erroneous interpretations of statistical significance.

Exploratory Data Analysis (EDA) and Preliminary Insights

Before conducting hypothesis testing, exploratory data analysis (EDA) provides an initial understanding of the dataset. EDA involves visualizing data through histograms, boxplots, scatterplots, and density plots to detect patterns, trends, and anomalies. Summary statistics such as mean, median, standard deviation, skewness, and kurtosis provide essential descriptive insights that guide subsequent statistical modeling.

Correlation matrices and principal component analysis (PCA) can be employed to assess the relationships between multiple variables, identifying latent structures within the dataset. Such exploratory techniques often reveal unexpected patterns that inform further hypothesis refinement.

Software Selection and Computational Considerations

Modern statistical analysis is heavily reliant on computational tools. The choice of software depends on the complexity of the analysis:

SPSS – Widely used for social sciences research due to its user-friendly interface.
R and Python – Preferred for high-level statistical modeling, machine learning, and data visualization.
Stata and SAS – Common in econometrics and medical research due to their powerful data manipulation capabilities.

Researchers must also consider data security, ethical considerations, and reproducibility, ensuring that analytical workflows are transparent and replicable. Open science frameworks encourage researchers to publish code, datasets, and methodologies, facilitating peer verification and integrity.

Conclusion

Preparation for statistical analysis is an intricate and rigorous process that lays the foundation for valid and reliable research findings. It encompasses careful research design, meticulous data collection, robust data cleaning, validation of statistical assumptions, and exploratory analysis. Any compromise in these preparatory stages can lead to misleading conclusions, diminishing the credibility of the research. By adhering to best practices in statistical preparation, researchers enhance the reliability, validity, and reproducibility of their findings, ensuring that statistical inference is both scientifically sound and methodologically rigorous.