Types of Analytics: « Previous Intro. To Regression Analysis:Next »
Linear Regression
It's one of the most well-known modeling approaches. When learning predictive modeling, linear regression is frequently one of the first topics that individuals choose. The dependent variable is continuous in this technique, the independent variable(s) might be continuous or discrete, and the regression line is linear.
Linear regression uses a best-fit straight line to establish a link between the dependent variable (Y) and one or more independent variables (X) (also known as a regression line). It's written as Y=a+b*X + e, where a is the intercept, b is the line's slope, and e is the error term. Based on a given predictor variable, this equation can be used to predict the value of the target variable (s).
Logistic Regression
The likelihood of event=Success and event=Failure is calculated using logistic regression. When the dependent variable is binary (0/ 1, True/ False, Yes/ No), we should apply logistic regression. The value of Y here ranges from 0 to 1, and it is represented by the equation below.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
The likelihood of the presence of the characteristic of interest, p, is given above. “Why did we use to log in the equation?” is a good question to ask here. We need to find a link function that is best suited for a binomial distribution (dependent variable) because we are working with one. It's also a logit function. Instead of minimizing the sum of squared errors, the parameters in the equation above are set to increase the chances of witnessing the sample values (like in ordinary regression).
Polynomial Regression
If the power of the independent variable is more than 1, the regression equation is a polynomial regression equation. A polynomial equation is represented by the equation below:
y=a+b*x^2
The best fit line in this regression procedure is not a straight line. Rather, it's a curve that fits the data points.
Stepwise Regression
When dealing with several independent variables, this type of regression is used. The selection of independent variables is done with the help of an automatic method that does not require human interaction in this technique. To accomplish this feat, statistical parameters like as R-square, t-stats, and the AIC metric are used to identify significant factors. Stepwise regression is a method of fitting a regression model by adding or removing co-variates one at a time, according to a set of criteria. The following are some of the most often used Stepwise regression methods:
Stepwise regression, in its most basic form, accomplishes two goals. For each stage, it adds and removes predictors as needed.
Forward selection begins with the model's most important predictor and gradually adds variables.
Backward elimination begins with all predictors in the model and progresses through each phase, removing the least significant variable.
The goal of this modeling technique is to increase prediction power while using the fewest amount of predictor variables possible. It is one of the strategies for dealing with data sets with increased dimensionality.
Ridge Regression
Is a technique for dealing with multicollinear data ( independent variables are highly correlated). Even though the least squares estimates (OLS) are unbiased in multicollinearity, their variances are substantial, causing the observed value to diverge significantly from the true value. Ridge regression reduces standard errors by adding a degree of bias to the regression estimates.
The linear regression equation is shown above. Remember? It's written like this:
y=a+ b*x
There is also an error term in this equation. The full equation is as follows:
y=a+b*x+e (error term)[The error term is the value needed to account for a prediction error between the observed and expected value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
Prediction errors can be split into two sub-components in a linear equation. The first is because of the bias, while the second is because of the variance. Any one of these two components, or both, can cause prediction error. We'll talk about the error induced by variation in this section.
Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator), like Ridge Regression, penalizes the absolute magnitude of the regression coefficients. It also has the ability to lower variability and improve the accuracy of linear regression models. Consider the following equation: Lasso regression differs from ridge regression in that it uses absolute values instead of squares in the penalty function. As a result, some of the parameter estimates are penalized (or equivalently, the total of the absolute values of the estimates is constrained), and some of the parameter estimates are absolutely zero. The greater the penalty, the closer the estimations become to absolute zero. This leads to variable selection from a set of n variables.
ElasticNet Regression
ElasticNet is a technique that combines Lasso with Ridge Regression. As a regularizer, it is trained with L1 and L2. Elastic-net is useful when there are several connected features. Lasso is more likely to choose one at random, but elastic-net is more likely to choose both.
Trading-off between Lasso and Ridge has the practical benefit of allowing Elastic-Net to inherit some of Ridge's rotational stability.
Panel Data Regression
Traditional linear regression models can lead to biased estimators due to unseen, independent variable dependencies on a dependent variable. Panel data regression is a strong technique to control these dependencies. Panel regression is a modeling technique for panel data (also known as longitudinal or cross-sectional data) that is suited to panel data. It's commonly used in econometrics to track the behavior of statistical units (i.e. panel units) across time. Firms, governments, states, and other entities are examples of these units. When estimating regression coefficients, panel regression allows you to control for both the panel unit effect and the time effect.
Ordinal Regression
Ordinal regression is a type of regression analysis that belongs to the regression analysis family. Ordinal regression describes data and explains the relationship between one dependent variable and two or more independent variables as a predictive study. The dependent variable in ordinal regression analysis is ordinal (statistically polytomous ordinal), and the independent variables are ordinal or continuous-level (ratio or interval).
Ordinal Regression Analysis has three main applications: 1) causal analysis, 2) effect forecasting, and 3) trend forecasting. Ordinal regression analysis presupposes a dependence or causal relationship between one or more independent and one dependent variable, unlike correlation analysis for ordinal variables (e.g., Spearman), which focuses on the strength of the relationship between two or more variables. Furthermore, the impact of one or more variables can be taken into account.
Cox Regression
For time-to-event data, Cox Regression creates a prediction model. For given values of the predictor variables, the model generates a survival function that predicts the likelihood that the event of interest occurred at time t. From observed subjects, the form of the survival function and the regression coefficients for the predictors are derived; the model may then be applied to fresh cases with measurements for the predictor variables. It's worth noting that data from censored participants, or those who don't witness the event of interest during the observation period, is important in estimating the model.
Cox proportional hazards regression is one of the most often used regression techniques for survival analysis. It is used to link numerous risk factors or exposures, all of which are analyzed at the same time, to survival time. The hazard rate is the risk of failure (i.e., the risk or probability of suffering the event of interest) given that the participant has survived up to a certain point in time in a Cox proportional hazards regression model. A probability must be between 0 and 1. The hazard, on the other hand, is the expected number of incidents per unit of time. As a result, the risk in a group can be greater than 1.
Tobit Regression
The tobit model, also known as a censored regression model, is used to estimate linear correlations between variables when the dependent variable has either left- or right-censoring (also known as censoring from below and above, respectively). Censoring from above occurs when all cases with a value at or over a certain threshold take on that threshold's value, so the genuine value may be equal to the threshold, but it could also be greater. Values that fall at or below a certain threshold are censored in the case of censoring from below.
Quantile Regression
Unlike standard linear regression, which calculates the conditional mean of the target across multiple values of the characteristics using the least squares approach, quantile regression estimates the conditional median of the target. Quantile regression is a type of linear regression that is employed when the linear regression requirements aren't met (i.e., linearity, homoscedasticity, independence, or normality).
The association between a set of predictor (independent) variables and certain percentiles (or "quantiles") of a target (dependent) variable, most typically the median, is modeled using quantile regression. It's commonly utilized in fields including ecology, healthcare, and financial economics to do research. Quantile regression is beneficial for identifying correlations between variables that are not normally distributed and have nonlinear associations with predictor variables since it enables for analyzing relationships between variables outside of the data's mean.
How to select the right regression model?
When you simply know one or two techniques, life is usually simple. If the outcome is continuous, one of the training institutes I know encourages their students to use linear regression. Use logistic regression if the data is binary! However, the greater the number of possibilities available to us, the more difficult it is to select the best one. Regression models are in a similar situation.
Within the various forms of regression models, it is critical to select the most appropriate technique depending on the type of independent and dependent variables, data dimensionality, and other significant data features. The following are the essential factors to consider while choosing a regression model:
Exploration of data is an unavoidable element of developing a prediction model. Identifying the link and influence of variables should be your first step before choosing the proper model.
We can use metrics like statistical significance of parameters, R-square, Adjusted r-square, AIC, BIC, and error term to compare the goodness of fit of different models. Mallow's Cp criteria is another. This simply examines your model for possible bias by comparing it to all possible submodels (or a careful selection of them).
Cross-validation is the most effective method for evaluating prediction models. In this step, you'll separate your data into two groups (train and validate). The prediction accuracy can be measured using a simple mean squared difference between observed and anticipated values.
You should not utilize the automatic model selection approach if your data set contains numerous confounding factors because you do not want to include them all in a model at the same time.
It will also be determined by your goal. It's possible that a less powerful model is easier to implement than one with a high statistical significance.
In the event of high dimensionality and multicollinearity among the variables in the data set, regression regularization approaches (Lasso, Ridge, and ElasticNet) function effectively.
I hope you've gotten a good understanding of regression by now. These regression approaches should be used in light of the data conditions. Checking the family of variables, i.e. discrete or continuous, is one of the finest strategies for determining which strategy to apply.
>"I'm sure I don't have all of the answers or information about types of regression here." I'm hoping you'll share your thoughts with regression in the comments area. In the comments, I'd love to hear your thoughts on this.” You can follow to this blog to receive notifications of new posts.
References:
https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/ordinal-regression-2/#:~:text=Ordinal%20regression%20is%20a%20member,two%20or%20more%20independent%20variables.&text=The%20independent%20variables%20are%20added,weighted%20sum%20of%20the%20form.
https://towardsdatascience.com/panel-data-regression-a-powerful-time-series-modeling-technique-7509ce043fa8
https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-and-implementation-with-python-4c84c5055cf8#:~:text=Panel%20data%20regression%20is%20a,in%20traditional%20linear%20regression%20models.
https://www.statsdirect.com/help/survival_analysis/cox_regression.htm
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Survival/BS704_Survival6.html
https://www.ibm.com/docs/en/spss-statistics/24.0.0?topic=option-cox-regression-analysis
https://towardsdatascience.com/quantile-regression-ff2343c4a03#:~:text=Quantile%20regression%20is%20an%20extension,linear%20regression%20model%20equation
https://www.mygreatlearning.com/blog/what-is-quantile-regression/
Is this article useful to you?
0 Comments