Simple Linear Regression
Ordinary Least Squares (OLS) is a statistical method that produces a best-fit line between some outcome variable \(Y\) and any number of predictor variables \(X_1, X_2, X_3, ...\). These predictor variables may also be called independent variables or right-hand-side variables.
For more information about OLS, see Wikipedia: Ordinary Least Squares.
Keep in Mind
- OLS assumes that you have specified a true linear relationship.
- OLS results are not guaranteed to have a causal interpretation. Just because OLS estimates a positive relationship between \(X_1\) and \(Y\) does not necessarily mean that an increase in \(X_1\) will cause \(Y\) to increase.
- OLS does not require that your variables follow a normal distribution.
Also Consider
- OLS standard errors assume that the model’s error term is IID, which may not be true. Consider whether your analysis should use heteroskedasticity-robust standard errors or cluster-robust standard errors.
- If your outcome variable is discrete or bounded, then OLS is by nature incorrectly specified. You may want to use probit or logit instead for a binary outcome variable, or ordered probit or ordered logit for an ordinal outcome variable.
- If the goal of your analysis is predicting the outcome variable and you have a very long list of predictor variables, you may want to consider using a method that will select a subset of your predictors. A common way to do this is a penalized regression method like LASSO.
- In many contexts, you may want to include interaction terms or polynomials in your regression equation.
Implementations
Gretl
# Load auto data
open https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.gdt
# Run OLS using the auto data, with mpg as the outcome variable
# and headroom, trunk, and weight as predictors
ols mpg const headroom trunk weight
Julia
# Uncomment the next line to install all the necessary packages
# import Pkg; Pkg.add(["CSV", "DataFrames", "GLM", "StatsModels"])
# We tap into JuliaStats ecosystem to solve our data and regression problems :)
# In particular, DataFrames package provides dataset handling functions,
# StatsModels gives us the `@formula` macro to specify our model in a concise and readable form,
# while GLM implements (Generalized) Linear Models fitting and analysis.
# And all these packages work together seamlessly.
using StatsModels, GLM, DataFrames, CSV
# Here we download the data set, parse the file with CSV and load into a DataFrame
mtcars = CSV.read(download("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Data/mtcars.csv"), DataFrame)
# The following line closely follows the R and Python syntax, thanks to GLM and StatModels packages
# Here we specify a linear model and fit it to our data set in one go
ols = lm(@formula(mpg ~ cyl + hp + wt), mtcars)
# This will print out the summary of the fitted model including
# coefficients' estimates, standard errors, confidence intervals and p-values
print(ols)
Matlab
% Load auto data
load('https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.mat')
% Run OLS using the auto data, with mpg as the outcome variable
% and headroom, trunk, and weight as predictors
intercept = ones(length(headroom),1);
X = [intercept headroom trunk weight];
[b,bint,r,rint,stats] = regress(mpg,X);
Python
# Use 'pip install statsmodels' or 'conda install statsmodels'
# on the command line to install the statsmodels package.
# Import the relevant parts of the package:
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Get the mtcars example dataset
mtcars = sm.datasets.get_rdataset("mtcars").data
# Fit OLS regression model to mtcars
ols = smf.ols(formula='mpg ~ cyl + hp + wt', data=mtcars).fit()
# Look at the OLS results
print(ols.summary())
R
# Load Data
# data(mtcars) ## Optional: automatically loaded anyway
# Run OLS using the mtcars data, with mpg as the outcome variable
# and cyl, hp, and wt as predictors
olsmodel <- lm(mpg ~ cyl + hp + wt, data = mtcars)
# Look at the results
summary(olsmodel)
SAS
/* Load Data */
proc import datafile="C:mtcars.dbf"
out=fromr dbms=dbf;
run;
/* OLS regression */
proc reg;
model mpg = cyl hp wt;
run;
Stata
* Load auto data
sysuse https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.dta
* Run OLS using the auto data, with mpg as the outcome variable
* and headroom, trunk, and weight as predictors
regress mpg headroom trunk weight