Interaction Terms and Polynomials

Regression models generally assume that the outcome variable is a function of an index, which is a linear function of the independent variables, for example in ordinary least squares:

\[Y = \beta_0+\beta_1X_1+\beta_2X_2\]

However, if the independent variables have a nonlinear effect on the outcome, the model will be incorrectly specified. This is fine as long as that nonlinearity is modeled by including those nonlinear terms in the index.

The two most common ways this occurs is by including interactions or polynomial terms. With an interaction, the effect of one variable varies according to the value of another:

\[Y = \beta_0+\beta_1X_1+\beta_2X_2 + \beta_3X_1X_2\]

and with polynomial terms, the effect of one variable one the outcome is allowed to take a non-linear shape:

\[Y = \beta_0+\beta_1X_1+\beta_2X_2 + \beta_3X_2^2 + \beta_4X_2^3\]

Keep in Mind

When you have interaction terms or polynomials, the effect of a variable can no longer be described with a single coefficient, and in some senses the individual coefficients lose meaning without the others. You can understand the effect of a single variable by taking the derivative of the index with respect to that variable. For example, in \(Y = \beta_0+\beta_1X_1+\beta_2X_2 + \beta_3X_1X_2\), the effect of \(X_2\) on \(Y\) is \(\partial Y/\partial X_2 = \beta_2 + \beta_3X_1\). You must plug in the value of \(X_1\) to get the effect of \(X_2\). Or in \(Y = \beta_0+\beta_1X_1+\beta_2X_2 + \beta_3X_2^2 + \beta_4X_2^3\), the effect of \(X_2\) is \(\partial Y/\partial X_2 = \beta_2 + 2\beta_3X_2 + 3\beta_4X_2^2\). You must plug in a value of \(X_2\) to get the marginal effect of \(X_2\) at that value.
In almost all cases, if you are including an interaction term, you should also include each of the interacted variables on their own. Otherwise, the coefficients become very difficult to interpret.
In almost all cases, if you are including a polynomial, you should include all terms of the polynomial. In other words, include the linear and squared term, not just the squared term.

Also Consider

Interaction terms tend to have low statistical power. Consider performing a power analysis of interaction terms before running your analysis.
Polynomials are not the only way to model a nonlinear relationship. You could, for example, run one of many kinds of nonparametric regression.
You may want to get the average marginal effects or the marginal effects at the mean of your variables after running your model.
One common way to display the effects of a model with interactions is to graph them. See marginal effects plots for interactions with continuous variables and Marginal effects plots for interactions with continuous variables

Implementations

Julia

Thanks to StatsModels.jl and GLM packages from the JuliaStats project we can match R and Python code very closely.

using StatsModels, GLM, DataFrames, CSV

# Load the R mtcars dataset from a URL
mtcars = CSV.read(download("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Data/mtcars.csv"), DataFrame)

# Here we specify a model with linear, quadratic and cubic `hp` terms.
# We can use any Julia functions and operators, including user-defined ones,
# in a `@formula` expression.
# We also specify `dropcollinear=false` otherwise `lm` function will drop
# the intercept during fitting, as soon as the model's terms are not linearly
# independent. That's a dubious thing to have in a presumably linear model,
# but here we show only how to write down a particular model, and not what model
# is the right one for the given data. :)
model1 = lm(@formula(mpg ~ hp + hp^2 + hp^3 + cyl), mtcars, dropcollinear=false)
print(model1)

# Include an interaction term and the variables by themselves using `*`
# The interaction term is represented by hp:cyl
model2 = lm(@formula(mpg ~ hp * cyl), mtcars)
print(model2)

# Include only the interaction term and not the variables themselves with `&`
# Hard to interpret! Occasionally useful though.
model3 = lm(@formula(mpg ~ hp&cyl), mtcars)
print(model3)

Python

Using the statsmodels package, we can use a similar formulation as the R example below.

# Standard imports
import numpy as np
import pandas as pd
import statsmodels.formula.api as sms
from matplotlib import pyplot as plt

# Load the R mtcars dataset from a URL
df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv')

# Include a linear, squared, and cubic term using the I() function.
# N.B. Python uses ** for exponentiation (^ means bitwise xor)
model1 = sms.ols('mpg ~ hp + I(hp**2) + I(hp**3) + cyl', data=df)
print(model1.fit().summary())

# Include an interaction term and the variables by themselves using *
# The interaction term is represented by hp:cyl
model2 = sms.ols('mpg ~ hp * cyl', data=df)
print(model2.fit().summary())

# Equivalently, you can request "all quadratic interaction terms" by doing
model3 = sms.ols('mpg ~ (hp + cyl) ** 2', data=df)
print(model3.fit().summary())

# Include only the interaction term and not the variables themselves with :
# Hard to interpret! Occasionally useful though.
model4 = sms.ols('mpg ~ hp : cyl', data=df)
print(model4.fit().summary())

R

# Load mtcars data
data(mtcars)

# Include a linear, squared, and cubic term using the I() function
model1 <- lm(mpg ~ hp + I(hp^2) + I(hp^3) + cyl, data = mtcars)

# Include a linear, squared, and cubic term using the poly() function
# The raw = TRUE option will give the exact same result as model1
# Omitting this will give you orthogonal polynomial terms,
# which are not correlated with each other but are more difficult to interpret
model2 <- lm(mpg ~ poly(hp, 3, raw = TRUE) + cyl, data = mtcars)

# Include an interaction term and the variables by themselves using *
model3 <- lm(mpg ~ hp*cyl, data = mtcars)

# Include only the interaction term and not the variables themselves with :
# Hard to interpret! Occasionally useful though.
model4 <- lm(mpg ~ hp:cyl, data = mtcars)

Stata

Stata allows interaction and polynomial terms using hashtags ## to join together variables to make interactions, or joining a variable with itself to get a polynomial. You must also specify whether each variable is continuous (prefix the variable with c.) or a factor (prefix with i.).

* Load auto data
sysuse auto.dta, clear

* Use ## to interact variables together and also include the variables individually
* foreign is a factor variable so we prefix it with i.
* weight is continuous so we prefix it with c.
reg mpg c.weight##i.foreign

* Use # to include just the interaction term and not the variables themselves
* If one is a factor, this will include the effect of the continuous variable
* For each level of the factor
reg mpg c.weight#i.foreign

* Interact a variable with itself to create a polynomial term
reg mpg c.weight##c.weight##c.weight foreign

It is also possible to use other type of functions and obtain correct marginal effects. For example: Say that you want to estimate the model:

\[y = a_0 + a_1 * x + a_2 * 1/x + e\]

and you want to estimate the marginal effects with respect to \(x\). You can do this as follows:

* requires package f_able
ssc install f_able
* Load auto data
sysuse auto.dta, clear
* create function using "fgen"
fgen _1_price = 1/price

reg mpg _1_price price
* indicates which variable is a "constructed" variable
f_able _1_price, auto
* estimate marginal effects
margins, dydx(price)
* How do you know it works? Use NL to verify
nl (mpg = {a0} + {a1} * price + {a2}*1/price), var(price)
margins, dydx(price)