Reshape Panel Data from Long to Wide
Panel data is data in which individuals are observed at multiple points in time. There are two standard ways of storing this data:
In wide format, there is one row per individual. Then, for each variable in the data set that varies over time, there is one column per time period. For example:
Individual | FixedCharacteristic | TimeVarying1990 | TimeVarying1991 | TimeVarying1992 |
---|---|---|---|---|
1 | C | 16 | 20 | 22 |
2 | H | 23.4 | 10 | 14 |
This format makes it easy to perform calculations across multiple years.
In long format, there is one row per individual per time period:
Individual | FixedCharacteristic | Year | TimeVarying |
---|---|---|---|
1 | C | 1990 | 16 |
1 | C | 1991 | 20 |
1 | C | 1992 | 22 |
2 | H | 1990 | 23.4 |
2 | H | 1991 | 10 |
2 | H | 1992 | 14 |
This format makes it easy to run models like fixed effects.
Reshaping is the method of converting wide-format data to long and vice versa.
Keep in Mind
- If your data has multiple observations per individual/time, then standard reshaping techniques generally won’t work.
- It’s a good idea to check your data by directly looking at it both before and after a reshape to check that it worked properly.
Also Consider
- To go in the other direction, reshape from wide to long.
- Determine the observation level of a data set.
Implementations
Python
The pandas package has several functions to reshape data. For going from long data to wide data, there’s pivot
and pivot_table
, both of which are demonstrated in the example below.
# Install pandas using pip or conda, if you don't already have it installed.
import pandas as pd
# Load WHO data on population as an example, which has 'country', 'year',
# and 'population' columns.
df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv',
index_col=0)
# In this example, we would like to have one row per country but the data have
# multiple rows per country, each corresponding with
# a year-country value of population.
# Let's take a look at the first 5 rows:
print(df.head())
# To reshape this into a dataframe with one country per row, we can use
# the pivot function and set 'country' as the index. As we'd like to
# split out years into different columns, we set columns to 'years', and the
# values within this new dataframe will be population:
df_wide = df.pivot(index='country',
columns='year',
values='population')
# What if there are multiple year-country pairs? Pivot can't work
# because it needs unique combinations. In this case, we can use
# pivot_table which can aggregate any duplicate year-country pairs. To test it, let's
# create some synthetic duplicate data for France and add it to the original
# data. We'll pretend there was a second count of population that came in with
# 5% higher values for all years.
# Copy the data for France
synth_fr_data = df.loc[df['country'] == 'France']
# Add 5% for all years
synth_fr_data['population'] = synth_fr_data['population']*1.05
# Append it to the end of the original data
df = pd.concat([df, synth_fr_data], axis=0)
# Compute the wide data - averaging over the two estimates for France for each
# year.
df_wide = df.pivot_table(index='country',
columns='year',
values='population',
aggfunc='mean')
R
There are many ways to reshape in R, including base-R reshape
and the deprecated reshape2::melt
and cast
and tidyr::gather
and spread
. We will be using the tidyr package function pivot_wider
, which requires tidyr version 1.0.0 or later.
# install.packages('tidyr')
library(tidyr)
# Load in population, which has one row per country per year
data("population")
# If we look at the data, we'll see that we have:
# identifying information in "country",
# a time indicator in "year",
# and our values in "population"
head(population)
Now we think:
- Think about the set of variables that contain the values we’re interested in reshaping. Here’s it’s
population
. This list of variable names will be ourvalues_from
argument. - Think about what we want the new variables to be called. The
variable
variable says which variable we’re looking at. So that will be ournames_from
argument. And we want to specify that each variable represents population in a given year (rather than some other variable, so we’ll add “pop_” as ournames_prefix
.
pop_wide <- pivot_wider(population,
names_from = year,
values_from = population,
names_prefix = "pop_")
Another way to do this is using data.table
.
#install.packages('data.table')
library(data.table)
# The second argument here is the formula describing the observation level of the data
# The full set of variables together is the current observation level (one row per country and year)
# The parts before the ~ are what we want the new observation level to be in the wide data (one row per country)
# The parts after the ~ are for the variables we want to no longer be part of the observation level (we no longer want a row per year)
population = as.data.table(population)
pop_wide = dcast(population,
country ~ year,
value.var = "population"
)
Stata
* Load blood pressure data in long format, which contains
* blood pressure both before and after a treatment for some patients
sysuse bplong.dta
The next steps involve thinking:
- Think about the set of variables that identify individuals. Here it’s
patient
. This will go ini()
, so we havei(patient)
. - Think about the set of variables that vary across time. Here’s it’s
bp
. This will be one of our “stub”s. - Think about which variable separates the different time periods within individual. Here we have “when”, and this goes in
j()
, so we havej(when)
.
* Syntax is:
* reshape wide stub, i(individualvars) j(newtimevar)
* So we have
reshape wide bp i(patient) j(when)
* Note that simply typing
reshape
* will show the syntax for the function
With especially large datasets, the Gtools package provides a much faster version of reshape known as greshape. The syntax can function exactly the same, though they provide alternative syntax that you may find more intuitive.
* First, we will create a toy dataset that is very large to demonstrate the speed gains
* If necessary, first install gtools:
* ssc install gtools
* Clear memory
clear all
* Turn on return message to see command run time
set rmsg on
* Set data size to 15 million observations
set obs 15000000
* Create ten observations per person
generate person_id = floor((_n-1)/10)
* Number time periods from 1 to 10 for each person
generate time_id = mod((_n-1), 10) + 1
*Create an income in each period
generate income = round(rnormal(100, 20))
* Demonstrate the comparative speed of these two reshape approaches.
* preserve and restore aren't a part of the reshape command;
* they just store the current state of the data and then restore it,
* so we can try our different reshape commands on the same data.
*The traditional reshape command
preserve
reshape wide income, i(person_id) j(time_id)
restore
*The Gtools reshape command
preserve
greshape wide income, i(person_id) j(time_id)
restore
*The Gtools reshape command, alternative syntax
preserve
greshape wide income, by(person_id) keys(time_id)
restore
Note: there is much more guidance to the usage of greshape on the Gtools reshape page.