Link Search Menu Expand Document

Creating a Variable with Group Calculations

Many data sets have hierarchical structures, where individual observations belong within certain groups. For example, data with students inside classrooms inside schools. Or companies inside countries. See the below table for one example

I J X
1 1 3
1 2 3.5
2 1 2
2 2 5

Here, we have data where each group \(I\) has multiple rows, one for each \(J\).

We often might want to create a new variable that performs a calculation within each group, and assigns the result to each value in that group. For example, perhaps we want to calculate the mean of \(X\) within each group \(I\), so we can know how far above or below the group average each observation is. Our goal is:

I J X AvgX
1 1 3 3.25
1 2 3.5 3.25
2 1 2 3.5
2 2 5 3.5

Also Consider

  • If the goal is to produce a data set with one row per group rather than a new variable assigning the calculation to each observation in the group, then instead look at Collapse a Data Set.

Implementations

Julia

The DataFrames.jl package makes data aggregation and manipulaion relatively straigthforward.

# Load required packages
using CSV             # Import .csv files 
using DataFrames      # Working with data frames 
using Statistics      # Required to calculate a mean 

# Import .csv file from GitHub and store as a DataFrame 
storms = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv"), DataFrame)

# Use 'groupby' to aggregate data by groups (namely: name, year, month and day
# columns) and use 'transform!' to add a new column called 'mean_wind'
# containing the mean of the existing 'wind' column. (The `!` means that this
# change will be made in-place.)
transform!(groupby(storms, [:name, :year, :month, :day]), :wind=> mean => :mean_wind)

Python

pandas doesn’t have a straightforward and flexible built-in method for doing this, with aggregation methods heavily preferring to work as described on Collapse a Data Set. However, we can just follow those methods and then merge the result back in.

import pandas as pd

# Pull in data on storms
storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv')

# Use groupby and agg to perform a group calculation
# Here it's a mean, but it could be anything
meanwind = (storms.groupby(['name', 'year', 'month', 'day'])
            .agg({'wind': 'mean'})
            # Rename so that when we merge it in it has a 
            # different name
           .rename({'wind': 'mean_wind'}))
            # make sure it's a data frame so we can join it  
    
# Use merge to bring the result back into the data
storms = pd.merge(storms,meanwind,
                    on = ['name', 'year', 'month', 'day'])

R

In R, we can use either the dplyr or data.table package to do this.

Here’s how to do it with dplyr

library(dplyr)

data("storms") # The dataset is bundled with dplyr, so we'll just open directly

# Use 'group_by' to designate the groups and 'mutate' to create new column(s).
# Note that dplyr doesn't modify in-place, so we need to reassign the result.
storms = storms %>%
  group_by(name, year, month, day) %>%
  mutate(mean_wind = mean(wind))

…and here’s how to do it with data.table.

library(data.table)

# storms = fread("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv")
setDT(storms) # Set the already-loaded storms DF as a data.table

# Use ':=' for in-place modification
storms[, mean_wind := mean(wind), by = .(name, year, month, day)]

Stata

This process is made easy in Stata using the egen command. egen is not entirely flexible and is limited to a set of predetermined group calculations (see help egen), although this can be expanded using the egenmore package (ssc install egenmore then help egenmore). However, the list of predetermined functions is long enough that you’ll rarely be caught out!

import delimited "https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv", clear

* Use bysort to determine the grouping, and egen to do the calculation
bysort name year month day: egen mean_wind = mean(wind)