Rowwise Calculations
When working with a table of data, it’s not uncommon to want to perform a calculations across many columns. For example, taking the mean of a bunch of columns for each row.
This is generally not difficult to do by hand if the number of variables being handled is small. For example, in most software packages, you could take the mean of columns A
and B
for each row by just asking for (A+B)/2
.
This becomes more difficult, though, when the list of variables gets too long to type out by hand, or when the calculation doesn’t play nicely with being given columns. In these cases, approaches explicitly designed for rowwise calculations are necessary.
Keep in Mind
- When incorporating lots of variables, rowwise calculations often allow you to select those variables by group, such as “all variables starting with r_”. When doing this, check ahead of time to make sure you aren’t accidentally incorporating unintended variables.
Implementations
Python
The pandas data analysis package provides several methods for performing row-wise (or column-wise) operations in Python. Many common operations, such as sum and mean, can be called directly (eg summing over multiple columns to create a new column). It’s useful to know the axis convention in pandas: operations that combine columns often require the user to pass axis=1
to the function, while operations that combine rows require axis=0
. This convention follows the usual one for matrices of denoting individual elements first by the ith row and then by the jth column.
Although not demonstrated in the example below, lambda functions can be used for more complex operations that aren’t built-in and apply to multiple rows or columns.
# If necessary, install pandas using pip or conda
import pandas as pd
# Grab the data
df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv",
index_col=0)
# Let's assume that we want to sum, row-wise, every column
# that contains 'perc' in its column name and check that
# the total is 300. Use a list comprehension to get only
# relevant columns, sum across them (axis=1), and create a
# new column to store them:
df['perc_sum'] = df[[x for x in df.columns if 'perc' in x]].sum(axis=1)
# We can now check whether, on aggregate, each row entry of this new column
# is 300 (it's not!)
df['perc_sum'].describe()
R
There are a few ways to perform rowwise operations in R. If you are summing the columns or taking their mean, rowSums
and rowMeans
in base R are great.
For something more complex, apply
in base R can perform any necessary rowwise calculation, but pmap
in the purrr
package is likely to be faster.
In all cases, the tidyselect helpers in the dplyr package can help you to select many variables by name.
# If necessary
# install.packages(c('purrr','ggplot2','dplyr'))
# ggplot2 is only for the data
data(midwest, package = 'ggplot2')
# dplyr is for the tidyselect functions, the pipe %>%, and select() to pick columns
library(dplyr)
# There are three sets of variables starting with "perc" - let's make sure they
# add up to 300 as they maybe should
# Use starts_with to select the variables
# First, do it with rowSums,
# either by picking column indices or using tidyselect
midwest$rowsum_rowSums1 <- rowSums(midwest[,c(12:16,18:20,22:26)])
midwest$rowsum_rowSums2 <- midwest %>%
select(starts_with('perc')) %>%
rowSums()
# Next, with apply - we're doing sum() here for the function
# but it could be anything
midwest$rowsum_apply <- apply(
midwest %>% select(starts_with('perc')),
MARGIN = 1,
sum)
# Next, two ways with purrr:
library(purrr)
# First, using purrr::reduce, which is good for some functions like summing
# Note that . is the data set being sent by %>%
midwest <- midwest %>%
mutate(rowsum_purrrReduce = reduce(select(., starts_with('perc')), `+`))
# More flexible, purrr::pmap, which works for any function
# using pmap_dbl here to get a numeric variable rather than a list
midwest <- midwest %>%
mutate(rowsum_purrrpmap = pmap_dbl(
select(.,starts_with('perc')),
sum))
# So do we get 300?
summary(midwest$rowsum_rowSums2)
# Uh-oh... looks like we didn't understand the data after all.
Stata
Stata has a series of built-in row operations that use the egen
command. See help egen
for the full list, and look for functions beginning with row
like rowmean
.
The full list includes: rowfirst
and rowlast
(first or last non-missing observation), rowmean
, rowmedian
, rowmax
, rowmin
, rowpctile
, and rowtotal
(the mean, median, max, min, given percentile, or sum of all the variables), and rowmiss
and rownonmiss
(the count of the number of missing or nonmissing observations across the variables).
The egenmore package, which can be installed with ssc install egenmore
, adds rall
, rany
, and rcount
(checks a condition for each variable and returns whether all are true, any are true, or the number that are true), rownvals
and rowsvals
(number of unique values for numeric and string variables, respectively), and rsum2
(rowtotal
with some additional options).
* Get data on midwestern states
import delimited using "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv"
* There are three sets of variables starting with "perc" - let's make sure they
* add up to 300 as they should
* Use * as a wildcard for variable names
egen total_perc = rowtotal(perc*)
summ total_perc
* They don't! Uh oh.
* Let's just check the education variables - should add up to 100
* Use - to include all variables from one to the other
* based on their current order in the data
egen total_ed = rowtotal(perchsd-percprof)
* Oh that explains it...
* These aren't exclusive categories (HSD, college overlap)
* and also leaves out non-HS graduates.
summ total_ed