When working with a table of data, it’s not uncommon to want to perform a calculations across many columns. For example, taking the mean of a bunch of columns for each row.
This is generally not difficult to do by hand if the number of variables being handled is small. For example, in most software packages, you could take the mean of columns
B for each row by just asking for
This becomes more difficult, though, when the list of variables gets too long to type out by hand, or when the calculation doesn’t play nicely with being given columns. In these cases, approaches explicitly designed for rowwise calculations are necessary.
- When incorporating lots of variables, rowwise calculations often allow you to select those variables by group, such as “all variables starting with r_”. When doing this, check ahead of time to make sure you aren’t accidentally incorporating unintended variables.
The pandas data analysis package provides several methods for performing row-wise (or column-wise) operations in Python. Many common operations, such as sum and mean, can be called directly (eg summing over multiple columns to create a new column). It’s useful to know the axis convention in pandas: operations that combine columns often require the user to pass
axis=1 to the function, while operations that combine rows require
axis=0. This convention follows the usual one for matrices of denoting individual elements first by the ith row and then by the jth column.
Although not demonstrated in the example below, lambda functions can be used for more complex operations that aren’t built-in and apply to multiple rows or columns.
# If necessary, install pandas using pip or conda import pandas as pd # Grab the data df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv", index_col=0) # Let's assume that we want to sum, row-wise, every column # that contains 'perc' in its column name and check that # the total is 300. Use a list comprehension to get only # relevant columns, sum across them (axis=1), and create a # new column to store them: df['perc_sum'] = df[[x for x in df.columns if 'perc' in x]].sum(axis=1) # We can now check whether, on aggregate, each row entry of this new column # is 300 (it's not!) df['perc_sum'].describe()
There are a few ways to perform rowwise operations in R. If you are summing the columns or taking their mean,
rowMeans in base R are great.
For something more complex,
apply in base R can perform any necessary rowwise calculation, but
pmap in the
purrr package is likely to be faster.
In all cases, the tidyselect helpers in the dplyr package can help you to select many variables by name.
# If necessary # install.packages(c('purrr','ggplot2','dplyr')) # ggplot2 is only for the data data(midwest, package = 'ggplot2') # dplyr is for the tidyselect functions, the pipe %>%, and select() to pick columns library(dplyr) # There are three sets of variables starting with "perc" - let's make sure they # add up to 300 as they maybe should # Use starts_with to select the variables # First, do it with rowSums, # either by picking column indices or using tidyselect midwest$rowsum_rowSums1 <- rowSums(midwest[,c(12:16,18:20,22:26)]) midwest$rowsum_rowSums2 <- midwest %>% select(starts_with('perc')) %>% rowSums() # Next, with apply - we're doing sum() here for the function # but it could be anything midwest$rowsum_apply <- apply( midwest %>% select(starts_with('perc')), MARGIN = 1, sum) # Next, two ways with purrr: library(purrr) # First, using purrr::reduce, which is good for some functions like summing # Note that . is the data set being sent by %>% midwest <- midwest %>% mutate(rowsum_purrrReduce = reduce(select(., starts_with('perc')), `+`)) # More flexible, purrr::pmap, which works for any function # using pmap_dbl here to get a numeric variable rather than a list midwest <- midwest %>% mutate(rowsum_purrrpmap = pmap_dbl( select(.,starts_with('perc')), sum)) # So do we get 300? summary(midwest$rowsum_rowSums2) # Uh-oh... looks like we didn't understand the data after all.
Stata has a series of built-in row operations that use the
egen command. See
help egen for the full list, and look for functions beginning with
The full list includes:
rowlast (first or last non-missing observation),
rowtotal (the mean, median, max, min, given percentile, or sum of all the variables), and
rownonmiss (the count of the number of missing or nonmissing observations across the variables).
The egenmore package, which can be installed with
ssc install egenmore, adds
rcount (checks a condition for each variable and returns whether all are true, any are true, or the number that are true),
rowsvals (number of unique values for numeric and string variables, respectively), and
rowtotal with some additional options).
* Get data on midwestern states import delimited using "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv" * There are three sets of variables starting with "perc" - let's make sure they * add up to 300 as they should * Use * as a wildcard for variable names egen total_perc = rowtotal(perc*) summ total_perc * They don't! Uh oh. * Let's just check the education variables - should add up to 100 * Use - to include all variables from one to the other * based on their current order in the data egen total_ed = rowtotal(perchsd-percprof) * Oh that explains it... * These aren't exclusive categories (HSD, college overlap) * and also leaves out non-HS graduates. summ total_ed