Link Search Menu Expand Document

Cross Tabulations

A cross-tabulation is a table that shows the relationship between two or more variables. While more complex models are incredibly important, it is just as useful to quickly understand and present a basic picture of your data. This is where cross-tabulations come in handy, they simplify the data by creating subgroups which can be interpreted at a smaller and more granular scale.

A cross-tabulation is rudimentary form of analysis, and a great starting point for working with relationships between discrete variables. When presenting data for initial qualitative and quantitative analysis it is important to show how distribution of responses and distribution of groups works in the dataset. This can allow you to immediately see where deeper analysis can be used and the patterns within the data. They are specifically useful in both market research and population surveys.

Keep in Mind

  • Cross-tabulations are generally not appropriate if either of the variables you’re looking at is continuous. You may want to consider a scatterplot in this case, or a number of other options.
  • By default, cross-tabulations count (tabulate) the number of observations in each cell. But it is usually straightforward to have it provide the percentage in each cell instead. Which one you want depends on what question you’re trying to answer.

Also Consider

  • Cross-tabulations are relatively simple but the possibilities are endless. It is less about knowing how to make the table and more about knowing what variables to include in order to spot trends in the data.

Implementations

Julia

Julia ecosystem has a specialized library for building cross-tables named FreqTables.jl. As any other registered library it can be installed from a REPL session with import Pkg; Pkg.add("FreqTables") command. We’ll also need CSV and DataFrames libraries for data loading and preparation.

using CSV, DataFrames, FreqTables

# Get the data and convert from a CSV into a Data Frame
lakers = CSV.read(download("https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Tables/Data/lakers.csv"), DataFrame)

# Filter for only a couple Lakers players for space
lakersd = lakers[lakers.team .== "LAL", :]
# Ignore missing names while filtering for the given names
lakersd = lakersd[coalesce.(in.(lakersd.player, Ref(["Jordan Farmar", "Pau Gasol", "Kobe Bryant"])), false), :]

# Create a crosstab for each basketball action for each player
ftbl = freqtable(lakersd, :player, :etype)

# 3×6 Named Matrix{Int64}
#  player ╲ etype │       "foul"  "free throw"     "rebound"        "shot"    "turnover"   "violation"
# ────────────────┼───────────────────────────────────────────────────────────────────────────────────
# "Jordan Farmar" │           99            75           110           389            87             4
# "Kobe Bryant"   │          180           529           410          1619           200             7
# "Pau Gasol"     │          159           423           752           993           148             6

You can even create tables analyzing three (or more) variables. Here is the same table as above for home and away games.

freqtable(lakersd, :player, :etype, :game_type)

# 3×6×2 Named Array{Int64, 3}

# [:, :, game_type="away"] =
#  player ╲ etype │       "foul"  "free throw"     "rebound"        "shot"    "turnover"   "violation"
# ────────────────┼───────────────────────────────────────────────────────────────────────────────────
# "Jordan Farmar" │           51            21            48           189            45             2
# "Kobe Bryant"   │           93           315           213           882            93             3
# "Pau Gasol"     │           78           221           389           513            78             5

# [:, :, game_type="home"] =
#  player ╲ etype │       "foul"  "free throw"     "rebound"        "shot"    "turnover"   "violation"
# ────────────────┼───────────────────────────────────────────────────────────────────────────────────
# "Jordan Farmar" │           48            54            62           200            42             2
# "Kobe Bryant"   │           87           214           197           737           107             4
# "Pau Gasol"     │           81           202           363           480            70             1

In order to obtain proportions instead of count use prop function with an optional margins argument. If margins is nothing (the default), proportions over the whole table are computed. If margins is an Integer, or an iterable of Integers, proportions sum to 1 over dimensions specified by margins. In particular for a two-dimensional array, when margins is 1 row proportions are calculated, and when margins is 2 column proportions are calculated.

prop(ftbl)

# 3×6 Named Matrix{Float64}
#  player ╲ etype │       "foul"  "free throw"     "rebound"        "shot"    "turnover"   "violation"
# ────────────────┼───────────────────────────────────────────────────────────────────────────────────
# "Jordan Farmar" │    0.0159935     0.0121163     0.0177706     0.0628433     0.0140549   0.000646204
# "Kobe Bryant"   │    0.0290792     0.0854604     0.0662359      0.261551     0.0323102    0.00113086
# "Pau Gasol"     │    0.0256866      0.068336      0.121486       0.16042     0.0239095   0.000969305

prop(ftbl, margins = 2)

# 3×6 Named Matrix{Float64}
#  player ╲ etype │       "foul"  "free throw"     "rebound"        "shot"    "turnover"   "violation"
# ────────────────┼───────────────────────────────────────────────────────────────────────────────────
# "Jordan Farmar" │     0.226027     0.0730282      0.086478      0.129623           0.2      0.235294
# "Kobe Bryant"   │     0.410959      0.515093      0.322327      0.539487       0.45977      0.411765
# "Pau Gasol"     │     0.363014      0.411879      0.591195       0.33089       0.34023      0.352941

Python

In Python we can use the .crosstab() method in pandas.

import pandas as pd

# Get Lakers data
lakers = pd.read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Tables/Data/lakers.csv')

# Filter for only a couple Lakers players for space
lakersd = lakers.loc[lakers['team'] == "LAL"]
lakersd = lakersd.loc[lakers['player'].isin(['Jordan Farmar', 'Pau Gasol', 'Kobe Bryant'])]

# Create a crosstab for each basketball action for each player
pd.crosstab(lakersd['player'], lakersd['etype'])

# etype          foul  free throw  rebound  shot  turnover  violation
# player
# Jordan Farmar    99          75      110   389        87          4
# Kobe Bryant     180         529      410  1619       200          7
# Pau Gasol       159         423      752   993       148          6

You can even create tables analyzing three variables. Here is the same table as above for home and away games.

# add game_type to the crosstab
# Use brackets[] to distinguish which two-way crosstab is "inside" a broader one.
pd.crosstab([lakersd['player'], lakersd['etype']], lakersd['game_type'])

# game_type                 away  home
# player        etype
# Jordan Farmar foul          51    48
#               free throw    21    54
#               rebound       48    62
#               shot         189   200
#               turnover      45    42
#               violation      2     2
# Kobe Bryant   foul          93    87
#               free throw   315   214
#               rebound      213   197
#               shot         882   737
#               turnover      93   107
#               violation      3     4
# Pau Gasol     foul          78    81
#               free throw   221   202
#               rebound      389   363
#               shot         513   480
#               turnover      78    70
#               violation      5     1

You can use the normalize argument to get percentages instead of counts. normalize = 'index' gets proportions relative to the row totals, normalize = 'column' goes relative to the column, and normalize = 'all' does relative to the whole table.

pd.crosstab(lakersd['player'], lakersd['etype'], normalize = 'all')

# etype              foul  free throw   rebound      shot  turnover  violation
# player
# Jordan Farmar  0.015994    0.012116  0.017771  0.062843  0.014055   0.000646
# Kobe Bryant    0.029079    0.085460  0.066236  0.261551  0.032310   0.001131
# Pau Gasol      0.025687    0.068336  0.121486  0.160420  0.023910   0.000969

Tests of independence between the two variables can be obtained by passing .crosstab() output to scipy.stats.contingency_table.

R

There are many functions for creating cross tabulations in R, including the base-R table() function. Here we will focus on one: tabyl, which is part of the janitor package, and is the tidyverse version of the table() function. This command can be used to quickly create pretty cross-tabulations which are report ready.

#Load in the packages
library(pacman)
p_load(tidyverse,janitor,kableExtra, lubridate)

First we can see how tabyl can create a quick summary table. To create the input just put the desired x and y variables into the tabyl() command.

# Get lakers data
data("lakers")

# Filter for only a couple Lakers players for space

lakersd = lakers %>% filter(team == "LAL") %>% filter(player %in% c('Jordan Farmar','Pau Gasol','Kobe Bryant'))

# Example using the base-R table() function
table(lakersd$player, lakersd$etype)

# Create a crosstab for each basketball action for each player
# Syntax is similar to table() but tabyl takes the data set as its first argument
lakersd %>%
  tabyl(player,etype)

# player foul free throw rebound shot turnover violation
# Jordan Farmar   99         75     110  389       87         4
# Kobe Bryant  180        529     410 1619      200         7
# Pau Gasol  159        423     752  993      148         6

You can even create tables analyzing three variables. Here is the same table as above for home and away games.


# add game_type to the crosstab
lakersd %>%
  tabyl(player, etype, game_type)

# $away
# player foul free throw rebound shot turnover violation
# Jordan Farmar   51         21      48  189       45         2
# Kobe Bryant   93        315     213  882       93         3
# Pau Gasol   78        221     389  513       78         5
#
# $home
# player foul free throw rebound shot turnover violation
# Jordan Farmar   48         54      62  200       42         2
# Kobe Bryant   87        214     197  737      107         4
# Pau Gasol   81        202     363  480       70         1

With tabyl you can also use the adorn commands to add percentages. For basketball this would help see what players you would be analyzing when looking at team shot data.

lakers_shot = lakersd %>% filter(etype == "shot")

lakers_shot %>%
  # Create tabyl
  tabyl(player,result) %>%
  # Add a totals row
  adorn_totals(where = c("row","col")) %>%
  # Add percentages of total shots taken by players
  adorn_percentages(denominator = "col") %>%
  # Format to 1 decimal place
  adorn_pct_formatting(digits = 1)%>%
  # Include amount of observations
  adorn_ns()

# player          made        missed         Total
# Jordan Farmar  10.3%  (153)  15.5%  (236)  13.0%  (389)
# Kobe Bryant  51.1%  (757)  56.7%  (862)  53.9% (1619)
# Pau Gasol  38.5%  (570)  27.8%  (423)  33.1%  (993)
# Total 100.0% (1480) 100.0% (1521) 100.0% (3001)

A chi square test of independence can use the chisq.test() function, which can accept table() output, or just your two variables directly.

Stata

In Stata we can use the base tabulate command, which can be shortened to tab.

* Load data
import delimited "https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Tables/Data/lakers.csv"

* Filter for only a couple Lakers players for space
keep if team == "LAL" & inlist(player, "Jordan Farmar","Pau Gasol","Kobe Bryant")

* Create a crosstab for each basketball action for each player
tab player etype


*                      |                               etype
*               player |      foul  free th..    rebound       shot   turnover  violation |     Total
*----------------------+------------------------------------------------------------------+----------
*        Jordan Farmar |        99         75        110        389         87          4 |       764
*          Kobe Bryant |       180        529        410      1,619        200          7 |     2,945
*            Pau Gasol |       159        423        752        993        148          6 |     2,481
*----------------------+------------------------------------------------------------------+----------
*                Total |       438      1,027      1,272      3,001        435         17 |     6,190

To creat tables analyzing three variables, use by with tab. The following example produces a separate table for home and away games. You could alternately use the table command, using table player etype game_type to get a very similar result. Switching to table is a good idea for anything more complex than this, as it’s a very flexible command.

bysort game_type: tab player etype

*----------------------------------------------------------------------------------------------------------------
*-> game_type = away
*
*                      |                               etype
*               player |      foul  free th..    rebound       shot   turnover  violation |     Total
*----------------------+------------------------------------------------------------------+----------
*        Jordan Farmar |        51         21         48        189         45          2 |       356
*          Kobe Bryant |        93        315        213        882         93          3 |     1,599
*            Pau Gasol |        78        221        389        513         78          5 |     1,284
*----------------------+------------------------------------------------------------------+----------
*                Total |       222        557        650      1,584        216         10 |     3,239
*
*----------------------------------------------------------------------------------------------------------------
*-> game_type = home
*
*                      |                               etype
*               player |      foul  free th..    rebound       shot   turnover  violation |     Total
*----------------------+------------------------------------------------------------------+----------
*        Jordan Farmar |        48         54         62        200         42          2 |       408
*          Kobe Bryant |        87        214        197        737        107          4 |     1,346
*            Pau Gasol |        81        202        363        480         70          1 |     1,197
*----------------------+------------------------------------------------------------------+----------
*                Total |       216        470        622      1,417        219          7 |     2,951

tab has a number of options that allow for different kinds of analysis. row calculates the percentage of the row that each cell constitutes. col does the same for columns, and cell works for the overall table. Tests of independence between the two variables can be performed with the chi2 option (among some other available tests; see the help file).