Histograms

Histograms are an indespensible tool of research across disciplines. They offer a helpful way to represent the distribution of a variable of interest. Specifically, their function is to record how frequently data values fall within pre-specified ranges called “bins.” Such visual representations can help researchers easily detect whether their data are distributed in a skewed or symmetric way, and can help detect outliers.

Despite being such a popular tool for scientific research, choosing the bin width (alternatively, number of bins) is ultimately a choice by the researcher. Histograms are intended to convey information about the variable, and choosing the “right” bin size to convey the information helpfully can be something of an art.

The relationship between bin width \(h\) and the number of bins \(k\) is given by:

\[k = \frac{ \max x - \min x}{h}\]

For this reason, statistical softwares such as R and Stata will often accept either custom bin width specifications, or a number of bins.

Histogram vs. bar graph

Because histograms represent data frequency using rectangular bars, they might be mistaken for bar graphs at first glance. Whereas bar graphs (sometimes called bar charts) plot values for categorical data, histograms represent the distribution of continuous variables such as income, height, weight, etc.

Implementations

When feeding data to visualise using a histogram, one will notice that both R and Stata will attempt to “guess” what the “best” bin width/number of bins are. These may be overridden by user commands, as we will see.

Python

There are many plotting libraries in Python, including declarative (say what you want) and imperative (build what you want) options. In the example below, we’ll explore several different options for plotting histogram data.

By far the quickest way to plot a histogram is to use data analysis package pandas’ built-in bar chart option (which uses plotting library matplotlib).

import pandas as pd

df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/PSID.csv",
                 index_col=0)

df['earnings'].plot.hist()

png

Note that, by default, the hist function shows counts on the y-axis.

We can make this plot a bit more appealing by calling on the customisation features from matplotlib. Calling df['column-name'].plot.hist() returns a matplotlib ax object that accepts further customisation. We will use this and a style to make the plot more appealing, while also switching to show a density rather than counts, setting the number of bins explicitly, and using a logarithmic scale.

import matplotlib.pyplot as plt

plt.style.use('seaborn')

ax = df['earnings'].plot.hist(density=True, log=True, bins=80)
ax.set_title('Earnings in the PSID', loc='left')
ax.set_ylabel('Density')
ax.set_xlabel('Earnings');

png

An alternative to the matplotlib-pandas combination is seaborn, a declarative plotting library. Using seaborn, we can quickly compared histograms of different cuts of the data. In the example below, a binary variable that sorts individuals into two groups based on age is added and used as the ‘hue’ of the keyword argument.

import seaborn as sns

age_cut_off = 45
df[f'Older than {age_cut_off}'] = df['age']>age_cut_off

ax = sns.histplot(df, x="earnings", hue=f"Older than {age_cut_off}", element="step", stat="density")
ax.set_yscale('log')

png

Finally, let’s look at a different declarative library, plotnine, which is inspired by (and has very similar syntax to) R’s plotting library ggplot.

from plotnine import ggplot, aes, geom_histogram

(
    ggplot(df, aes(x='earnings', y='stat(density)')
          )
    + geom_histogram(bins=80)
)

png

R

Histograms can be represented using base R, or more elegantly with ggplot. R comes with a built-in state.x77 dataset containing per-capita income in the US states for the year 1974, which we will be using.

# loading the data

incomes = data.frame(income = state.x77[,'Income'])

# first using base R

hist(incomes$income)

# now using ggplot

library(ggplot2)

ggplot(data = incomes) +
  geom_histogram(aes(x = income))

# showing how we can adjust number of bins...

ggplot(data = incomes) +
  geom_histogram(aes(x = income), bins = 15)

# ...or the width of each bin

ggplot(data = incomes) +
  geom_histogram(aes(x = income), binwidth = 500)

Stata

To illustrate the basic histogram function in Stata we will use the “auto” dataset.

** loading the data

webuse auto

* histogram with default bin width
* The frequency option puts a count of observations on the y-axis
* rather than a proportion

histogram mpg, frequency

* we can adjust the number of bins...

histogram mpg, bin(15) frequency

* ...or the bin width

hist mpg, width(2) frequency