Combining Datasets Overview

There are two main ways to combine data: vertically and horizontally. That is, you can want to combine observations (adding new variables) or combine variables (adding new observations). This is perhaps easiest to show visually:

Individual Name Info

Name	ID
John Smith	A63240
Desiree Thomas	B78242

Individual Age Info

ID	Age
B78242	22
A63240	27

In the case above, we would like to combine two datasets, the Individual Name Info and the Individual Date Info, that have different information about the same people, who are identified by the ID variable. The result from the merge would be to have a new dataset with more columns than the original datasets because it contains all of the information for each individual from both of the original datasets. Here we have to combine the files according to the ID variable, placing the information from observations with the same ID on the same row in the combined dataset.

Alternatively, the below example has two datasets that collect the same information about different people. We would like to combine these datasets vertically, with the result containing more rows than the original dataset, because it contains all of the people that are present in each of the original datasets. Here we combine the files based on the name or position of the columns in the dataset.

Name	ID	Age
John Smith	A63240	22
Desiree Thomas	B78242	27

Name	ID	Age
Teresa Suarez	Y34208	19
Donald Akliberti	B72197	34

These ways of combining data are referred to by different names across different programming languages, but will largely be referred to by one common set of terms (used by Stata and Python’s Pandas): merge for horizontal combination and append for for vertical combination.

Combining Datasets Overview

Table of contents