Merging & Appending¶
summary: This tutorial explains how to combine datasets by merging or appending them. Merging means transferring columns from one dataset to another, while appending means transferring rows from one dataset to another. These are two important tasks since datasets only very rarely come with all the information we need.
Merging¶
Say you’re interested in how economic development affects democracy across countries, but all you have are two separate datasets on these two factors. What you need to do is to combine the two. When you’re done with this, you also need information on what world region each country is located in. These tasks can be accomplished with one-to-one merging and one-to-many merging.
One-to-one merging¶
To illustrate, say you had two datasets that look like this:
[1]:
# Generate toy dataset 1:
demo <- data.frame(
country = c("USA", "Albania", "Turkey", "China", "Sudan"),
democracy_score = c(19, 16, 16, 5, 10)
)
demo
# Generate toy dataset 2:
econ <- data.frame(
country = c("China", "Albania", "Turkey", "USA", "Sudan"),
gdp_pc = c(12000, 10000, 9000, 20000, 500)
)
econ
country | democracy_score |
---|---|
USA | 19 |
Albania | 16 |
Turkey | 16 |
China | 5 |
Sudan | 10 |
country | gdp_pc |
---|---|
China | 12000 |
Albania | 10000 |
Turkey | 9000 |
USA | 20000 |
Sudan | 500 |
We can merge these datasets using R’s merge()
command, where by
specifies the variable the two datasets has in common (usually called an “ID” variable):
[2]:
df <- merge(demo, econ, by = "country")
df
country | democracy_score | gdp_pc |
---|---|---|
Albania | 16 | 10000 |
China | 5 | 12000 |
Sudan | 10 | 500 |
Turkey | 16 | 9000 |
USA | 19 | 20000 |
You can merge on more than one variable. Say you had two datasets that look like this:
[3]:
# Generate toy dataset 1:
demo <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
democracy_score = round(runif(9, 0, 20), 0)
)
demo
# Generate toy dataset 2:
econ <- data.frame(
expand.grid(year = 1994:1996, country = c("USA", "China", "Sudan")),
gdp_pc = round(runif(9, 1000, 20000), 0)
)
econ
merge(demo, econ, by = c("country", "year"))
country | year | democracy_score |
---|---|---|
USA | 1994 | 15 |
China | 1994 | 13 |
Sudan | 1994 | 9 |
USA | 1995 | 1 |
China | 1995 | 7 |
Sudan | 1995 | 3 |
USA | 1996 | 2 |
China | 1996 | 10 |
Sudan | 1996 | 9 |
year | country | gdp_pc |
---|---|---|
1994 | USA | 17180 |
1995 | USA | 14869 |
1996 | USA | 8093 |
1994 | China | 8566 |
1995 | China | 14255 |
1996 | China | 13959 |
1994 | Sudan | 19699 |
1995 | Sudan | 17739 |
1996 | Sudan | 11488 |
country | year | democracy_score | gdp_pc |
---|---|---|---|
China | 1994 | 13 | 8566 |
China | 1995 | 7 | 14255 |
China | 1996 | 10 | 13959 |
Sudan | 1994 | 9 | 19699 |
Sudan | 1995 | 3 | 17739 |
Sudan | 1996 | 9 | 11488 |
USA | 1994 | 15 | 17180 |
USA | 1995 | 1 | 14869 |
USA | 1996 | 2 | 8093 |
Here, we’re merging on both country and year.
If one of the data frames has missing observations, only observations the two data frames have in common will be kept when merging, unless you specify all.x
, all.y
, or all
:
[4]:
econ <- econ[-c(5, 9), ] #delete obs. 5 & 9 for illustrative purposes
econ
dim(demo) #different number of observations (rows) than 'econ'
dim(econ) #different number of observations (rows) than 'demo'
merge(demo, econ, by = c("country", "year")) #keep only matching observations
merge(demo, econ, by = c("country", "year"), all.x = TRUE) #keep all observations in 'demo'
year | country | gdp_pc | |
---|---|---|---|
1 | 1994 | USA | 17180 |
2 | 1995 | USA | 14869 |
3 | 1996 | USA | 8093 |
4 | 1994 | China | 8566 |
6 | 1996 | China | 13959 |
7 | 1994 | Sudan | 19699 |
8 | 1995 | Sudan | 17739 |
- 9
- 3
- 7
- 3
country | year | democracy_score | gdp_pc |
---|---|---|---|
China | 1994 | 13 | 8566 |
China | 1996 | 10 | 13959 |
Sudan | 1994 | 9 | 19699 |
Sudan | 1995 | 3 | 17739 |
USA | 1994 | 15 | 17180 |
USA | 1995 | 1 | 14869 |
USA | 1996 | 2 | 8093 |
country | year | democracy_score | gdp_pc |
---|---|---|---|
USA | 1994 | 15 | 17180 |
USA | 1995 | 1 | 14869 |
USA | 1996 | 2 | 8093 |
China | 1994 | 13 | 8566 |
China | 1995 | 7 | NA |
China | 1996 | 10 | 13959 |
Sudan | 1994 | 9 | 19699 |
Sudan | 1995 | 3 | 17739 |
Sudan | 1996 | 9 | NA |
One-to-many merging¶
In the previous examples the ID variables we used for merging uniquely identified observations in each data frame. This is called one-to-one merging.
Sometimes we need to do things slightly differently, using one-to-many merging. Say, for example, we have the following two data frames:
[5]:
demo <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
democracy_score = round(runif(9, 0, 20), 0)
)
demo
region <- data.frame(
country = c("USA", "China", "Sudan"),
region = c("America", "Asia", "Africa")
)
region
country | year | democracy_score |
---|---|---|
USA | 1994 | 3 |
China | 1994 | 6 |
Sudan | 1994 | 11 |
USA | 1995 | 8 |
China | 1995 | 15 |
Sudan | 1995 | 2 |
USA | 1996 | 11 |
China | 1996 | 3 |
Sudan | 1996 | 3 |
country | region |
---|---|
USA | America |
China | Asia |
Sudan | Africa |
A one-to-many merge of these data frames would look like this:
[6]:
merge(demo, region, by = "country")
country | year | democracy_score | region |
---|---|---|---|
China | 1994 | 6 | Asia |
China | 1995 | 15 | Asia |
China | 1996 | 3 | Asia |
Sudan | 1995 | 2 | Africa |
Sudan | 1994 | 11 | Africa |
Sudan | 1996 | 3 | Africa |
USA | 1994 | 3 | America |
USA | 1996 | 11 | America |
USA | 1995 | 8 | America |
Appending¶
Appending means matching datasets vertically. We can do this in R using rbind()
. The two dataframes you’re appending must have identical variable names. Here’s an example:
[7]:
df1 <- data.frame(
year = rep(1990:1995, 2),
country = c(rep("country1", 6), rep("country2", 6))
)
df2 <- data.frame(
year = rep(1996:2000, 2),
country = c(rep("country1", 5), rep("country2", 5))
)
df <- rbind(df1, df2) #append
require(dplyr) #to display output in certain order
arrange(df, country, year)
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
year | country |
---|---|
1990 | country1 |
1991 | country1 |
1992 | country1 |
1993 | country1 |
1994 | country1 |
1995 | country1 |
1996 | country1 |
1997 | country1 |
1998 | country1 |
1999 | country1 |
2000 | country1 |
1990 | country2 |
1991 | country2 |
1992 | country2 |
1993 | country2 |
1994 | country2 |
1995 | country2 |
1996 | country2 |
1997 | country2 |
1998 | country2 |
1999 | country2 |
2000 | country2 |
Exercises¶
Warm-up: Merge the three data frames defined below. Your final data frame should have 9 rows and 5 columns. Hint: merge only two data frames at a time.
[8]:
df1 <- data.frame(
name = c("Mary", "Thor", "Sven", "Jane", "Ake", "Stephan",
"Bjorn", "Oden", "Dennis"),
treatment_gr = c(rep(c(1, 2, 3), each = 3)),
weight_p1 = round(runif(9, 100, 200), 0)
)
df2 <- data.frame(
name = c("Sven", "Jane", "Ake", "Mary", "Thor", "Stephan",
"Oden", "Bjorn"),
weight_p2 = round(runif(8, 100, 200), 0)
)
df3 <- data.frame(
treatment_gr = c(1, 2, 3),
type = c("dog-lovers", "cat-lovers", "all-lovers")
)
More extensive: A prominent economic theory predicts that higher income inequality should be associated with more redistribution from the rich to the poor (Meltzer and Richard 1981). Let’s create a dataset that will allow us to test this prediction using U.S. state-level data.
- Download data on inequality and tax collection (links below). Take a moment to familiarize yourself with the codebooks. Then read each dataset into R.
- Dataset with six measures of inequality (originally from Mark W. Frank).
- Tax data that can be used to measure fiscal capacity/redistribution. Use the “State Government Tax Collections” link.
- Keep only the Year, State, Name, Total Taxes, and Total Income Taxes variables in the tax dataset, and rename these if necessary. Keep all the variables in the inequality dataset. Subset both datasets to be in the year range 1960-2012.
- Merge the two datasets. Take a moment to think about how to do this. Your final dataset should have 2650 rows and 12 variables. (Keep only data on the 50 states; 50 states x 53 years = 2650 observations.) Hint: You may find this dataset with state identifiers helpful.
- The Total Tax and Total Income Tax variables from the tax dataset have commas in them, and are therefore not numeric. Remove the commas and convert the variables to numeric. Hint:
gsub()
.