Merging & Appending

summary: This tutorial explains how to combine datasets by merging or appending them. Merging means transferring columns from one dataset to another, while appending means transferring rows from one dataset to another. These are two important tasks since datasets only very rarely come with all the information we need.

Merging

Say you’re interested in how economic development affects democracy across countries, but all you have are two separate datasets on these two factors. What you need to do is to combine the two. When you’re done with this, you also need information on what world region each country is located in. These tasks can be accomplished with one-to-one merging and one-to-many merging.

One-to-one merging

To illustrate, say you had two datasets that look like this:

[1]:
# Generate toy dataset 1:
demo <- data.frame(
    country = c("USA", "Albania", "Turkey", "China", "Sudan"),
    democracy_score = c(19, 16, 16, 5, 10)
    )
demo

# Generate toy dataset 2:
econ <- data.frame(
    country = c("China", "Albania", "Turkey", "USA", "Sudan"),
    gdp_pc = c(12000, 10000, 9000, 20000, 500)
    )
econ
countrydemocracy_score
USA 19
Albania16
Turkey 16
China 5
Sudan 10
countrygdp_pc
China 12000
Albania10000
Turkey 9000
USA 20000
Sudan 500

We can merge these datasets using R’s merge() command, where by specifies the variable the two datasets has in common (usually called an “ID” variable):

[2]:
df <- merge(demo, econ, by = "country")
df
countrydemocracy_scoregdp_pc
Albania16 10000
China 5 12000
Sudan 10 500
Turkey 16 9000
USA 19 20000

You can merge on more than one variable. Say you had two datasets that look like this:

[3]:
# Generate toy dataset 1:
demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo

# Generate toy dataset 2:
econ <- data.frame(
    expand.grid(year = 1994:1996, country = c("USA", "China", "Sudan")),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
econ

merge(demo, econ, by = c("country", "year"))
countryyeardemocracy_score
USA 1994 15
China1994 13
Sudan1994 9
USA 1995 1
China1995 7
Sudan1995 3
USA 1996 2
China1996 10
Sudan1996 9
yearcountrygdp_pc
1994 USA 17180
1995 USA 14869
1996 USA 8093
1994 China 8566
1995 China14255
1996 China13959
1994 Sudan19699
1995 Sudan17739
1996 Sudan11488
countryyeardemocracy_scoregdp_pc
China1994 13 8566
China1995 7 14255
China1996 10 13959
Sudan1994 9 19699
Sudan1995 3 17739
Sudan1996 9 11488
USA 1994 15 17180
USA 1995 1 14869
USA 1996 2 8093

Here, we’re merging on both country and year.

If one of the data frames has missing observations, only observations the two data frames have in common will be kept when merging, unless you specify all.x, all.y, or all:

[4]:
econ <- econ[-c(5, 9), ] #delete obs. 5 & 9 for illustrative purposes
econ
dim(demo) #different number of observations (rows) than 'econ'
dim(econ) #different number of observations (rows) than 'demo'

merge(demo, econ, by = c("country", "year")) #keep only matching observations
merge(demo, econ, by = c("country", "year"), all.x = TRUE) #keep all observations in 'demo'
yearcountrygdp_pc
11994 USA 17180
21995 USA 14869
31996 USA 8093
41994 China 8566
61996 China13959
71994 Sudan19699
81995 Sudan17739
  1. 9
  2. 3
  1. 7
  2. 3
countryyeardemocracy_scoregdp_pc
China1994 13 8566
China1996 10 13959
Sudan1994 9 19699
Sudan1995 3 17739
USA 1994 15 17180
USA 1995 1 14869
USA 1996 2 8093
countryyeardemocracy_scoregdp_pc
USA 1994 15 17180
USA 1995 1 14869
USA 1996 2 8093
China1994 13 8566
China1995 7 NA
China1996 10 13959
Sudan1994 9 19699
Sudan1995 3 17739
Sudan1996 9 NA

One-to-many merging

In the previous examples the ID variables we used for merging uniquely identified observations in each data frame. This is called one-to-one merging.

Sometimes we need to do things slightly differently, using one-to-many merging. Say, for example, we have the following two data frames:

[5]:
demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo

region <- data.frame(
    country = c("USA", "China", "Sudan"),
    region = c("America", "Asia", "Africa")
    )
region
countryyeardemocracy_score
USA 1994 3
China1994 6
Sudan1994 11
USA 1995 8
China1995 15
Sudan1995 2
USA 1996 11
China1996 3
Sudan1996 3
countryregion
USA America
China Asia
Sudan Africa

A one-to-many merge of these data frames would look like this:

[6]:
merge(demo, region, by = "country")
countryyeardemocracy_scoreregion
China 1994 6 Asia
China 1995 15 Asia
China 1996 3 Asia
Sudan 1995 2 Africa
Sudan 1994 11 Africa
Sudan 1996 3 Africa
USA 1994 3 America
USA 1996 11 America
USA 1995 8 America

Appending

Appending means matching datasets vertically. We can do this in R using rbind(). The two dataframes you’re appending must have identical variable names. Here’s an example:

[7]:
df1 <- data.frame(
    year = rep(1990:1995, 2),
    country = c(rep("country1", 6), rep("country2", 6))
    )
df2 <- data.frame(
    year = rep(1996:2000, 2),
    country = c(rep("country1", 5), rep("country2", 5))
    )

df <- rbind(df1, df2) #append

require(dplyr) #to display output in certain order
arrange(df, country, year)
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

yearcountry
1990 country1
1991 country1
1992 country1
1993 country1
1994 country1
1995 country1
1996 country1
1997 country1
1998 country1
1999 country1
2000 country1
1990 country2
1991 country2
1992 country2
1993 country2
1994 country2
1995 country2
1996 country2
1997 country2
1998 country2
1999 country2
2000 country2

Exercises

Warm-up: Merge the three data frames defined below. Your final data frame should have 9 rows and 5 columns. Hint: merge only two data frames at a time.

[8]:
df1 <- data.frame(
    name = c("Mary", "Thor", "Sven", "Jane", "Ake", "Stephan",
             "Bjorn", "Oden", "Dennis"),
    treatment_gr = c(rep(c(1, 2, 3), each = 3)),
    weight_p1 = round(runif(9, 100, 200), 0)
    )
df2 <- data.frame(
    name = c("Sven", "Jane", "Ake", "Mary", "Thor", "Stephan",
             "Oden", "Bjorn"),
    weight_p2 = round(runif(8, 100, 200), 0)
    )
df3 <- data.frame(
    treatment_gr = c(1, 2, 3),
    type = c("dog-lovers", "cat-lovers", "all-lovers")
    )

More extensive: A prominent economic theory predicts that higher income inequality should be associated with more redistribution from the rich to the poor (Meltzer and Richard 1981). Let’s create a dataset that will allow us to test this prediction using U.S. state-level data.

  1. Download data on inequality and tax collection (links below). Take a moment to familiarize yourself with the codebooks. Then read each dataset into R.
  • Dataset with six measures of inequality (originally from Mark W. Frank).
  • Tax data that can be used to measure fiscal capacity/redistribution. Use the “State Government Tax Collections” link.
  1. Keep only the Year, State, Name, Total Taxes, and Total Income Taxes variables in the tax dataset, and rename these if necessary. Keep all the variables in the inequality dataset. Subset both datasets to be in the year range 1960-2012.
  2. Merge the two datasets. Take a moment to think about how to do this. Your final dataset should have 2650 rows and 12 variables. (Keep only data on the 50 states; 50 states x 53 years = 2650 observations.) Hint: You may find this dataset with state identifiers helpful.
  3. The Total Tax and Total Income Tax variables from the tax dataset have commas in them, and are therefore not numeric. Remove the commas and convert the variables to numeric. Hint: gsub().