Modifying Data

summary: Being able to quickly modify datasets is critical. This involves, among other things: subsetting, sorting, extracting unique observations, renaming variables, dropping variables, and creating new variables. To accomplish each of these tasks, we’ll use a set of relatively new functions introduced in the dplyr package developed by Hadley Wickham. A nice introduction to this package is here. I’ll try to give more of a beginner’s introduction below.

Overview

The table below gives an overview of very common data management tasks and their corresponding dplyr function.

Task

Function

\*Covered in a `separate tutorial <../collapsing-data>`__.

Subsetting

filter()

Sorting

arrange()

Dropping variables

select()

Renaming variables

rename()

Extracting unique values

distinct()

Creating new variables

mutate()

Collapsing data*

group_by() and summarize()

All of these tasks can also be accomplished using R’s base functions, but this usually requires more involved code. An additional advantage of using dplyr functions is that they can be combined in an elegant way using chaining.

To demonstrate these functions, we’ll rely on a small data frame called countries that you can create as follows:

[1]:
countries <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
countries$country <- as.character(countries$country) #factor --> character
countries
countryyeargdp_pc
USA 1994 2306
China1994 7525
Sudan1994 3702
USA 1995 16697
China1995 4938
Sudan1995 9435
USA 1996 13227
China1996 17835
Sudan1996 4718

We’ll also make use of world-small.csv, which you can download here.

Before we begin, let’s also load two packages we’ll need. (Of course, they need to be installed first — remind yourself here.)

[2]:
require(plyr)
require(dplyr)
Loading required package: plyr
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Subsetting

The best way to subset a dataset is by using a logical statement. The idea is that we want to keep only some rows in a dataset that meet some logical condition.

Say we wanted to subset the countries dataset to (1) a certain country, (2) a certain year, (3) a certain country and year, or (4) a certain GDP range. Here’s how we can do this using filter().

[3]:
filter(countries, country == "China")
filter(countries, year == 1996)
filter(countries, country == "USA" & year == 1995)
filter(countries, gdp_pc > 5000 & gdp_pc < 14000)
countryyeargdp_pc
China1994 7525
China1995 4938
China1996 17835
countryyeargdp_pc
USA 1996 13227
China1996 17835
Sudan1996 4718
countryyeargdp_pc
USA 1995 16697
countryyeargdp_pc
China1994 7525
Sudan1995 9435
USA 1996 13227

As should be obvious, the first argument in the filter() function specifies the dataset on which to carry out the operation. The second argument specifies the logical operation used to filter the data.

In fact, the first argument in all dplyr functions is a dataset. Using dplyr we can move this first argument outside the function and use the special operator %>% to chain statements. The benefits of this will become more obvious in the section on chaining. The code below accomplishes the same thing as that above using %>% and moving the data frame outside the filter() function.

[4]:
countries %>% filter(country == "China")
countries %>% filter(year == 1996)
countries %>% filter(country == "USA" & year == 1995)
countries %>% filter(gdp_pc > 5000 & gdp_pc < 14000)
countryyeargdp_pc
China1994 7525
China1995 4938
China1996 17835
countryyeargdp_pc
USA 1996 13227
China1996 17835
Sudan1996 4718
countryyeargdp_pc
USA 1995 16697
countryyeargdp_pc
China1994 7525
Sudan1995 9435
USA 1996 13227

One last note that applies throughout most of this tutorial: I’m not saving the subsets to any object here. You would normally want to do this. (I’m not doing it here so that you can see the output of the function without extra lines of code.) Just to be clear, if you wanted to take a subset and save the result into an object you’d use the assignment operator as usual.

[5]:
china <- countries %>% filter(country == "China")
china
countryyeargdp_pc
China1994 7525
China1995 4938
China1996 17835

Sorting

Use arrange() to sort a dataset. Here are a few examples. (I’ll use the chaining operator %>% from now on — if you’re confused about this remember that you can just bring the data frame countries inside the function.)

[6]:
# Sort by country names
countries %>% arrange(country)

# Sort by GDP (ascending is default)
countries %>% arrange(gdp_pc)

# Sort by GDP (descending)
countries %>% arrange(desc(gdp_pc))

# Sort by country name, then GDP
countries %>% arrange(country, gdp_pc)
countryyeargdp_pc
China1994 7525
China1995 4938
China1996 17835
Sudan1994 3702
Sudan1995 9435
Sudan1996 4718
USA 1994 2306
USA 1995 16697
USA 1996 13227
countryyeargdp_pc
USA 1994 2306
Sudan1994 3702
Sudan1996 4718
China1995 4938
China1994 7525
Sudan1995 9435
USA 1996 13227
USA 1995 16697
China1996 17835
countryyeargdp_pc
China1996 17835
USA 1995 16697
USA 1996 13227
Sudan1995 9435
China1994 7525
China1995 4938
Sudan1996 4718
Sudan1994 3702
USA 1994 2306
countryyeargdp_pc
China1995 4938
China1994 7525
China1996 17835
Sudan1994 3702
Sudan1996 4718
Sudan1995 9435
USA 1994 2306
USA 1996 13227
USA 1995 16697

Dropping

To only keep some variables in a data frame use select():

[7]:
# Keep country and GDP
countries %>% select(country, gdp_pc)

# Same thing using '-', implying you want to delete a variable
countries %>% select(-year)

# Selecting and renaming in one
countries %>% select(country_name = country, gdp_pc)
countrygdp_pc
USA 2306
China 7525
Sudan 3702
USA 16697
China 4938
Sudan 9435
USA 13227
China17835
Sudan 4718
countrygdp_pc
USA 2306
China 7525
Sudan 3702
USA 16697
China 4938
Sudan 9435
USA 13227
China17835
Sudan 4718
country_namegdp_pc
USA 2306
China 7525
Sudan 3702
USA 16697
China 4938
Sudan 9435
USA 13227
China17835
Sudan 4718

Renaming

As illustrated in the last line of code above you can rename variables using select(). But this can also be done using rename():

[8]:
# Rename GDP per capita
countries %>% rename(GDP.PC = gdp_pc)
countryyearGDP.PC
USA 1994 2306
China1994 7525
Sudan1994 3702
USA 1995 16697
China1995 4938
Sudan1995 9435
USA 1996 13227
China1996 17835
Sudan1996 4718

Unique values

Removing duplicate observations can be useful, but be careful: entire rows will be deleted. Use distinct():

[9]:
countries %>% distinct(country)
country
USA
China
Sudan

Note that the function keeps the first non-duplicate. This is more useful when you suspect that duplicate values of the following kind have slipped into the dataset:

[10]:
countries2 <- rbind(
    data.frame(country = "USA", year = 1994, gdp_pc = 10000),
    countries
    )
countries2
countryyeargdp_pc
USA 1994 10000
USA 1994 2306
China1994 7525
Sudan1994 3702
USA 1995 16697
China1995 4938
Sudan1995 9435
USA 1996 13227
China1996 17835
Sudan1996 4718

Note that we all of a sudden have two USA 1994 observations. Delete one of them using distinct().

[11]:
countries2 %>% distinct(country, year)
countryyear
USA 1994
China1994
Sudan1994
USA 1995
China1995
Sudan1995
USA 1996
China1996
Sudan1996

New variables

Use mutate() to create new variables or to modify existing variables. Here are a few examples.

[12]:
# Create a new variable that has GDP per capita in 1000s
countries %>% mutate(gdppc_1k = gdp_pc / 1000)

# Create a new variable with lower-case country names
countries %>% mutate(country_lc = tolower(country))

# Both in one statement
countries %>% mutate(gdppc_1k = gdp_pc / 1000,
                     country_lc = tolower(country))
countryyeargdp_pcgdppc_1k
USA 1994 2306 2.306
China 1994 7525 7.525
Sudan 1994 3702 3.702
USA 1995 16697 16.697
China 1995 4938 4.938
Sudan 1995 9435 9.435
USA 1996 13227 13.227
China 1996 17835 17.835
Sudan 1996 4718 4.718
countryyeargdp_pccountry_lc
USA 1994 2306usa
China1994 7525china
Sudan1994 3702sudan
USA 1995 16697usa
China1995 4938china
Sudan1995 9435sudan
USA 1996 13227usa
China1996 17835china
Sudan1996 4718sudan
countryyeargdp_pcgdppc_1kcountry_lc
USA 1994 2306 2.306usa
China 1994 7525 7.525china
Sudan 1994 3702 3.702sudan
USA 1995 16697 16.697usa
China 1995 4938 4.938china
Sudan 1995 9435 9.435sudan
USA 1996 13227 13.227usa
China 1996 17835 17.835china
Sudan 1996 4718 4.718sudan

A lot of times it makes more sense to just overwrite an existing variable rather than adding a variable.

[13]:
countries %>% mutate(country = tolower(country))
countryyeargdp_pc
usa 1994 2306
china1994 7525
sudan1994 3702
usa 1995 16697
china1995 4938
sudan1995 9435
usa 1996 13227
china1996 17835
sudan1996 4718

mutate() can be combined with two other useful functions: ifelse() and revalue().

ifelse()

ifelse() is a logical function that is useful for modifying variables in datasets (or individual vectors). Here’s an illustration of how it works:

[14]:
numbers <- 1:10
ifelse(numbers > 5, numbers * 10, numbers / 10)
  1. 0.1
  2. 0.2
  3. 0.3
  4. 0.4
  5. 0.5
  6. 60
  7. 70
  8. 80
  9. 90
  10. 100

The function takes three arguments: (1) a logical test, (2) what to do if the test is true, and (3) what to do if the test is false. Thus, in the code above numbers greater than 5 are multiplied by 10 and numbers less than or equal to 5 are divided by 10.

We can apply this to variables in data frames. Let’s use the world-small.csv dataset to illustrate (download here).

[15]:
# Read world-small dataset
world <- read.csv("data/world-small.csv")
[16]:
head(world)

# Create a new variable equal to "democracy" if a country has
# a polity score >= 15 and "autocracy" otherwise
world <- world %>% mutate(democracy = ifelse(polityIV >= 15, "democracy", "autocracy"))
head(world)

# Or represent this information as a dummy variable instead
world <- world %>% mutate(democracy = ifelse(polityIV >= 15, 1, 0))
head(world)
countryregiongdppcap08polityIV
Albania C&E Europe 7715 17.8
Algeria Africa 8033 10.0
Angola Africa 5899 8.0
Argentina S. America 14333 18.0
Armenia C&E Europe 6070 15.0
Australia Asia-Pacific35677 20.0
countryregiongdppcap08polityIVdemocracy
Albania C&E Europe 7715 17.8 democracy
Algeria Africa 8033 10.0 autocracy
Angola Africa 5899 8.0 autocracy
Argentina S. America 14333 18.0 democracy
Armenia C&E Europe 6070 15.0 democracy
Australia Asia-Pacific35677 20.0 democracy
countryregiongdppcap08polityIVdemocracy
Albania C&E Europe 7715 17.8 1
Algeria Africa 8033 10.0 0
Angola Africa 5899 8.0 0
Argentina S. America 14333 18.0 1
Armenia C&E Europe 6070 15.0 1
Australia Asia-Pacific35677 20.0 1

We first created a new variable called democracy equal to “democracy” if a country has a Polity IV score of at least 15 and “autocracy” otherwise. We then overwrote this variable with a “dummy variable” representing the same information using zeroes and ones.

General note about dummy variables: Dummy variables — sometimes called “indicator variables” or “binary variables” — are extremely useful for representing binary information. Their usefulness in part stems from their applicability in regression models and in part from how they can be used to summarize information. For example, taking the mean of the dummy variable ``democracy`` above — that is, ``mean(world$democracy)`` — gives us the proportion of countries that we classified as “democracy”.

revalue()

We often want to recode the categories of a variable. For example, take a look at the region variable in world:

[17]:
table(world$region)

      Africa Asia-Pacific   C&E Europe  Middle East   N. America   S. America
          42           24           25           16            3           19
 Scandinavia    W. Europe
           4           12

Currently this variable has eight categories. Say we wanted to simplify it a little bit by combining the three regions in Europe. We also want to rename the two regions in America. This can be accomplished with revalue() from the plyr package.

[18]:
world$region <- revalue(world$region, c(
    "C&E Europe" = "Europe",
    "Scandinavia" = "Europe",
    "W. Europe" = "Europe",
    "N. America" = "North America",
    "S. America" = "South America"))
table(world$region)

       Africa  Asia-Pacific        Europe   Middle East North America
           42            24            41            16             3
South America
           19

This can also be accomplished with ifelse(), but we would have to nest several ifelse() statements. For this reason I prefer revalue().

A note on factor variables

The read.csv() function automatically converts variables that contain strings to the class “factor”. Note, for example, that both the country and region variables in the world dataset are represented as factors.

[19]:
class(world$country)
class(world$region)
'factor'
'factor'

These variables are useful in regression models in R. For example, including region in an OLS model (using command lm()) would automatically represent all regions as dummy variables except one left as the reference category. If that’s not making sense to you at this stage, don’t worry at all.

But factor variables can also cause issues if you’re not careful. In particular, factor variables are sometimes displayed to the user as a certain set of numbers, yet are stored internally as an entirely different set of numbers. For example, say the gdppcap08 variable in world had been read as a factor. (Fortunately, in reality it was read correctly as numeric, but this sometimes isn’t the case.)

[20]:
# Recode gdp per capita to factor
# NOTE: Only for illustrative purposes!
world <- world %>% mutate(gdppcap08 = as.factor(gdppcap08))
head(world)
class(world$gdppcap08)
countryregiongdppcap08polityIVdemocracy
Albania Europe 7715 17.8 1
Algeria Africa 8033 10.0 0
Angola Africa 5899 8.0 0
Argentina South America14333 18.0 1
Armenia Europe 6070 15.0 1
Australia Asia-Pacific 35677 20.0 1
'factor'

Ok, the GDP variable is now a factor variable. Note, though, that it looks just like a numeric variable. You might therefore be tempted to try to do things like mean(world$gdppcap08), yet this would return NA instead of an actual value as before. You might also be tempted to recode the GDP variable from factor to numeric using the following code:

[21]:
head(world %>% mutate(gdppcap08 = as.numeric(gdppcap08)))
countryregiongdppcap08polityIVdemocracy
Albania Europe 76 17.8 1
Algeria Africa 79 10.0 0
Angola Africa 67 8.0 0
Argentina South America 99 18.0 1
Armenia Europe 69 15.0 1
Australia Asia-Pacific 132 20.0 1

Something has gone seriously wrong — look at the values of gdppcap08. Here’s the learning lesson: to convert a factor variable to numeric, you have to recode it to character first. Use as.numeric(as.character(variable)), where variable is the factor variable to convert.

[22]:
world <- world %>% mutate(gdppcap08 = as.numeric(as.character(gdppcap08)))
head(world)
class(world$gdppcap08)
countryregiongdppcap08polityIVdemocracy
Albania Europe 7715 17.8 1
Algeria Africa 8033 10.0 0
Angola Africa 5899 8.0 0
Argentina South America14333 18.0 1
Armenia Europe 6070 15.0 1
Australia Asia-Pacific 35677 20.0 1
'numeric'

Chaining

What if you wanted to apply multiple functions to one data frame? This is where chaining is elegant and facilitates debugging.

More specifically, say we wanted to use countries to create a new data frame called countries_new, which should have observations from years 1995 and 1996 (dropping 1994), should be sorted by country name (in lower case), and should have a new variable equal to GDP per capita in 1000s.

Here’s how we could do this without chaining:

[23]:
countries_new <- filter(countries, year != 1994) #drop year 1994
countries_new <- arrange(countries_new, country) #sort by country names
countries_new <- mutate(countries_new, country = tolower(country), #convert name to lower-case
                                       gdppc_1k = gdp_pc / 1000) #create GDP pc in 1000s
countries_new
countryyeargdp_pcgdppc_1k
china 1995 4938 4.938
china 1996 17835 17.835
sudan 1995 9435 9.435
sudan 1996 4718 4.718
usa 1995 16697 16.697
usa 1996 13227 13.227

Here’s the same thing using chaining:

[24]:
countries_new <- countries %>%
    filter(year != 1994) %>%
    arrange(country) %>%
    mutate(country = tolower(country), gdppc_1k = gdp_pc / 1000)
countries_new
countryyeargdp_pcgdppc_1k
china 1995 4938 4.938
china 1996 17835 17.835
sudan 1995 9435 9.435
sudan 1996 4718 4.718
usa 1995 16697 16.697
usa 1996 13227 13.227

I find code that makes use of chaining more readable. Chaining always begins with specifying the data frame we want to operate on (e.g., countries). Every subsequent statement will operate on this data frame, starting with the function that comes right after the data frame and working its way down. In our case, the first thing we’ll do to countries is to subset it. We’ll then sort it by country name. Lastly, we’ll overwrite the country name to be lower-case and create a new variable representing GDP per capita in 1000s.

Exercises

  1. Read the world-small.csv dataset into R and store it in an object called world.
  2. Subset world to European countries. Save this subset as a new data frame called europe.
  3. Add two variables to europe:
    1. A variable that recodes polityIV from 0-20 to -10-10.
    2. A variable that categorizes a country as “rich” or “poor” based on some cutoff of gdppcap08 you think is reasonable.
  4. Drop the region variable in europe (keep the rest).
  5. Sort europe based on Polity IV.
  6. Repeat Exercises 2-5 using chaining.
  7. What was the world’s mean GDP per capita in 2008? Polity IV score?
  8. What was Africa’s mean GDP per capita and Polity IV score?
  9. What was the poorest country in the world in 2008? Richest?
  10. How many countries in Europe are “rich” according to your coding? How many are poor? What percentage have Polity IV scores of at least 18?
[25]: