Modifying Data¶
summary: Being able to quickly modify datasets is critical. This involves, among other things: subsetting, sorting, extracting unique observations, renaming variables, dropping variables, and creating new variables. To accomplish each of these tasks, we’ll use a set of relatively new functions introduced in the dplyr package developed by Hadley Wickham. A nice introduction to this package is here. I’ll try to give more of a beginner’s introduction below.
Overview¶
The table below gives an overview of very common data management tasks and their corresponding dplyr function.
Task | Function |
---|---|
\*Covered in a `separate tutorial <../collapsing-data>`__. | |
Subsetting |
|
Sorting |
|
Dropping variables |
|
Renaming variables |
|
Extracting unique values |
|
Creating new variables |
|
Collapsing data* |
|
All of these tasks can also be accomplished using R’s base functions, but this usually requires more involved code. An additional advantage of using dplyr functions is that they can be combined in an elegant way using chaining.
To demonstrate these functions, we’ll rely on a small data frame called countries
that you can create as follows:
[1]:
countries <- data.frame(
expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
gdp_pc = round(runif(9, 1000, 20000), 0)
)
countries$country <- as.character(countries$country) #factor --> character
countries
country | year | gdp_pc |
---|---|---|
USA | 1994 | 2306 |
China | 1994 | 7525 |
Sudan | 1994 | 3702 |
USA | 1995 | 16697 |
China | 1995 | 4938 |
Sudan | 1995 | 9435 |
USA | 1996 | 13227 |
China | 1996 | 17835 |
Sudan | 1996 | 4718 |
We’ll also make use of world-small.csv
, which you can download here.
Before we begin, let’s also load two packages we’ll need. (Of course, they need to be installed first — remind yourself here.)
[2]:
require(plyr)
require(dplyr)
Loading required package: plyr
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:plyr’:
arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Subsetting¶
The best way to subset a dataset is by using a logical statement. The idea is that we want to keep only some rows in a dataset that meet some logical condition.
Say we wanted to subset the countries
dataset to (1) a certain country, (2) a certain year, (3) a certain country and year, or (4) a certain GDP range. Here’s how we can do this using filter()
.
[3]:
filter(countries, country == "China")
filter(countries, year == 1996)
filter(countries, country == "USA" & year == 1995)
filter(countries, gdp_pc > 5000 & gdp_pc < 14000)
country | year | gdp_pc |
---|---|---|
China | 1994 | 7525 |
China | 1995 | 4938 |
China | 1996 | 17835 |
country | year | gdp_pc |
---|---|---|
USA | 1996 | 13227 |
China | 1996 | 17835 |
Sudan | 1996 | 4718 |
country | year | gdp_pc |
---|---|---|
USA | 1995 | 16697 |
country | year | gdp_pc |
---|---|---|
China | 1994 | 7525 |
Sudan | 1995 | 9435 |
USA | 1996 | 13227 |
As should be obvious, the first argument in the filter()
function specifies the dataset on which to carry out the operation. The second argument specifies the logical operation used to filter the data.
In fact, the first argument in all dplyr functions is a dataset. Using dplyr we can move this first argument outside the function and use the special operator %>%
to chain statements. The benefits of this will become more obvious in the section on chaining. The code below accomplishes the same thing as that above using %>%
and moving the data frame outside the filter()
function.
[4]:
countries %>% filter(country == "China")
countries %>% filter(year == 1996)
countries %>% filter(country == "USA" & year == 1995)
countries %>% filter(gdp_pc > 5000 & gdp_pc < 14000)
country | year | gdp_pc |
---|---|---|
China | 1994 | 7525 |
China | 1995 | 4938 |
China | 1996 | 17835 |
country | year | gdp_pc |
---|---|---|
USA | 1996 | 13227 |
China | 1996 | 17835 |
Sudan | 1996 | 4718 |
country | year | gdp_pc |
---|---|---|
USA | 1995 | 16697 |
country | year | gdp_pc |
---|---|---|
China | 1994 | 7525 |
Sudan | 1995 | 9435 |
USA | 1996 | 13227 |
One last note that applies throughout most of this tutorial: I’m not saving the subsets to any object here. You would normally want to do this. (I’m not doing it here so that you can see the output of the function without extra lines of code.) Just to be clear, if you wanted to take a subset and save the result into an object you’d use the assignment operator as usual.
[5]:
china <- countries %>% filter(country == "China")
china
country | year | gdp_pc |
---|---|---|
China | 1994 | 7525 |
China | 1995 | 4938 |
China | 1996 | 17835 |
Sorting¶
Use arrange()
to sort a dataset. Here are a few examples. (I’ll use the chaining operator %>%
from now on — if you’re confused about this remember that you can just bring the data frame countries
inside the function.)
[6]:
# Sort by country names
countries %>% arrange(country)
# Sort by GDP (ascending is default)
countries %>% arrange(gdp_pc)
# Sort by GDP (descending)
countries %>% arrange(desc(gdp_pc))
# Sort by country name, then GDP
countries %>% arrange(country, gdp_pc)
country | year | gdp_pc |
---|---|---|
China | 1994 | 7525 |
China | 1995 | 4938 |
China | 1996 | 17835 |
Sudan | 1994 | 3702 |
Sudan | 1995 | 9435 |
Sudan | 1996 | 4718 |
USA | 1994 | 2306 |
USA | 1995 | 16697 |
USA | 1996 | 13227 |
country | year | gdp_pc |
---|---|---|
USA | 1994 | 2306 |
Sudan | 1994 | 3702 |
Sudan | 1996 | 4718 |
China | 1995 | 4938 |
China | 1994 | 7525 |
Sudan | 1995 | 9435 |
USA | 1996 | 13227 |
USA | 1995 | 16697 |
China | 1996 | 17835 |
country | year | gdp_pc |
---|---|---|
China | 1996 | 17835 |
USA | 1995 | 16697 |
USA | 1996 | 13227 |
Sudan | 1995 | 9435 |
China | 1994 | 7525 |
China | 1995 | 4938 |
Sudan | 1996 | 4718 |
Sudan | 1994 | 3702 |
USA | 1994 | 2306 |
country | year | gdp_pc |
---|---|---|
China | 1995 | 4938 |
China | 1994 | 7525 |
China | 1996 | 17835 |
Sudan | 1994 | 3702 |
Sudan | 1996 | 4718 |
Sudan | 1995 | 9435 |
USA | 1994 | 2306 |
USA | 1996 | 13227 |
USA | 1995 | 16697 |
Dropping¶
To only keep some variables in a data frame use select()
:
[7]:
# Keep country and GDP
countries %>% select(country, gdp_pc)
# Same thing using '-', implying you want to delete a variable
countries %>% select(-year)
# Selecting and renaming in one
countries %>% select(country_name = country, gdp_pc)
country | gdp_pc |
---|---|
USA | 2306 |
China | 7525 |
Sudan | 3702 |
USA | 16697 |
China | 4938 |
Sudan | 9435 |
USA | 13227 |
China | 17835 |
Sudan | 4718 |
country | gdp_pc |
---|---|
USA | 2306 |
China | 7525 |
Sudan | 3702 |
USA | 16697 |
China | 4938 |
Sudan | 9435 |
USA | 13227 |
China | 17835 |
Sudan | 4718 |
country_name | gdp_pc |
---|---|
USA | 2306 |
China | 7525 |
Sudan | 3702 |
USA | 16697 |
China | 4938 |
Sudan | 9435 |
USA | 13227 |
China | 17835 |
Sudan | 4718 |
Renaming¶
As illustrated in the last line of code above you can rename variables using select()
. But this can also be done using rename()
:
[8]:
# Rename GDP per capita
countries %>% rename(GDP.PC = gdp_pc)
country | year | GDP.PC |
---|---|---|
USA | 1994 | 2306 |
China | 1994 | 7525 |
Sudan | 1994 | 3702 |
USA | 1995 | 16697 |
China | 1995 | 4938 |
Sudan | 1995 | 9435 |
USA | 1996 | 13227 |
China | 1996 | 17835 |
Sudan | 1996 | 4718 |
Unique values¶
Removing duplicate observations can be useful, but be careful: entire rows will be deleted. Use distinct()
:
[9]:
countries %>% distinct(country)
country |
---|
USA |
China |
Sudan |
Note that the function keeps the first non-duplicate. This is more useful when you suspect that duplicate values of the following kind have slipped into the dataset:
[10]:
countries2 <- rbind(
data.frame(country = "USA", year = 1994, gdp_pc = 10000),
countries
)
countries2
country | year | gdp_pc |
---|---|---|
USA | 1994 | 10000 |
USA | 1994 | 2306 |
China | 1994 | 7525 |
Sudan | 1994 | 3702 |
USA | 1995 | 16697 |
China | 1995 | 4938 |
Sudan | 1995 | 9435 |
USA | 1996 | 13227 |
China | 1996 | 17835 |
Sudan | 1996 | 4718 |
Note that we all of a sudden have two USA 1994 observations. Delete one of them using distinct()
.
[11]:
countries2 %>% distinct(country, year)
country | year |
---|---|
USA | 1994 |
China | 1994 |
Sudan | 1994 |
USA | 1995 |
China | 1995 |
Sudan | 1995 |
USA | 1996 |
China | 1996 |
Sudan | 1996 |
New variables¶
Use mutate()
to create new variables or to modify existing variables. Here are a few examples.
[12]:
# Create a new variable that has GDP per capita in 1000s
countries %>% mutate(gdppc_1k = gdp_pc / 1000)
# Create a new variable with lower-case country names
countries %>% mutate(country_lc = tolower(country))
# Both in one statement
countries %>% mutate(gdppc_1k = gdp_pc / 1000,
country_lc = tolower(country))
country | year | gdp_pc | gdppc_1k |
---|---|---|---|
USA | 1994 | 2306 | 2.306 |
China | 1994 | 7525 | 7.525 |
Sudan | 1994 | 3702 | 3.702 |
USA | 1995 | 16697 | 16.697 |
China | 1995 | 4938 | 4.938 |
Sudan | 1995 | 9435 | 9.435 |
USA | 1996 | 13227 | 13.227 |
China | 1996 | 17835 | 17.835 |
Sudan | 1996 | 4718 | 4.718 |
country | year | gdp_pc | country_lc |
---|---|---|---|
USA | 1994 | 2306 | usa |
China | 1994 | 7525 | china |
Sudan | 1994 | 3702 | sudan |
USA | 1995 | 16697 | usa |
China | 1995 | 4938 | china |
Sudan | 1995 | 9435 | sudan |
USA | 1996 | 13227 | usa |
China | 1996 | 17835 | china |
Sudan | 1996 | 4718 | sudan |
country | year | gdp_pc | gdppc_1k | country_lc |
---|---|---|---|---|
USA | 1994 | 2306 | 2.306 | usa |
China | 1994 | 7525 | 7.525 | china |
Sudan | 1994 | 3702 | 3.702 | sudan |
USA | 1995 | 16697 | 16.697 | usa |
China | 1995 | 4938 | 4.938 | china |
Sudan | 1995 | 9435 | 9.435 | sudan |
USA | 1996 | 13227 | 13.227 | usa |
China | 1996 | 17835 | 17.835 | china |
Sudan | 1996 | 4718 | 4.718 | sudan |
A lot of times it makes more sense to just overwrite an existing variable rather than adding a variable.
[13]:
countries %>% mutate(country = tolower(country))
country | year | gdp_pc |
---|---|---|
usa | 1994 | 2306 |
china | 1994 | 7525 |
sudan | 1994 | 3702 |
usa | 1995 | 16697 |
china | 1995 | 4938 |
sudan | 1995 | 9435 |
usa | 1996 | 13227 |
china | 1996 | 17835 |
sudan | 1996 | 4718 |
mutate()
can be combined with two other useful functions: ifelse()
and revalue()
.
ifelse()¶
ifelse()
is a logical function that is useful for modifying variables in datasets (or individual vectors). Here’s an illustration of how it works:
[14]:
numbers <- 1:10
ifelse(numbers > 5, numbers * 10, numbers / 10)
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 60
- 70
- 80
- 90
- 100
The function takes three arguments: (1) a logical test, (2) what to do if the test is true, and (3) what to do if the test is false. Thus, in the code above numbers greater than 5 are multiplied by 10 and numbers less than or equal to 5 are divided by 10.
We can apply this to variables in data frames. Let’s use the world-small.csv
dataset to illustrate (download here).
[15]:
# Read world-small dataset
world <- read.csv("data/world-small.csv")
[16]:
head(world)
# Create a new variable equal to "democracy" if a country has
# a polity score >= 15 and "autocracy" otherwise
world <- world %>% mutate(democracy = ifelse(polityIV >= 15, "democracy", "autocracy"))
head(world)
# Or represent this information as a dummy variable instead
world <- world %>% mutate(democracy = ifelse(polityIV >= 15, 1, 0))
head(world)
country | region | gdppcap08 | polityIV |
---|---|---|---|
Albania | C&E Europe | 7715 | 17.8 |
Algeria | Africa | 8033 | 10.0 |
Angola | Africa | 5899 | 8.0 |
Argentina | S. America | 14333 | 18.0 |
Armenia | C&E Europe | 6070 | 15.0 |
Australia | Asia-Pacific | 35677 | 20.0 |
country | region | gdppcap08 | polityIV | democracy |
---|---|---|---|---|
Albania | C&E Europe | 7715 | 17.8 | democracy |
Algeria | Africa | 8033 | 10.0 | autocracy |
Angola | Africa | 5899 | 8.0 | autocracy |
Argentina | S. America | 14333 | 18.0 | democracy |
Armenia | C&E Europe | 6070 | 15.0 | democracy |
Australia | Asia-Pacific | 35677 | 20.0 | democracy |
country | region | gdppcap08 | polityIV | democracy |
---|---|---|---|---|
Albania | C&E Europe | 7715 | 17.8 | 1 |
Algeria | Africa | 8033 | 10.0 | 0 |
Angola | Africa | 5899 | 8.0 | 0 |
Argentina | S. America | 14333 | 18.0 | 1 |
Armenia | C&E Europe | 6070 | 15.0 | 1 |
Australia | Asia-Pacific | 35677 | 20.0 | 1 |
We first created a new variable called democracy
equal to “democracy” if a country has a Polity IV score of at least 15 and “autocracy” otherwise. We then overwrote this variable with a “dummy variable” representing the same information using zeroes and ones.
revalue()¶
We often want to recode the categories of a variable. For example, take a look at the region
variable in world
:
[17]:
table(world$region)
Africa Asia-Pacific C&E Europe Middle East N. America S. America
42 24 25 16 3 19
Scandinavia W. Europe
4 12
Currently this variable has eight categories. Say we wanted to simplify it a little bit by combining the three regions in Europe. We also want to rename the two regions in America. This can be accomplished with revalue()
from the plyr package.
[18]:
world$region <- revalue(world$region, c(
"C&E Europe" = "Europe",
"Scandinavia" = "Europe",
"W. Europe" = "Europe",
"N. America" = "North America",
"S. America" = "South America"))
table(world$region)
Africa Asia-Pacific Europe Middle East North America
42 24 41 16 3
South America
19
This can also be accomplished with ifelse()
, but we would have to nest several ifelse()
statements. For this reason I prefer revalue()
.
A note on factor variables¶
The read.csv()
function automatically converts variables that contain strings to the class “factor”. Note, for example, that both the country
and region
variables in the world
dataset are represented as factors.
[19]:
class(world$country)
class(world$region)
These variables are useful in regression models in R. For example, including region
in an OLS model (using command lm()
) would automatically represent all regions as dummy variables except one left as the reference category. If that’s not making sense to you at this stage, don’t worry at all.
But factor variables can also cause issues if you’re not careful. In particular, factor variables are sometimes displayed to the user as a certain set of numbers, yet are stored internally as an entirely different set of numbers. For example, say the gdppcap08
variable in world
had been read as a factor. (Fortunately, in reality it was read correctly as numeric, but this sometimes isn’t the case.)
[20]:
# Recode gdp per capita to factor
# NOTE: Only for illustrative purposes!
world <- world %>% mutate(gdppcap08 = as.factor(gdppcap08))
head(world)
class(world$gdppcap08)
country | region | gdppcap08 | polityIV | democracy |
---|---|---|---|---|
Albania | Europe | 7715 | 17.8 | 1 |
Algeria | Africa | 8033 | 10.0 | 0 |
Angola | Africa | 5899 | 8.0 | 0 |
Argentina | South America | 14333 | 18.0 | 1 |
Armenia | Europe | 6070 | 15.0 | 1 |
Australia | Asia-Pacific | 35677 | 20.0 | 1 |
Ok, the GDP variable is now a factor variable. Note, though, that it looks just like a numeric variable. You might therefore be tempted to try to do things like mean(world$gdppcap08)
, yet this would return NA
instead of an actual value as before. You might also be tempted to recode the GDP variable from factor to numeric using the following code:
[21]:
head(world %>% mutate(gdppcap08 = as.numeric(gdppcap08)))
country | region | gdppcap08 | polityIV | democracy |
---|---|---|---|---|
Albania | Europe | 76 | 17.8 | 1 |
Algeria | Africa | 79 | 10.0 | 0 |
Angola | Africa | 67 | 8.0 | 0 |
Argentina | South America | 99 | 18.0 | 1 |
Armenia | Europe | 69 | 15.0 | 1 |
Australia | Asia-Pacific | 132 | 20.0 | 1 |
Something has gone seriously wrong — look at the values of gdppcap08
. Here’s the learning lesson: to convert a factor variable to numeric, you have to recode it to character first. Use as.numeric(as.character(variable))
, where variable
is the factor variable to convert.
[22]:
world <- world %>% mutate(gdppcap08 = as.numeric(as.character(gdppcap08)))
head(world)
class(world$gdppcap08)
country | region | gdppcap08 | polityIV | democracy |
---|---|---|---|---|
Albania | Europe | 7715 | 17.8 | 1 |
Algeria | Africa | 8033 | 10.0 | 0 |
Angola | Africa | 5899 | 8.0 | 0 |
Argentina | South America | 14333 | 18.0 | 1 |
Armenia | Europe | 6070 | 15.0 | 1 |
Australia | Asia-Pacific | 35677 | 20.0 | 1 |
Chaining¶
What if you wanted to apply multiple functions to one data frame? This is where chaining is elegant and facilitates debugging.
More specifically, say we wanted to use countries
to create a new data frame called countries_new
, which should have observations from years 1995 and 1996 (dropping 1994), should be sorted by country name (in lower case), and should have a new variable equal to GDP per capita in 1000s.
Here’s how we could do this without chaining:
[23]:
countries_new <- filter(countries, year != 1994) #drop year 1994
countries_new <- arrange(countries_new, country) #sort by country names
countries_new <- mutate(countries_new, country = tolower(country), #convert name to lower-case
gdppc_1k = gdp_pc / 1000) #create GDP pc in 1000s
countries_new
country | year | gdp_pc | gdppc_1k |
---|---|---|---|
china | 1995 | 4938 | 4.938 |
china | 1996 | 17835 | 17.835 |
sudan | 1995 | 9435 | 9.435 |
sudan | 1996 | 4718 | 4.718 |
usa | 1995 | 16697 | 16.697 |
usa | 1996 | 13227 | 13.227 |
Here’s the same thing using chaining:
[24]:
countries_new <- countries %>%
filter(year != 1994) %>%
arrange(country) %>%
mutate(country = tolower(country), gdppc_1k = gdp_pc / 1000)
countries_new
country | year | gdp_pc | gdppc_1k |
---|---|---|---|
china | 1995 | 4938 | 4.938 |
china | 1996 | 17835 | 17.835 |
sudan | 1995 | 9435 | 9.435 |
sudan | 1996 | 4718 | 4.718 |
usa | 1995 | 16697 | 16.697 |
usa | 1996 | 13227 | 13.227 |
I find code that makes use of chaining more readable. Chaining always begins with specifying the data frame we want to operate on (e.g., countries
). Every subsequent statement will operate on this data frame, starting with the function that comes right after the data frame and working its way down. In our case, the first thing we’ll do to countries
is to subset it. We’ll then sort it by country name. Lastly, we’ll overwrite the country name to be lower-case and create a new variable
representing GDP per capita in 1000s.
Exercises¶
- Read the
world-small.csv
dataset into R and store it in an object calledworld
. - Subset
world
to European countries. Save this subset as a new data frame calledeurope
. - Add two variables to
europe
:- A variable that recodes
polityIV
from 0-20 to -10-10. - A variable that categorizes a country as “rich” or “poor” based on some cutoff of
gdppcap08
you think is reasonable.
- A variable that recodes
- Drop the
region
variable ineurope
(keep the rest). - Sort
europe
based on Polity IV. - Repeat Exercises 2-5 using chaining.
- What was the world’s mean GDP per capita in 2008? Polity IV score?
- What was Africa’s mean GDP per capita and Polity IV score?
- What was the poorest country in the world in 2008? Richest?
- How many countries in Europe are “rich” according to your coding? How many are poor? What percentage have Polity IV scores of at least 18?
[25]: