Vectors

This tutorial introduces a key object in R: vectors. By the end of this section, you will learn how to create different types of vectors, subset them, modify them, and summarize them.

What is a vector?

A vector is an ordered collection of values, all of the same type. For example, you can have a vector that contains the ages of everyone in your class (a numeric vector), or a vector of all the names of people in your family (a character vector).

Vectors are central to R because R was designed for statistics, and vectors are a perfect way to collect lots of different observations of a variable. In fact, one of the secrets of R is that almost everything is a vector. Surprise! All the variable you created before were actually vectors – they just happened to be vectors with only one entry (vectors of length 1) so you didn’t notice. But you can see if you ask for the length of the variable nick_pet_dinosaurs from the Introduction tutorials:

[1]:
nick_pet_dinosaurs = 2
length(nick_pet_dinosaurs)
1

To make vectors that have more than one item, we use the c function (c stands for “combine”). Just pass it a set of values seperated by commas, and you’ll get back a vector.

To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:

Day of Week Attendees
Monday 132 people
Tuesday 94 people
Wednesday 112 people
Thursday 84 people
Friday 254 people
Saturday 322 people
Sunday 472 people

Let’s make a vector for this attendance pattern:

[3]:
attendance = c(132, 94, 112, 84, 254, 322, 472)
attendance
  1. 132
  2. 94
  3. 112
  4. 84
  5. 254
  6. 322
  7. 472

You may find that this looks a little different in your R session than on this site. By default, R likes to print out vectors sideways, so that this vector may look more like this:

sideways_attendance

But these are just two ways of printing the same thing.

One of the nice things about vectors is that we can easily do things like multiple all the values by another number easily. For example, suppose tickets to my zoo cost $15 per person. What is the total money generated by ticket sales each day? Let’s find out!

[5]:
revenue = attendance * 15
revenue
  1. 1980
  2. 1410
  3. 1680
  4. 1260
  5. 3810
  6. 4830
  7. 7080

Cool!

Now what if we want to know to the total amount raised in a week, instead of just the amount on each day? We can use one of R’s many helper functions – in this case sum – which adds up all the values of a vector

[6]:
sum(revenue)
22050
[ ]:
# Numeric vectors
n1 <- 20
n2 <- c(20, 25, 60, 55)

# Character vectors
c1 <- "Blue"
c2 <- c("Red", "Green", "Purple")

# Logical vectors
l1 <- TRUE
l2 <- c(TRUE, FALSE, TRUE)

Note that vectors can consist of one or many elements. Three common ways to create vectors with more than one element is to use c(), seq(), or rep().

c()

As illustrated above, one very common way to create vectors with more than one element is to use c() (“concatenate”), which simply combines whatever values you specify in the parentheses.

seq()

seq() applies to numeric vectors only:

[ ]:
n1 <- seq(from = 0, to = 10, by = 2)   #using 'by'
n1
n2 <- seq(from = 0, to = 10, length.out = 5)   #using 'length.out'
n2
n3 <- seq(1, 2, 0.1)   #no argument names specified; automatically uses 'from', 'to', 'by'
n3
n4 <- 1:5    #shortcut for integer sequence; same as 'seq(1, 5, 1)'
n4

seq() by default takes three parameters: starting value, end value, and a value that specifies how elements will be incremented (“by”), which can be substituted with “length.out”. Integer sequences can be created using a colon.

General note about argument names: A function’s argument names need not be specified, as illustrated when we created ``n3``. If they are not specified, R uses arguments based on a default order. One way to learn about this order is to use ``?`` (e.g., ``?seq``). If you specify the argument names, the order doesn’t matter. Putting all this together, hopefully it’s obvious why ``seq(by = -1, from = 10, to = 2)`` is the same as ``seq(10, 2, -1)``.

rep()

Vectors can also be created using rep(). As the name implies, this function is useful if you want to repeat an element or elements.

[ ]:
rep(1, 5)
rep("blue", 3)
rep(TRUE, 4)

As should be obvious, the first parameter in the function specifies the element to repeat, and the second the number of times to repeat it.

Using more than one function

Perhaps the most powerful use of these functions comes from combining them. Here are a two examples:

[ ]:
rep(c("blue", "red"), 3)
c(rep(seq(0, 6, 2), 2), 4:1)

The second example is somewhat hard to follow, and is probably at the limit of complexity in terms of how many functions we want to combine. Separating a task into multiple lines of codes can help.

[ ]:
s <- rep(seq(0, 6, 2), 2)
c(s, 4:1)

Subsetting vectors

Extracting a subset of elements from a vector is an extremely important task, not least because it generalizes nicely to datasets (which are at the heart of data science). This process — whether applied to a vector or a dataset — is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill you need to master as quickly as possible, it’s this.

In R, there are three ways to filter a vector: using a separate logical vector, using indexing, and using names. I tend to use the first method most, but all three are useful.

Subsetting with logicals

Let’s jump right into an example. Say we have a character vector with only two elements (“apple” and “banana”). Subsetting it to “apple” could be done like so:

[ ]:
fruits <- c("apple", "banana")
fruits[c(TRUE, FALSE)]

Note the use of brackets, [] — this is common when filtering. Within these brackets is a vector with the same number of logical elements as there are elements in the vector you want to subset. Elements across the two vectors are matched by order: elements that match with TRUE are kept while elements that match with FALSE are dropped.

This process is extremely useful when combined with a logical operation. Please familiarize yourself with the logical operations listed here. For example, using a logical operation we can filter a large vector of oranges, apples and bananas:

[ ]:
# Create a vector with 30 fruits
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits

# Create a logical vector for dropping bananas
# Note: I'm creating the exact same logical vector three times (overriding it each time)
# This is for illustrative purposes; using one of these is sufficient
lv <- fruits == "orange" | fruits == "apple"
lv <- fruits != "banana"
lv <- fruits %in% c("orange", "apple")
lv

# Carry out the subset
fruits[lv]

We applied the same logic as above: We have a vector (fruits) that we want to subset. We do so using a logical vector (lv), where elements that match with TRUE are kept. The only difference here is that we create the logical vector with a logical operation. The logical operators (e.g., !=, |) used here are discussed in the link above, with the exception of %in%.

General note about ``%in%``: This operator is extremely useful as an alternative for repeated “or” (``|``) statements. For example, say you have a vector with 10 types of fruits and you want to keep elements that are equal to “orange”, “apple”, “mango”, “mandarin”, or “kiwi”. You could accomplish this by creating a logical vector like so: ``lv <- fruits == "orange" | fruits == "apple" | fruits == "mango" | fruits == "mandarin" | fruits == "kiwi"``. What a nighmarishly long statement compared to

the %in% option that accomplishes the exact same thing: lv <- fruits %in% c("orange", "apple", "mango", "mandarin", "kiwi").

Of course, subsetting using logicals can also be done on numeric vectors. Here are a few examples:

[ ]:
# Create a numeric vector
numbers <- seq(0, 100, by = 10)
numbers

# Illustrate three different filters
numbers[numbers <= 50 & numbers != 30]
numbers[numbers == 0 | numbers == 100]
numbers[numbers > 100] #returns an empty vector

Note that I didn’t create logical objects to carry out the subsets here, as opposed to above where we explicitly defined lv. I find it more compact and intuitive to take subsets without first creating a logical vector.

Subsetting using indexing

A different way to subset a vector is to specify the index or indeces you want to keep, again using brackets. Here are a few examples:

[ ]:
fruits <- c("apple", "banana")
fruits[1]
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[c(10, 20)]
fruits[seq(1, 30, by = 5)]

I sometimes use this when I want to inspect or modify an element that I know occurs at a specific index in the vector, a more manual approach than using logical statements.

Subsetting using indexing can also be used in random sampling, which has many important applications — for example, in experiments and when you want to test-run code on a representative subset of your data. So, let’s introduce the sample() function:

[ ]:
# Draw 10 elements at random from 1 to 100
sample(1:100, size = 10)

The function takes a vector of values (often successive integer values) and an argument that specifies how many values to draw at random from this vector. We can use the resulting values as indeces to subset another vector:

[ ]:
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[sample(1:30, size = 5)]

Here, we’re drawing a random sample of five elements from the vector fruits. Why did I specify 1:30? Well, fruits consists of 30 elements, so specifying something like 1:100 likely would have resulted in sampled values outside the bounds of the vector (e.g., fruits[35] doesn’t exist). Specifying 1:30 gives every element in fruits an equal chance of being included in the sample.

Subsetting using names

Lastly, we can assign names to each element in a vector and take a subset based on the names.

[ ]:
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")
age #note that values now have names
age[c("dad", "grandpa")] #subset

That is, we have a vector representing the age of three family members. We assign names to each value, and then keep the values associated with two of the family members.

Modifying vectors

The subsetting logic from above can be used to modify vectors. The idea here is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them. For example, what if we had mis-entered grandpa’s age above? We can fix it using indexing, a logical statement, or naming.

[ ]:
# Recreate vector with age values from above
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")

# Three ways of changing grandpa's age
# Note: you'd only need to use one of these
age[age == 80] <- 82 #using a logical statement
age[3] <- 82 #using indexing
age["grandpa"] <- 82 #using naming
age

A logical statement is most efficient when we need to change a lot of elements.

[ ]:
fruits <- rep(c("orange", "apple", "bamama"), 5)
fruits #bamamas anyone?

# Let's fix the misspelled element
fruits[fruits == "bamama"] <- "banana"
fruits

Vector arithmetics

We can modify or create new numeric vectors using arithmetic operations. Three common types of operations involve:

  1. A vector with more than one element and a vector with only one element.
  2. Two vectors with the same number of elements. Elements are matched based on index.
  3. A vector modified by a function.

In all cases, we can modify all elements of a vector or only a subset of elements using the bracket notation we learned above.

[ ]:
numbers <- 1:10
numbers

# One value modifying all values in a vector
numbers <- numbers / 10
numbers

# One value modifying a subset of a vector
numbers[numbers > 0.5] <- numbers[numbers > 0.5] * 100
numbers

# Two vectors with the same number of elements
numbers1 <- 1:10
numbers2 <- 10:1
numbers3 <- numbers2 - numbers1
numbers3

# Replacing a subset of a vector using another vector
numbers <- 1:10
numbers[numbers > 5] <- 5:1
numbers

# Modify a vector (or a subset of a vector) using a function
numbers <- 1:10
sqrt(numbers) #square root
exp(numbers) #exponentiate
log(numbers[c(1, 5, 10)]) #natural log

Vector arithmetics can also be carried out in R on two multi-value vectors with different number of elements. Such operations use the recycling rule.

Summarizing vectors

We often want to get summary statistics from a vector — that is, learn something general about it by looking beyond its constituent elements. If we have a vector in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for vectors:

[ ]:
numbers <- sample(1:1000, 10)
numbers

class(numbers) #check the class
length(numbers) #number of elements
max(numbers) #maximum value
min(numbers) #minimum value
sum(numbers) #sum of all values in the vector
mean(numbers) #mean
median(numbers) #median
var(numbers) #variance
sd(numbers) #standard deviation
quantile(numbers) #percentiles in intervals of .25
quantile(numbers, probs = seq(0, 1, 0.1)) #percentiles in invervals of 0.1
summary(numbers) #function that contains many summary stats from above

If you forget one of these functions or if I haven’t included one here that you need, google almost surely has the answer for you. Also note that some of the operations above — most notably class() and length() — apply to non-numeric vectors.

Code generalization

We want our code to be as general as possible so that it can be reapplied to a different coding task or if the data change. Commands that summarize vectors can be useful to accomplish this.

Remember above when we found a random sample of fruits? Here is more or less the code we used:

[ ]:
fruits <- rep(c("orange", "apple", "banana"), 10)
length(fruits)
random_sample <- fruits[sample(1:30, size = 5)]
random_sample

The third line, where we create random_sample, is not very general. Why? In this case, fruits has 30 elements. What if it instead had 50 elements? Then the third line would not give us a random sample. Or more precisely, this line would give us a random sample of the 30 first elements of fruits — the last 20 elements would not have a chance of being included. We could modify the third line to read random_sample <- fruits[sample(1:50, size = 5)]. But if we then modified fruits to have a different number of elements again we’d end up with the same problem.

Here’s the solution: find the number of elements of fruits using length() and then input this as an argument in the sample() function.

[ ]:
fruits <- rep(c("orange", "apple", "banana"), 100)
n <- length(fruits) #store the result of length() in an object
n

random_sample <- fruits[sample(1:n, size = 5)] #now use 'n' in the sample() function
random_sample

# Or we could have used length() directly in the sample() function
# Note: Accomplishes the same thing as first creating 'n'
random_sample <- fruits[sample(1:length(fruits), size = 5)]

Exercises

Create a vector that represents the age of at least four different family members or friends. You can name it whatever you want.

  1. What is the mean age of the people in your vector? Find out in two ways, with and without using the mean() command.
  2. How old is the youngest person in your vector? (Use an R command to find out.)
  3. What is the age gap between the youngest person and the oldest person in your vector? (Again use R to find out, and try to be as general as possible in the sense that your code should work even if the elements in your vector, or their order, change.)
  4. How many people in your vector are above age 25? (Again, try to make your code work even in the case that your vector changes.)
  5. Replace the age of the oldest person in your vector with the age of someone else you know.
  6. Create a new vector that indicates how old each person in your vector will be in 10 years.
  7. Create a new vector that indicates what year each person in your vector will turn 100 years old.
  8. Create a new vector with a random sample of 3 individuals from your original vector. What is the mean age of the people in this new vector?
[ ]: