# Vectors¶

This tutorial introduces a key object in R: vectors. By the end of this section, you will learn how to create different types of vectors, subset them, modify them, and summarize them.

## What is a vector?¶

A *vector* is an ordered collection of values, all of the same type. For example, you can have a vector that contains the ages of everyone in your class (a numeric vector), or a vector of all the names of people in your family (a character vector).

Vectors are central to R because R was designed for statistics, and vectors are a perfect way to collect lots of different observations of a variable. In fact, one of the secrets of R is that *almost everything* is a vector. Surprise! All the variable you created before were actually vectors – they just happened to be vectors with only one entry (vectors of length 1) so you didn’t notice. But you can see if you ask for the `length`

of the variable `nick_pet_dinosaurs`

from the Introduction
tutorials:

```
[1]:
```

```
nick_pet_dinosaurs = 2
length(nick_pet_dinosaurs)
```

To make vectors that have more than one item, we use the `c`

function (c stands for “combine”). Just pass it a set of values seperated by commas, and you’ll get back a vector.

To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:

Day of Week | Attendees |
---|---|

Monday | 132 people |

Tuesday | 94 people |

Wednesday | 112 people |

Thursday | 84 people |

Friday | 254 people |

Saturday | 322 people |

Sunday | 472 people |

Let’s make a vector for this attendance pattern:

```
[3]:
```

```
attendance = c(132, 94, 112, 84, 254, 322, 472)
attendance
```

- 132
- 94
- 112
- 84
- 254
- 322
- 472

You may find that this looks a little different in your R session than on this site. By default, R likes to print out vectors sideways, so that this vector may look more like this:

But these are just two ways of printing the same thing.

One of the nice things about vectors is that we can easily do things like multiple *all* the values by another number easily. For example, suppose tickets to my zoo cost $15 per person. What is the total money generated by ticket sales each day? Let’s find out!

```
[5]:
```

```
revenue = attendance * 15
revenue
```

- 1980
- 1410
- 1680
- 1260
- 3810
- 4830
- 7080

Cool!

Now what if we want to know to the total amount raised in a week, instead of just the amount on each day? We can use one of R’s many helper functions – in this case `sum`

– which adds up all the values of a vector

```
[6]:
```

```
sum(revenue)
```

```
[ ]:
```

```
# Numeric vectors
n1 <- 20
n2 <- c(20, 25, 60, 55)
# Character vectors
c1 <- "Blue"
c2 <- c("Red", "Green", "Purple")
# Logical vectors
l1 <- TRUE
l2 <- c(TRUE, FALSE, TRUE)
```

Note that vectors can consist of one or many *elements*. Three common ways to create vectors with more than one element is to use `c()`

, `seq()`

, or `rep()`

.

### c()¶

As illustrated above, one very common way to create vectors with more than one element is to use `c()`

(“concatenate”), which simply combines whatever values you specify in the parentheses.

### seq()¶

`seq()`

applies to numeric vectors only:

```
[ ]:
```

```
n1 <- seq(from = 0, to = 10, by = 2) #using 'by'
n1
n2 <- seq(from = 0, to = 10, length.out = 5) #using 'length.out'
n2
n3 <- seq(1, 2, 0.1) #no argument names specified; automatically uses 'from', 'to', 'by'
n3
n4 <- 1:5 #shortcut for integer sequence; same as 'seq(1, 5, 1)'
n4
```

`seq()`

by default takes three parameters: starting value, end value, and a value that specifies how elements will be incremented (“by”), which can be substituted with “length.out”. Integer sequences can be created using a colon.

### rep()¶

Vectors can also be created using `rep()`

. As the name implies, this function is useful if you want to *repeat* an element or elements.

```
[ ]:
```

```
rep(1, 5)
rep("blue", 3)
rep(TRUE, 4)
```

As should be obvious, the first parameter in the function specifies the element to repeat, and the second the number of times to repeat it.

### Using more than one function¶

Perhaps the most powerful use of these functions comes from combining them. Here are a two examples:

```
[ ]:
```

```
rep(c("blue", "red"), 3)
c(rep(seq(0, 6, 2), 2), 4:1)
```

The second example is somewhat hard to follow, and is probably at the limit of complexity in terms of how many functions we want to combine. Separating a task into multiple lines of codes can help.

```
[ ]:
```

```
s <- rep(seq(0, 6, 2), 2)
c(s, 4:1)
```

## Subsetting vectors¶

Extracting a subset of elements from a vector is an extremely important task, not least because it generalizes nicely to datasets (which are at the heart of data science). This process — whether applied to a vector or a dataset — is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill you need to master as quickly as possible, it’s this.

In R, there are three ways to filter a vector: using a separate logical vector, using indexing, and using names. I tend to use the first method most, but all three are useful.

## Subsetting with logicals¶

Let’s jump right into an example. Say we have a character vector with only two elements (“apple” and “banana”). Subsetting it to “apple” could be done like so:

```
[ ]:
```

```
fruits <- c("apple", "banana")
fruits[c(TRUE, FALSE)]
```

Note the use of brackets, `[]`

— this is common when filtering. Within these brackets is a vector with the same number of logical elements as there are elements in the vector you want to subset. Elements across the two vectors are matched by order: elements that match with `TRUE`

are kept while elements that match with `FALSE`

are dropped.

This process is extremely useful when combined with a *logical operation*. Please familiarize yourself with the logical operations listed here. For example, using a logical operation we can filter a large vector of oranges, apples and bananas:

```
[ ]:
```

```
# Create a vector with 30 fruits
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits
# Create a logical vector for dropping bananas
# Note: I'm creating the exact same logical vector three times (overriding it each time)
# This is for illustrative purposes; using one of these is sufficient
lv <- fruits == "orange" | fruits == "apple"
lv <- fruits != "banana"
lv <- fruits %in% c("orange", "apple")
lv
# Carry out the subset
fruits[lv]
```

We applied the same logic as above: We have a vector (`fruits`

) that we want to subset. We do so using a logical vector (`lv`

), where elements that match with `TRUE`

are kept. The only difference here is that we create the logical vector with a logical operation. The logical operators (e.g., `!=`

, `|`

) used here are discussed in the link above, with the exception of `%in%`

.

the `%in%`

option that accomplishes the exact same thing: `lv <- fruits %in% c("orange", "apple", "mango", "mandarin", "kiwi")`

.

Of course, subsetting using logicals can also be done on numeric vectors. Here are a few examples:

```
[ ]:
```

```
# Create a numeric vector
numbers <- seq(0, 100, by = 10)
numbers
# Illustrate three different filters
numbers[numbers <= 50 & numbers != 30]
numbers[numbers == 0 | numbers == 100]
numbers[numbers > 100] #returns an empty vector
```

Note that I didn’t create logical objects to carry out the subsets here, as opposed to above where we explicitly defined `lv`

. I find it more compact and intuitive to take subsets without first creating a logical vector.

### Subsetting using indexing¶

A different way to subset a vector is to specify the index or indeces you want to keep, again using brackets. Here are a few examples:

```
[ ]:
```

```
fruits <- c("apple", "banana")
fruits[1]
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[c(10, 20)]
fruits[seq(1, 30, by = 5)]
```

I sometimes use this when I want to inspect or modify an element that I know occurs at a specific index in the vector, a more manual approach than using logical statements.

Subsetting using indexing can also be used in *random sampling*, which has many important applications — for example, in experiments and when you want to test-run code on a representative subset of your data. So, let’s introduce the `sample()`

function:

```
[ ]:
```

```
# Draw 10 elements at random from 1 to 100
sample(1:100, size = 10)
```

The function takes a vector of values (often successive integer values) and an argument that specifies how many values to draw at random from this vector. We can use the resulting values as indeces to subset another vector:

```
[ ]:
```

```
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits[sample(1:30, size = 5)]
```

Here, we’re drawing a random sample of five elements from the vector `fruits`

. Why did I specify `1:30`

? Well, `fruits`

consists of 30 elements, so specifying something like `1:100`

likely would have resulted in sampled values outside the bounds of the vector (e.g., `fruits[35]`

doesn’t exist). Specifying `1:30`

gives every element in `fruits`

an equal chance of being included in the sample.

### Subsetting using names¶

Lastly, we can assign names to each element in a vector and take a subset based on the names.

```
[ ]:
```

```
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")
age #note that values now have names
age[c("dad", "grandpa")] #subset
```

That is, we have a vector representing the age of three family members. We assign names to each value, and then keep the values associated with two of the family members.

## Modifying vectors¶

The subsetting logic from above can be used to modify vectors. The idea here is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them. For example, what if we had mis-entered grandpa’s age above? We can fix it using indexing, a logical statement, or naming.

```
[ ]:
```

```
# Recreate vector with age values from above
age <- c(50, 55, 80)
names(age) <- c("mom", "dad", "grandpa")
# Three ways of changing grandpa's age
# Note: you'd only need to use one of these
age[age == 80] <- 82 #using a logical statement
age[3] <- 82 #using indexing
age["grandpa"] <- 82 #using naming
age
```

A logical statement is most efficient when we need to change a lot of elements.

```
[ ]:
```

```
fruits <- rep(c("orange", "apple", "bamama"), 5)
fruits #bamamas anyone?
# Let's fix the misspelled element
fruits[fruits == "bamama"] <- "banana"
fruits
```

## Vector arithmetics¶

We can modify or create new numeric vectors using arithmetic operations. Three common types of operations involve:

- A vector with more than one element and a vector with only one element.
- Two vectors with the same number of elements. Elements are matched based on index.
- A vector modified by a function.

In all cases, we can modify all elements of a vector or only a subset of elements using the bracket notation we learned above.

```
[ ]:
```

```
numbers <- 1:10
numbers
# One value modifying all values in a vector
numbers <- numbers / 10
numbers
# One value modifying a subset of a vector
numbers[numbers > 0.5] <- numbers[numbers > 0.5] * 100
numbers
# Two vectors with the same number of elements
numbers1 <- 1:10
numbers2 <- 10:1
numbers3 <- numbers2 - numbers1
numbers3
# Replacing a subset of a vector using another vector
numbers <- 1:10
numbers[numbers > 5] <- 5:1
numbers
# Modify a vector (or a subset of a vector) using a function
numbers <- 1:10
sqrt(numbers) #square root
exp(numbers) #exponentiate
log(numbers[c(1, 5, 10)]) #natural log
```

Vector arithmetics can also be carried out in R on two multi-value vectors with different number of elements. Such operations use the recycling rule.

## Summarizing vectors¶

We often want to get summary statistics from a vector — that is, learn something general about it by looking beyond its constituent elements. If we have a vector in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for vectors:

```
[ ]:
```

```
numbers <- sample(1:1000, 10)
numbers
class(numbers) #check the class
length(numbers) #number of elements
max(numbers) #maximum value
min(numbers) #minimum value
sum(numbers) #sum of all values in the vector
mean(numbers) #mean
median(numbers) #median
var(numbers) #variance
sd(numbers) #standard deviation
quantile(numbers) #percentiles in intervals of .25
quantile(numbers, probs = seq(0, 1, 0.1)) #percentiles in invervals of 0.1
summary(numbers) #function that contains many summary stats from above
```

If you forget one of these functions or if I haven’t included one here that you need, google almost surely has the answer for you. Also note that some of the operations above — most notably `class()`

and `length()`

— apply to non-numeric vectors.

## Code generalization¶

We want our code to be as general as possible so that it can be reapplied to a different coding task or if the data change. Commands that summarize vectors can be useful to accomplish this.

Remember above when we found a random sample of `fruits`

? Here is more or less the code we used:

```
[ ]:
```

```
fruits <- rep(c("orange", "apple", "banana"), 10)
length(fruits)
random_sample <- fruits[sample(1:30, size = 5)]
random_sample
```

The third line, where we create `random_sample`

, is not very general. Why? In this case, `fruits`

has 30 elements. What if it instead had 50 elements? *Then the third line would not give us a random sample.* Or more precisely, this line would give us a random sample of the 30 first elements of fruits — the last 20 elements would not have a chance of being included. We could modify the third line to read `random_sample <- fruits[sample(1:50, size = 5)]`

. But if we then modified `fruits`

to have a different number of elements again we’d end up with the same problem.

Here’s the solution: find the number of elements of fruits using `length()`

and then input this as an argument in the `sample()`

function.

```
[ ]:
```

```
fruits <- rep(c("orange", "apple", "banana"), 100)
n <- length(fruits) #store the result of length() in an object
n
random_sample <- fruits[sample(1:n, size = 5)] #now use 'n' in the sample() function
random_sample
# Or we could have used length() directly in the sample() function
# Note: Accomplishes the same thing as first creating 'n'
random_sample <- fruits[sample(1:length(fruits), size = 5)]
```

## Exercises¶

Create a vector that represents the age of at least four different family members or friends. You can name it whatever you want.

- What is the mean age of the people in your vector? Find out in two ways, with and without using the
`mean()`

command. - How old is the youngest person in your vector? (Use an R command to find out.)
- What is the age gap between the youngest person and the oldest person in your vector? (Again use R to find out, and try to be as general as possible in the sense that your code should work even if the elements in your vector, or their order, change.)
- How many people in your vector are above age 25? (Again, try to make your code work even in the case that your vector changes.)
- Replace the age of the oldest person in your vector with the age of someone else you know.
- Create a new vector that indicates how old each person in your vector will be in 10 years.
- Create a new vector that indicates what year each person in your vector will turn 100 years old.
- Create a new vector with a random sample of 3 individuals from your original vector. What is the mean age of the people in this new vector?

```
[ ]:
```

```
```