About This Site

From Whom is This Course Designed?

The goal of this course is to teach a student enough about R to do serious data analysis. However, there are many “flavors” of data science these days, so it’s worth emphasizing the type of data science this course is most designed to support.

This course is designed to serve students who are interested in analyzing data to answer specific questions. For example, it is well suited for social scientists who wish to do statistical analyses of political or economic data, or people advising policymakers on the likely costs of different courses of action. In this kind of data science, analyses of data are usually “one-offs” – the goal is to analyze a dataset to answer a specific question, not build software that will continuously analyze data on an ongoing basis.

It is not well suited for people who wish to design programs that, as a part of their operation, analyze data. So if your interest is in writing an iOS app that analyzes user data to make restaurant suggestions, this is probably not the best resource. That kind of data science is closer to “software engineering” (the writing of software programs you plan to distribute to users), for which you will probably find better training in a traditional computer science curriculum.

Assumed Knowledge

This site is explicitly designed to require zero prior knowledge. People forget how unintuitive programming is when they start, and most tutorials accidentally (implicitly) assume understanding of things like variables, variable assignment, or functions. I’ve done my best to avoid that here, and if you find a place I screwed up, please reach out and let me know so we can improve it!

How should I take this course?

Each topic here is presented in two parts: a tutorial that provides instructional materials about the topic in question, and then a set of exercises to allow you to practice the skills you are learning.

These tutorials will not generally be very long, and we strongly recommend that while you read through them you do so with an open programming session so you can just play around a little, trying out the things you learn.

But the most important thing is for readers to do the exercisesat the end of each tutorial. The research on learning to program is exceedingly clear on this point: the only way to learn to program is to actually program, so the more time you spend playing with the tools we are using, making mistakes, and troubleshooting, the more you will learn.

Why R?

Why R? Because it’s currently one of the two most-used programs in data science (the other being Python), which means there is a good chance you’ll be called upon to use it when working in teams. Moreover, it’s a much easier tool to get started with than a language like Python.

It is worth emphasizing that we’re not learning R because it is necessarily “the best” language. The reality is that there are lots of tools for statistical programming, and each has its own strengths and weaknesses (e.g. R, Stata, SPSS, Python, Julia, Matlab, etc., etc.). People often develop strong opinions about which language is best, and sometimes pass judgement on people who use other languages. Every programming language has its strengths and weaknesses, and what is “best” depends on your use-case (the types of things you are using the language to do). This is true not only because languages themselves have strengths and weaknesses, but also because the tools and packages that have been created for use in different languages differ (e.g. people just haven’t made a good package for doing geo-spatial work in Julia yet, for example). And if you’re working on teams, you’ll also have to make decisions based on the backgrounds of your tool sets. All of which is to say: there is no single best language for all purposes. But R is a very popular, strong, general purpose language, so will serve as a great starting point.

As a result, over the course of your career you may find yourself gravitating to one tool or another as required by your research. But in providing you with a firm foundation in a very popular language like R, we feel confident that we will not only be providing you with tools that will allow you to do most everything you’ll want to do in graduate school, but we will also be providing you with generalizable skills around data manipulation that you will find useful if you later change platforms.

To learn more about the differences between common programming languages, you can find a summary here.

Where did this course come from?

These tutorials form the backbone of Programming for Data Science, a course taught as part of the Masters in Data Science (MIDS) at Duke University by myself, Nick Eubank. They are built off of the amazing tutorials of Simon Ejdmyer.

Want to learn more?

These tutorials are from the first class in a two-course sequence. If you want to learn more, check out Programming For Data Science, which provides a more in depth introduction to computer science concepts relevant for data science.