Overview

This page contains links to a number of web resources, tutorials, and slides related to a Data Science and High Performance Computing tutorial originally presented at ICPSR Session II as part of the Blalock Lecture series on July 27-28 & 31, 2015. You can email me at mzd5530@psu.edu or matthewjdenny@gmail.com if you have any questions any questions. There are lots of related materials available on my website at: http://www.mjdenny.com/teaching.html if you are interested in pushing beyond the scope of this tutorial. This tutorial is draws many examples and materials from an earlier 20-hour (3 day) tutorial presented at the University of Massachusetts Institute for Social Science Research -- Summer Methodology Summit 2015. Materials for this workshop are available on Github.

Schedule

This is a draft outline of the workshop schedule, it will likely change over the course of the workshop depending on how fast we end up going.

Background

Please check out the following resources outside of the lectures as they provide a nice background in R programming:

  • I have prepared an online R tutorial that will complement what we go over during the lectures, so please check it out. This tutorial assumes some basic experience with R (like how to install it and enter input) and is available here
  • If you are looking for a more basic introduction to R, this section of Quick R provides a basic overview of the R interface. You can navigate between pages by clicking on the links on the top left.

We will be going over some R programming on the first night, but the more time you spend with this material, the more you will get out of the tutorial. Nobody learns to write great code overnight, but as Hadley Wickham says: "The only way to write good code is to write tons of shitty code first. Feeling shame about bad code stops you from getting to good code". Practice makes perfect!

Data Science Introduction -- Monday 7/27/15

This lecture introduces Data Science broadly and then introduces some foundational programming concepts in R before diving into a host of related technologies including remote access and version control, among other tools.

  • Many of the tools I discuss in this lecture are covered in greater detail in the Data Science Tools tutorial, available here.
  • Slides for this lecture are available here.
  • Example R code from slides for this lecture is available here.
  • Unix for Poets is a great introduction to shell scripting and is available here.

HPC and Big Data Analytics -- Tuesday 7/28/15

This lecture focusses specifically on High Performance Computing (HPC) and big data analytics and introduces a number of approaches to HPC using parallelization in R, as well as memory efficient regression for large datasets. I will then provide a big data example of preprocessing legislative texts before discussing some of the hardware choices available for HPC and big data.

  • Those interested in the example using legislative texts may be interested in some of the tools for text processing available in R. A short tutorial on string processing in R is available here.
  • Slides for this lecture are available here.
  • Example R code from slides for this lecture is available here.

The Bigger Picture -- Friday 7/31/15

This lecture will cover a number of more advanced topics and tools related to Data Science and HPC. I will begin by introducing using C++ with R to write highly performant functions, and covering some common challenges people encounter. I will then discuss and provide some examples of web scraping and the legal and ethical considerations that go along with it. Finally I will spend some time discussing collaboration and code distribution, and work through the basics of package development in R.

  • For those interested in integrating C++ and R code, a tutorial with a number of examples is available here.
  • I will be referencing an R package development tutorial, which is available here.
  • Slides for this lecture are available here.

Resources

I will keep adding to this as I remember more useful resources:

  • A nice place to start learning R interactively is Swirl.
  • Quick-R has a bunch of easy to read tutorials for doing all sorts of basic things -- http://www.statmethods.net/.
  • Hadley Wickham wrote a book that covers a bunch of advanced functionality in R, titled Advanced R -- which is available online for free here -- http://adv-r.had.co.nz/.
  • Hadley Wickham also wrote a book on R packages, aptly named R Packages -- which is also available online for free here -- http://r-pkgs.had.co.nz/.
  • The official website for Rcpp is -- http://www.rcpp.org/
  • Dirk Edelbuettel has a great site for all things R check out the code and blog sections. He is the creator of the Rcpp package among many others -- http://dirk.eddelbuettel.com/
  • Tim Chuches has a nice tutorial on parallelization in R, available here.