Increasingly scholars from the social sciences and humanities use ‘big data’ to conduct research; these data can be obtained from a wide variety of online sources, such as web sites, social media, or from external data providers (for example, Statistics Netherlands).
This course introduces issues of collecting, preparing, analysing, and visualising ‘big’ data. Participants will be familiarised with how to write, debug, and keep track of their own code using R, a popular programming language for data manipulation, analysis, and visualisation.
After completion of this course, you will be able to:
- Acquire a basic understanding of big data and social media analytics in the context of social science and humanities research;
- write code in R in order to obtain, prepare, analyse, and visualise data;
- understand how to automate data collection from web sites and social media;
- gain basic proficiency with tools for analysing large quantities of text;
- monitor and manage the various steps of data collection and analysis for both integrity and replication purposes, and
- help you become a more productive (taking less time to analyse your data) and careful (making fewer mistakes) scientist.
There are four weekly sessions of 3-4 hours each. Sessions will include a mix of lectures, demonstrations, and/or in-class exercises. You will need to bring a laptop to these sessions on which you have the necessary rights to install software.
Students will work with data sets supplied for the course, but can also use a data set of their own to work with. Data can be from any source: experiments, surveys, time series, panels, etc.
Students following this course are expected to satisfy the following requirements:
- Prior exposure to R programming language. This is a very low threshold of knowledge, and one that can be attained, for example, by following an online tutorial or course.
Visit https://www.rstudio.com/online-learning/#r-programming for a list of resources.
- Knowledge of basic probability theory and statistical analysis, for example, regarding linear models or analysis of variance. If you are in doubt about your background, contact the Graduate School office (Jan Nagtzaam: email@example.com).
Sessions are both iterative and cumulative, hence attendance for all four sessions is mandatory. In the first session, you will follow a tutorial that encompasses many of the tools you will eventually encounter in the course, but it is not expected at this stage that you will understand every aspect of this exercise.
In each session, we will build upon the previous, adding new tools while reinforcing what you have already learned. The goal is that by the fourth session, you will have learned enough to apply these tools to your own research.
Between sessions, you will complete exercises in order to practice and develop your new skills. Although these exercises will not be graded, their completion is mandatory, as students will review and attempt to replicate each other’s work throughout the course.
- Session 1:
Course overview and first steps with R
- You will create, edit, and compile an R-markdown file that contains both a free text discussion of your data analysis, your code, and any output from that code (including plots).
- We will build an R-markdown file that collects data from an online source, performs a few basic manipulations, and plots the results. You will learn how to use version control software to track changes to this markdown file over time.
- Session 2:
Acquiring, preparing, and visualising data
- You will learn how to write code to acquire data from files located on the web or stored on your local computer, load them into R, and “clean” the data in preparation for further analysis (such as data visualisation). You will then learn about a powerful yet relatively simple “grammar” for visualising data that has been implemented in the ggplot2 package in R.
- We will also discuss the underlying theory that drives this grammar (including the psychological principles behind effective data visualisation), and gain an appreciation for how visualisation can lead to insights about data more quickly than statistical analysis.
- Session 3:
Obtaining data from web sites and social media
- You will learn how to acquire data from various online sources, such as web pages and the Twitter API, and how to automation these procedures. You will continue to gain practice preparing, analysing, and visualising these data.
- Session 4:
Text and sentiment analysis
- You will learn how to process large amounts of unstructured data (e.g. text documents) to extract important features (e.g., the occurrence of special words). You will also learn how to conduct automatic sentiment analysis (scoring text based its positivity or negativity).
Jason Roos is an Assistant Professor at the Department of Marketing Management of the Rotterdam School of Management (RSM), Erasmus University Rotterdam (EUR). His research focuses on issues related to new media and the Internet, as well as the entertainment industry.
Jason received his PhD from Duke University's Fuqua School of Business. Before he entered academia, he was a consultant and software engineer in the Seattle area during the original dot-com bubble, having worked on projects for Microsoft, BP, and AT&T Wireless, and the U.S. Government.