Digital research methods for textual dataMethodology courses and philosophy of science
Number of sessions: 2
Hours of session: 3
Entry level: Intermediate
This course is suitable for all levels of PhD candidates. This course’s material overlaps slightly with the PhD course “Data visualisation, web scraping, and text analysis in R”, which also discusses the scraping of social media and text analytics. However, this DRM course will focus on non-programming approaches to data acquisition, cleaning, and topic modelling. Therefore, the previous course is not a requirement for this one.
This course was previously entitled "Topic modelling". The course content remains unchanged.
This course introduces a set of digital research methods (DRM). With these innovative methods, it is possible to analyse large textual datasets from social media, news articles, interviews, and other sources. In virtually all disciplines in the social sciences and humanities, these techniques are becoming increasingly popular.
The course is specifically designed for people who do not feel comfortable using technical programming software. We will focus on how DRM can be applied with accessible software based on user-friendly interfaces.
The first class will introduce basic approaches to scraping social media content (namely Twitter) as well as news articles (LexisNexis) and will also cover steps for cleaning textual data. Additionally, some text analysis approaches will be introduced, and there will be in-depth exploration of topic modelling, a powerful but easy to use text analytic method for uncovering hidden themes from many text documents.
In the second class, we will explore additional social media scraping tools (Facebook and YouTube) and also investigate how topic modelling results can be visualized as networks. Finally, some exercises of using network analysis approaches to visualize and analyse qualitative content coding will be undertaken. Network depictions of textual content can reveal new perspectives and lead to enhanced interpretations.
- There will be two 3-hour sessions. Each session will include a mix of lectures (15%), demonstrations (5%), and in-class exercises (80%).
- Participants can work with text data supplied for the course or they can explore text data of their own to work with.
After completion of this workshop, you will:
- Scrape and clean textual data from social media and news articles;
- Know how to conduct digital research methods, particularly topic modelling;
- Be able to visualize and interpret results of the analysis.
How to prepare
In order to actively participate in the course, you are required to read the following literature:
- Levallois, C. (2017). A primer on text mining for business. URL: https://seinecle.github.io/mk99/generated-pdf/text-mining-for-business.pdf
- Levallois, C. (2017). A primer on network analysis for business. URL: https://seinecle.github.io/mk99/generated-pdf/network-analysis-for-business.pdf
- Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
- DRM Step-by-Step Guide: https://goo.gl/3ToLWW
- ConText 1.2, Gephi 0.9.1, Mozdeh (Big Data Text Analysis, Windows only), and GetOldTweets tool (authored by Jay Lee)
- ConText: http://context.lis.illinois.edu
- Gephi: https://gephi.org
- Mozdeh: http://mozdeh.wlv.ac.uk
- GetOldTweets is available only through the DRM Dropbox ‘tools/Extra’ folder: https://goo.gl/f418Yh
- DRM Dropbox ‘tools’ folder: https://goo.gl/yAuJaA
The first two readings are very short introductions and applicable to domains beyond business.
You should also familiarize yourself with the instructor’s Digital Research Methods Step-by-Step Guide, particularly the sections on topic modelling (5.8) and topic networks (7.9) and data scraping: Mozdeh (3.8), LexisNexis (4.2), GetOldTweets (4.8), and Netvizz (4.3, 4.6):
You are welcome to use your own laptops during the in-class exercises. However, you may need to enable Administrator rights to install the software. If you choose to use your own laptop, the following software programs need to be installed:
These tools may be acquired from either the hosts’ websites or from the course instructor’s Digital Research Methods Dropbox ‘tools’ folder (see below).
Basic scraping and cleaning of data and basic topic modelling
- In this session, you will learn to scrape data from Twitter and LexisNexis using several online and offline tools, extract their textual elements, and learn how to conduct basic, but necessary, cleaning of the data in the ConText text analysis software.
- You will also learn about how topic models operate, their application, and subsequently perform and interpret topic modelling on the acquired data.
Further scraping and advanced topic modelling
- In this session, we will cover other approaches to social media scraping (namely, Netvizz) and more rigorous cleaning through Excel.
- You will learn about more advanced approaches to topic modelling and become familiarized with a more precise tool for topic modelling (MALLET).
- We will explore visual interpretation of topic models through network representations (in Gephi).
About the instructor
Ju-Sung (Jay) Lee is assistant professor of digital research methods at the Department of Media and Communication of Erasmus University Rotterdam (EUR).
His research focuses on various digital, network, and statistical methodologies and their application to online and offline discourse and interactions, recently in the context of the refugee crisis and artist communities. Jay holds a PhD in sociology from Carnegie Mellon University (USA) and has a background in computer science, organisation and decision sciences, and quantitative sociology.