9 Web Scraping in R

🎯 Learning goals

After working through Tutorial 9, you’ll be able to…

  • explain web scraping, including ethical and legal limitations
  • explain the basics of CSS and HTML
  • apply functions for web scraping

1. What is web scraping?

I understand web scraping as the automated collection and retrieval of relevant data from website code. Generally, this includes at least three steps:

  1. Identify the URL of the website you want to scrape

  2. Download its content

  3. From downloaded data, separate „junk“ from relevant data

Since a lot of social media platforms have closed automated access to their data via Application Programming Interfaces (APIs), researchers increasingly have to collect data via other methods.

As a data collection method, web scraping has several advantages:

  • It allows us to collect data that may otherwise not be available for research
  • It allows us to collect new (meta) data (e.g., timestamps of content, multi-modal data)
  • It allows us to rely on the structuredness of websites to scale-up data collection

3. Source code

To understand web scraping, you have to understand what it relies on: source code.

Generally, websites are text documents that are interpreted and designed based on their source code. Let’s take the Wikipedia page on “Communication studies” as an example.

This is how the website looks like in my browser:

Image: Wikipedia page

Overview of wikipedia site on comm studies

Now, let us look at the underlying code (click on the right, then choose “View Page Source” or “View Source” depending on your computer). For different options to do so across browsers and operating systems, see here.

Image: Wikipedia page Source Code

Overview of wikipedia site on comm studies

You can see: Websites are simple text documents (ok, the code does not look simple - but we can easily understand parts of it!). When you visit websites, your browser reads the underlying source code (e.g., HTML, CSS, Javascript) to correctly display the website.

  • HTML (Hypertext Markup Language): is a “markup language” that defines the structure of websites. “Markup” means that the code includes additional info besides just the content you want to display. For example, you can use HTML to define the title of your website: you may not only include information on the content of the title, but also where and how it should be displayed.
  • CSS (Cascading Style Sheets): is a “style sheet” language we use to change the design of websites. For example, you can use CSS to define the color of the title of your website.
  • JavaScript: is a language we also use to change the design/behavior of websites (mostly for dynamic and interactive elements). For example, you can use Javascript to automatically play a video when you hover over the title of your website.

In the following, we will focus on understanding the basics of HTML and CSS to learn web scraping.

3.1 HTML

You can use HTML (Hypertext Markup Language) to structure websites. For our seminar, you mainly need to know three things about HTML:

  1. Websites consist of nested elements. For example, this is an element: <body>Some content</body>. This gives HTML files a “tree-like” structure where elements are nested in elements nested in elements etc. For example, the element <body>Some content</body> is usually nested in html: <html><body>Some content</body></html>

  2. Most elements (e.g., <body>Some content</body>) consist of a tag which marks the beginning <> of an element. Similarly, a tag marks the end of the element </> (this differs for some elements, but we can ignore this for now). In between these tags is some content, e.g. text. For example: the element <h1>My heading</h1>consists of a start tag (<h1>), content (“My heading”), and an end tag (</h1>). What type of tag is used tells us a bit about what content we may expect (e.g., in h1 we may expect a title, in img an image - which is useful information for automatically scraping such code!).

  3. You can also include attributes in-between tags. Attributes provide additional information. For example, we may want to add a link to our heading “My heading” from above. We could do that via the a tag (an “anchor” tag used to embed links) and an additional href attribute that specifies where the link should guide readers.

A typical HTML text may look like this:

<html>
<body>

<h1>My heading</h1>
<p>Some text I wrote.</p>

</body>
</html>

What does this mean?

  • the <html> element is the root element of an HTML website
  • the <body> element defines the document’s body (where text, images etc. are included)
  • the <h1> element defines a large heading
  • the <p> element defines a paragraph

On a website, the HTML snippet above is rendered to only the following two sentences:

Output of HTML snippet

My heading

Some text I wrote.

If you want to try this yourself, copy-in the HTML code here and try to play around with it a bit.

As you can see, while the HTML file contains a lot of information, only some of it is displayed here. Most of it is, instead, used to structure text “in the background”.

Next, let’s try to include a link for the text “Some text I wrote”. We can use the a anchor and the href attribute (including the link, here to Google).

<html>
<body>

<h1>My heading</h1>
<p><a href="https://www.google.de/">Some text I wrote</a></p>

</body>
</html>

On a website, the result looks like this (try clicking the link!):

Output of HTML snippet

My heading

Some text I wrote

You may ask yourself: Great - so why exactly should you know about HTML?

Knowing about the structure of HTML files (and what type of content different elements contain) is important to systematically parse relevant data from websites.

For example, articles in news websites will often be included in body, article titles in h1, etc. So we could look for these tags/elements when scraping news websites to only extract the title and text of a news article.

3.2 CSS

While you could fine-tune the appearance of your website via HTML elements (e.g., <b>, <i>), developers came up with CSS (Cascading Style Sheets) to more neatly format the appearance of HTML pages.

For our seminar, you mainly need to know four things about CSS:

  1. Rules are the building blocks of CSS. They describe how different sections of websites should be formatted. Rules consist of selectors and declaration blocks. For example, a rule could define that all headings of a website should be displayed in the color red.

  2. Selectors define which HTML element you want to style. In the example above, the selector could be the element h1, which stands for the first heading.

  3. Declaration blocks describe how the element should be styled by including information on properties (e.g., please set my color) and values (e.g., please set it to red).

  4. There are different ways in which we can include CSS in HTML. Here, we will discuss inline CSS (defining rules for every single element) and internal CSS (defining rules for types of elements, so-called classes).

3.2.1 Inline CSS

Let’s try to change the color of my h1 heading “My heading”. My new rule should be that all headings h1 should displayed in red.

How do I do this?

  • I want to change the content included in the heading h1 element via inline CSS. To do so, I change the style attribute within the h1 element.
  • I want to change the color of this element, so I include the property color within the style attribute.
  • I want to change the color of this element to red, so I include the value red for the property color within the style attribute.
<html>

<body>

<h1 style="color:red;">My heading</h1>
<p>Some text I wrote</p>

</body>
</html>

On a website, this changes the color of the h1 element:

Output of HTML snippet

My heading

Some text I wrote

3.2.2 Internal CSS

The inline version above is a bit inefficient, since you would have to include information on the color for every single heading (making your code very long and more prone to errors).

Instead of inline code, we could also use internal CSS: Here, we define rules not for every single element (e.g., every h1) but for types of elements, so-called classes. We define a CSS style for all elements of a certain class and assign this class to elements we want to be displayed in a certain way via attributes.

Let’s say, for example, that we only want parts of “Some text I wrote” to be depicted in pink:

  • I create a new class .text-pink for text that should be pink within the <style> element. Notice how <style> is now defined at the beginning of the document, so not related to a specific element.
  • For .text-pink, I want to define a color, so I include the property color.
  • For .text-pink, I want to set the color to pink, so I include the value pink for the property color.
  • To mark which words of “Some text I wrote” should be depicted in pink, I can use the <span> element. <span> is used to mark specific parts of text. I only want the word “text” to be pink, so I include “text” between <span> and </span>.
<html>

<style>
.text-pink {
  color:pink;
}
</style>

<body>

<h1 class="heading-new">My heading</h1>
<p>Some <span class="text-pink">text</span> I wrote</p>

</body>
</html>

On a website, this changes the color of the word “text”:

Output of HTML snippet

My heading

Some text I wrote

Again, you may ask yourself: Great - so why exactly should you know about CSS?

Again, knowing about the structure of CSS syntax is important to systematically parse relevant data from websites.

For example, articles in news websites may be formatted according to a specific style, e.g., the class style-article. We could look for style-article when scraping news websites to only extract the text of the article (and ignore all “junk code” around it).

4. Let’s go: Web scraping in R

Let’s try this. Remember the steps of scraping you learned about in the last tutorial:

  1. Identify the URL of the website you want to scrape

  2. Download its content

  3. From downloaded data, separate „junk“ from relevant data

4.1 Identify URL of website

Let’s again take the Wikipedia page on “Communication studies” as an example: https://en.wikipedia.org/wiki/Communication_studies

This is how the website looks like in my browser:

Image: Wikipedia page

Overview of wikipedia site on comm studies

Now, look at the underlying code (click on the right, then choose “View Page Source” or “View Source” depending on your computer/browser).

Image: Wikipedia page Source Code

Overview of wikipedia site on comm studies

To be sure that we can scrape the website, take a look at Wikipedia’s “robots.txt” file here.

4.2 Download website content

To download the source code, we download and activate two packages:

  • r package polite (for more info, see here)
  • r package rvest (for more info, see here)

While rvest allows us to scrape and parse content, polite assures that we adhere to “good behavior” while doing so. With polite, we can…

  • tell website owners who we are
  • respect which website content we are (not) allowed to scrape
  • time our scraper so that we do not constantly hit websites with our scraping requests
install.packages("rvest")
install.packages("polite")
library("rvest")
library("polite")

Next, we use the bow() command polite to “introduce” us to the website and see whether we can scrape it:

session <- bow(url = "https://en.wikipedia.org/wiki/Communication_studies",
               user_agent = "Teaching project, 
               Valerie Hase, 
               Department of Media and Communications,
               AAU Klagenfurt")
#Result
session
<polite session> https://en.wikipedia.org/wiki/Communication_studies
    User-agent: Teaching project, 
               Valerie Hase, 
               Department of Media and Communications,
               AAU Klagenfurt
    robots.txt: 464 rules are defined for 34 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

When inspecting the result, we can see that we are allowed to scrape this website. However, we should wait for 5 seconds in-between scraping requests.

We now use the scrape() command to implement these rules while scraping the website:

url <- scrape(session)

Important: If the url object is empty (i.e. NULL), it may be that you are not allowed/cannot scrape this website.

4.3 Identify relevant HTML elements

Remember the third part of scraping: separating „junk“ from relevant data. Now this is actually the hardest part. From the long HTML code, how do we find (and keep) relevant data?

The solution: We have to identify the right selector (e.g., an element, a class, an attribute) and retrieve it via html_element (retrieves the first match) or html_elements (retrieves all matches).

The following Table gives you a small overview of common commands for retrieving selectors based on this great, more detailed overview by T. Gessler. The W3 school also has a great overview here.

Overview of selected commands
Command Example Meaning
html_elements(element) html_elements(p) retrieves all elements p
html_elements(element, element) html_elements(div, p) retrieves all elements div and p
html_elements(element element) html_elements(div p) retrieves all elements p inside div
html_elements(.class) html_elements(.article) retrieves all elements with class article
html_elements(element.class) html_elements(p.article) retrieves all p elements with class article
html_elements(element) |> html_attr(attribute) html_elements(a) |> html_attr(href) retrieves all links within elements a
html_elements(element) |> html_attr(attribute) html_elements(img) |> html_attr(src) retrieves the source of all images within elements img

To find the right selector, you can…

  • Inspect: click right on the specific element and choose “Inspect”, which will open the underlying source code
  • SelectorGadget: use SelectorGadget to automatically find the right selectors. To install SelectorGadget, click on this link. Install the Chrome Extension (link at the end of the website, search for “Try our new Chrome Extension”). For a more detailed instruction of how the tool works, see here.

4.3.1. Example 1: Scrape a website title

For example, let’s say I want to extract the title of the website. Using the Inspect function (right click, then inspect to open the source code), we can see that Wikipedia has defined its own class for the title: .mw-page-title-main.

Image: Wikipedia page Source Code

Overview of wikipedia site on comm studies

Using the html_elements function from rvest, we now only keep relevant content:

url |>
  html_elements(".mw-page-title-main")
{xml_nodeset (2)}
[1] <span class="mw-page-title-main">Communication studies</span>
[2] <span class="mw-page-title-main">Communication studies</span>

Great, but we only want the content within span, not all “junk” around it.

To clean our data, we can now use the html_text() command:

url |>
  html_elements(".mw-page-title-main") |>
  html_text()
[1] "Communication studies" "Communication studies"

4.3.2. Example 2: Scrape website subheadings

Let’s try another example: Let’s say we want to extract all subheadings.

We already know that subheadings are often indicated by h1, h2, etc., so we use this information to retrieve all headings from h1 to h6. We can combine selectors like so:

url |>
  html_elements("h1, h2, h3, h4, h5, h6") |>
  html_text()
 [1] "Contents"                              
 [2] "Communication studies"                 
 [3] "History"                               
 [4] "Origins"                               
 [5] "Foundations of the academic discipline"
 [6] "In the United States"                  
 [7] "In Canada"                             
 [8] "Scope and topics"                      
 [9] "Business"                              
[10] "Healthcare"                            
[11] "Professional associations"             
[12] "See also"                              
[13] "References"                            
[14] "Bibliography"                          

4.3.3. Example 3: Scrape website text

Let’s try another example: Let’s say we want to scrape the text of the article. For this, we retrieve all paragrahs as indicated by p.

url |>
  html_elements("p") |>
  html_text() |>
  
  #we only want to read the first block of text
  head(1)
[1] "Communication studies (or communication science) is an academic discipline that deals with processes of human communication and behavior, patterns of communication in interpersonal relationships, social interactions and communication in different cultures.[1]Communication is commonly defined as giving, receiving or exchanging ideas, information, signals or messages through appropriate media, enabling individuals or groups to persuade, to seek information, to give information or to express emotions effectively.[2][3] Communication studies is a social science that uses various methods of empirical investigation and critical analysis to develop a body of knowledge that encompasses a range of topics, from face-to-face conversation at a level of individual agency and interaction to social and cultural communication systems at a macro level.[4][5]"

If you look at the output, you see that this worked (although the text is a bit messy and probably needs some more cleaning).

🤓 Smart Hacks

Oftentimes, you may want to automate your scraping process, for example by writing a timer that automatically runs your scraping script every x-th minute.

To automate the scraping process, packages like taskscheduleR offer great a solution.

install.packages("taskscheduleR")
library("taskscheduleR")

Using taskscheduler_create(), you can tell your computer to run a scraping script automatically at a dedicated time. Just define the task name via taskname and tell R which script to run via rscript. Next, well R that you want to run the script every 5 minutes via schedule and modifier.

taskscheduler_create(taskname = "Automated-Scraping", 
                     rscript = "Scraping-Skript.R",
                     schedule = "MINUTE", 
                     modifier = 5)

For example, you may need PDF screenshots of webpages to understand what content was shown how where. To automate this process, you can use the pagedown package.

install.packages("pagedown")
library("pagedown")

Using chrome_print(), you can take screenshots. Just define the URL via input and the file name for the resulting screenshot via output

chrome_print(input = "https://www.bbc.com", 
       output = "screenshot.pdf")

Important: Usually, these screenshots are not perfect. You often have to include some “waiting” time via the wait argument so the website can fully load before you take a screenshot. Some hosts also use automated means of disallowing automated screenshots.

💡 Take-Aways

HTML: a “markup language” that defines the structure of websites

  • consists of nested elements (e.g., body, title) marked by tags (<>, </>) (for overview lists of elements, see here)
  • elements may include attributes including additional information (e.g., a link via href) (for overview lists of attributes, see here)

As an overview, see the most important types of elements and attributes in HTML (for a full list, see here):

Overview of HTML elements
Element Meaning
<head> structure - defines meta data of a document
<title> structure - defines the title of a document
<body> structure - defines the document’s body
<h1> structure - defines headings (h1, h2, hc, etc.)
<p> structure - defines a paragraph
<a> structure - defines a link
<img> structure - defines an image
<div> structure - defines a container in which elements can be styled via CSS/JavaScript
<b> formatting - makes text bold
<i> formatting - makes text italic

CSS: a “style sheet” language that defines the design of websites

  • consists of rules (e.g., this element should be green) specified by selectors and related declaration blocks

  • rules can be specified for classes of elements

  • selectors describe which element should be formatted (e.g., the heading h1)

  • declaration blocks define what property of the selector should be formatted (e.g., its color color) and what value should be used (e.g., `red``)

R Code for scraping:

  • introduce your scraping agent: bow()

  • scrape website: scrape()

  • parse content: hmtl_elements(), hmtl_attr(), hmtl_text()

  • screenshots: chrome_print()

  • automate script execution: taskscheduler_create()

🎲 Quiz

🎲 Question 1

Which of the following statements about web scraping are correct?

🎲 Question 2

Which of the following statements about HTMl, CSS, and Java are correct?

📚 More tutorials on this

You still have questions? The following tutorials & papers can help you with that:

📌 Test your knowledge

Task 1 (Easy🔥)

Use your knowledge about web scraping to scrape content from the Wikipedia page on web scraping. Retrieve the main topics discussed here (see blue circled headings).

Image: Wikipedia page: Web Scraping

Overview of headings on Wikipedia site on web scraping

Task 2 (Medium🔥🔥)

Now get the number of references mentioned in the reference list of this Wikipedia article. At the moment, it is 37 but this may change one you scrape the website.

Task 3 (Hard🔥🔥🔥)

Now, we want to collect information on the most heated Wikipedia debates!

Write a function that goes to a Wikipedia “Talk” page for a given topic (e.g., debates on the Wikipedia Web Scraping page here).

Image: Wikipedia page: Web Scraping - Talks

Overview of open debates on Wikipedia site on web scraping

The function should ….

  • Count how many discussion any Wikipedia talk page currently has.
  • Save the result (topic, link, number of discussion sections).

The function should be flexible: It should run for any list of pages you give to it.

Using this function, run the function for three Wikipedia articles: Web scraping, Artificial intelligence, Algorithm. The result could/should look something like this (here, I named my function as scrape_discussion():

# Run and compare number of discussions on articles
bind_rows(scrape_discussion("Web scraping"),
          scrape_discussion("Artificial intelligence"),
          scrape_discussion("Algorithm"))
# A tibble: 3 × 3
  article                 url                                         discussion
  <chr>                   <chr>                                            <dbl>
1 Web scraping            https://en.wikipedia.org/wiki/Talk:Web_scr…          3
2 Artificial intelligence https://en.wikipedia.org/wiki/Talk:Artific…          6
3 Algorithm               https://en.wikipedia.org/wiki/Talk:Algorit…          5