Link Search Menu Expand Document

Introduction

Webscraping is the processs of programmatically extracting information from the internet that was intended to be displayed in a browser. But it should only be used as a last resort; generally an API (appplication programming interface) is a much better way to obtain information, if one is available.

If you do find yourself in a scraping situation, be really sure to check it’s legally allowed and that you are not violating the website’s robots.txt rules. robots.txt is a special file on almost every website that sets out what’s fair play to crawl (conditional on legality) and what your webscraper should not go poking around in.

Keep in Mind

  Remember that webscraping is an art as much a science so play around with a problem and figure out creative ways to solve issues, it might not pop out at you immediately.

Implementation

Python

Five of the most well-known and powerful libraries for webscraping in Python, which between them cover a huge range of needs, are requests, lxml, beautifulsoup, selenium, and scrapy. Broadly, requests is for downloading webpages via code, beautifulsoup and lxml are for parsing webpages and extracting info, and scrapy and selenium are full web-crawling solutions. For the special case of scraping table from websites, pandas is the best option.

For quick and simple webscraping of individual HTML tags, a good combo is requests, which does little more than go and grab the HTML of a webpage, and beautifulsoup, which then helps you to navigate the structure of the page and pull out what you’re actually interested in. For dynamic webpages that use javascript rather than just HTML, you’ll need selenium. To scale up and hit thousands of webpages in an efficient way, you might try scrapy, which can work with the other tools and handle multiple sessions, and all other kinds of bells and whistles… it’s actually a “web scraping framework”.

Let’s see a simple example using requests and beautifulsoup, followed by an example of extracting a table using pandas. First we need to import the packages; remember you may need to install these first by running pip install packagename on your computer’s command line.

import requests
from bs4 import BeautifulSoup
import pandas as pd

Now we’ll specify a URL to scrape, download it as a page, and show some of the HTML as downloaded (here, the first 500 characters)

url = "https://blackopaldirect.com/product/black-opals/2-86-ct-black-opal-11-6x9-7x3-9mm/"
page = requests.get(url)
print(page.text[:500])
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1.0, user-scalable=no">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://blackopaldirect.com/xmlrpc.php">
<!-- Facebook Pixel Code -->
<script>
!function(f,b,e,v,n,t,s)
{if(f.fbq)return;n=f.fbq=function(){n.callMethod?
n.callMethod.apply(n,arguments):n.queue.push(arguments)};
if(!f._fbq)f._fbq=n;n.push=n;n.loaded=

That’s a bit tricky to read, let alone get any useful data out of. So let’s now use beautifulsoup, which parses extracted HTML. To pretty print the page use .text. In the example below, we’ll just show the first 100 characters and we’ll also use rstrip and lstrip to trim leading and trailing whitespace:

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.text[:100].lstrip().rstrip())
2.86 ct black opal 11.6x9.7x3.9mm - Black Opal Direct

There are lots of different elements, with tags, that make up a page of HTML. For example, a title might have a tag ‘h1’ and a class ‘product_title’. Let’s see how we can retrieve anything with a class that is ‘price’ and a tag that is ‘p’ as these are the characteristics of prices displayed on the URL we are scraping.

price_html = soup.find("p", {"class": "price"})
print(price_html)
<p class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">US$</span>2,500.00</bdi></span></p>

This returns the first tag found that satisfies the conditions (to get all tags matching the criteria use soup.find_all). To extract the value, just use .text:

price_html.text
'US$2,500.00'

Now let’s see an example of reading in a whole table of data. For this, we’ll use pandas, the ubiquitous Python library for working with data. We will read data from the first table on ‘https://simple.wikipedia.org/wiki/FIFA_World_Cup’ using pandas. The function we’ll use is read_html, which returns a list of dataframes of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the match= keyword argument with text that only appears in the table(s) you’re interested in.

The example below shows how this works; looking at the website, we can see that the table we’re interested in (of past world cup results), has a ‘fourth place’ column while other tables on the page do not. Therefore we run:

df_list = pd.read_html('https://simple.wikipedia.org/wiki/FIFA_World_Cup', match='Fourth Place')
# Retrieve first and only entry from list of dataframes
df = df_list[0]
df.head()
Year Host Winner Score Runners-up Third Place Score.1 Fourth Place
0 1930 Details Uruguay Uruguay 4 - 2 Argentina United States [note 1] Yugoslavia
1 1934 Details Italy Italy 2 - 1(a.e.t.) Czechoslovakia Germany 3 - 2 Austria
2 1938 Details France Italy 4 - 2 Hungary Brazil 4 - 2 Sweden
3 1950 Details Brazil Uruguay [note 2] Brazil Sweden [note 2] Spain
4 1954 Details Switzerland West Germany 3 - 2 Hungary Austria 3 - 1 Uruguay

This delivers the table neatly loaded into a pandas dataframe ready for further use.

R

The “rvest” package is a webscraping package in R that provides a tremendous amount of versatility, as well as being easy to use. For this specific task, of web scrapping pages on a website, we will be using read_html(), html_node(), html_table(), html_elements(), and html_text(). I will also make use of the selector gadget tool,(link for the download:Selector Gaget, as well as F12, to find the html paths.

html_node and html_text

library(rvest)

black_opals = read_html("https://blackopaldirect.com/product/black-opals/2-86-ct-black-opal-11-6x9-7x3-9mm/") # Website of interest

price = black_opals %>% 
  html_node("#product-103505 > div.summary.entry-summary > p.price > span > bdi") %>% # Find the exact element's node for the price
  html_text() # Convert it to text 

price # print the price

## [1] "US$2,500.00"

html_table

world_cup = read_html("https://simple.wikipedia.org/wiki/FIFA_World_Cup") # Past_World_Cup_results

cup_table = world_cup %>% html_elements(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/table[2]") %>%
  html_table() # Extract html elements

cup_table = cup_table[[1]] # Assign the table from the lists

cup_table %>% head(5) # First 5 obs

## # A tibble: 5 x 8
##   Year    Host    Winner  Score  `Runners-up` `Third Place` Score `Fourth Place`
##   <chr>   <chr>   <chr>   <chr>  <chr>        <chr>         <chr> <chr>         
## 1 1930 D~ Uruguay Uruguay 4 - 2  Argentina    United States [not~ Yugoslavia    
## 2 1934 D~ Italy   Italy   2 - 1~ Czechoslova~ Germany       3 - 2 Austria       
## 3 1938 D~ France  Italy   4 - 2  Hungary      Brazil        4 - 2 Sweden        
## 4 1950 D~ Brazil  Uruguay [note~ Brazil       Sweden        [not~ Spain         
## 5 1954 D~ Switze~ West G~ 3 - 2  Hungary      Austria       3 - 1 Uruguay

Another good tool is html_element. Also here is the rvest website for more information. rvest