Web Scraping

Table of Contents

Scraping Instagram

This is a function to scrape the user data and the last 12 posts from that user. Set this up as a recurring cron job and it could scrape multiple users and store the data. R code is adapted from this post, which was written in python. We’re also using the rjsoncons package, which allows for JMESpath subsetting of JSON data. Since we’re adding headers to the GET request, we’ll use the httr package, rather than rvest.

library(httr)
library(jsonlite)

# function to parse data and put into list 
parse_user <- function(data) {
  # get data on the account
  account_data <- rjsoncons::jmespath(data, "data.user.{
        name: full_name,
        username: username,
        id: id,
        category: category_name,
        business_category: business_category_name,
        phone: business_phone_number,
        email: business_email,
        bio: biography,
        bio_links: bio_links[].url,
        homepage: external_url,        
        followers: edge_followed_by.count,
        follows: edge_follow.count,
        facebook_id: fbid,
        is_private: is_private,
        is_verified: is_verified,
        profile_image: profile_pic_url_hd,
        video_count: edge_felix_video_timeline.count,
        image_count: edge_owner_to_timeline_media.count,
        saved_count: edge_saved_media.count,
        related_profiles: edge_related_profiles.edges[].node.username
                                   
}"
  ) |> 
    fromJSON()
  
  # get post data
  post_data <- rjsoncons::jmespath(data, "data.user.edge_owner_to_timeline_media.edges[].node.{
         post_date: taken_at_timestamp, 
         typename: __typename,
         type: is_video,
         id: id,
         shortcode: shortcode,
         thumbnail: thumbnail_src,
         video_url: video_url,
         caption: edge_media_to_caption.edges[].node.text,
         video_viewcount: video_view_count,
         comment_count: edge_media_to_comment.count,
         like_count: edge_liked_by.count,
         location: location.name,
         music_artist: clips_music_attribution_info.artist_name,
         song_name: clips_music_attribution_info.song_name
}"
  ) |> 
    fromJSON() |> 
    mutate(post_date = as.POSIXct(post_date, origin = "1970-01-01"))
  
  list(account_data = account_data, post_data = post_data)
}

scrape_user <- function(username) {
  INSTAGRAM_APP_ID <- "936619743392459" # this is the public app id for instagram.com
  url <- paste0("https://i.instagram.com/api/v1/users/web_profile_info/?username=", username)
  response <- GET(
    url = url,
    add_headers(`x-ig-app-id` = INSTAGRAM_APP_ID)
  )
  data <- content(response, "text")
  parse_user(data)
}

out <- scrape_user('bmw')

Two lists are saved. The first is the account data.

out$account_data[c(1,2,3,5,8,11:12,17,18)] |> 
  cbind() |> 
  data.frame() |> 
  set_names(nm = "") |> 
  knitr::kable()
name BMW
username bmw
id 43109246
business_category Auto Dealers
bio The official #BMW account, home of Sheer Driving Pleasure.
Use #BMWrepost for the chance to get featured.
followers 37361042
follows 76
video_count 49
image_count 10594

The second is the post data that we’ve pulled.

out$post_data |> 
  head(3) |> 
  knitr::kable()
post_date typename type id shortcode caption video_viewcount comment_count like_count location music_artist song_name
2023-09-05 09:12:49 GraphVideo TRUE 3185240436572707198 Cw0P00KIFl- Taking over our hometown in our home colours 💙🤍 #NeueKlasse #IAA23 #THEVisionNeueKlasse #TheNeueNew 678386 813 164169 Max-Joseph-Platz ottocarclub Original audio
2023-09-07 01:00:45 GraphVideo TRUE 3186442829185025149 Cw4hN7iqgR9 Every decision is an act of creation, shaping the path of our journey. THE NEUE NEW. The BMW Vision Neue Klasse. #NeueKlasse #IAA23 #THEVisionNeueKlasse #TheNeueNew 325513 256 40401 NA bmw Original audio
2023-09-08 10:34:30 GraphVideo TRUE 3187455988330466509 Cw8HlVhoQzN BMW faces its toughest critic yet @liamcarps1 🫣 Shop Liam’s look via the link in bio. #NeueKlasse #IAA23 #THEVisionNeueKlasse #TheNeueNew #BMW 1540983 959 121633 NA bmw Original audio

Scraping hidden data

In this example I show how when there is an identifiable script, that we can scrape the javascript from a dynamically loaded site. But what to do when this hidden data is loaded via a generic script? When this occurs, we have to do some parsing. It can be a pain, but being able to do this correctly opens a whole new world of data availability.

As an example, we’re going to use the Home Depot state by state store location page. In this case, for the state of Alabama. This data is actually easily scrapable using rvest and css selection, but pulling the JSON is far more convenient, and often yields additional information.

After loading the page, we go to devtools and search for part of the first address. As can be seen, the script isn’t identified, but we see the window.__APOLLO_STATE__:. We’re going to use this little bit to find the JSON data.

img
The first thing we’re going to do is use the httr package to request the page and then parse it.

# Define the URL
url <- "https://www.homedepot.com/l/AL"

# Send a GET request to the URL
response <- GET(url)

# Check if the request was successful
(http_status(response)$category == "Success")
  
# Parse the HTML content from the response
html_content <- content(response, as = "text", encoding = "UTF-8")

Now that we have our content in text format, we have to find that script identifier. The start_pos variable is finding the location after the window.__APOLLO_STATE__= text is found. The end_pos_relative is looking for the location of the end of the JSON. In this case, becasue the data is nested, we’re looking for 3 end brackets. This will change depending on the structure of the data.

# Find the position of the substring "window.__APOLLO_STATE__="
start_pos <- str_locate(html_content, "window.__APOLLO_STATE__=")[2] + 1
    
# If the start position is found, proceed to find the end position and extract the JSON string
end_substring <- str_sub(html_content, start = start_pos)
end_pos_relative <- str_locate(end_substring, "\\}\\}\\}")

Now we adjust our end position relative to where our JSON data started, extract the data, and parse it with the jsonlite::fromJSON() function.

# Assuming we found the end_pos, adjust end_pos relative to start_pos 
end_pos <- end_pos_relative[2] + start_pos - 1
        
# Extract the JSON string using substring
json_str <- substr(html_content, start_pos, end_pos)
        
# Parse the JSON string to a list
json_data <- fromJSON(json_str, flatten = TRUE)

Because our data is nested, it’s going to parse as a list. Calling the data, we can see that our data can be found in the following location.

stores_df <- json_data$ROOT_QUERY$`storeDirectoryByState({"state":"AL"})`$storesInfo |> 
  select(storeName, address.street, address.city, address.state, 
         address.postalCode, address.county, url, phone)

head(stores_df)
##      storeName        address.street address.city address.state address.postalCode address.county
## 1     W Mobile  755 Schillinger Rd S       Mobile            AL              36695         Mobile
## 2        Foley    2899 S Mckenzie St        Foley            AL              36535        Baldwin
## 3 W Huntsville  4045 Lawson Ridge Dr      Madison            AL              35757        Madison
## 4 N Huntsville 1035 Memorial Pkwy Nw   Huntsville            AL              35801        Madison
## 5       Pelham      3191 Pelham Pkwy       Pelham            AL              35124         Shelby
## 6   Prattville  2710 Legends Parkway   Prattville            AL              36066         Elmore
##                                                                url         phone
## 1         https://www.homedepot.com/l/W-Mobile/AL/Mobile/36695/801 (251)634-0351
## 2             https://www.homedepot.com/l/Foley/AL/Foley/36535/802 (251)955-2401
## 3    https://www.homedepot.com/l/W-Huntsville/AL/Madison/35757/803 (256)837-6658
## 4 https://www.homedepot.com/l/N-Huntsville/AL/Huntsville/35801/804 (256)536-2216
## 5           https://www.homedepot.com/l/Pelham/AL/Pelham/35124/805 (205)685-1837
## 6   https://www.homedepot.com/l/Prattville/AL/Prattville/36066/806 (334)285-1693

Scraping YouTube

YouTube has an API, but it can be rate limited pretty quickly. The following will get performance metrics and all comments from each video, without running into any limit.

Set up

Within a conda environment created via the reticulate package, install yt-dlp (more detail).

reticulate::conda_install("[ENV Name]", "yt-dlp", pip = TRUE)

Next, write the python function that will call the yt-dlp library and pull the video metrics and comments for any url that’s passed to it.

def youtube_dl_info(URL):
  import json
  import yt_dlp

  # See help(yt_dlp.YoutubeDL) for a list of available options and public functions
  ydl_opts = {'getcomments':True}
  with yt_dlp.YoutubeDL(ydl_opts) as ydl:
      info = ydl.extract_info(URL, download=False)
      return(info)

After that, write quick function to clean the urls that we’re going to read in.

yt_clean <- function(file_loc) {
  tmp <- readr::read_csv(file_loc) |> 
    dplyr::filter(stringr::str_detect(links, "^https://www.youtube.com/watch?")) |> 
    dplyr::distinct(links) |> 
    dplyr::pull()
}

Because I prefer to work in R, I need to get the data out of the python environment and into R. From there, I’m going to clean the data. The data will be split into two different dataframes - the summary metrics and the comments.

# to get data out of the returned lists # 
youtube_extract <- function(datalist) {
  # get summary data about the video
  summary <- purrr::map_dfr(datalist, 
                            magrittr::extract, 
                            c("uploader", "upload_date", "title", "description", "categories",
                              "duration_string", "view_count", 
                              "like_count", "comment_count", "original_url")) |> 
    dplyr::rename(duration = duration_string) |> 
    dplyr::mutate(upload_date = lubridate::ymd(upload_date))
  # get the comments
  x <- purrr::map(datalist, "comments")
  # drop empty lists (no comments)
  x <- Filter(length, x)
  
  # loop over comments and put into dataframe
  comments <- NULL
  for (i in seq_len(nrow(summary))) {
    print(i)
    tmp <- x[[i]] %>% 
      data.table::rbindlist(fill = TRUE) |> 
      dplyr::as_tibble() |> 
      dplyr::mutate(timestamp = lubridate::as_datetime(as.numeric(timestamp, tz = "America/Los_Angeles"))) |>
      dplyr::select(c(text, timestamp, like_count, author)) |> 
      dplyr::mutate(video = summary$title[i],
                    timestamp = as.Date(timestamp)) |> 
      dplyr::rename(date = timestamp)
    comments <- rbind(comments, tmp)
    tmp <- NULL
  }
  out <- list(summary = summary, comments = comments)
}

Finally, we write a wrapper function that will run the above functions and save the output into .rds and .xlsx files.

Everything in this function is referenced by the here() function. The directory structure looks like the following:

├── R
│   └── helpers
├── py_functions
├── youtube_report
│   └── youtube_data
├── yt_caption_data
└── yt_link_data
# YouTube metrics and comments # 

# 1. start with LinkGopher to extract all urls
# 2. add links to a csv with the column name "links"
# 3. save csv as "[handle_name]_yt_links.csv" in the "yt_link_data" folder
# 4. run this function 

yt_scrape_fn <- function(handle) {
  # load packages
  pacman::p_load(tidyverse, janitor, here, glue)
  # load functions
  reticulate::source_python(here::here('youtube', 'py_functions', 'youtube_dl_info.py'))
  source(here('youtube','R','helpers', 'helpers.R'))
  source(here('youtube','R', "youtube_extract.R"))
  
  loc <- here('youtube','yt_link_data', glue("{handle}_yt_links.csv"))
  urls <- yt_clean(loc)
  
  # map over function and save to environment for safe processing
  tmpdata <<- urls |>
    purrr::map(youtube_dl_info)
  
  # clean and add to reactive values
  out <- youtube_extract(tmpdata)
  
  outlist <- list(summary = out$summary, comments = out$comments)
  saveRDS(outlist, here("youtube","youtube_report", "youtube_data", glue::glue("yt_{handle}_{Sys.Date()}.rds")))
  
  openxlsx::write.xlsx(outlist, here("youtube","youtube_report", "youtube_data", glue::glue("yt_{handle}_{Sys.Date()}.xlsx")))
}

We have to pass links into the function and to get them, we’re going to use the Link Gopher extension to pull all urls on the YouTube page of the account of interest. Then copy and paste the data and save as a csv.

Run it

As written, the script will save the data.

yt_scrape_fn("tesla")

And when the data is loaded, it looks like this

## # A tibble: 3 Ă— 10
##   uploader upload_date title  description categories duration view_count like_count comment_count original_url
##   <chr>    <date>      <chr>  <chr>       <chr>      <chr>         <int>      <int>         <int> <chr>       
## 1 Tesla    2023-01-21  Tesla… "Tesla is … Autos & V… 2:17         619217      21844          1586 https://www…
## 2 Tesla    2023-03-31  Lars … "Engineere… Autos & V… 2:45         319640      15934          1089 https://www…
## 3 Tesla    2023-09-25  Tesla… "Optimus c… Autos & V… 1:18          28591       5882           729 https://www…
## # A tibble: 6 Ă— 5
## # Groups:   video [3]
##   text                                                                      date       like_count author video
##   <chr>                                                                     <date>          <int> <chr>  <chr>
## 1 From Lars to Lars. Nice Explanation 👍                                    2023-09-24         NA @lars… Lars…
## 2 When you drill into the crash test data, it becomes apparent that the Mo… 2023-09-11         NA @cras… Lars…
## 3 And my car still can't drive itself halfway through the city in a light … 2023-09-25         NA @ne0n… Tesl…
## 4 The fact that it moves it whole body, hips included when moving the bloc… 2023-09-25         NA @Wify… Tesl…
## 5 I love to see them working very hard to replace every human on earth, ro… 2023-09-04         NA @Ahme… Tesl…
## 6 Please build artificial, noiseless, space- and energy-saving muscles int… 2023-08-25         NA @gk... Tesl…

Scraping TikTok

Code below can be used to scrape three different aspects of TikTok data - Profile data, Post data, Comments and comment data.

Profile data

The first thing we’ll scrape is the Profile data, which is found in a JavaScript script tag. The data is pretty comprehensive and is returned in a JSON format. A function to pull the data is below.

get_tt_user <- function(handle) {
  pacman::p_load(tidyverse)
  handle <- sub("^@", "", handle) # drop @ if it's there
  link <- glue::glue("https://www.tiktok.com/@{handle}?lang=en")

  # setting headers
  headers = c(
    `Accept` = '*/*',
    `Accept-Language` = 'en-US,en;q=0.9',
    `Connection` = 'keep-alive',
    `Content-type` = 'application/x-www-form-urlencoded',
    `DNT` = '1',
    `Referer` = "https://www.google.com/",
    `Sec-Fetch-Dest` = 'empty',
    `Sec-Fetch-Mode` = 'cors',
    `Sec-Fetch-Site` = 'same-origin',
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
    `cache-control` = 'no-cache',
    `pragma` = 'no-cache',
    `sec-ch-ua` = '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
    `sec-ch-ua-mobile` = '?0',
    `sec-ch-ua-platform` = '"macOS"'
  )

  # GET request
  res <- httr::GET(link, httr::add_headers(.headers = headers))
  if (res$status_code != 200) {
    stop(paste("Error: Received status code", res$status_code))
  }

  tmpout <- httr::content(res) |>
    rvest::html_elements("script#__UNIVERSAL_DATA_FOR_REHYDRATION__") |>
    rvest::html_text() |>
    jsonlite::fromJSON()

  out <- tmpout$`__DEFAULT_SCOPE__`$`webapp.user-detail`$userInfo

  # build out a table of user data
  tibble::tibble(name = out$user$nickname,
         handle = out$user$uniqueId,
         signature = out$user$signature,
         verified = out$user$verified,
         biolink = out$user$bioLink$link,
         region = out$user$region,
         followers = out$stats$followerCount,
         following = out$stats$followingCount,
         hearts = out$stats$heartCount,
         videos = out$stats$videoCount,
         diggs = out$stats$diggCount,
         friends = out$stats$friendCount,
         date_pulled = Sys.Date())
}

Let’s run it.

adre <- get_tt_user("addisonre")
adre |> 
  knitr::kable()
name handle signature verified biolink region followers following hearts videos diggs friends date_pulled
Addison Rae addisonre TRUE addisonraefragrance.com US 88800000 58 1.5e+09 1781 0 18 2024-05-18

Post data

Similar to profile data, post-level data can be found in a hidden script. The function to pull post level data is below.

get_tt_posts <- function(link){
  # setting headers
  headers = c(
    `Accept` = '*/*',
    `Accept-Language` = 'en-US,en;q=0.9',
    `Connection` = 'keep-alive',
    `Content-type` = 'application/x-www-form-urlencoded',
    `DNT` = '1',
    `Referer` = "https://www.google.com/",
    `Sec-Fetch-Dest` = 'empty',
    `Sec-Fetch-Mode` = 'cors',
    `Sec-Fetch-Site` = 'same-origin',
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
    `cache-control` = 'no-cache',
    `pragma` = 'no-cache',
    `sec-ch-ua` = '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
    `sec-ch-ua-mobile` = '?0',
    `sec-ch-ua-platform` = '"macOS"'
  )
  
  # GET request
  res <- httr::GET(link, httr::add_headers(.headers = headers))
  if (res$status_code != 200) {
    stop(paste("Error: Received status code", res$status_code))
  }
  # content and pull script
  tmpout <- httr::content(res) |>
    rvest::html_elements("script#__UNIVERSAL_DATA_FOR_REHYDRATION__") |>
    rvest::html_text() |>
    jsonlite::fromJSON()
  
  out <- tmpout$`__DEFAULT_SCOPE__`$`webapp.video-detail`$itemInfo$itemStruct
  # build out a table of user data
  tibble::tibble(id = out$id, 
                 post_text = out$desc,
                 created_time = out$createTime,
                 duration = out$video$duration,
                 author_id = out$author$id,
                 author_uniqueID = out$author$uniqueId,
                 author_nickname = out$author$nickname,
                 author_signature = out$author$signature,
                 author_verified = out$author$verified,
                 like_count = out$stats$diggCount,
                 share_count = out$stats$shareCount,
                 comment_count = out$stats$commentCount,
                 play_count = out$stats$playCount,
                 collect_count = out$stats$collectCount,
                 diversification_labels = paste(out$diversificationLabels, collapse = ", "),
                 suggested_words = paste(out$suggestedWords, collapse = ", ")) |>  
    dplyr::mutate(hashtags = stringr::str_extract_all(post_text, "#\\S+"),
                  at_mentions = stringr::str_extract_all(post_text, "@\\S+"),
                  created_time = lubridate::as_datetime(as.integer(created_time), tz = "America/Los_Angeles"))
} 

To run it, just provide a link

single_post <- get_tt_posts(link = "https://www.tiktok.com/@addisonre/video/7363016297233354027")
single_post |> 
  knitr::kable()
id post_text created_time duration author_id author_uniqueID author_nickname author_signature author_verified like_count share_count comment_count play_count collect_count diversification_labels suggested_words hashtags at_mentions
7363016297233354027 Got married at stagecoach. thank u @airbnb 🙏🏼 2024-04-28 13:24:14 30 6703550784929793030 addisonre Addison Rae TRUE 236500 2353 820 9800000 6180 Romance, Family & Relationship addison rae, addison rae boyfriend, Getting Married, I Got Married On The Show, Abby Roberts And Addison Rae, enzo and addison rae, enzo addison rae, Abby And Addison Rae, Addison Rae’s Dad, addison rae throw it @airbnb

Comment data

Finally, we’ll pull the comments from a post. Comments aren’t found in a hidden script, but are loaded dynamically through hidden APIs as a user scrolls. We’re going to call the hidden API directly, and iterate through all of the comments to pull them. We can get 50 comments per API call, and one of the API parameters is cursor, which identifies where in the number of comments we want to start to pull our next 50.

There are several functions that will all be used in our convenience wrapper function. We’re going to use the same post as above.

parse_comments <- purrr::possibly(function(l){
  # putting comment data into a cleaned tibble
  tibble::tibble(
    text = l$comments$text,
    comment_language = l$comments$comment_language,
    digg_count = l$comments$digg_count,
    reply_comment_count = l$comments$reply_comment_total,
    author_pin = l$comments$author_pin,
    create_time = l$comments$create_time,
    cid = l$comments$cid,
    nickname = l$comments$user$nickname,
    unique_id = l$comments$user$unique_id,
    post_id = l$comments$aweme_id
  ) |> 
    dplyr::mutate(create_time = lubridate::as_datetime(create_time, tz = "America/Los_Angeles"))
}, otherwise = NA_character_)

scrape_comments <- function(post_id, cursor) {
  # setting headers
  headers = c(
    `Accept` = '*/*',
    `Accept-Language` = 'en-US,en;q=0.9',
    `Connection` = 'keep-alive',
    `Content-type` = 'application/x-www-form-urlencoded',
    `DNT` = '1',
    `Referer` = "https://www.google.com/",
    `Sec-Fetch-Dest` = 'empty',
    `Sec-Fetch-Mode` = 'cors',
    `Sec-Fetch-Site` = 'same-origin',
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
    `cache-control` = 'no-cache',
    `pragma` = 'no-cache',
    `sec-ch-ua` = '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
    `sec-ch-ua-mobile` = '?0',
    `sec-ch-ua-platform` = '"macOS"'
  )
  # build url and first pull 
  base_url <- "https://www.tiktok.com/api/comment/list/?"
  params <- list(
    aweme_id = post_id,
    count = 50,
    cursor = cursor
  )
  # GET request
  res <- httr::GET(base_url, query = params, httr::add_headers(.headers = headers))
  # parse data depending on cursor location
  if (cursor == 0) {
    tmp <- httr::content(res, "text") |> 
      jsonlite::fromJSON()
    
    out <- list(comment_data = parse_comments(tmp), total_comments = tmp$total)
  } else {
    tmp <- httr::content(res, "text") |> 
      jsonlite::fromJSON()
    
    out <- parse_comments(tmp)
  }
}

fetch_tt_comments <- function(post_id) {
  tmp <- scrape_comments(post_id, cursor = 0)
  # get cursor starting locations
  cursors <- (1:floor(tmp$total_comments/50)*50)+1
  # map across cursor starts
  out <- purrr::map2(post_id, cursors, scrape_comments)
  out <- out[!is.na(out)]
  out <- dplyr::bind_rows(out)
  # bind all pulls together
  rbind(tmp$comment_data, out)
}

We’ll run it and provide the first 5 comments.

fetch_tt_comments("7363016297233354027") |> 
  head(5) |> 
  knitr::kable()
text comment_language digg_count reply_comment_count author_pin create_time cid nickname unique_id post_id
W pl 9 0 FALSE 2024-04-29 09:51:59 7363332657865212702 kka kkalton 7363016297233354027
SHES THIS GENERATIONS BRITNEY en 8875 38 FALSE 2024-04-28 13:37:19 7363019734805349151 Fany🎀 kyngfany 7363016297233354027
ATEEE tl 8 0 FALSE 2024-04-28 18:56:47 7363102061292765982 fredylon fr3dylon 7363016297233354027
SHES SO ICONIC en 5729 13 FALSE 2024-04-28 13:32:38 7363018484672791328 meilospov ིྀ owlrae 7363016297233354027
ICONIC en 143 2 FALSE 2024-04-28 14:40:51 7363036103747683077 Luara luara 7363016297233354027
Taylor Grant
Taylor Grant
Group Director, Strategy & Analytics
Next
Previous

Related