Scraping with RSelenium

Table of Contents

Normally, when data is loaded by a website with an identifiable script, it’s pretty simple to get the data with code like this. It’s also pretty simple to get data when it’s being loaded via a GET request to an API, which can be found using developer tools. But I recently ran into an issue where a site was loading data dynamically, but wouldn’t authorize my curl request.

I guess it’s time to learn the basics of RSelenium.

RSelenium needs java

If the computer doesn’t have java, it’s easy to use homebrew to install it. From the terminal, check to see if java is installed

java -version

If java isn’t found, use homebrew to install it. After install, homebrew provides you the code to set the path to java.

brew install openjdk
# set the path
sudo ln -sfn /opt/homebrew/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
# check it can be found 
java -version

Once that’s done install RSelenium via CRAN or Github for the development version.

install.package("RSelenium")
# install.packages("remotes")
remotes::install_github("ropensci/RSelenium")

RSelenium has a helpful vignette that walks through the basics of use. The vignette says the recommended setup is to run a Selenium Server by running a Docker image. For whatever reason, I’ve been unable to get that working on a silicon mac. But the built in rsDriver() function does work.

Why RSelenium

I was trying to scrape Kawasaki dealer locations in the US. The site forces you to search with a zip code and then dynamically returns a list of 10 dealers. You can’t expand the search criteria to return more than 10 listings. I can’t find any script that loads the data nor can I find an API that’s being queried. So I’ll use RSelenium to hit the site, search for a specific zip code, and then scrape the results. Eventually, I’ll do this for every zip code in a state.

Using RSelenium

The first thing to do is to instantiate a new remote driver. This will open up a new Firefox browser.

# start a session
rD <- rsDriver(browser = "firefox", port = 4545L, verbose = FALSE)
remDr <- rD$client

After starting, you have to direct your driver to a specific site. In this case, I want the kawasaki dealer locator.

remDr$navigate("https://www.kawasaki.com/en-us/research-tools/dealer-locator")
Dealer locator with results

Figure 1: Dealer locator with results

The page has a text field where you have to enter the zip code. To the right of the text field is a search button. If I’m going to have my browser enter a zip code and search, I need to be able to tell it where to put the zip and which button to push. This requires me to specify the css locations and tell the browser what to do.

The css for the input and button can be found easily using developer tools. As you run each of these commands, you can reference the open Firefox browser, and see that it’s directed to the page, the zip is filled, and then the button pushed with dealer information returned.

# Enter the zip code
zip_input <- remDr$findElement(using = "css", "#dealerSearchInput")
zip_input$clearElement() # Clear any pre-filled text
zip_input$sendKeysToElement(list(as.character(98105)))
  
# Click the search button
search_button <- remDr$findElement(using = "css", "#dealerSearchBtn")
search_button$clickElement()

Now to get the dealer data, I also need to specify the css for each piece.

View from Developer Tools

Figure 2: View from Developer Tools

If I look at the developer tools, I can see that the dealer data is found in a list group with the id #dealerSearchResults.

dealers <- remDr$findElements(using = "css", "#dealerSearchResults > li") # Get all dealer result items

This returns a list of 10 items within my R environment, but in order to extract the data for each item, I need to specify the css of where each is found. If I highlight the row of data I’m interested in, e.g., “LAKE CITY KAWASAKI,” and right-click on it, I can copy the selector, which will give me the css location of that data – #dealerSearchResults > li:nth-child(1) > a > h2. I’ve already pulled the dealer info using #dealerSearchResults > li so I only need to use the second part. I’ll do the same thing with the address and phone data.

Below will pull the results for the first of my 10 dealers returned.

# Extract dealer name
name <- dealers[[1]]$findChildElement(using = "css", "a > h2")$getElementText() |> 
  unlist()
      
# Extract address line 1
address_line1 <- dealers[[1]]$findChildElement(using = "css", "a > p:nth-child(2)")$getElementText() |> 
  unlist()
      
# Extract address line 2 (if available)
address_line2 <- dealers[[1]]$findChildElement(using = "css", "a > p:nth-child(3)")$getElementText() |> 
  unlist()
      
# Extract phone number
phone <- dealers[[1]]$findChildElement(using = "css", "a > p:nth-child(4)")$getElementText() |> 
  unlist()

tibble(name = name, address_line1 = address_line1, address_line2 = address_line2, phone = phone)
## # A tibble: 1 × 4
##   name               address_line1          address_line2     phone        
##   <chr>              <chr>                  <chr>             <chr>        
## 1 LAKE CITY KAWASAKI 12048 LAKE CITY WAY NE SEATTLE, WA 98125 (206)364-1372

Putting this all into a function

Now that I have everything working, I can put this into a function to run over multiple zip codes.

dealer_data <- map_dfr(zip_codes, function(zip) {
  # Open the dealer locator page
  remDr$navigate("https://www.kawasaki.com/en-us/research-tools/dealer-locator")
  Sys.sleep(2) # Wait for the page to load
  
  # Enter the zip code
  zip_input <- remDr$findElement(using = "css", "#dealerSearchInput")
  zip_input$clearElement() # Clear any pre-filled text
  zip_input$sendKeysToElement(list(as.character(zip)))
  
  # Click the search button
  search_button <- remDr$findElement(using = "css", "#dealerSearchBtn")
  search_button$clickElement()
  
  Sys.sleep(3) # Wait for results to load
  
  # Extract dealer information
  dealers <- remDr$findElements(using = "css", "#dealerSearchResults > li") # Get all dealer result items
  
  if (length(dealers) == 0) {
    # No dealers found, return a single row tibble with NA values
    message(glue("No dealerships found for zip code: {zip}"))
    return(tibble(
      ZipCode = zip,
      Name = NA_character_,
      AddressLine1 = NA_character_,
      AddressLine2 = NA_character_,
      Phone = NA_character_
    ))
  } else {
    # Map over each dealer element to extract data
    map_dfr(dealers, function(dealer) {
      # Extract dealer details
      name <- dealer$findChildElement(using = "css", "a > h2")$getElementText() |> 
        unlist()
      address_line1 <- dealer$findChildElement(using = "css", "a > p:nth-child(2)")$getElementText() |> 
        unlist()
      address_line2 <- tryCatch(
        dealer$findChildElement(using = "css", "a > p:nth-child(3)")$getElementText() |> unlist(),
        error = function(e) NA_character_
      )
      phone <- tryCatch(
        dealer$findChildElement(using = "css", "a > p:nth-child(4)")$getElementText() |> unlist(),
        error = function(e) NA_character_
      )
      
      # Return a tibble for each dealer
      tibble(
        query_zip = zip,
        name = name,
        address_line1 = address_line1,
        address_line2 = address_line2,
        phone = phone
      )
    })
  }
})

If I run this across the 98105 zip code, this is what I get.

## # A tibble: 10 × 4
##    name                             address_line1            address_line2 phone
##    <chr>                            <chr>                    <chr>         <chr>
##  1 LAKE CITY KAWASAKI               12048 LAKE CITY WAY NE   SEATTLE, WA … (206…
##  2 BELLEVUE KAWASAKI                14004 NE 20TH ST         BELLEVUE, WA… (425…
##  3 LYNNWOOD MOTOPLEX KAWASAKI       17900 HIGHWAY 99         LYNNWOOD, WA… (425…
##  4 ADVENTURE MOTORSPORTS KAWASAKI   320 N LEWIS ST           MONROE, WA 9… (360…
##  5 SOUTH BOUND KAWASAKI             1200 CHARLESTON BEACH R… BREMERTON, W… (360…
##  6 NASH POWERSPORTS AUBURN KAWASAKI 1611 W VALLEY HWY S      AUBURN, WA 9… (253…
##  7 CYCLE BARN KAWASAKI              15202 SMOKEY POINT BLVD  MARYSVILLE, … (360…
##  8 ENUMCLAW KAWASAKI                408 ROOSEVELT AVE        ENUMCLAW, WA… (360…
##  9 SKAGIT VALLEY KAWASAKI           107 SUNDQUIST DR         MOUNT VERNON… (360…
## 10 PAULSON'S KAWASAKI               4402 6TH AVE SE          LACEY, WA 98… (360…

Close the session

Don’t forget to close the RSelenium session

remDr$close()
rD$server$stop()
Taylor Grant
Taylor Grant
Group Director, Strategy & Analytics
Previous

Related