Scraping with rvest

Take a look at the movie SpaceJam on IMDB. We want to collect its reviews.

We will first use the popular package rvest for scraping. We will load the package as well as dplyr for some additional data manipulation functions.

library(rvest)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

How do we go about scraping the reviews? Manually inspecting the site, we found where they are: [https://www.imdb.com/title/tt0117705/reviews]. We read this page into R. To do that, we need to tell R its URL which we first save into the new object url. Next, we parse the web page. We use the function read_html().

url <- "https://www.imdb.com/title/tt0117705/reviews"
spacejamrevhtml <- read_html(url)

spacejamrevhtml is a list containing the entire document, including the HTML formatting. If you do not anything about HTML, W3Schools is a great resource for learning it.

Excursus: HTML and CSS

For web-scraping, we make use of the fact that web pages follow a standardised structure. We can tell R exactly which element of this structure we want to collect.

Web pages have a head and a body. HTML documents consist of a tree of elements and text. Inside the body of a HTML page, tags can specify the elements to be displayed. Examples:

<h1>This is a heading.</h1>

<p>This is a paragraph.</p><p>And this is the next.</p>

In addition, HTML attributes can provide additional information about HTML elements. Attributes are included in the starting tag and come as name-value-pairs (attributename=“attributevalue”). We often need to use attributes to specify what exactly we want to scrape. An example of an attribute are classes. They specify a class for an HTML element. Multiple HTML elements can share the same class. We often encounter classes because they are frequently used for consistent formatting of similar content.

Example: <p class="definitions">This is the text within the p of class definitions.</p>

For an overview of further attributes, check W3Schools.

How can we specify the tags and attributes of the information we are interested in? We use CSS selectors. CSS selectors are pattern matching rules which are originally used to determine which style and formatting rules apply to which elements.

Examples: p refers to all paragraphs on a website. .definitions refers to all elements with the class definitions on a website.

To learn more about CSS selectors, check W3Schools.

We can find the relevant element tags and attributes through our browser, by looking at the HTML source code of any website and by inspecting specific elements. There are also tools to make our life easier like SelectorGadget (check out its webpage).

The limits of rvest

What we want to extract from the scraped site is the user-written reviews: These are specific elements.

The method html_elements() allows the selection of such specific elements of the HTML code. The documentation of the html_elements()command reveals that we need CSS selectors (or XPath expressions) to specify what we want to select.

We inspect the website to find out how we can access that information.

You can use SelectorGadget to find the correct CSS selector. We assign the names to the new object selected_elements and then inspect the results. To get rid of the HTML tags, we use the html_text() command.

selected_elements<-html_elements(spacejamrevhtml,".review-container")
selected_elements
## {xml_nodeset (25)}
##  [1] <div class="review-container">\n        <div class="lister-item-content" ...
##  [2] <div class="review-container">\n        <div class="lister-item-content" ...
##  [3] <div class="review-container">\n        <div class="lister-item-content" ...
##  [4] <div class="review-container">\n        <div class="lister-item-content" ...
##  [5] <div class="review-container">\n        <div class="lister-item-content" ...
##  [6] <div class="review-container">\n        <div class="lister-item-content" ...
##  [7] <div class="review-container">\n        <div class="lister-item-content" ...
##  [8] <div class="review-container">\n        <div class="lister-item-content" ...
##  [9] <div class="review-container">\n        <div class="lister-item-content" ...
## [10] <div class="review-container">\n        <div class="lister-item-content" ...
## [11] <div class="review-container">\n        <div class="lister-item-content" ...
## [12] <div class="review-container">\n        <div class="lister-item-content" ...
## [13] <div class="review-container">\n        <div class="lister-item-content" ...
## [14] <div class="review-container">\n        <div class="lister-item-content" ...
## [15] <div class="review-container">\n        <div class="lister-item-content" ...
## [16] <div class="review-container">\n        <div class="lister-item-content" ...
## [17] <div class="review-container">\n        <div class="lister-item-content" ...
## [18] <div class="review-container">\n        <div class="lister-item-content" ...
## [19] <div class="review-container">\n        <div class="lister-item-content" ...
## [20] <div class="review-container">\n        <div class="lister-item-content" ...
## ...
reviews <- html_text(selected_elements)
head(reviews)
## [1] "\n        \n    \n            \n            8/10\n            \n    \n Authentic fun on everyone's behalf\n            \n                    StevePulaski7 February 2016\n            \n            \n                NOTE: This film was recommended to me by Ryan Clevenger for \"Steve Pulaski Sees It.\" Living in Illinois, Space Jam is a film that hits the tender spots of the last two generations; one generation that got to experience Michael Jordan's unfathomable legacy as arguably the greatest basketball player who ever lived, and the other, mine, that reflects on his legacy through highlights and documentaries to keep the memory of such an all-star alive. Jordan's legacy didn't stop at on-court talent, as he was one of the most marketed athletes of his time and helped popularize the NBA, let alone the Chicago Bulls, on a previously unforeseen international level.If we remove the nostalgia factor from Space Jam, which is a very difficult thing to do by the way, then the film serves as Jordan's versatility. After retiring from the NBA at a relatively young age to pursue a career in baseball, Jordan only became more of a fascinating person, in addition to someone with impeccable charisma. Space Jam exists as a response to Jordan's departure from the NBA to the MLB, as the Looney Tune gang of Bugs Bunny, Daffy Duck, Tweety Bird, Sylvester the Cat, Porky Pig, and Lola Bunny all call Jordan out of retirement when they challenge a group of intergalactic invaders from \"Moron Mountain\" to a basketball game in exchange for the planet.The Looney Tunes thing this will be an easy win, until the aliens from Moron Mountain, who are relatively puny in size and strength, find a way to steal the talents of star basketball players like Charles Barkley and Larry Johnson and become the \"Monstars\" of the court. Meanwhile, Jordan agrees to play for the Looney Tunes team, but it takes all of the five minutes of practice to show that the team is disproportionately talented towards Jordan. As a result, the team indulges in some aggressive training tactics to beat the Monstars and save the planet.As an amalgamation of live-action and animation, especially in an age where Pixar was coming on the scene and traditional animation was soon to be phased out, Space Jam is bright and vivid. The real-life characters of Michael Jordan, Wayne Knight, who has an amusing role, like he always does, Larry Bird, and even Bill Murray's interactions with the animated characters of Bugs Bunny and the like in a convincing, believable manner. The result is a beautifully colored and nicely executed mix of whimsy.Because both worlds of reality and animation are explored here, Space Jam has the luxury of being a film that can go beyond traditional boundaries of a sports film, and the Looney Tunes are no better characters to incite such zaniness. The animated bunch are quick-witted and ecstatic, and Jordan is clearly doing this for fun and excitement rather than a phoned-in project or another endorsement. Had Space Jam been more of a lackluster cash-in, sports fans and Jordan fans would've seen it from a mile away and dismissed the film immediately. However, because everyone involved recognizes what a zany project this is, they don't try to fight the lunacy, but instead, play along, and that provides us, the audience, with a wickedly entertaining stride into a lively sports film that is so fun you almost, almost miss the clichéd underdog element.Starring: Michael Jordan, Wayne Knight, Bill Murray, and Larry Bird. Directed by: Joe Pytka.\n                \n                    11 out of 11 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                        Permalink\n                \n            \n        \n        \n    "
## [2] "\n        \n    \n            \n            7/10\n            \n    \n Silly but hard to dislike\n            \n                    preppy-32 April 2008\n            \n            \n                Movie about Michael Jordan and the Looney Tunes characters (Bugs Bunny, Daffy Duck, Roadrunner etc etc). An evil animated monster runs Moron Mountain--a planet with an amusement park. They need a new ride so he sends five helpers to Earth to kidnap the Looney Tunes characters. Then they can be used as a new act at the amusement park. Bugs and the others agree--IF they play them in a basketball game and win. Bugs and the others convince Michael Jordan to help them--but the aliens have evil plans up their sleeve.Michael Jordan meets the Looney Tunes. Sounds like a sure recipe for disaster. I was positive this film was going to be a bomb when it was released (with HEAVY publicity) in 1996. Yes--it's silly but if you love Looney Tunes (like I do) you'll probably love this. The characters are treated (more or less) respectfully and during the climatic basketball game the stands are full of every Warner Bros. cartoon character ever made. One small mouse character who talks nonstop I remembered from childhood and I literally broke up laughing when he appeared! The animation is just great, the merging of live action with cartoon figures works and the script is fun. There are groaners and stupid lines but, all in all, it was pretty amusing.The debits: Michael Jordan just can't act. I'm sure he was a wonderful basketball player but his acting was as wooden as a basketball court. Also various other sports figures pop up and prove they're worse actors than Jordan! Bill Murray (who can be good) is just lousy in a supporting role. And Wayne Knight is just SO annoying. Also I could have lived without seeing the Warner Bros. logo popping up everywhere. At one point it's on Daffy Duck's butt...and he kisses it! And there's a brief take-off on \"Pulp Fiction\" which isn't as funny now as it was in 1996. So, these things lessen the movie but don't destroy it. Worth catching if you're a Warner Bros. animation fan. Kids will love it. I give it a 7.\n                \n                    39 out of 47 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                        Permalink\n                \n            \n        \n        \n    "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [3] "\n        \n    \n            \n            7/10\n            \n    \n milestone of childhood\n            \n                    hedin_8826 January 2013\n            \n            \n                This movie has defined generations. Kids loved it because this dude who was a good sports player got to play with these loony tones which is what the movie was all about. Its not about MJ, its not about Bugs Bunny, its not about a deep storyline or memorable characters. Its about fun and cartoons,which barely ever make sense, if you remember. Its great fun for the young folks out there, with animation that really is worth watching. Its entertainment for the entire family, combining real life with cartoons as its never done before. I love it, ill always love it. Looks like i have to add a few lines for this review to be accepted and posted. I really have nothing else to add, go ahead and watch it without expecting nothing more than fun and entertainment and you will get far more then you expect.\n                \n                    86 out of 92 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                        Permalink\n                \n            \n        \n        \n    "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [4] "\n        \n It's got game\n            \n                    Op_Prime25 September 2000\n            \n            \n                Space Jam is a very enjoyable movie featuring probably the most popular cartoon characters ever. The plot may seem rather weak to you and me, but let's not forget that this movie was made for children. Space Jam has all the humor of the classic shorts that made us love Bugs, Daffy and all the rest. Michael Jordan actually wasn't that bad in this movie, considering he is not a professional actor. Bill Murray was very hilarious. Thumbs up on this one.\"That's all folks!\"\n                \n                    60 out of 78 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                        Permalink\n                \n            \n        \n        \n    "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [5] "\n        \n    \n            \n            9/10\n            \n    \n Slam Dunk For Bugs\n            \n                    EmperorNortonII4 May 2001\n            \n            \n                One of my earliest inspirations was Bugs Bunny and the Warner Bros. Looney Tunes.  This movie brings back all the favorites.  And not just Bugs Bunny, Daffy Duck and the other stars.  This movie features every character that ever appeared in a Warner Bros. cartoon.  Your eyes are certainly kept busy looking for each one!  Pairing the animation with Michael Jordan's athletic abilities may seem a little mismatched.  But the game just becomes that much more enjoyable.\n                \n                    51 out of 64 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                        Permalink\n                \n            \n        \n        \n    "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [6] "\n        \n    \n            \n            7/10\n            \n    \n Hit em high (low)\n            \n                    kosmasp16 July 2021\n            \n            \n                What a soundtrack ... I just rewatched this to be ready for the new one ... and wow! I was quite blown away by all the songs on this, that I almost had forgotten about too. I've watched the new one too and while for me MJ is the one (you can feel differently about that, we can have different feelings on who is the GOAT, it's all good), when it comes to acting ... well LeBron is better for sure.That being said, in a movie like this, there is not much you have to do acting wise. What I did like is that MJ does not take himself too seriously. While he has serious moments in there - the sub story with his dad, that I am certain was very important to him - maybe not so important to viewers who are unaware, but it's a nice touch for sure - overall this is a funny affair.Add to that some really good actors who have some hits and misses with their jokes (I love Bill Murray, but certain scenes did not age well) - of course this completely ridiculous story (or shall I say ... \"looney\"?) is aimed at a younger audience. A now maybe grown up audience that went ahead and watched the new one with their kids. It's been 25 years since this was released.There is some lessons to be learned here ... but overall you should watch it with a grain of salt .. and a lot of suspension of disbelief.\n                \n                    8 out of 9 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                        Permalink\n                \n            \n        \n        \n    "

There we are - with a list of the 25 first reviews. Collecting more becomes more complex: The user reviews site is not a static but a dynamic site. To see more reviews, we need to click the load more-button. This is what we need RSelenium for!

Time to shine for RSelenium

We just collected user reviews of SpaceJam. However, we only collected the first 25 reviews. We then need to click the load more-button to see more reviews.

We do that using the package RSelenium. Selenium allows driving a web browser natively the way a user would. There are different ways to run RSelenium. We will make use of the rsDriver() function it provides via the Webdriver manager package wdman. Because of different operating systems, browser versions and Selenium version, it can be fiddly to get this to run. When aiming for more replicable scraping and better platform stability, a recommended way is to use Docker. Docker is a free software for isolating applications using container virtualisation. Running a Docker container standardises the build across operating system. However, rsDriver() allows us to more easily follow along and see exactly what is happening.

We now load in the RSelenium package and start with the setup of our driver. First, we need to download binaries, start the driver and get the client object. We will open a chrome browser and specify the version we are calling. The port identifies your browser. You cannot open multiple instances on the same port. If you get the error message that a port is already in use, just change the port number.

library(RSelenium)
driver <- rsDriver(browser = "chrome",chromever = "108.0.5359.71",port = 4444L)
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "chrome"
## 
## $browserVersion
## [1] "108.0.5359.98"
## 
## $chrome
## $chrome$chromedriverVersion
## [1] "108.0.5359.71 (1e0e3868ee06e91ad636a874420e3ca3ae3756ac-refs/branch-heads/5359@{#1016})"
## 
## $chrome$userDataDir
## [1] "C:\\Users\\NSCHWI~1\\AppData\\Local\\Temp\\scoped_dir14696_1222579386"
## 
## 
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:65225"
## 
## 
## $networkConnectionEnabled
## [1] FALSE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "windows"
## 
## $proxy
## named list()
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $`webauthn:extension:credBlob`
## [1] TRUE
## 
## $`webauthn:extension:largeBlob`
## [1] TRUE
## 
## $`webauthn:virtualAuthenticators`
## [1] TRUE
## 
## $webdriver.remote.sessionid
## [1] "47519a33de681fe898b6c6ec4c63c080"
## 
## $id
## [1] "47519a33de681fe898b6c6ec4c63c080"
browser <- driver$client

A new browser window opened up! You will now navigate it remotely. We will tell our browser to navigate to the page of user reviews for SpaceJam:

browser$navigate("https://www.imdb.com/title/tt0117705/reviews")

The browser opened the website! We now want to tell the browser to click on the load more button. We do this by telling our browser to find the element and then to click on it. To find and element, we use the function findElement(). Again, we use CSS selectors to tell our browser that we are looking for the load more button. Once we find the load button, we assign it to the object load_button. We then click on it and wait 2 seconds so that the elements have some time to load using Sys.sleep(). Depending on the speed of your internet connection and the complexity of the website you are working with, you might want to wait shorter or longer than 2 seconds.

load_button <- browser$findElement(using = "css selector", "#load-more-trigger")
load_button$clickElement()
Sys.sleep(2) 

It is a little bit like magic: The browser clicked the button! We need to repeat this now. We know that 290 reviews have been written and 25 are displayed per page. Thus, there are 12 pages with information. We need to click the button 10 more times and can do so in a simple loop. Look at your browser and watch the magic happen!

for (i in 1:10){
  load_button <- browser$findElement(using = "css selector", "#load-more-trigger")
  load_button$clickElement()
  Sys.sleep(2)
}

Now, all reviews are visible! Let us get to the HTML. We cannot use rvest right now, because we need to get the HTML from the currently displayed version that RSelenium navigated us to. RSelenium also has a function to get the HTML code of a site, getPageSource(). We use this function and save it in reviewdata.

reviewdata <- browser$getPageSource()

Using getPageSource(), the source code is saved as a list. We extract the first element of the list and can then continue with methods from rvest.

reviewdata <- reviewdata[[1]]

spacejamrevhtml_all <- read_html(reviewdata)
reviews_all <- html_text(html_elements(spacejamrevhtml_all, ".review-container"))

There you go - we collected all reviews!

Dynamic web pages are the time to shine for RSelenium. Next to clicking on elements, another popular application is to use it when sending text to form fields:

browser$navigate("https://tfl.gov.uk/modes/tube/")

station <- browser$findElement(using = "css selector", "#Input")

station$sendKeysToElement(list("Paddington Underground Station"))
                               
go_button <- browser$findElement(using = "css selector", "#go-submit")
go_button$clickElement()

If you are interested in learning more about web-scraping, you can also take a look at my scripts and slides here: [https://github.com/nschwitter/webdata-warwick]