In this post we are going to learn what is scrapping and how it is done using ‘rvest‘ package of R.
What is Web Scraping?
Web Scraping is a technique of extracting data from Websites.Unfortunately not every website allows data to be downloaded as easy as CSV format.
There are many libraries and modules are available for us to do web scraping.The most popular Web Scrapping library for R is ‘rvest‘
Package Used: To scrape data from the web pages R provide a package ‘rvest’ to do so. You need to install this library in your R console.
Data Used : ‘http://decisionstats.org/‘
Installing package into ‘/home/farheen/R/x86_64-pc-linux-gnu-library/3.2’ (as ‘lib’ is unspecified) trying URL 'http://cran.rstudio.com/src/contrib/rvest_0.2.0.tar.gz' Content type 'application/x-gzip' length 5407840 bytes (5.2 MB) ================================================== downloaded 5.2 MB * installing *source* package ‘rvest’ ... ** package ‘rvest’ successfully unpacked and MD5 sums checked ** R ** demo ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (rvest) The downloaded source packages are in ‘/tmp/RtmpvPDdFk/downloaded_packages
SCRAP A HTML TABLE:
STEP 1: Get the link.
>library("rvest") url1 = 'http://decisionstats.org/'
> url1  "http://decisionstats.org/"
STEP 2: Now get the web page of the added link .
> webpage = html(url1) > webpage
We are a bunch of dreamers who dream big on Big Data and are focused on technology driven change
STEP 3: Now get the table from the web page.
> tbl = html_node(webpage,"table") >
20 No Yes M.S. 21 yes No minor, B.S., M.S., PhD 22 Yes No M.S. 23 Yes No MSc 24 Yes No MSc 25 Yes No MSc 26 Yes No MSc 27 Yes No MSc 28 Yes No MSc 29 Yes No MSc 30 Yes No MSc 31 Yes No MSc/Certificate 32 Yes No MSc 33 Yes No MSc 34 Yes No MSc 35 Yes No PhD 36 Yes No MSc 37 Yes No MSc 38 Yes No MSc 39 Yes No MSc 40 Yes No MSc 41 Yes No MSc 42 Yes No MSc 43 Yes No MSc 44 Yes No MSc 45 Yes No MSc
Note: Its important to node that its not the table we needed to extract because we used ‘html_node()’ to extact tabe.
It actually takes the first match of your extracted html page. If the extracte webpage contains multiple tables and you want to convert specific table into DataFrame then use “html_nodes()”
Suppose we are interseted in the second table of extracted Web Page
> tbl = html_nodes(webpage,"table") []
STEP 4: Convert table into DataFrame
df = as.data.frame(html_table(tbl))
> df X1 X2 X3 X4 1 Basics of Data Science Introduction to Python Introduction to R Introduction to Interface 2 Basics of Analytics Introduction to iPython Introduction to R Studio Introduction to SAS language 3 LTV Analysis Introduction to Pandas Introduction to R Data Step 4 LTV Analysis Quiz Introduction to iPython Notebook Introduction to Rattle Proc Print 5 RFM Analysis IDE- IDLE and Spyder Deducer Proc Means and Proc Freq 6 RFM Analysis Quiz Python 1 Quiz R Quiz 1 SAS Quiz 1 7 Basic Stats Data Input Data Input Proc Univariate 8 Introduction to Modeling Data Analysis Data Analysis Do loops 9 Data Summarization Data Summarization Proc sgplot 10 Introduction to Google Analytics Data Visualization Data Visualization Proc SQL 11 Blogging Data Output Data Output SAS Macro Language 12 Web Analytics Quiz Ipython 2 Quiz R Quiz 2 menu driven options 13 data.table ODS Output 14 ggplot 15 sports analytics SAS Quiz 2
You are done with scraping html tables.
If your table content contains link then follow one more step to get the actual link .
webpage to extract table.
Perform all the above 4 steps (don’t forget to change the link with new one)and then perforn the next step to get the links.
STEP 5 : Now the DataFrame would be.
> head(df) School Program On-Campus Online Degrees 1 Stevens Institute of Technology Business Intelligence & Analytics Yes Yes M.S. 2 iSchool @ Syracuse University Data Science Yes Yes Grad Certificate 3 North Carolina State University Analytics Yes No M.S. 4 Northwestern University Predictive Analytics No Yes M.S. 5 Northwestern University Analytics Yes No M.S. 6 Stanford University Data Mining Some Courses Yes Grad Certificate
Note that we are using html_nodes() to get all the nodes of such type.We will use the same method as describe in Step 3 for extracting particular or multiple nodes.
> tr = html_nodes(tbl,'tr') > tds = html_nodes(tr,"td a") > links = html_attr(tds,'href')
Now see what’s there in the links’
links .... .. [] Rensselaer Polytechnic Institute
Dublin City University
STEP 6: Now append this list to earlier created dataFrame.
> df$URL = links > df
> head(df) School Program On-Campus Online Degrees 1 Stevens Institute of Technology Business Intelligence & Analytics Yes Yes M.S. 2 iSchool @ Syracuse University Data Science Yes Yes Grad Certificate 3 North Carolina State University Analytics Yes No M.S. 4 Northwestern University Predictive Analytics No Yes M.S. 5 Northwestern University Analytics Yes No M.S. 6 Stanford University Data Mining Some Courses Yes Grad Certificate URL 1 http://www.stevens.edu/howe/academics/graduate/business-intelligence-analytics-bi-ms 2 http://ischool.syr.edu/future/cas/datascience.aspx 3 http://analytics.ncsu.edu/ 4 http://www.scs.northwestern.edu/grad/mspa/ 5 http://www.analytics.northwestern.edu/ 6 http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=1209602
Now we are done with extracting links.Explore more from RStudio blog.
The credit goes to ‘Ryan Swanstorm’ .