OVERVIEW:

In this post we are going to learn what is scrapping and how it is done using ‘rvest‘ package of R.

INTRODUCTION

What is Web Scraping?

Web Scraping is a technique of extracting data from Websites.Unfortunately not every website allows data to be downloaded as easy as CSV format.

There are many libraries and modules are available for us to do web scraping.The most popular Web Scrapping library for R is ‘rvest

Package Used: To scrape data from the web pages R provide a package ‘rvest’ to do so. You need to install this library in your R console.

Data Used :http://decisionstats.org/

> install.packages("rvest")
Installing package into ‘/home/farheen/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/rvest_0.2.0.tar.gz'
Content type 'application/x-gzip' length 5407840 bytes (5.2 MB)
==================================================
downloaded 5.2 MB

* installing *source* package ‘rvest’ ...
** package ‘rvest’ successfully unpacked and MD5 sums checked
** R
** demo
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (rvest)

The downloaded source packages are in
	‘/tmp/RtmpvPDdFk/downloaded_packages

SCRAP  A HTML TABLE:

ds1

STEP 1: Get the link.

>library("rvest")
url1 = 'http://decisionstats.org/'
> url1
[1] "http://decisionstats.org/"

STEP 2: Now get the web page of the  added link .

> webpage = html(url1)
> webpage

Management

 We are a bunch of dreamers who dream big on Big Data and are focused on technology driven change

STEP 3: Now get the table from the web page.

> tbl = html_node(webpage,"table")
> 
20            No    Yes                   M.S.
21           yes     No minor, B.S., M.S., PhD
22           Yes     No                   M.S.
23           Yes     No                    MSc
24           Yes     No                    MSc
25           Yes     No                    MSc
26           Yes     No                    MSc
27           Yes     No                    MSc
28           Yes     No                    MSc
29           Yes     No                    MSc
30           Yes     No                    MSc
31           Yes     No        MSc/Certificate
32           Yes     No                    MSc
33           Yes     No                    MSc
34           Yes     No                    MSc
35           Yes     No                    PhD
36           Yes     No                    MSc
37           Yes     No                    MSc
38           Yes     No                    MSc
39           Yes     No                    MSc
40           Yes     No                    MSc
41           Yes     No                    MSc
42           Yes     No                    MSc
43           Yes     No                    MSc
44           Yes     No                    MSc
45           Yes     No                    MSc

Note: Its important to node that its not the table we needed to extract because we used ‘html_node()’ to extact tabe.

It actually takes the first match of your extracted html page. If the extracte webpage contains multiple tables and you want to convert specific table into DataFrame then use “html_nodes()”

Suppose we are interseted in the second table of extracted Web Page

ds2

> tbl = html_nodes(webpage,"table") [[2]]

STEP 4: Convert table into DataFrame

 df = as.data.frame(html_table(tbl))
> df
                                 X1                               X2                       X3                           X4
1            Basics of Data Science           Introduction to Python        Introduction to R    Introduction to Interface
2               Basics of Analytics          Introduction to iPython Introduction to R Studio Introduction to SAS language
3                      LTV Analysis           Introduction to Pandas        Introduction to R                    Data Step
4                 LTV Analysis Quiz Introduction to iPython Notebook   Introduction to Rattle                   Proc Print
5                      RFM Analysis             IDE- IDLE and Spyder                  Deducer     Proc Means and Proc Freq
6                 RFM Analysis Quiz                    Python 1 Quiz                 R Quiz 1                   SAS Quiz 1
7                       Basic Stats                       Data Input               Data Input              Proc Univariate
8          Introduction to Modeling                    Data Analysis            Data Analysis                     Do loops
9                                                 Data Summarization       Data Summarization                  Proc sgplot
10 Introduction to Google Analytics               Data Visualization       Data Visualization                     Proc SQL
11                         Blogging                      Data Output              Data Output           SAS Macro Language
12               Web Analytics Quiz                   Ipython 2 Quiz                 R Quiz 2          menu driven options
13                                                                                 data.table                   ODS Output
14                                                                                     ggplot                             
15                                                                           sports analytics                   SAS Quiz 2

You are done with scraping html tables.

If your table content contains link then follow one more step to get the actual link .

Here we are taking “http://101.datascience.community/2012/04/09/colleges-with-data-science-degrees/”

webpage to extract table.

Perform all the above 4 steps (don’t forget to change the link with new one)and then perforn the next step to get the links.

ds3

STEP 5 : Now the DataFrame would be.

> head(df)
                           School                           Program    On-Campus Online          Degrees
1 Stevens Institute of Technology Business Intelligence & Analytics          Yes    Yes             M.S.
2   iSchool @ Syracuse University                      Data Science          Yes    Yes Grad Certificate
3 North Carolina State University                         Analytics          Yes     No             M.S.
4         Northwestern University              Predictive Analytics           No    Yes             M.S.
5         Northwestern University                         Analytics          Yes     No             M.S.
6             Stanford University                       Data Mining Some Courses    Yes Grad Certificate

Note that we are using html_nodes() to get all the nodes of such type.We will use the same method as describe in Step 3 for extracting particular or multiple nodes.

> tr = html_nodes(tbl,'tr')
> tds = html_nodes(tr,"td a")
> links = html_attr(tds,'href')

Now see what’s there in the links’

links
....
..
[[89]]

Rensselaer Polytechnic Institute

Business Analytics

Yes

No

M.S.

[[90]]
Dublin City University

Computing (Data Analytics)

Yes

No

MSc

[[91]]

STEP 6: Now append this list to earlier created dataFrame.

> df$URL = links
> df
> head(df)
                           School                           Program    On-Campus Online          Degrees
1 Stevens Institute of Technology Business Intelligence & Analytics          Yes    Yes             M.S.
2   iSchool @ Syracuse University                      Data Science          Yes    Yes Grad Certificate
3 North Carolina State University                         Analytics          Yes     No             M.S.
4         Northwestern University              Predictive Analytics           No    Yes             M.S.
5         Northwestern University                         Analytics          Yes     No             M.S.
6             Stanford University                       Data Mining Some Courses    Yes Grad Certificate
                                                                                                             URL
1                           http://www.stevens.edu/howe/academics/graduate/business-intelligence-analytics-bi-ms
2                                                             http://ischool.syr.edu/future/cas/datascience.aspx
3                                                                                     http://analytics.ncsu.edu/
4                                                                     http://www.scs.northwestern.edu/grad/mspa/
5                                                                         http://www.analytics.northwestern.edu/
6 http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=1209602

Now we are done with extracting links.Explore more from RStudio blog.

The credit goes to ‘Ryan Swanstorm’ .

Advertisements