To create your own datasets, python is particularly useful for scrapping the web, but you can also use R. You will also want to familiarize yourself with regular expressions.

  • Simple RegEx Tutorial. html
  • Basic Regular Expressions in R. pdf

The tidyverse, rvest and janitor packages are needed.

read_html is used to transform an html webpage into an html tree (with parents and children), and html_table allows to extract all tables in a webpage.

Most downloaded Songs in the U.K.

The interesting data is in the second data.frame.

Table 1: List of most downloaded songs in the United Kingdom.
No. Artist Song Copies sold[a]
1 Pharrell Williams “Happy” 1,922,000[3]
2 Adele “Someone Like You” 1,637,000+[4]
3 Robin Thicke featuring T.I. and Pharrell Williams “Blurred Lines” 1,620,000+
4 Maroon 5 featuring Christina Aguilera “Moves Like Jagger” 1,500,000+
5 Gotye featuring Kimbra “Somebody That I Used to Know” 1,470,000+
6 Daft Punk featuring Pharrell Williams “Get Lucky” 1,400,000+
7 The Black Eyed Peas “I Gotta Feeling” 1,350,000+
8 Avicii “Wake Me Up” 1,340,000+
9 Rihanna featuring Calvin Harris “We Found Love” 1,337,000+
10 Kings of Leon “Sex on Fire” 1,293,000+

Salaries at U.C. Berkeley, by specialty

Using the XML package and the readHTMLTable function, we may download the pays at UC Berkeley, and then plot them by Field.

Name Rank Department Salary
Turner, James Professor English 202650
Lee, Taeku Professor Political Science 240691
Jeziorski, Przemyslaw Assistant Professor Business 209741
Rosen, Christine Associate Professor Business 74184
Tang, Maureen Lecturer Theater, Dance and Performance Studies 30931
Barnes, Barbara Lecturer Gender and Women’s Studies 48540
Sankara Rajulu, Bharathy Lecturer South & Southeast Asian Studies 67348
Faber, Benjamin Assistant Professor Economics 189085
Olsen, Carl Lecturer Scandinavian Languages 15663
Foote, Christopher Lecturer Business 18430