To create your own datasets, python
is particularly useful for scrapping the web, but you can also use R
. You will also want to familiarize yourself with regular expressions.
The tidyverse
, rvest
and janitor
packages are needed.
pklist <- c("tidyverse", "rvest", "janitor", "magrittr")
source("https://fgeerolf.com/code/load-packages.R")
read_html
is used to transform an html webpage into an html tree (with parents and children), and html_table
allows to extract all tables in a webpage.
https://smartasset.com/mortgage/price-to-rent-ratio-in-us-cities
# {html_document}
# <html>
# [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
# [2] <body>\n<h1>Index of /emos_prod/reports/public</h1>\n<pre> <a href=" ...
a_elements <- read_html("https://ec.europa.eu/energy/observatory/reports/") |>
html_elements("a") |>
html_text2()
a_elements
# [1] "Name" "Last modified" "Size" "Description"
# [5] "Parent Directory"
https://en.wikipedia.org/wiki/List_of_most-downloaded_songs_in_the_United_Kingdom
data <- "https://en.wikipedia.org/wiki/List_of_most-downloaded_songs_in_the_United_Kingdom" %>%
read_html %>%
html_table(header = T, fill = T)
The interesting data is in the second data.frame
.
data[[2]][, c(1, 2, 3, 7)] %>%
as.tibble %>%
head(10) %>%
{if (is_html_output()) print_table(.) else .}
No. | Artist | Song | Copies sold[a] |
---|---|---|---|
1 | Pharrell Williams | “Happy” | 1,922,000[4] |
2 | Adele | “Someone Like You” | 1,637,000+[5] |
3 | Robin Thicke featuring T.I. and Pharrell Williams | “Blurred Lines” | 1,620,000+ |
4 | Maroon 5 featuring Christina Aguilera | “Moves Like Jagger” | 1,500,000+ |
5 | Gotye featuring Kimbra | “Somebody That I Used to Know” | 1,470,000+ |
6 | Daft Punk featuring Pharrell Williams | “Get Lucky” | 1,400,000+ |
7 | The Black Eyed Peas | “I Gotta Feeling” | 1,350,000+ |
8 | Avicii | “Wake Me Up” | 1,340,000+ |
9 | Rihanna featuring Calvin Harris | “We Found Love” | 1,337,000+ |
10 | Kings of Leon | “Sex on Fire” | 1,293,000+ |
http://projects.dailycal.org/paychecker
Using the XML
package and the readHTMLTable
function, we may download the pays at UC Berkeley, and then plot them by Field.
pays_berkeley <- readHTMLTable("http://projects.dailycal.org/paychecker")[[1]] %>%
rename(Salary = `Salary (2015)`) %>%
mutate(Salary = as.numeric(gsub('[$,]', '', Salary)))
pays_berkeley %>%
head(10) %>%
{if (is_html_output()) print_table(.) else .}
Name | Rank | Department | Salary |
---|---|---|---|
Turner, James | Professor | English | 202650 |
Lee, Taeku | Professor | Political Science | 240691 |
Jeziorski, Przemyslaw | Assistant Professor | Business | 209741 |
Rosen, Christine | Associate Professor | Business | 74184 |
Tang, Maureen | Lecturer | Theater, Dance and Performance Studies | 30931 |
Barnes, Barbara | Lecturer | Gender and Women’s Studies | 48540 |
Sankara Rajulu, Bharathy | Lecturer | South & Southeast Asian Studies | 67348 |
Faber, Benjamin | Assistant Professor | Economics | 189085 |
Olsen, Carl | Lecturer | Scandinavian Languages | 15663 |
Foote, Christopher | Lecturer | Business | 18430 |
readHTMLTable("http://projects.dailycal.org/paychecker")[[1]] %>%
rename(Salary = `Salary (2015)`) %>%
mutate(Salary = as.numeric(gsub('[$,]', '', Salary))) %>%
ggplot(., aes(x=Department, y=Salary)) + coord_flip() +
geom_boxplot(aes(color=Rank,
x=reorder(Department, Salary, FUN=max))) +
scale_y_continuous(labels = scales::dollar) +
labs(title="Salaries by Department",
subtitle="UC Berkeley Salaries",
y="Annual Salary (2015)",
x="Department") +
theme(plot.caption = element_text(size=7.5))
readHTMLTable("http://projects.dailycal.org/paychecker")[[1]] %>%
rename(Salary = `Salary (2015)`) %>%
mutate(Salary = as.numeric(gsub('[$,]', '', Salary))) %>%
ggplot(., aes(x=Department, y=Salary)) + coord_flip() +
geom_boxplot(aes(color=Rank,
x=reorder(Department, Salary, FUN=max))) +
scale_y_continuous(labels = scales::dollar) +
labs(title="Salaries by Department",
subtitle="UC Berkeley Salaries",
y="Annual Salary (2015)",
x="Department") +
theme(plot.caption = element_text(size=7.5))