Starting from:

$30

Assignment 1-1: Web Scraping

Assignment 1-1: Web Scraping
Objective
Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:

How to download HTML pages from a website?
How to extract relevant content from an HTML page?
Furthermore, you will gain a deeper understanding of the data science lifecycle.

Requirements:

Please use pandas.DataFrame rather than spark.DataFrame to manipulate data.

Please use BeautifulSoup rather than lxml to parse an HTML page and extract data from the page.

Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

Preliminary
If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: Tutorial: Web Scraping and BeautifulSoup.

Please let me know if you find a better resource. I'll share it with the other students.

Overview
Imagine you are a data scientist working at SFU. Your job is to extract insights from SFU data to answer questions.

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle from Lecture 1. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.

Task 1: SFU CS Faculty Members
Sometimes you don't know what questions to ask. No worries. Start collecting data first.

In Task 1, your job is to write a web scraper to extract the faculty information from this page: http://www.sfu.ca/computing/people/faculty.html.

(a) Crawl Web Page
A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("csfaculty.html").

from requests import get
url = 'http://www.sfu.ca/computing/people/faculty.html'
response = get(url)

with open('./csfaculty.html', 'w', encoding='utf-8') as f:
    f.write(response.text)
(b) Extract Structured Data
Please write code to extract relevant content (name, rank, area, profile, homepage) from "csfaculty.html" and save them as a CSV file (like faculty_table.csv).

from bs4 import BeautifulSoup
import pandas as pd

def prep_string(input_string):
    return ("_".join(input_string.split())).upper()

with open('./csfaculty.html', 'r', encoding='utf-8') as f:
    txt = f.read()
html_soup = BeautifulSoup(txt, 'html.parser')
faculty_containers = html_soup.find_all('div', class_ = 'textimage section')

names = []
ranks = []
areas = []
profiles = []
homepages = []

for f in faculty_containers:
    #get name and rank
    n_r = f.h4.text
    n_r = n_r.split(',')
    name = n_r[0].title()
    if(len(n_r) == 1):
        rank = None
    else:
        rank = n_r[1]
        rank = rank.title().strip()
    
    #deal with one special case
    if(name.startswith('Anders Miltner')):
        print(name)
        name = 'Anders Miltner'
        rank = None
    
    names.append(name)
    ranks.append(rank)

    #get area
    if(f.p):
        area = f.p.text.replace('Area:', '')
        area = area.strip()
    else:
        area = None
    areas.append(area)

    #get profile and homepage links
    profile = None
    homepage = None
    
    for link in f.find_all('a'):
        txt = link.text
        txt = txt.lstrip().rstrip()
        
        if link.has_attr('href'):#prevent no link(key error)
            if(txt == 'Profile & Contact Information'):
                tmp_link = link['href']
                if(tmp_link.startswith('http://www.sfu.ca')):
                    profile = tmp_link
                else:
                    profile = 'http://www.sfu.ca'+tmp_link
            if(txt == 'Home Page'):
                tmp_link = link['href']
                if(tmp_link.startswith('http')):
                    homepage = tmp_link
                else:
                    homepage = 'http://www.sfu.ca'+tmp_link
    profiles.append(profile)
    homepages.append(homepage)

#convert to pandas data frame
faculty_df = pd.DataFrame({'name': names, 
                           'rank': ranks,
                           'area': areas,
                           'profile': profiles,
                           'homepage': homepages
                         })
faculty_df.to_csv('faculty_table.csv')
Anders Miltner (Starting August 15Th
(c) Interesting Finding
Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage exploratory data analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will learn it soon from this course.

First, please install dataprep. Then, run the cell below. It shows a bar chart for every column. What interesting findings can you get from these visualizations?

from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_table.csv")
# plot(df)
Below are some examples:

Finding 1: Professor# (26) is more than 2x larger than Associate Professor# (10).

Questions: Why did it happen? Is it common in all CS schools in Canada? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?

Finding 2: The Homepage has 22% missing values.

Questions: Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future?

Task 2: Age Follows Normal Distribution?
In this task, you start with a question and then figure out what data to collect.

The question that you are interested in is Does SFU CS faculty age follow a normal distribution?

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (gradyear) and then estimate age using the following equation:

age≈2021+23−gradyear
 
For example, if one graduates from a university in 1990, then the age is estimated as 2021+23-1990 = 54.

(a) Crawl Web Page
You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Jiannan Wang graduated from Harbin Institute of Technology in 2008 at http://www.sfu.ca/computing/people/faculty/jiannanwang.html.

Please write code to download the 68 profile pages and save each page as a text file.

from time import sleep
from random import randint
import numpy as np

fac_data = pd.read_csv('./faculty_table.csv')
profiles = fac_data['profile'].values.tolist()
names = fac_data['name'].values.tolist()

for i in range(len(profiles)):
    profile = profiles[i]
    name = names[i]
    if(profile is np.nan):
        continue    
    else:
        response = get(profile)
        with open('./'+name+'.html', 'w', encoding='utf-8') as f:
            f.write(response.text)
        sleep(randint(1,4))
(b) Extract Structured Data
Please write code to extract the earliest graduation year (e.g., 2008 for Dr. Jiannan Wang) from each profile page, and create a csv file like faculty_grad_year.csv.

import re

grad_years = []

for i in range(len(names)):
    name = names[i]
    grad_years.append(-1)
    #special case
    if(name == 'Ryan Shea'):
        continue

    with open('./'+name+'.html', 'r', encoding='utf-8') as f:
            txt = f.read()
    html_soup = BeautifulSoup(txt, 'html.parser')
    container = html_soup.find_all('div', class_ = 'text parbase section')
    
    for content in container:
        heads = content.find_all('h2')
        for head in heads:
            if (head.text.rstrip().lstrip().strip(':') == 'Education'):
                edu = head.parent.find_all('p')
                edu.extend(head.parent.find_all('li'))
                #get grad year: search for the first grad year
                grad_year = 9999
                for degree in edu:
                    token = re.split(r'[,\s()]\s*',degree.text.rstrip())
                    token[-1] = token[-1].rstrip('.')
                    if(token[-1].isnumeric()):
                        if(int(token[-1])<grad_year):
                            grad_year = int(token[-1])
                #deal with special format of graduation year
                if(grad_year == 9999):
                    for degree in edu:
                        token = re.split(r'[,\s()]\s*',degree.text.rstrip())
                        for t in token:
                            if(t.isnumeric() and int(t)<grad_year):
                                grad_year = int(t)
                grad_years[i] = grad_year
            else:
                continue

#check abnormal grad years 
#deal with no h2 header situation
for i in range(len(grad_years)):
    if(grad_years[i] == -1 and names[i]!='Ryan Shea'):
        name = names[i]
        with open('./'+name+'.html', 'r', encoding='utf-8') as f:
            txt = f.read()
        html_soup = BeautifulSoup(txt, 'html.parser')
        container = html_soup.find_all('div', class_ = 'text parbase section')
        for content in container:
            p_tags = content.find_all('p')
            for p in p_tags:
                if(p.text.lstrip().rstrip() == 'Education'):
                    edu = p.parent.find_all('li')
                    edu.extend(p.parent.find_all('p'))
                    grad_year = 9999
                    for degree in edu:
                        token = re.split(r'[,\s()]\s*',degree.text.rstrip())
                        token[-1] = token[-1].rstrip('.')
                        if(token[-1].isnumeric()):
                            if(int(token[-1])<grad_year):
                                grad_year = int(token[-1])
                    grad_years[i] = grad_year

#generate pandas pd and save as csv format
faculty_grad_year = pd.DataFrame({'name': names, 
                           'gradyear': grad_years,
                         })
faculty_grad_year['gradyear'] = faculty_grad_year['gradyear'].astype(int)
faculty_grad_year = faculty_grad_year.replace([-1,9999], pd.NA)
faculty_grad_year.to_csv('faculty_grad_year.csv')
(c) Interesting Finding
Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: Does SFU CS faculty age follow a normal distribution?

from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2021+23-df["gradyear"]

plot(df, "age")
Stats
Histogram
KDE Plot
Normal Q-Q Plot
Box Plot
Value Table
Overview
Approximate Distinct Count    29
Approximate Unique (%)    45.3%
Missing    5
Missing (%)    7.2%
Infinite    0
Infinite (%)    0.0%
Memory Size    1024.0 B
Mean    44.4688
Minimum    27
Maximum    67
Zeros    0
Zeros (%)    0.0%
Negatives    0
Negatives (%)    0.0%
Quantile Statistics
Minimum    27
5-th Percentile    29
Q1    33
Median    45
Q3    53
95-th Percentile    62.85
Maximum    67
Range    40
IQR    20
Descriptive Statistics
Mean    44.4688
Standard Deviation    11.2954
Variance    127.5863
Sum    2846
Skewness    0.1277
Kurtosis    -1.0925
Coefficient of Variation    0.254
Submission
Complete the code in this notebook, and submit it to the CourSys activity Assignment 1.

More products