For this assignment, you'll write several small python programs (within a single jupyter notebook) to scrape simple HTML data from several websites. You will use Python 3 with the following libraries:
Beautiful Soup 4 (makes it easier to pull data out of HTML and XML documents) Requests (for handling HTTP requests from python)
Here is a fairly simple example for finding out how many datasets can currently be searched/accessed on data.gov. You should make sure you can run this code before going on to the questions you’ll be writing (the answer when I last ran this was 194,708).
import bs4 import requests response = requests.get('http://www.data.gov/') soup = bs4.BeautifulSoup(response.text,"html.parser") link = soup.select("small a") print(link.text)
This is an individual programming assignment.
Write python programs to answer the following questions. You will need to do some reading/research regarding the Beautiful Soup interface and possibly on Python as well. Also reference the relevant material in the Grus text, and my jupyter notebook from the 2/14 class. Do not hardcode any data; everything should be dynamically scraped from the live websites. Remember to post questions on Piazza.
gov (relevant url, http://catalog.data.gov/dataset?q=&sort=metadata_created+desc): accept an integer as input and find the name (href text) of the nth "most recent" dataset on data.gov. For example, if the user enters 1, print the name of the first dataset on data.gov when ordered by "date added”. You can assume that the dataset appears on the first page.
It is possible to prompt and receive user input in a jupyter notebook cell using the standard python input syntax. Try this code:
num = input("Enter a number: ") print(num) Example (based on data when viewed on 2/18/2019): Which dataset? 4 Summary of RHESSys Simulations of GI Sensitivity
2. White House Press Briefings (relevant url: https://www.whitehouse.gov/briefing-room/press-briefings) Programmatically find the link for the most recent press conference (this will be the first one on the page identified by the "Remarks" subheader), follow the link and display the time that the briefing took place. Note that the url for the most recent press briefing should not be hardcoded. If a new press briefing is added, your program should give the time of the newly added briefing. Test your code with several press briefings to be sure it is consistently getting the correct time.
Example output: https://www.whitehouse.gov/briefings-statements/ remarks-vice-president-pence-2019-munich-security-conference-munich-germany/ Time of most recent White House Press Briefing Issued on: February 16, 2019
3. Texas Dept of Criminal Justice (relevant url: http://www.tdcj.state.tx.us/death_row/dr_executions_by_year.html): Accept two integers as input. You can assume that these values represent a valid starting and ending year within the range of the years in the table. Process the html and find the total number of executions in Texas between the starting year and the ending year (inclusive of the start and end years).
Example: Enter starting year: 1990 Enter ending year: 2000 Total executions: 206
4. For this problem, you'll interact with the Twitter API. Once again, consult the example code in the Grus text and the 2/14 in-class jupyter notebook. Use:
tweepy (for wrapping the Twitter API) csv (for interacting with csv file) json (for parsing json data)
Create a free Twitter account if necessary. Follow the instructions in the Grus text to enable free-access to the twitter API. Feel free to use a credentials.json file similar to my usage in the example notebook.
Use the data at http://unitedstates.sunlightfoundation.com/legislators/legislators.csv to find which currently serving senator (not representative) has the most Twitter followers and who has the fewest. Also, look up the 10 most recent tweets of each currently serving senator (once again, no representatives) and report the totals of how many people have favorited the last ten tweets and how many people retweeted the last ten tweets. Use the requests library to read the the csv file in and use the csv module to process it (look at dictreader). Be sure to filter out those who aren't currently in office and those who don't have a twitter account. Write your output in a reasonable/readable format (example below):
Example output: Most followers: xxx Fewest followers: xxx xxx last 10 tweets: xx favorited, xx retweeted xxx last 10 tweets: xx favorited, xx retweeted ...