New york times editorials and news articles
In [1]: |
---|
Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the assignments with others please include their names below.
Welcome to the third homework assignment of Data 100! In this assignment, we will be exploring tweets from several high profile Twitter users.
In this assignment you will gain practice with:
|
---|
from ds100_utils import*
# Ensure that Pandas shows at least 280 characters in columns, so we can see full pd.set_option('max_colwidth', 280)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
Question | Points |
---|---|
|
In [3]: |
---|
AOC_recent_tweets.txt
BernieSanders_recent_tweets.txt
BillGates_recent_tweets.txt
Cristiano_recent_tweets.txt
EmmanuelMacron_recent_tweets.txt
elonmusk_recent_tweets.txtwithopen("filename", "r") as f:
f.read(2)
In [7]: |
---|
In [ ]: | |
---|---|
What format is the data in? Answer this question by entering the letter corresponding to the right format in the variable q1b below.
A. CSV
B. HTML
C. JavaScript Object Notation (JSON)
D. Excel XML
Question 1c
In [ ]: In [ ]: |
---|
In [ ]: | |
---|---|
Question 1d
There are many ways we could choose to read tweets. Why might someone be interested in doing data analysis on tweets? Name a kind of person or institution which might be interested in this kind of analysis. Then, give two reasons why a data analysis of tweets might be interesting or useful for them. Answer in 2-3 sentences.
In [ ]: |
---|
In this question we will use a regular expression to convert this messy HTML snippet into something more readable. For example: <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> should be Twitter for iPhone .
Question 2a
In [ ]: |
|
|
---|---|---|
most_freq(pd.Series(["A", "B", "A", "C", "B", "A"]), k=2)
would return:
|
---|
We just looked at the top 5 most commonly used devices for each user. However, we used the number of tweets as a measure, when it might be better to compare these distributions by comparing proportions of tweets. Why might proportions of tweets be better measures than numbers of tweets?
Type your answer here, replacing this text.
hour + | minute | + | second |
---|---|---|---|
60 | 602 |
Note: The below code calls your add_hour function and updates each tweets dataframe by using the created_at timestamp column to calculate and store the hour column.
With our new hour column, let's take a look at the distribution of tweets for each user by time of day. The following cell helps create a density plot on the number of tweets based on the hour they are posted.
Question 3b
In [ ]: defconvert_timezone(df, new_tz):
...
With our adjusted timestamps for each user based on their timezone, let's take a look again at the distribution of tweets by time of day.
In [ ]: # just run this cell
tweets = {handle: add_hour(df, "converted_time",
"converted_hour") for handle, df binned_hours
= {handle: bin_df(df, hour_bins, "converted_hour")
for handle, df in
How do we actually measure the sentiment of each tweet? In our case, we can use the words in the text of a tweet for our calculation! For example, the word "love" within the sentence "I love America!" has a positive sentiment, whereas the word "hate" within the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."
p
As you can see, the lexicon contains emojis too! Each row contains a word and the polarity of that word, measuring how positive or negative the word is.
VADER Sentiment Analysis
Qu estion 4a ase score the sentiment of one of the following words, using your own personal interpretation. No code is required for this question! |
---|
Optional (ungraded): Are there circumstances (e.g. certain kinds of language or data) when you might not want to use VADER? What features of human speech might VADER misrepresent or fail to capture?
Question 4b
Assign a regular expression to a new variable punct_re that captures all of the punctuations within a tweet. We consider punctuation to be any non-word, non-whitespace character.
Note: A word character is any character that is alphanumeric or an underscore. A whitespace character is any character that is a space, a tab, a new line, or a carriage return.
Question 4c Part (ii)
Assign a regular expression to a new variable mentions_re that matches any mention in a tweet. Your regular expression should use a capturing group to extract the user's username in a mention.
Tweet Sentiments and User Mentions
3. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.
Question 4d
|
||
---|---|---|
Complete the following function extract_mentions that takes in the full_text (not clean_text !) column from a tweets dataframe and uses mentions_re to extract all the mentions in a dataframe. The returned dataframe is:
single-indexed by the IDs of the tweets
In [ ]: |
---|
In [ ]: |
---|
The to_tidy_format function implemented for you uses the clean_text column of each tweets dataframe to create a tidy table, which is:
single-indexed by the IDs of the tweets, for every word in the tweet.
In [ ]: |
---|
Adding in the Polarity Score
In [ ]: | |
---|---|
Question 4f
Type your answer here, replacing this text.
Question 5: You Do EDA!
a dataframe, series, or plot)
2. a short (4-5 sentence) description of the findings of your analysis: what were you looking for? What did you find? How did you go about answering your question?Your work should involve text analysis in some way, whether that's using regular expressions or some other form.
Helper | Description |
---|---|
|
Each of the provided helpers is in ds100_utils.py and has a comprehensive docstring. You can read the docstring by calling help on the plotting function:
In [ ]: |
---|
how sentiment varies with time of tweet
expand on regexes from 4b to perform additional analysis (e.g. hashtags)
Attempts to describe analysis
Description | No attempt | ||
---|---|---|---|
at writing a | |||
comprehensively and summarizes results correctly | |||
of results is disconnected from | description |
Question 5a
Use this space to put your EDA code.
In [ ]: | |
---|---|
In [ ]: In [ ]: |
---|
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting!
In [ ]: |
---|