New york times editorials and news articles

In [1]:

Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the assignments with others please include their names below.

Welcome to the third homework assignment of Data 100! In this assignment, we will be exploring tweets from several high profile Twitter users.

In this assignment you will gain practice with:

	Conducting Data Cleaning and EDA on a text-based dataset. Manipulating data in pandas with the datetime and string accessors.

from ds100_utils import*

# Ensure that Pandas shows at least 280 characters in columns, so we can see full pd.set_option('max_colwidth', 280)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")

Question	Points
4cii 4d 4e 4f 4g 5a 5b Total

In [3]:

AOC_recent_tweets.txt
BernieSanders_recent_tweets.txt
BillGates_recent_tweets.txt
Cristiano_recent_tweets.txt
EmmanuelMacron_recent_tweets.txt
elonmusk_recent_tweets.txt

withopen("filename", "r") as f:

f.read(2)

In [7]:

In [ ]:

What format is the data in? Answer this question by entering the letter corresponding to the right format in the variable q1b below.

A. CSV
B. HTML
C. JavaScript Object Notation (JSON)
D. Excel XML

Question 1c

In [ ]:

Question 1d

There are many ways we could choose to read tweets. Why might someone be interested in doing data analysis on tweets? Name a kind of person or institution which might be interested in this kind of analysis. Then, give two reasons why a data analysis of tweets might be interesting or useful for them. Answer in 2-3 sentences.

In [ ]:

In this question we will use a regular expression to convert this messy HTML snippet into something more readable. For example: <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> should be Twitter for iPhone .

Question 2a

most_freq(pd.Series(["A", "B", "A", "C", "B", "A"]), k=2)

would return:

We just looked at the top 5 most commonly used devices for each user. However, we used the number of tweets as a measure, when it might be better to compare these distributions by comparing proportions of tweets. Why might proportions of tweets be better measures than numbers of tweets?

Type your answer here, replacing this text.

hour +	minute	+	second
	60		602

Note: The below code calls your add_hour function and updates each tweets dataframe by using the created_at timestamp column to calculate and store the hour column.

With our new hour column, let's take a look at the distribution of tweets for each user by time of day. The following cell helps create a density plot on the number of tweets based on the hour they are posted.

Question 3b

In [ ]: defconvert_timezone(df, new_tz):
...

With our adjusted timestamps for each user based on their timezone, let's take a look again at the distribution of tweets by time of day.

In [ ]: # just run this cell
tweets = {handle: add_hour(df, "converted_time", "converted_hour") for handle, df binned_hours = {handle: bin_df(df, hour_bins, "converted_hour") for handle, df in

How do we actually measure the sentiment of each tweet? In our case, we can use the words in the text of a tweet for our calculation! For example, the word "love" within the sentence "I love America!" has a positive sentiment, whereas the word "hate" within the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

As you can see, the lexicon contains emojis too! Each row contains a word and the polarity of that word, measuring how positive or negative the word is.

VADER Sentiment Analysis

Optional (ungraded): Are there circumstances (e.g. certain kinds of language or data) when you might not want to use VADER? What features of human speech might VADER misrepresent or fail to capture?

Question 4b

Assign a regular expression to a new variable punct_re that captures all of the punctuations within a tweet. We consider punctuation to be any non-word, non-whitespace character.

Note: A word character is any character that is alphanumeric or an underscore. A whitespace character is any character that is a space, a tab, a new line, or a carriage return.

Question 4c Part (ii)

Assign a regular expression to a new variable mentions_re that matches any mention in a tweet. Your regular expression should use a capturing group to extract the user's username in a mention.

Tweet Sentiments and User Mentions

3. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

Question 4d

Complete the following function extract_mentions that takes in the full_text (not clean_text !) column from a tweets dataframe and uses mentions_re to extract all the mentions in a dataframe. The returned dataframe is:

single-indexed by the IDs of the tweets

In [ ]:

In [ ]:

The to_tidy_format function implemented for you uses the clean_text column of each tweets dataframe to create a tidy table, which is:

single-indexed by the IDs of the tweets, for every word in the tweet.

In [ ]:

Adding in the Polarity Score

In [ ]:

Question 4f

Type your answer here, replacing this text.

Question 5: You Do EDA!

a dataframe, series, or plot)
2. a short (4-5 sentence) description of the findings of your analysis: what were you looking for? What did you find? How did you go about answering your question?

Your work should involve text analysis in some way, whether that's using regular expressions or some other form.

Helper	Description
make_scatter_plot

Each of the provided helpers is in ds100_utils.py and has a comprehensive docstring. You can read the docstring by calling help on the plotting function:

In [ ]:

how sentiment varies with time of tweet

expand on regexes from 4b to perform additional analysis (e.g. hashtags)

Attempts to describe analysis

		No attempt
		at writing a
comprehensively and summarizes results correctly		at writing a
comprehensively and summarizes results correctly	of results is disconnected from	description

Question 5a

Use this space to put your EDA code.

In [ ]:

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before exporting!

In [ ]: