Star Wars Survey (FiveThirtyEight)

This was a guided project on Dataquest.io in the Data Analyst path.  While I strongly followed the steps provided during the data cleaning process, I exhibited much more freedom in the data visualization section, especially with the formatting.  I’ve also included visuals not part of the guided project.  The data utilized for this project is provided by FiveThirtyEight on their GitHub page.

# Setup environment
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Read in data, summarize, and review
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.shape
(1187, 38)
star_wars.columns
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
# Remove rows where RespondentID is null, reject shape
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
star_wars.shape
(1186, 38)
# Map yes/no to True/False
yes_no = {
    'Yes':True,
    'No':False
}
map_cols = [
    'Have you seen any of the 6 films in the Star Wars franchise?',
    'Do you consider yourself to be a fan of the Star Wars film franchise?']
for c in map_cols:
    star_wars[c] = star_wars[c].map(yes_no)
# Import numpy in order to use np.nan
import numpy as np
# Change movie names to True and NaN to False
movie_mapping = {
    'Star Wars: Episode I  The Phantom Menace': True,
    np.nan: False,
    'Star Wars: Episode II  Attack of the Clones': True,
    'Star Wars: Episode III  Revenge of the Sith': True,
    'Star Wars: Episode IV  A New Hope': True,
    'Star Wars: Episode V The Empire Strikes Back': True,
    'Star Wars: Episode VI Return of the Jedi': True
}

for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(movie_mapping)
# Rename the columns to more intuitive names
rename_cols = {
    'Which of the following Star Wars films have you seen? Please select all that apply.':'seen_1',
    'Unnamed: 4':'seen_2',
    'Unnamed: 5':'seen_3',
    'Unnamed: 6':'seen_4',
    'Unnamed: 7':'seen_5',
    'Unnamed: 8':'seen_6'
}
star_wars = star_wars.rename(columns=rename_cols)
#Convert columns 9-14 to float and rename
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
rename_cols2 = {
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'Ranking_1',
    'Unnamed: 10':'Ranking_2',
    'Unnamed: 11':'Ranking_3',
    'Unnamed: 12':'Ranking_4',
    'Unnamed: 13':'Ranking_5',
    'Unnamed: 14':'Ranking_6'
}
star_wars = star_wars.rename(columns=rename_cols2)
star_wars[star_wars.columns[9:15]].head()
Ranking_1 Ranking_2 Ranking_3 Ranking_4 Ranking_5 Ranking_6
1 3.0 2.0 1.0 4.0 5.0 6.0
2 NaN NaN NaN NaN NaN NaN
3 1.0 2.0 3.0 4.0 5.0 6.0
4 5.0 6.0 1.0 2.0 4.0 3.0
5 5.0 4.0 6.0 2.0 1.0 3.0
# Average ranking by movie
ranking_cols = star_wars.columns[9:15]
ranking_avg = star_wars[ranking_cols].mean()
ranking_avg
Ranking_1    3.732934
Ranking_2    4.087321
Ranking_3    4.341317
Ranking_4    3.272727
Ranking_5    2.513158
Ranking_6    3.047847
dtype: float64
# Plot movie rankings; 1 = 'Most Liked' and 6 = 'Least Liked'
fig, ax = plt.subplots()
ax.bar(range(1,7), ranking_avg, align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Ranking')
plt.yticks(range(1,6,1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Average Movie Rankings', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

avg_movie_rankings

Movie Rankings

So far I have cleaned up the data by renaming columns, changing yes/no to True/False, and converting strings to floats for calculation.

The numbers along the x-axis represent the Star Wars episodes. Number 1 represents episode 1 (The Phantom Menace) and number 6 represents episode 6 (Return of the Jedi). Something doesn’t look right, though. From the looks of it, the first three episodes, which are actually the newer movies, rank higher than the original series. Any Star Wars fan will tell you this can’t be true. Well, it just so happens that a higher ranking means less liked. That’s because 1 = ‘Most Liked’ all the way through 6 = ‘Least Liked’. Typically, with ranking, 1 is the best.

# Sum movie views
movie_views = star_wars[star_wars.columns[3:9]].sum()
movie_views
seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64
# Plot movie views
fig, ax = plt.subplots()
ax.bar(range(1,7), movie_views, align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Views')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Movie Views', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

movie_views

Movie Views

It appears the original movies (4-6) were seen by more respondents than the latter three movies (1-3). While a lot of respondents viewed episode 1, episodes 2 and 3 were seen by less respondents. This may suggest that a lot of respondents didn’t like episode 1 and therefore had no desire to see episodes 2 and 3. This certainly seems to correlate with the movie rankings in the previous plot. Let’s produce a scatter plot that shows movie views vs. movie rankings.

# Plot correlation between movie views and movie rankings
movie_ep = ['Ep. 1','Ep. 2','Ep. 3','Ep. 4','Ep. 5','Ep. 6']

fig, ax = plt.subplots()
ax.scatter(movie_views, ranking_avg, color='dodgerblue')
ax.set_xlabel('Movie Views')
ax.set_ylabel('Ranking (Lower = Liked More)')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Movie Views vs. Movie Rankings', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

for i, txt in enumerate(movie_ep):
    ax.annotate(txt, (movie_views[i]+1, ranking_avg[i]+.10))

views_vs_rankings

Correlation

Keep in mind that the lower the ranking, the better. We can easily see from the scatterplot that the better ranked episodes are, for the most part, watched more. While episode 1 has more views than episode 4, episode 4 has a better ranking.

# Plot movie ranks by fans
sw_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==1]
sw_no_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==0]
sw_fan_rank = sw_fan[sw_fan.columns[9:15]]
sw_no_fan_rank = sw_no_fan[sw_no_fan.columns[9:15]]

fig, ax = plt.subplots()
ax.bar(range(1,7), sw_fan_rank.mean(), align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Ranking')
plt.yticks(range(1,6,1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Average Movie Rankings Among Fans', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

avg_rankings_fans

# Plot movie ranks by non-fans
fig, ax = plt.subplots()
ax.bar(range(1,7), sw_no_fan_rank.mean(), align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Ranking')
plt.yticks(range(1,6,1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Average Movie Rankings Among Non-Fans', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

avg_rankings_non_fans

Movie Rankings Among Fans & Non-Fans

There looks to be quite a difference of rankings between fans and non-fans. Whereas fans definitely ranked the original episodes in a more favorable manner, non-fans tended to rank episodes 5 and 6 in-line with episodes 1 and 2. However, both fans and non-fans disliked episode 3 the most.

# Plot movie views by fans
sw_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==1]
sw_no_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==0]
sw_fan_views = sw_fan[sw_fan.columns[3:9]]
sw_no_fan_views = sw_no_fan[sw_no_fan.columns[3:9]]

fig, ax = plt.subplots()
ax.bar(range(1,7), sw_fan_views.sum(), align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Views')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Movie Views Among Fans', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

movie_views_fans

# Plot movie views by non-fans
fig, ax = plt.subplots()
ax.bar(range(1,7), sw_no_fan_views.sum(), align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Views')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('Movie Views Among Non-Fans', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

movie_views_non_fans

num_fans = len(sw_fan)
num_fans
552
num_non_fans = len(sw_no_fan)
num_non_fans
284
ttl_respondents = len(star_wars)
ttl_respondents - (num_fans + num_non_fans)
350
# What pcnt do fans, non-fans, and unknown make out of all respondents?
val = [num_fans/ttl_respondents*100, num_non_fans/ttl_respondents*100, (ttl_respondents - (num_fans+num_non_fans))/ttl_respondents*100]

fig, ax = plt.subplots()
ax.bar(range(1,4), val, align='center', color='dodgerblue', linewidth=0)
ax.set_ylabel('% of Total')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.xaxis.set_ticklabels('')
ax.set_title('Fan Category Pct of Total Respondents', loc='left', fontweight='bold', y=1.1)
ax.text(0.88, 50, 'Fans')
ax.text(1.80, 28, 'Non-Fans')
ax.text(2.80, 34, 'Unknown')
plt.tight_layout()

fan_cat_pct_total

Movie Views Among Fans and Non-Fans

From the plots we can see that fans saw the movies a lot more than non-fans. However, there are twice as much fans represented in the data as there are non-fans. (There are 350 respondents who did not specify whether they were fans or not. About 30% of the respondents.) For that reason, I think a more meaningful analysis would be to answer the questions: Out of all fans, what percent saw each episode? Out of all non-fans, what percent saw each episode?

# What pcnt of fans watched each episode?
sw_fan_pct = sw_fan_views.sum() / num_fans

fig, ax = plt.subplots()
ax.bar(range(1,7), sw_fan_pct * 100, align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Pct')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('What pct of fans saw each episode?', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

pct_fan_views

# What pcnt of non-fans watched each episode?
sw_no_fan_pct = sw_no_fan_views.sum() / num_non_fans

fig, ax = plt.subplots()
ax.bar(range(1,7), sw_no_fan_pct * 100, align='center', color='dodgerblue', linewidth=0)
ax.set_xlabel('Movies')
ax.set_ylabel('Pct')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(length=0)
ax.set_title('What pct of non-fans saw each episode?', y=1.1, loc='left', fontweight='bold')
plt.tight_layout()

pct_non_fan_views

Percent of Movie Views Among Fans and Non-Fans

As would be expected, a large percentage of those who consider themselves fans of Star Wars saw each episode, whereas a much smaller percentage of those who consider themselves non-fans saw each episode. For fans, the smallest percent is about 80% for episode 3. For non-fans, the smallest percent is about 35% for episode 3. The shape of each plot is somewhat similar, which suggests that the quality of the movies is recognized by both fans and non-fans alike.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s