Visualizing My Netflix Viewing Activity with Python and Matplotlib (Part II)

How to find out your Netflix viewing patterns using your own Netflix data for fun.

10 min readAug 26, 2020

How does your Netflix homepage look like? Here’s mine.

Previously, I have written a simple tutorial on visualizing your Netflix viewing activity by downloading the history file (.CSV) from your own account. The article can be viewed here. Now I want to go further by requesting more data from Netflix.

As I have said on my previous article, I have requested a more complete data on my account on 26 June, and finally by 20 July I finally got it! Yay! And yes it took me quite a long time to finish writing this article.

Unluckily, the data sets are not exactly what I expected. I kind of expected a data set detailing on whether this piece of video I watched is a movie or a television show, and if it is a show what is the original title (instead of the combined title of the show title + episode title), or even better if we can even get the genres of the shows/movies, the cast, the directors, etc. Instead, I got these files that are more of a report on every click I did on Netflix.

Still, let’s try to gain insights from those data.

After your request is granted, you will get a compressed file called “netflix-report.zip”. If you decompress it, you will see the following folders inside.

If you want to see what’s inside each of those folders, you can see it below.

As you can see above, there are a lot of CSV files here that you can process and try to visualize if you’re curious. For this article, I want to keep it brief and focus on analysis using one file only. I choose the “ViewingActivity.csv” for the purpose of the article. It’s like a more detailed version of the “NetflixViewingActivity.csv” that I used on my previous article.

So, here we go.

There are 5 profiles on my account. I have my own account, my sister has her own account, and the three others are used by my friends. Now I know that I don’t let anyone use my profile (yes I am selfish about Netflix!), but I have no idea if my friends use theirs with other people (and they probably do). Mostly I don’t care as long as I don’t get the “too many people are using your account right now” notification.

Here’s the five profiles on my account. El is mine, Silev is my sister’s.

El is mine, Silev is my sister’s, and the rest are my friends’.

So let’s begin by importing some libraries and the file itself.

import pandas as pd
import numpy as npdf = pd.read_csv("netflix-report/Content_Interaction/ViewingActivity.csv")
df.head(5)

What I can observe from the data above is the following columns.

Profile Name: clear enough
Start Time: the time when you start watching the show/movie
Duration: how long you watch it
Attributes: I am still not sure what this column explains, but it describes whether the show/movie/trailer you watch is autoplayed
Title: clear enough
Supplemental Video Type: this column describes whether what you watch is a trailer, hook, preview, teaser trailer, recap, etc. I take a guess at this but when left empty, that means you are actually watching the show/movie, not some preview/hook/trailer etc
Device Type: explains what device you watch Netflix on
Bookmark: not clear
Latest Bookmark: not clear
Country: explains the country from which you watch Netflix. Y̶e̶s̶ ̶I̶ ̶d̶o̶ ̶u̶s̶e̶ ̶V̶P̶N̶ ̶s̶o̶m̶e̶t̶i̶m̶e̶s̶,̶ ̶a̶n̶d̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶r̶e̶g̶i̶s̶t̶e̶r̶ ̶s̶a̶i̶d̶ ̶c̶o̶u̶n̶t̶r̶y̶ ̶(̶t̶h̶e̶ ̶V̶P̶N̶)̶ ̶t̶o̶ ̶t̶h̶i̶s̶.̶

Data Cleaning

This is important because not all data presented are relevant to our analysis below, e̶s̶p̶e̶c̶i̶a̶l̶l̶y̶ ̶t̶h̶e̶ ̶c̶o̶l̶u̶m̶n̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶d̶o̶n̶’̶t̶ ̶u̶n̶d̶e̶r̶s̶t̶a̶n̶d̶.

Dropping unnecessary data

I don’t want to include the times I watched trailers or previews f̶r̶o̶m̶ ̶s̶c̶r̶o̶l̶l̶i̶n̶g̶ ̶N̶e̶t̶f̶l̶i̶x̶ ̶m̶i̶n̶d̶l̶e̶s̶s̶l̶y̶ ̶w̶h̶i̶l̶e̶ ̶n̶o̶t̶ ̶b̶e̶i̶n̶g̶ ̶a̶b̶l̶e̶ ̶t̶o̶ ̶d̶e̶c̶i̶d̶e̶ ̶w̶h̶a̶t̶ ̶t̶o̶ ̶w̶a̶t̶c̶h̶, so I am going to drop all rows that have values in ‘Supplemental Video Type’ column.

df = df[df['Supplemental Video Type'].isna()]

Convert timestamp to your local timezone

I think Netflix uses GMT timezone, so I need to convert the time on ‘Start Time’ column to my local timezone. I live in Western Indonesia, so I set the timezone below to “Asia/Jakarta”, which is GMT+7. You can convert it accordingly based on your location. The list of available time zones can be seen on a StackOverflow thread here.

from datetime import datetime, timezone
import pytzdef utc_to_local(utc_dt):
 return utc_dt.replace(tzinfo=timezone.utc).astimezone(tz=pytz.timezone("Asia/Jakarta"))df['start_time'] = pd.to_datetime(df[Start Time']).apply(utc_to_local)df['start_time']

Convert timestamp to appropriate format

df['start_time'] = df['start_time'].apply(lambda x: x.strftime("%Y-%m-%d, %H:00:00"))
df[‘start_time’]

Drop rows with very short durations

Okay, so there are probably times when we open a show/movie on Netflix but a few seconds later we change our mind for whatever reason. Does this only happen to me? Well, I don’t want to count that out, so we drop viewings with less than 5 minutes duration. You can set this to whatever time duration that fits you though.

df['duration_minutes'] = df['Duration'].str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1]))# only include viewings with at least 5 minutes duration
df = df[df['duration_minutes']>=5]

Data Visualization

O̶k̶a̶y̶ ̶t̶h̶a̶t̶ ̶d̶a̶t̶a̶ ̶c̶l̶e̶a̶n̶i̶n̶g̶ ̶p̶a̶r̶t̶ ̶w̶a̶s̶ ̶b̶o̶r̶i̶n̶g̶, now here comes the fun part! The visualization with graphics and such!

Who watches the most Netflix?

Let’s see the viewing frequency of each profile. Who watches the most Netflix?

import matplotlib.pyplot as pltprofile_count = df["Profile Name"].value_counts()plt.figure(figsize=(8,5))
plt.bar(profile_count.index, profile_count.values, color="teal")
plt.ylabel("Freq", fontsize=14)
plt.xlabel("Profile Names", fontsize=14)
plt.xticks(fontsize=11)
plt.title("Viewing Frequency of Each Profile", fontsize=16)
plt.show()

There’s also another way to do this that gives the same result.

fig, ax = plt.subplots(figsize=(8,5))
ax.bar(profile_count.index, profile_count.values, color="teal")
ax.set_xlabel("Profile names", fontsize=14)
ax.set_ylabel("Freq", fontsize=14)
ax.set_title("Viewing Frequency of Each Profile", fontsize=16)
plt.show()

How long is the duration of shows/movies we watch?

We already converted the duration into minutes before. Let’s try to turn that into a histogram.

fig, ax = plt.subplots(figsize=(8,5))
ax.bar(profile_count.index, profile_count.values, color=”teal”)
ax.set_xlabel(“Profile names”, fontsize=14)
ax.set_ylabel(“Freq”, fontsize=14)
ax.set_title(“Viewing Frequency of Each Profile”, fontsize=16)
plt.show()

Seems like we mostly watch shorter shows

T̶h̶e̶ ̶h̶i̶s̶t̶o̶g̶r̶a̶m̶ ̶l̶o̶o̶k̶s̶ ̶a̶w̶f̶u̶l̶,̶ ̶a̶n̶d̶ ̶I̶ ̶d̶o̶n̶’̶t̶ ̶l̶i̶k̶e̶ ̶i̶t̶.̶ So let’s try to do something else. You can categorize the duration (in minutes) into discrete categories, like shows with less than 30 minutes length, shows between 30–60 minutes, and so on.

df_duration = df[['Profile Name', 'duration_minutes']]def categorize_duration(x):
    cat = ""
    if x < 0:
        cat = "less than 30 mins"
    elif x<60:
        cat = "31–60 mins"
    else:
        cat = "more than 1 hour"
 
return catdf_duration['duration_cats'] = df_duration['duration_minutes'].apply(categorize_duration)

Then turn that into a graphic.

durations_count = df_duration[‘duration_cats’].value_counts()fig, ax = plt.subplots(figsize=(8,5))
ax.bar(durations_count.index, durations_count.values, color="orange:)
ax.set_xticks(durations_count.index)
ax.set_xticklabels(durations_count.index)
ax.set_xlabel("Duration categories", fontsize=14)
ax.set_ylabel("Freq", fontsize=14)
ax.set_title(Durations of Viewing Activity", fontsize=16)
plt.show()

Yeah, apparently we mostly watch Netflix less than 30 minutes duration. S̶u̶c̶h̶ ̶a̶ ̶s̶h̶o̶r̶t̶ ̶a̶t̶t̶e̶n̶t̶i̶o̶n̶ ̶s̶p̶a̶n̶ ̶i̶n̶d̶e̶e̶d̶!̶

Now if you want to see the duration of viewings each profile, you can do that too!

df_duration.groupby(['Profile Name', 'duration_cats']).size().unstack().plot(kind='bar', stacked=True)
plt.title("Duration of Viewing Activity")
plt.legend(loc=(1.05, 0))# loc = x,y
plt.show()

Can’t believe the same applies to my friends too

Where do you access Netflix the most?

There used to be a state owned internet provider in Indonesia that blocked Netflix, and unluckily I use that provider. I̶ ̶c̶a̶n̶ ̶w̶r̶i̶t̶e̶ ̶a̶ ̶w̶h̶o̶l̶e̶ ̶r̶a̶n̶t̶ ̶a̶b̶o̶u̶t̶ ̶t̶h̶e̶ ̶g̶o̶v̶e̶r̶n̶m̶e̶n̶t̶s̶ ̶h̶e̶r̶e̶,̶ ̶b̶u̶t̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶n̶o̶t̶ ̶t̶h̶e̶ ̶t̶i̶m̶e̶.̶ At least they are not blocking Netflix anymore today. But the day when they still did, you know what I did.

country_count = df['Country'].value_counts()plt.figure(figsize=(8,5))
plt.bar(country_count.index, country_count.values, color="crimson")
plt.xlabel("Countries", fontsize=14)
plt.ylabel("Freq", fontsize=14)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.title("Locations Used to Access Netflix", fontsize=16)
plt.show()

If you look at that graphic above and think, “nah I can’t see that.” Then you’re right. It’s cases like this when logarithmic scale is actually a good option to use.

plt.figure(figsize=(8,5))
plt.bar(country_count.index, country_count.values, color="crimson")
plt.xlabel("Countries", fontsize=14)
plt.ylabel("Freq (log scale)", fontsize=14)
plt.xticks(rotation=45, ha=”right”, fontsize=10)
plt.title("Locations Used to Access Netflix", fontsize=16)# log scale, the base doesn't have to be 10 btw, 
# you can use 2 or whatever
plt.yscale("log", basey=10) 
plt.show()

I have also painstakingly tried another way, which gives the same result.

fig, ax = plt.subplots(figsize=(8,5))
ax.bar(country_count.index, country_count.values, color="crimson")
ax.set_xticks(country_count.index)
ax.set_xticklabels(country_count.index, rotation=45)
ax.set_xlabel("Countries", fontsize=14)
ax.set_ylabel("Freq (log scale)", fontsize=14)
ax.set_title("Locations Used to Access Netflix", fontsize=16)
ax.set_yscale("log")
plt.show()

Now let’s do that to each profile a̶n̶d̶ ̶s̶e̶e̶ ̶w̶h̶o̶ ̶a̶c̶c̶e̶s̶s̶e̶d̶ ̶N̶e̶t̶f̶l̶i̶x̶ ̶u̶s̶i̶n̶g̶ ̶V̶P̶N̶ ̶t̶h̶e̶ ̶m̶o̶s̶t̶.̶

df.groupby(['Profile Name', 'Country']).size().unstack().plot(kind=’bar’, stacked=True)
plt.title(“Locations Used to Access Netflix for Each Profile”, fontsize=14)
plt.legend(loc=(1.05, 0))# loc = x,y
plt.show()

Obviously everyone accessed Netflix from Indonesia here smh

Since all five of us are from Indonesia, you see that we access Netflix mostly from Indonesia.

That’s so obvious! So let’s remove Indonesia, not from reality though, just from the dataset.

# remove your home country
df[df['Country']!='ID (Indonesia)'].groupby(['Profile Name', 'Country']).size().unstack().plot(kind='bar', stacked=True)plt.title("Locations Used to Access Netflix (not including Indonesia)", fontsize=14)
plt.legend(loc=(1.05, 0))# loc = x,y
plt.show()

Well hello there, it’s me again. I̶ ̶d̶o̶n̶’̶t̶ ̶w̶a̶n̶t̶ ̶t̶o̶ ̶a̶d̶m̶i̶t̶ ̶t̶h̶a̶t̶ ̶I̶ ̶u̶s̶e̶ ̶V̶P̶N̶ ̶a̶ ̶l̶o̶t̶ ̶t̶o̶ ̶a̶c̶c̶e̶s̶s̶ ̶N̶e̶t̶l̶i̶x̶.̶ ̶L̶e̶t̶’̶s̶ ̶j̶u̶s̶t̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶a̶t̶ ̶I̶ ̶d̶i̶d̶ ̶i̶n̶d̶e̶e̶d̶ ̶g̶o̶ ̶t̶o̶ ̶t̶h̶o̶s̶e̶ ̶c̶o̶u̶n̶t̶r̶i̶e̶s̶ ̶o̶n̶l̶y̶ ̶t̶o̶ ̶w̶a̶t̶c̶h̶ ̶N̶e̶t̶f̶l̶i̶x̶!̶

My sister (Silev) probably used VPN once or twice. Meanwhile my friend (calico cat) doesn’t appear on the graph at all, which means she never used VPN at all. Now that’s a law abiding citizen.

What device do you use to watch Netflix?

How do you watch Netflix? On a big screen? On a phone screen? On a smart fridge?

device_count = df['Device Type'].value_counts()plt.figure(figsize=(10,5))
plt.barh(device_count.index, device_count.values, color="mediumturquoise")
plt.xlabel("Freq", fontsize=14)
plt.ylabel("Devices used", fontsize=14)
plt.xticks(fontsize=10)
plt.title("Devices Used to Access Netflix", fontsize=16)# you need this line so the highest values will be on top
plt.gca().invert_yaxis() 
plt.show()

Chrome PC is apparently the most common choice to watch Netflix here. And that Linux thing (Chrome and Firefox) is definitely me, because I am the only one who uses Linux here (that may sound like humblebragging, trust me I am not, I̶ ̶a̶m̶ ̶j̶u̶s̶t̶ ̶t̶o̶o̶ ̶c̶h̶e̶a̶p̶ ̶t̶o̶ ̶b̶u̶y̶ ̶o̶r̶i̶g̶i̶n̶a̶l̶ ̶O̶S̶). In case you don’t believe it, let’s prove it here.

df.groupby(['Profile Name','Device Type']).size().unstack().plot(kind='bar', stacked=True, colormap="tab20b")plt.title("Devices Used for Each Profile")
plt.legend(loc=(1.05, 0))# loc = x,y
plt.show()

My sister watches a lot from her phone, and so do I (dark blue), though I also watch a lot from Chrome Linux (light beige) and Opera (salmon). Meanwhile Chrome PC (light green) is more popular among my friends.

Heat Map

Here comes the actual fun part, heat map! I am still learning about this too, hopefully this is good enough for you, because I got lost a few times when writing the code. At least it’s working now, or so I guess.

For this part I only want to analyze my own data, from El profile, so I am going to focus on that one only. I will make a new dataframe called ‘dfl’.

dfl = df[df['Profile Name']=='El']
dfl.head()

First, let’s count the viewing frequency and sort that by hour.

by_hour = dfl['start_time'].value_counts().sort_index(ascending=True)
by_hour

by_hour.index = pd.to_datetime(by_hour.index)df_datehour = by_hour.rename_axis('date_hour').reset_index(name='counts')
df_datehour

You see, there are a lot of dates and hours missing, for example during the time when I didn’t watch Netflix (well I don’t watch Netflix 24/7 you know), meanwhile we need a continuous time series with 1 hour in between. So we will have to fill those missing dates and time with 0s.

idx = pd.date_range(min(by_hour.index), max(by_hour.index), freq='1H')
s = by_hour.reindex(idx, fill_value=0)
s

After that, we get the date, hour, month, year to the dataframe.

dfl_count = s.rename_axis('datetime').reset_index(name='freq')
dfl_count['date'] = dfl_count['datetime'].dt.date
dfl_count['hour'] = dfl_count['datetime'].dt.hour
dfl_count['day'] = dfl_count['datetime'].dt.weekday_name
dfl_count['month'] = dfl_count['datetime'].dt.month
dfl_count['year'] = dfl_count['datetime'].dt.yeardfl_count = dfl_count.drop(['datetime'], axis=1)

The we make a matrix like dataframe before we can turn it into a heat map.

dfl_hm = dfl_count[['day', 'hour', 'freq']].groupby(['day', 'hour']).sum()m = dfl_hm.unstack().fillna(0)
m

Let’s define how many hours there are in a day and how many days there are in a week.

hours_list = list(range(0,24))
days_name = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

Now here is the heatmap!

import seaborn as snssns.set_context('talk')
f, ax = plt.subplots(figsize=(12,5))
ax = sns.heatmap(m, linewidths=.5, ax=ax, yticklabels=days_name, xticklabels=hours_list,cmap='viridis')
ax.axes.set_title('Heatmap of My Netflix Viewing Activity', fontsize=20, y=1.02)
ax.set(xlabel='Hour of day',ylabel='Day of Week');

The lighter the color means the more frequency I watch Netflix. The darker means less. I watch Netflix mostly at night, very rarely during the day, and I guess Sunday is a pretty prominent time, though that’s quite an interesting 20.00 too for a Thursday.

Conclusion

Finally, this article reaches an end too! I only made use one of the files I got from Netflix, but I believe there are still a lot more you can dig from those files. I wish I could analyze and visualize them all, b̶u̶t̶ ̶I̶ ̶a̶m̶ ̶a̶ ̶b̶i̶t̶ ̶t̶o̶o̶ ̶l̶a̶z̶y̶. While I know my writing isn’t perfect, I hope this little piece of writing can help you or inspire you (or even aggravate you) or whatever. Feel free to comment below or contact me on Twitter (it’s on my profile) about what you think or if you need any help.

You’re finished reading? Now go away. Okay, nope just kidding.