Parsing Youtube History With Beautiful Soup
Last Updated: 1934Z 28NOV19 (Created: 1929Z 28NOV19)

You can export your YouTube using Google Takeout[1] and receive the data as a zipfile via email.

The exported data includes:

  • my-messages.html
  • search-history.html
  • watch-history.html
  • my-comments.html
  • likes.json
  • watch-later.json
  • subscriptions.json

I had hoped that my watch history would be available in an easy to parse format that could be used for statical analysis purposes. Unfortunately, watch history is contained in a large and complex html file. Online research didn't turn up a worked example for this problem so I got to solve the problem the old fashioned way.

It turns out to be a trivial problem to solve using Python3 and Beautiful Soup[2] and I am documenting my results here.

Opening the file with Sublime Text and searching for youtube.com will take you to the area of the file containing the first video record. Coping a chunk of the file into another text window and cleaning it up results in a readable version of the divs that we're looking for.

<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
     Watched
     <a href="https://www.youtube.com/watch?v=VIDEO_ID>
      VIDEO_TITLE
     </a>
     <br/>
     <a href="https://www.youtube.com/channel/CHANNEL_ID">
      CHANNEL_TITLE
     </a>
     <br/>
     Jul 21, 2011, 8:21:55 PM EDT
</div>

With the target div identified we can easily use Beautiful Soup to parse the file and extract the interesting divs. Once we have the divs we can extract the video information out of them.

from pathlib import Path

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

HISTORY_FILE = "watch-history.html"



# Only parse the elements we're interesed in, adding this cut 60 seconds off of the
# parse time for my file.
kwargs = {"class": "content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1"}
only_vid_data = SoupStrainer(**kwargs)


soup = BeautifulSoup(data_file.open().read(), "html.parser", parse_only=only_vid_data)

for tag in soup:
    print(tag.contents)
    print("===================")

    links = tag.find_all('a')
    print(links)

    print("===================")

    print(links[0].attrs['href'])
    print(links[0].contents)

    print("===================")

    print(tag.text)
    print("\n++++++++++++++++++++++++++++++++++++++++++++++++++++++\n")
    break # only parse the first tag

Output:

[<a href="https://www.youtube.com/watch?v=JFivtOmXPPM">
    The Decline of RadioShack...What Happened?</a>,
 <a href="https://www.youtube.com/channel/UCQMyhrt92_8XM0KgZH6VnRg">Company Man</a>
 ]

===================

https://www.youtube.com/watch?v=JFivtOmXPPM
['The Decline of RadioShack...What Happened?']

===================
Watched The Decline of RadioShack...What Happened?Company ManMay 28, 2019, 3:49:29 PM EDT

From here it is trivial to extract the video data for analysis:

  • channel id
  • channel title
  • video id
  • video title
  • time watched

I have expanded this code into a CLI script that extracts all of the video data from my watch history and writes it to a csv file for reading by other programs. It is available here: GithubGist