Director recommender pt.1

March 7, 2019, 10:26 a.m.

I love watching movies and when I find a director that I like I tend to want to watch more movies by that same director. My favourites include: Andrei TarkovskyYimou Zhang and Hayao Miyazaki. Finding a new director that fits my interests can be challenging sometimes, so I was looking for a movie director recommender system, and however many recommender systems exist for movies, I could not find a single one that looks for directors.

My vision for this blog to showcase my work, that means that I shouldn't just create tools, I should learn how to share them on this website. And however my machine learning skills are quite extensive, my website building skills are not quite up to par. So, what better way to practice this is to create a relatively simple tool as a first project to publish here and write about it? In this series I will write about my simple implementation of a movie director recommender system, how make the system efficient and fast and how to publish it on a Django framework. You can find the finished product in the Tools section, or here.

In this first post I will write about cleaning the data and setting it up for deployment on a MySQL server.

 

The data I will be using comes from IMDB. They publish data on the movies in their database that we can use in multiple, clean and easily accessible files. One of the files that we are going to use includes basic information about the movie: title, type (e.g. movie, short, tvseries, tvepisode, video, etc), year and so forth. Another file is for ratings and another one is for the crew (director and writers). The last one that we are using is an important one, IMDB calls it the "principles", which include the (up to) 10 most important people that worked on the movie. These people include the directors, producers, writers, cinematographers and actors. These principles are what I am going to base my recommender system on, as I believe that if I was to look for similarity between directors I should look for the people that they choose to work with. Many current recommender system work with similarities in just actors, in similarity in words used in the synopsis or preferences by users. I believe that to recommend a director we should trust on the professionalism of the individual and see who he or she likes to work with on a dailiy basis.

 

Preparing the data

 

import pandas as pd
import dask.dataframe as dd
import numpy as np

movies = pd.read_csv('title.basics.tsv', sep='\t',
                     na_values=r'\N', low_memory=False)
ratings = pd.read_csv('title.ratings.tsv', sep='\t',
                      na_values=r'\N', low_memory=False)

 

Because of its ease to work with and large collection of functions I love using Pandas when speed is not of the utmost importance. However, sometimes Pandas can have some difficulty reading large files using the conventional read_csv method. There is an increase in efficiency if you specify the chunk_size parameter in read_csv. Dask is a package that will increase the speed even more since it utilizes multiple cores. The reason I don't use Dask for all files is because I feel a little bit more comfortable with Pandas' read_csv and the other (smaller) files threw a warning using Dask.

Creating the movies DataFrame and our filtered movie list

A large portion (~90%) of the database is not relevant for our recommender system as they don't contain movies but rather series or documentaries. Removing them can increase the speed of our tool and since this is a movie director recommender system we should leave them out of our model. Next, I want to remove outliers in terms of directors, some movies have too many directors and some directors that have too many movies. I feel this data might do more bad to our model than good.

docus = movies[movies['genres'].str.contains('Documentary') == True]
movies = movies.drop(docus.index)
movies = movies.query('titleType == "movie" or \
                       titleType == "tvMovie" or \
                       titleType == "tvMiniSeries"')
movies = movies[['tconst', 'primaryTitle', 'originalTitle', 'startYear']]
movies = pd.merge(movies, ratings, how='left', on=['tconst'])

int_cols = ['startYear', 'numVotes']
str_cols = ['tconst', 'primaryTitle', 'originalTitle']
movies.fillna(-1, inplace=True)
movies[int_cols] = movies[int_cols].astype(np.int64)
movies[str_cols] = movies[str_cols].applymap(str)
movie_list = list(movies['tconst'])

 

First we find all the titles that have documentary as their 'genre' and remove them, then we filter our data on the 'titleType' that is relevant to us and we add the ratings so we can use those later. We also delete the collumns that are not relevant to us and convert the types of the columns. Then, movie_list is created so later we can filter our other datasets with the list we have now.

Next, we create a DataFrame that maps the id 'nconst' to an actual name:

names = dd.read_csv('name.basics.tsv', sep='\t',
                    na_values=r'\N', low_memory=False)
names = names.compute()
names = names[['nconst', 'primaryName']]
names = names.rename(columns={'nconst': 'dconst'})
names.fillna(-1, inplace=True)

 

Creating the movie_dir DataFrame

The movie_dir DataFrame maps a movie to its director(s) and a director to its movie(s). 

move_dir = pd.read_csv('title.crew.tsv', sep='\t',
                       na_values=r'\N', low_memory=False)

 

For every movie there is a column for the writers and a column for the directors of a movie. These people are comma separated.

In [9]: crew[6:10]
Out[9]: 
      tconst            directors    writers
6  tt0000007  nm0005690,nm0374658        NaN
7  tt0000008            nm0005690        NaN
8  tt0000009            nm0085156  nm0085156
9  tt0000010            nm0525910        NaN

 

First, we drop the writers, we drop the movies that are not in our movie list and then we separate the directors by comma so we get a list of directors instead of a long string. We then check the length of those lists and drop the movies that have too many directors.

move_dir.drop('writers', axis=1, inplace=True)
move_dir.dropna(inplace=True)
movie_dir = move_dir.query('tconst in @movie_list').copy()

movie_dir['directors'] = movie_dir['directors'].str.split(',')
upper = movie_dir['directors'].str.len().quantile(0.999)
upper_dirs = movie_dir[movie_dir['directors'].str.len() > upper].index
movie_dir = movie_dir.drop(upper_dirs)

 

movie_dir = pd.concat([movie_dir.reset_index(drop=True),
                       pd.DataFrame(movie_dir['directors'].tolist())],
                      axis=1)

 

We then do some magic. We create a DataFrame that splits the 'director' column for every individual over multiple columns and then use concatenate to map it onto the original DataFrame so we get something that looks like this:

In [21]: movie_dir.head()
Out[21]: 
      tconst               directors          0          1     2     3     4     5     6     7
0  tt0000009             [nm0085156]  nm0085156       None  None  None  None  None  None  None
1  tt0000335  [nm0675140, nm0095714]  nm0675140  nm0095714  None  None  None  None  None  None
2  tt0000502             [nm0063413]  nm0063413       None  None  None  None  None  None  None
3  tt0000574             [nm0846879]  nm0846879       None  None  None  None  None  None  None
4  tt0000615             [nm0533958]  nm0533958       None  None  None  None  None  None  None
movie_dir.drop('directors', axis=1, inplace=True)
movie_dir = movie_dir.melt(id_vars='tconst', value_name='dconst')

 

We then drop the redundant 'directors' column and use the melt function to create something as follows:

In [24]: movie_dir[movie_dir['tconst'] == 'tt0000335']
Out[24]: 
            tconst variable     dconst
1        tt0000335        0  nm0675140
467926   tt0000335        1  nm0095714
935851   tt0000335        2       None
1403776  tt0000335        3       None
1871701  tt0000335        4       None
2339626  tt0000335        5       None
2807551  tt0000335        6       None
3275476  tt0000335        7       None

 

We can now drop the variable column and None values using:

movie_dir.drop('variable', axis=1, inplace=True)
movie_dir.dropna(inplace=True)

 

Then for our final filtering remove all the directors that have too many movies:

temp = movie_dir.groupby('dconst')['tconst'].apply(list)
upper = temp.apply(lambda x: len(x)).quantile(0.999)
upper_movies = temp[temp.apply(lambda x: len(x)) > upper].index
to_drop = movie_dir[movie_dir['tconst'].isin(upper_movies)].index
movie_dir.drop(to_drop, inplace=True)

 

We used groupby() and apply(list) to create a list for every director containing all the movies they have made. We find the upper 0.999 quantile again, find those directors and drop them.

Creating the directors DataFrame

directors = movie_dir['dconst'].unique()
directors = pd.DataFrame(directors, columns=["dconst"])
directors = pd.merge(directors, names, how='left', on=['dconst'])

 

This is straightforward. From our movie_dir we take the unique names and we concatenate them with the names DataFrame that contain the actual names.

Creating the movie_prin DataFrame

Lastly, we create the DataFrame that maps the movies to their principles.

movie_prin = dd.read_csv('title.principals.tsv', sep='\t')
movie_prin = movie_prin.compute()
movie_prin = movie_prin.query('tconst in @movie_list')
movie_prin = movie_prin[['tconst', 'nconst']]
movie_prin = movie_prin.rename(columns={'nconst': 'pconst'})
movie_prin.dropna(inplace=True)

 

DataFrame overview

At this point we have the following DataFrames:

Movies

In [29]: movies.head()
Out[29]: 
      tconst                 primaryTitle                originalTitle  startYear  averageRating  numVotes
0  tt0000009                   Miss Jerry                   Miss Jerry       1894            5.6        75
1  tt0000335        Soldiers of the Cross        Soldiers of the Cross       1900            6.2        38
2  tt0000502                     Bohemios                     Bohemios       1905           -1.0        -1
3  tt0000574  The Story of the Kelly Gang  The Story of the Kelly Gang       1906            6.2       499
4  tt0000615           Robbery Under Arms           Robbery Under Arms       1907            4.8        14

Directors

In [30]: directors.head()
Out[30]: 
      dconst       primaryName
0  nm0085156   Alexander Black
1  nm0675140      Joseph Perry
2  nm0063413  Ricardo de BaƱos
3  nm0846879      Charles Tait
4  nm0533958  Charles MacMahon

Movie_dir

In [32]: movie_dir.head()
Out[32]: 
      tconst     dconst
0  tt0000009  nm0085156
1  tt0000335  nm0675140
2  tt0000502  nm0063413
3  tt0000574  nm0846879
4  tt0000615  nm0533958

Movie_prin

In [33]: movie_prin.head()
Out[33]: 
        tconst     pconst
24   tt0000009  nm0063086
25   tt0000009  nm0183823
26   tt0000009  nm1309758
27   tt0000009  nm0085156
530  tt0000335  nm1010955

 

These DataFrames can now be posted on my MySQL server that i set up. In the next post we will explore how this is done.