Director recommender pt.2

March 12, 2019, 12:05 a.m.

In the previous post I described how I prepared the data for creating and publishing our model.

The data is set upon in a manner so that it will be suitable for a relational database like MySQL. Many alternatives are available, of which PostgreSQL might be one of the most popular at this time. I am using Pythonanywhere.com to host my server and they only offer MySQL for free, so that is the one I will be using for this project.

 

Uploading the data to the server

import sshtunnel
import pandas as pd
import sqlalchemy as db

sshtunnel.SSH_TIMEOUT = 5.0
sshtunnel.TUNNEL_TIMEOUT = 5.0

 

Pythonanywhere has an interesting server architecture in place so in order to upload to our MySQL from outside of Pythonanywhere we need to first SSH into our account, from which we can then upload our data. For this we will be using the sshtunnel package. Later, we will be using sqlalchemy to make some SQL queries. The actual posting of the DataFrames to the server will be done by a pandas function called to_sql() which actually uses sqlalchemy in the backend. We can also set some sshtunnel settings to break connection to the server if the connection is not used, but all of the posting and will be done inside a with ... as ... statement anyway:

with sshtunnel.SSHTunnelForwarder(
    ('ssh.pythonanywhere.com'),
    ssh_username='<username>', ssh_password='<password>',
    remote_bind_address=('<username>.mysql.pythonanywhere-services.com', 3306)
) as tunnel:

 

Then we connect to our MySQL server:

engine = db.create_engine('mysql+mysqldb://<username>:<mySQL password>@127.0.0.1:{}'
    '/<username>$<database>?charset=utf8mb4'.format(tunnel.local_bind_port))
con = engine.connect()

 

And we can post the DataFrames we created in our previous post as easy as pie:

directors.to_sql(con=con, name='directors', if_exists='replace',
                 chunksize=5000, index=False)
movies.to_sql(con=con, name='movies', if_exists='replace',
              chunksize=5000, index=False)

movie_dir.to_sql(con=con, name='movie_dir', if_exists='replace',
                 chunksize=5000, index=False)

movie_prin.to_sql(con=con, name='movie_prin', if_exists='replace',
                  chunksize=5000, index=False)

 

This process takes quite some time. I'm not sure if it should be attributed to the usage of the SSH tunnel or if there are other causes but for me this step took about 30 to 60 minutes. Some people reported speeding up after changing the chunksize, but I did not find such results. The other arguments of this functions are quite straightforward:

Next, we create a new table called principles and we copy only the pconst column, which is the ID of all the principles. Creating a separate column is needed if we want to implement a many-to-many relationship.

con.execute("""CREATE TABLE principles(
                   pconst VARCHAR(10),
                   PRIMARY KEY (pconst));""")

con.execute("""INSERT INTO principles (pconst)
               SELECT DISTINCT pconst
               FROM movie_prin;""")

 

The first command creates the table principles, having one column which is a VARCHAR with a maximum of 10 characters which is also the index (PRIMARY KEY). Then we copy the unique (DISTINCT) values from movie_prin which we uploaded in the previous step.

Remember that we specified index=False in the uploading process? Except for the principles table, the other tables don't have any indexes. We can query those tables as is but that process will be quite slow as it will need to search through the whole thing to find a value. In the next step we will ALTER the tables and properly set the column types of those tables and assign the indexes:

con.execute("""ALTER TABLE directors MODIFY dconst VARCHAR(10),
                                     MODIFY primaryName VARCHAR(255),
                                     ADD PRIMARY KEY (dconst);""")

con.execute("""ALTER TABLE movies MODIFY tconst VARCHAR(10),
                                  MODIFY primaryTitle VARCHAR(255),
                                  MODIFY originalTitle VARCHAR(255),
                                  MODIFY startYear SMALLINT,
                                  MODIFY averageRating FLOAT,
                                  MODIFY numVotes INT,
                                  ADD PRIMARY KEY (tconst);""")

con.execute("""ALTER TABLE movie_dir MODIFY tconst VARCHAR(10),
                                     MODIFY dconst VARCHAR(10),
                                     ADD FOREIGN KEY (dconst)
                                         REFERENCES directors (dconst),
                                     ADD FOREIGN KEY (tconst)
                                         REFERENCES movies (tconst);""")

con.execute("""ALTER TABLE movie_prin MODIFY tconst VARCHAR(10),
                                      MODIFY pconst VARCHAR(10),
                                      ADD FOREIGN KEY (tconst)
                                          REFERENCES movies (tconst),
                                      ADD FOREIGN KEY (pconst)
                                          REFERENCES principles (pconst);
                                      """)

con.close()

 

The directors and movies tables get a PRIMARY KEY and the movie_dir and movie_prin will get FOREIGN KEYS as they will hold the many-to-many relationship.

When we finish this we are done uploading and we can close the connection.

Implementing the recommendation system

The recommendation system functions in a fairly straightforward manner. The system will find the principles (most attributed people in a movie) that the DOI (director of interest) most commonly works with. It will then find all the directors that share these preferences. All these steps are done through one big SQL query, the reason for this being is that querying the database might cost precious time and so if everything can be done at once, it should. 

 

  1. Find all the movies that the DOI directed.
  2. For every movie, find all the principles (a maximum of 10 per movie).
  3. Count how often the DOI worked with each of the principles.
  4. For every principle, find all the movies they played in and sum the counts.
  5. Find the directors that directed all these movies and count all the sums per director.
  6. Divide by the amount of movies that the director directed to obtain a director score.
  7. Go down the director list in descending order of score.
  8. To filter out any too unknown or plain bad directors, have a minimum threshold of review score and amount that the director should have at least one movie of.
  9. Limit the amount of directors (now set to 5).

I implemented the code into a Python script that is run from the Django framework that my website runs on.

import sqlalchemy as db
import os

def dir_recomm(DOI_dconst, minimum_rating,
               minimum_votes, dir_limit=5):
    try:
        engine = db.create_engine('mysql+mysqldb://<username>:{}'
            '@<username>.mysql.pythonanywhere-services.com:3306/'
            '<username>$<database>?charset=utf8mb4'.format(DATABASE_KEY))
        conn = engine.connect()

        DOI_dconst_query = """SELECT primaryName
                              FROM directors
                              WHERE dconst = '{d}'
                              LIMIT 1""".format(d=DOI_dconst)

        DOI_name = conn.execute(DOI_dconst_query).fetchone()[0]

    ...

    except Exception:
        conn.close()
        return("error", "")

 

The arguments for the function are:

If the script fails for whatever reason we don't have an output and we get a very unexplanatory error in the browser. We can put some error handling in the script that returns more information about the error (problems with the database connection? No movies found for the director? Cannot find any recommendations?), for now I've decided to put everything in a big try, except block, but this can be improved on greatly.

We connect to the database again and then the first thing we do is convert the DOI_dconst to the real director's for usage later when we display the results in the browser. For ease of readability (but also writeability!) we split the SQL query up in multiple parts which are put together to form one big query. This first query represents step 1, 2 and 3 (find all the movies that the DOI directed, for every movie, find all the principles, count how often the DOI worked with each of the principles).

colleague_query = """SELECT mp2.pconst, COUNT(mp2.pconst) AS count
                     FROM movie_prin AS mp2
                     WHERE mp2.tconst IN (
                         SELECT md1.tconst
                         FROM movie_dir AS md1
                         WHERE md1.dconst = '{d}')
                     GROUP BY mp2.pconst""".format(d=DOI_dconst)

 

The next part of the query represent step 4 (for every principle, find all the movies they played in and sum the counts).

col_movie_query = """SELECT mp4.tconst, SUM(mp3.count) AS sum
                     FROM movie_prin AS mp4
                     INNER JOIN ({q}) AS mp3
                     ON mp3.pconst = mp3.pconst
                     WHERE mp4.pconst IN (mp3.pconst)
                     GROUP BY mp4.tconst""".format(q=colleague_query)

 

Step 5 and 6 (find the directors that directed all these movies and count all the sums per director, divide by the amount of movies that the director directed to obtain a director score):

dir_points_query = """SELECT md5.dconst, SUM(mp5.sum) as sum,
                             MAX(numVotes) as votes,
                             MAX(averageRating) as rating
                      FROM movie_dir AS md5
                      INNER JOIN ({q}) AS mp5
                      ON mp5.tconst = md5.tconst
                      INNER JOIN movies as mo
                      ON mo.tconst = md5.tconst
                      WHERE md5.dconst <> '{d}'
                      GROUP BY md5.dconst""".format(q=col_movie_query,
                                                    d=DOI_dconst)

 

And step 7, 8 and 9 (go down the director list in descending order of score, filter out directors below the threshold, limit the amount of directors):

dir_score_query = """SELECT md7.dconst
                     FROM movie_dir AS md7
                     INNER JOIN ({q}) AS md6
                     ON md7.dconst = md6.dconst
                     WHERE md6.votes >= {v}
                       AND md6.rating >= {r}
                     GROUP BY md7.dconst
                     ORDER BY md6.sum / COUNT(md7.dconst)
                         DESC
                     LIMIT {lim}""".format(q=dir_points_query,
                                           v=minimum_votes,
                                           r=minimum_rating,
                                           lim=dir_limit)

 

We then submit the query to our server:

result = conn.execute(dir_score_query).fetchall()
directors = [r for r, in result]

 

The query returns a list of tuples, we can put that into a regular list by the list comprehension that we used to create directors.

We then go through our list of directors and retrieve their 3 highest rated movies and return it to our Django framework.

movie_list = []
for director in directors:
    query1 = """SELECT md1.dconst, md1.tconst
                FROM movie_dir AS md1
                WHERE dconst = '{d}'""".format(d=director)
    query2 = """SELECT md2.dconst,
                       m2.tconst,
                       m2.primaryTitle,
                       m2.originalTitle,
                       m2.startYear,
                       m2.averageRating,
                       m2.numVotes
                FROM movies AS m2
                INNER JOIN ({q}) AS md2
                ON m2.tconst = md2.tconst
                ORDER BY averageRating DESC
                LIMIT 3""".format(q=query1)
    query3 = """SELECT d3.dconst,
                       d3.primaryName,
                       m3.tconst,
                       m3.primaryTitle,
                       m3.originalTitle,
                       m3.startYear,
                       m3.averageRating,
                       m3.numVotes
                FROM directors AS d3
                INNER JOIN ({q}) AS m3
                ON d3.dconst = m3.dconst""".format(q=query2)
    movie_list.extend(conn.execute(query3).fetchall())

conn.close()
return(DOI_name, movie_list)

 

For example, for Kaige Chen (nm0155280) what is returns looks like this:

DOI_name:

'Kaige Chen'

 

movie_list:

[('nm0396284', 'Hsiao-Hsien Hou', 'tt0096908', 'A City of Sadness', 'Beiqíng chéngshì', 1989, 8.0, 3503),
 ('nm0396284', 'Hsiao-Hsien Hou', 'tt0087164', "A Summer at Grandpa's", 'Dong dong de jiàqi', 1984, 7.8, 1004),
 ('nm0396284', 'Hsiao-Hsien Hou', 'tt0090185', 'A Time to Live and a Time to Die', 'Tóngnián wangshì', 1985, 7.8, 1957),
 ('nm0939182', 'Kar-Wai Wong', 'tt0118694', 'In the Mood for Love', 'Faa yeung nin wa', 2000, 8.1, 103058),
 ('nm0939182', 'Kar-Wai Wong', 'tt0109424', 'Chungking Express', 'Chung Hing sam lam', 1994, 8.1, 51516),
 ('nm0939182', 'Kar-Wai Wong', 'tt0118845', 'Happy Together', 'Chun gwong cha sit', 1997, 7.8, 18901),
 ('nm1249411', 'Xiaolu Xue', 'tt1498858', 'Ocean Heaven', 'Hai yang tian tang', 2010, 7.6, 2147),
 ('nm1249411', 'Xiaolu Xue', 'tt2169322', 'Finding Mr. Right', 'Bei Jing yu shang Xi Ya Tu', 2013, 6.4, 1963),
 ('nm1249411', 'Xiaolu Xue', 'tt4960242', 'Book of Love', 'Beijing yu shang: Xiyatu 2', 2016, 6.3, 661),
 ('nm0000487', 'Ang Lee', 'tt0454876', 'Life of Pi', 'Life of Pi', 2012, 7.9, 525158),
 ('nm0000487', 'Ang Lee', 'tt0190332', 'Crouching Tiger, Hidden Dragon', 'Wo hu cang long', 2000, 7.9, 237510),
 ('nm0000487', 'Ang Lee', 'tt0111797', 'Eat Drink Man Woman', 'Yin shi nan nv', 1994, 7.8, 16314),
 ('nm0423524', 'Alejandro Jodorowsky', 'tt0071615', 'The Holy Mountain', 'La montaña sagrada', 1973, 7.9, 30745),
 ('nm0423524', 'Alejandro Jodorowsky', 'tt0098253', 'Santa Sangre', 'Santa Sangre', 1989, 7.7, 16632),
 ('nm0423524', 'Alejandro Jodorowsky', 'tt4451458', 'Endless Poetry', 'Poesía Sin Fin', 2016, 7.6, 3817)]

(How amazing is it that Jodorowsky is recommended in this list?!)

 

Lets do a quick test how long it takes for dir_recomm to give us this recommendation:

%timeit dir_recomm('nm0155280', 7.5, 1000) 293 ms ± 17.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

About a third of a second, not bad!
Note: I found that sometimes it takes a few seconds to get the recommendation. This has something to do with the SQL server starting if it has been idle for a while (I think this is an hour), Pythonanywhere.com tends to do this and is annoying but very understandable.

In this post I showed you how to prepare a dataset and push it to your own MySQL server. I also used some basic SQL to query the database and create your own recommender system out of it. I think this was a fun project and a great foundation to creating bigger projects on this blog!