Streaming services such as Netflix, Prime, Hulu are garnering more and more audience each day. With more streaming services than one can keep a count of coming up, families must decide as to which service to subscribe to. Increasingly, the quality of recommendations a service has become pivotal to this decision making. The objective of this project is to recommend movies for a subscriber to watch.
While there are several factors that one could consider while creating a recommendation engine, in this project we are focusing on a few aspects of the user and the movie to provide recommendation. We would be focusing on the movie genre, the tags associated with it, the director and cast of the movie. We will also be focusing on recommending movies to a user based on how close his taste matches with another user. The goal of this project is to provide holistic experience to a user by providing recommendations based on various criteria such as popularity, user-user collaborative filtering and content filtering.
The data for this comes from grouplens (a research lab from the University of Minnesota) who have collected data from MovieLens. MovieLens is a non-commercial, personalized movie recommendation website that collects user ratings for movies and provides recommendation.
`This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
To get more details about a movie, using the link provided in the links.csv file, I did webscraping to extract details from Wikipedia page such as Director and Cast details.
Word count comparison: Below bar graph shows the most frequent occulting words in news headlines.
We want to look at the average rating for each of the genres. We want to know if there are some genres that are generally rated higher than other genres or are the average rating are not impacted by the genres at all.
Inferences:
We see a few interesting inferences from the graph
Rating 4 and 3 have the highest frequency. This could imply that most people are more comfortable giving mid level ratings rather than going very high or very low.
All the 0.5 values aren’t ver popular as compared to the full number they are neighbours to. This implies that people have a tendency to choose a full number over .5s iii. Very few ratings are between 0.5 to 1.5 . It would make sense to then conclude either that people are a little liberal while scoring movies they didn’t enjoy or if they didn’t enjoy a movie, they generally don’t take the effort to rate it
There are about 20 genres that the 10k movies are tagged as. Each movie can be tagged with more than one tag. Let’s take a look at how many movies each genre is tagged with before diving deep.
Inferences:
Of the 10k movies, around 4300 are tagged as Drama, the most popular genre. What this tells us is that recommending movies just based on genre a user seems to like, might be futile.
There also seem to be movies that have no tag. If a user likes one of those movies and would like something similar, going by genre might be counterproductive.
Inferences:
Surprisingly it seems that the genre does not affect the average ratings, ie from the ratings given, we can’t tell for sure if a particular genre is well loved or hated by the raters. The averages values fall between 3.25 to 4 for all genres.
We now move towards the modeling part of the project with an aim to build a tool that is able is recommend movies to a user.
To provide the recommendation we would be looking at the following aspects:
Popularity Based
Content Based
Collaborative based
The first model is going to be based on what movies are the most popular in the entire database of movies we have. These movies that seem to be univerally liked (by majority) would be a good place to start for a basic model.
For example Shawshank redemption, The Godfather and The Dark knight are the top 3 rated movies from IMDB. These movies have a universal appeal and recommending these movies to anyone, there is a high probability that they would enjoy these movies as well
But if we just go with the movies that are highly rated, we could run into an issue. Let’s say there is a new movie or a movie that very few people have rated. And everyone who has rated the movie has given it a 5 star rating. Now even though the average might be higher than Shawshank redemption, it would be a fallacy to consider it a popular movie as there aren’t enough data points to validate the claim.
So to make the recommendations fair, we are going with a weighted popularity calculation model. (WR) = (v / (v+m)) × R + (m / (v+m)) × C Where: R = average for the movie (mean) = (rating) v = number of votes for the movie = (votes) m = minimum votes required to be listed in the Top Rated list C = the mean vote across the whole report
Movie Id | Name |
---|---|
277 | Shawshank Redemption |
659 | Godfather |
2226 | Fight Club |
922 | Godfather Part II |
46 | Usual Suspects |
224 | Star Wars Episode IV |
602 | Dr Strangelove or: How I stopped worrying |
914 | Goodfellas |
461 | Schindler’s List |
6710 | The Dark Knight |
6315 | The Departed |
899 | The Princess Bride |
686 | Rear Window |
898 | Star Wars V |
694 | Casablanca |
257 | Pulp Fiction |
900 | Raiders of the Lost Arc |
841 | A streetcar named Desire |
1939 | The matrix |
1734 | American History X |
A second way to recommend movies for a user is using movies a user has liked in the past. The premise is that by finding movies similar to movies a user has rated high, we can assume that the user would like those movies as well.
For example, let’s assume user 1 likes Toy Story and Cars, we can assume that the user likes animated movies or that are kid friendly or that the user likes movies made by directors John Lasseter etc and recommend accordingly.
For our set of movies we are going to focus on the following features:
i The genres of the movie
ii The tags associated with a movie
iii The director/actors of a movie
The steps we’ll follow to find similar movies are as follows:
1. Combine the movie name, genre, tags, director and actor name to a single string
In order to get the director and actors name, we are using the beautiful soup package and using the links provided in the links.csv file and extracting necessary details from the html pages.
2. Preparing the string for comparison :
The first step would be to create the string using which we would compare the movies. We need one single string to perform the comparison. Therefore, we will create a string that combines the movie name, genres, tags, director and cast. We are going to perform a few enhancements to our string though. The Director of a movie generally plays a major role when it comes to choosing a movie based on another movie followed by genres. As we wouldn’t be able to add weights to a part of the string, we would repeat the director name thrice and genre twice to add more weightage. We are going to remove the space between the first name and last name as we don’t want a common first name or last name to skew the result
3.Vectorizing the string:
There are 2 techniques we could use for vectorizing the string, either a Count vectorizer or tfidf, Term frequency Inverse document frequency.
In the Count Vectorizer, we give more weightage to words appearing several times. In our metadata, we have added director twice and hence that would be given more weightage.
tfidf does the opposite where it penalizes more frequently occuring words. This would be more applicable if we were taking into consideration the description of the movies. In such cases, the words " a the in on" could occur in almost all descriptions. Providing more weightage to these words could skew the results.
Since we are not using description and just succinct tags, we would be using count vectorizer.
4. Finding similarities using Cosine similarities method:
Now that we have created matrix to represent each of the movies’ metadata features, we will now find similarities between the movies. For this we would be using Cosine values to measure similarities. The reason we use cosine is that larger the distance between the 2 vectors, smaller the distance value will be. This linearity is helpful in finding out the most closely related movies.
5.Extracting the cosine matrix row for a particular movie:
We now will need to extract the corresponding row in the cosine_matrix for a particular movie. The user must enter the movie name as in the movies.csv file, ie along with the year of release eg Toy Story (1995) From that we calculate the movie id which in term we use to find the index in the final_df dataframe. We then use this index to extract a particular row from the cosine matrix
Sorting the resulting row in descending order and finding top 5 recommended movies From the previous step we have a row that corresponds to a particular movie. This list now contains the cosine similarity values. We want to find the top 5 values and the movies corresponding to that
For Harry Potter and the chamber of secrets, we get the following recommendations:
Name |
---|
Pokemon: The First Movie (1998) |
Harry Potter and the Prisoner of Azkaban (2004) |
The Lord of the Rings: The Fellowship of the Ring (2001) |
Harry Potter and the Goblet of Fire (2005) |
Harry Potter and the Sorcerer’s stone (2001) |
As a straight forward performance metric would not be applicable for a Content based recommendation system, I have derived a version of Recall for comparing with other Content Based models.
• I divided the movies rated by a user into training data and test data of movies the user has rated 3.5 or higher (Movies rated favorably)
• From the list of training data movies, using the cosine matrix I derive a list of recommended movies.
• I then calculate the percentage of movies in the test data that are part of the recommended movies.
The result was 2% recall. (This is an expected behavior as we don’t have a controlled dataset)
The third model we would be building will use Collaborative filtering. This takes advantage of similarity between users and the way they’ve rated movies to provide recommendations.
For example, there are 2 users A and B.
User A likes Movies x , y and z
User B likes movies w , x and y
We can recommend user A to watch movie w and user B to watch movie z
One major disadvantage of the this model is that, more the number of users, the more expensive computationally this model becomes. But since we have a relatively smaller dataset, for the sake of experimentation we would build this model.
The steps involved in building a collaborative filtering model is:
1.Split this into test and training:
For each user we split about 20% data into test and the rest into training.
A point of significance here is that training and testing data cannot just be done in the traditional method.
If we were to split the number of users in 80:20 then we will run into the issue that for 20% of the users we have no information regarding their rating pattern to suggest more, ie it would be a Cold Start Similarly, if we were to split the movies into 80:20, then those movies will never be recommended. The solution is that we need to split each user’s information into 80:20 and use the 20% as test data.
2.Creating a Keras model
3.Training the model
4.Testing to find accuracy
5.Recommend movies to the user
Keras Model form and prediction:
We first create the first layer, for both the movies and user data, which is basically a tensor for the hidden layers.
The embedding layer then converts this input layer into a continuous vector.
We then reshape the output layer for the number of factors (movie recommendations we need)
From this we create an output layer that is a sigmoid layer with activation ‘relu’
We used adam optimizer and compile the model
The way the model works is by predicting what value a user will provide for a particular movie and use the highest rated movies for recommendation.
We fit the test data on our model to calculate the accuracy using RMSE.
Accuracy:
The Root mean square is 0.516.
Collaborative filtering is the most sophisticated and useful method. So to ensure our system provides the best recommendation we would be building a second model based on Singular Value Decomposition.
This model in theory is similar to the Keras based model. Here the movies are rated based on user-user filtering. This model also addresses the shortcomings of the deep learning model.
i. With the Keras model as the number of users and movies go up, the computation gets more and more expensive. The model doesn't scale up well. By performing matrix factorization we are able to scale up.
ii. The matrix containing user rating is a sparse matrix, and because of this we would not be able to apply a variety of functions to the matrix. To overcome this, in this method we decompose the matrix into 3 denser matrices.
What is SVD?
Using SVD, we are able to find the latent features in our data.
For example, let’s say user A likes movies Harry Potter and the Chamber of Secrets, Toy Story 2, and Charlie and the chocolate factory.
Here there is a common underline that is not very evident. The user seems to enjoy children’s movies a lot. Our model would then recommend in turn Cars and Trolls. Even though we didn’t specify the genre of the movie as our feature, the algorithm will pick up on important latent features and use that for recommendation
Breaking down the math behind the algorithm:
A=U * Sima * V where A is the matrix containing the user rating U is the User feature rating, ie how mu ch does a user like a particular feature (Comedy, thriller) ∑ is the diagonal matrix that contains the weight/strength of each of these features V is the Movie feature rating, ie how much of these features apply to the movie.
Accuracy: To find out the effectiveness of the SVD method, we use the SVD cross validation from the Surprise package. We perform a 10 fold cross validation to find out the Root mean square error.
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Mean 0.8736 0.8799 0.866 0.8594 0.8689 0.8609 0.8648 0.8799 0.866 0.8726 0.8692
Top 5 movies Rated by the User:
The holiday The King’s Speech Troy Despicable Me An Education
Top 5 movies Recommended for the User:
Shrek Lord of the Rings: The return of the King Finding Nemo Up The Dark Knight
The content based model gave us a Modified-recall of 2%
The two collaborative based models have the following RMSE values
SVD RMSE | Keras RMSE |
---|---|
0.8962 | 0.516 |
For RMSE, the smaller the value, the better the model is. From the two accuracy values we see that Keras model is more efficient in finding out more accurate recommendations.
We have seen 3 major categories of recommendation systems in this project.
The popularity based model is a good model for Cold start of a new user. There is a high chance that most users would enjoy the movies recommended through this model. But this model suffers the disadvantage of personalization. All users get the same recommendation.
That problem is solved in a Content based model. Here we are able to cater to the niche taste of a user and recommend movies based on their interests. But this model requires a lot of domain knowledge. We’d need to know based on what criteria a user rates a movie favorably.
In our third category of models we bypass this issue where we use the latent similarity between users or items to make recommendations. However these recommendation models could run into the risk of scalability as the number of users and movies on the platform grow.
Based on each platform’s priority and expertise, we could pick any of these models and provide recommendations for a user. Employing these models we would be able to ensure the user enjoys his experience on the platform.