Do you ever think about how Google creates recommendation movies that are close to your liking? How it just “figures it out” for you? Well, after reading this post you will be able to know-how. Even better, you will be able to build a recommendation system by yourself.
As a web creator, there are things that every python developer must know, such as pandas and numpy libraries. The beginner’s program used in this article, cannot even be compared to the industry standards. Hence, it is used only as an introduction to systems. We assume that the readers have previous experience with Python.
What is a recommender system?
The recommendation system is a basic computation that is intended by detecting correlations in a dataset to provide a client with the most relevant information. The algorithm evaluates the elements and shows the user highly items that are near to their preference.
Netflix and Amazon are some of the best examples of such recommendation systems. Whenever you chose an item on Amazon, it automatically starts showing you another item that you might like. The same is the case with Netflix and its option for recommended movies for you.
There are three ways to build a Recommender System;
- Recommender’s system based on popularity
- Recommender’s system based on content
- Recommender’s system based on similarity
Building a simple recommender system in python
In this basic recommender’s system, we are using movielens. This is a similarity-based recommender system. You can use PyCharm or Skit-Learn if you’d like and see why pycharm is becoming important for every python programmer. So, moving on to the first step, importing numPy and pandas is our top priority.
import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')
Subsequently, we use pandas read_csv() utility in the data set. The dataset is separated from the tab, so the sep parameter is passed in \t. We then move on to the names parameter.
df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id','rating','titmestamp'])
Let's search the data head to see the data with which we are concerned.
It could be a lot easy if we could see the movie titles rather than just dealing with the IDs. To load the movie titles and merge the dataset;
movie_titles = pd.read_csv('Movie_Titles') movie_titles.head()
Since the columns of item_id are, all the same, these datasets can be combined into this section.
df = pd.merge(df, movie_titles, on='item_id') df.head()
Let's have a glance at the representations of each column:
user_id - the ID of the user who rated the movie.
item_id - the ID of the movie.
rating - The rating the user gave the movie, between 1 and 5.
timestamp - The time the movie was rated.
title - The title of the movie.
We can get a brief description of our dataset using the description or info commands.
We may say the average score is 3.52 and the maximum score is 5.
Let's construct a data frame with each film's average rating and rating number. We will use these ratings later to measure the correlation between the films. Movies with a high coefficient of correlation are the films that are most comparable to one another. We're going to use the Pearson correlation coefficient in our case. The number will range from-1 to 1. 1 shows a positive linear correlation while the negative correlation is indicated by-1. 0 shows no linear correlation and shows that these films are least similar in any way.
ratings = pd.DataFrame(df.groupby('title')['rating'].mean()) ratings.head()
Now we want to see the number of ratings for each film. This is what we do by creating a column such as number_of_ratings. It's relevant to see the relationship between a film's average rating and the number of ratings that the film has received. Only one person may have rated a 5-star movie. Therefore, classifying that film has a 5-star film is statistically incorrect.
Therefore, as we build the recommender system, we need to set a threshold for the minimum number of ratings. We use group-by-utility pandas to create this new column. We group the titles and then use the count function to determine each film's number of ratings. Let’s use the head() function to view the new data frame.
ratings['number_of_ratings'] = df.groupby('title')['rating'].count() ratings.head()
Let's now use pandas to plot a histogram to represent the ratings distribution
import matplotlib.pyplot as plt %matplotlib inline ratings['rating'].hist(bins=50)
We see that almost all of the movies range from 2.5 to 4. Next, let's take a similar look at the number of rating column.
It is evident from the histogram above that several movies have very few ratings. Movies with most ratings are the most famous ones.
Now let's test the correlation between a film's rating and the number of ratings. We do that by using seaborn to map a scatter graph. Seaborn allows it with the function of the jointplot().
import seaborn as sns sns.jointplot(x='rating', y='number_of_ratings', data=ratings)
We see from the graph that their connection is positive between the number of ratings and the average ratings of a movie. The graph shows that the higher the ratings a film gets the higher it gets.
Let's move on quickly now and build a simple recommender system based on similarity. If we wish to see movie titles as columns, user_id as list and ratings as values then, we need to turn our dataset into a matrix.
By doing so, we're going to get a data frame with columns like movie titles and rows like user ids. Each column reflects all of the users ' ratings of a movie. The ranking indicates NAN where a user has not rated a particular film. To build the movie matrix, we use the pandas pivot_table utility.
movie_matrix = df.pivot_table(index='user_id', columns='title', values='rating') movie_matrix.head()
First, let's look at the best-rated movies and pick 2 of them, to begin within this simple system of recommendations. To organize the movies from the most valued, we use the pandas sort values utility and set up to false. Then we use the function head() for the top 10.
Let's assume that Air Force One (1997) and Contact (1997) were watched by a user. Based on this watching history, we would like to recommend movies to this user. The goal is to search for films similar to Contact (1997) and Air Force One (1997), which we will recommend to this user. This can be accomplished by calculating the similarity between the ratings of these two films and the ratings of the rest of the films in the dataset. The very first move is to develop a data frame with both the ratings of such films from the movie matrix.
AFO_user_rating = movie_matrix['Air Force One (1997)'] contact_user_rating = movie_matrix['Contact (1997)'] AFO_user_rating.head() contact_user_rating.head()
We use pandas corwith functionality to calculate the correlation between two data frames. Corrwith calculates the pair correlation of two data frame objects ' rows or columns. Let's use this feature to get our results;
We can see that the Air Force One film correlation with Til Was You is 0.867. This suggests that these two films have a very strong similarity.
Moving on, let's calculate the correlation between the ratings of "Contact" and the rest of the films by using the same procedure.
similar_to_contact = movie_matrix.corrwith(contact_user_rating)
We see a very strong correlation between Contact (1997) and Til There Was You (1997).
As noted earlier, our matrix had a lot of missing values as not all the films were rated by all the subscribers. Therefore, we dump those null values and convert the results of the correlation into data frames to make the information appear more attractive.
corr_contact = pd.DataFrame(similar_to_contact, columns=['Correlation']) corr_contact.dropna(inplace=True) corr_contact.head() corr_AFO = pd.DataFrame(similar_to_air_force_one, columns=['correlation']) corr_AFO.dropna(inplace=True) corr_AFO.head()
Such data frames above tell us the movies that are most similar to Contact and Air Force One films. Nonetheless, we have a problem that some of the movies have quite a few reviews and may end up being recommended merely because they were given a 5-star rating by one or two people.
We can fix this by setting several rating thresholds. We've seen a sharp decline in the number of scores from 100 from the histogram earlier. Therefore, we will set this as the limit, but this is a number with which you can play until you have an appropriate option.
To do this, the two data frames must be joined with the number of rating column in the data frame ratings.
corr_AFO = corr_AFO.join(ratings['number_of_ratings']) corr_contact = corr_contact.join(ratings['number_of_ratings']) corr_AFO .head() corr_contact.head()
Now we're going to have the films that are most similar to Air Force One by limiting them to films that have at least 100 feedbacks. Then we form them by the column of the correlation and see the first ten.
We notice a perfect correlation among Air Force One with itself. Air Force One's next most similar film is Hunt for Red October, with a 0.554 correlation. We get different results from the previous way of doing it by changing the threshold for the number of reviews. Limiting the number of ratings gives us better results and we can recommend the above films to someone who watched Air Force One (1997) with assurance and credibility.
is, of course, a very easy way to build a recommender program and is nowhere near industry standards. But still, you can now find a similar authentic movie recommendation on your own. How cool is that!
Always remember, the secret to coding is to keep going!