Content-based recommendation systems involve finding items that are similar to the user’s prior liked items. An advantage of content-based recommendation system is that it is best used when there’s not a lot of data available. However, it is limited in that it cannot allow discoveries for items that are not similar to the items that the user has used before.

To create a content-based recommendation system, one must first create an item profile and a user profile. From these profiles, one can determine the similarities of the item profile to the user profile. If the item profile is similar to the user profile, then proceed to recommend.

Item Profiles

An item profile is a matrix of items to features that might describe the items. You can think of this as a collection of vectors (or item vectors) in space. A simple example is illustrated by the following table:

Tom Cruise Meryll Streep Julia Roberts
Mission Impossible 1 0 0
Steel Magnolias 0 1 1
Ellen Brockovich 0 0 1
Pretty Woman 0 0 1

In the table above, the features are actors Tom Cruise, Meryll Streep, and Julia Roberts while the items are the movies indicated in the first column. The values 1 and 0 indicate whether the feature is present in the movie or not, respectively.

Determining the values in item profiles can involve just creating them by hand. If items are documents or texts, for example in recommending news articles, the values as well as the features can be determined using bag-of-words or TF-IDF methods. In this case, the features are words or only the important words, and the values are the number of the words or TF-IDF scores in cases where TF-IDF is used in extracting features from the document. The values therefore, are not 1’s and 0’s but other real numbers. In some cases, these values are normalized or recentered to zero prior to determining the similarities of item profiles and user profiles.

User Profiles

User profile will use the same features as the constructed item profiles, but the rows in the matrix in this case represent the user. The values this time represent the degree of affinity of the user for the feature.

For example, the corresponding user profile for the item profile illustrated above be constructed as follows. Suppose user A watched 10 movies. Six of those movies starred Meryll Streep in them, one had Tom Cruise and three had Julia Roberts. To calculate the affinity of the user for Tom Cruise, we just get the ratio of the number of Tom Cruise movies to the total number of movies the user has watched. This will be the same for the other features Julia Roberts and Meryll Streep.

Tom Cruise Meryll Streep Julia Roberts
user A 0.1 0.6 0.3

Again, the values can also be normalized in some cases.

Determining the Similarities of Item and User Profiles

To determine the similarities of item profiles and user profiles, a number of techniques can be used. Some techniques that don’t use machine learning are cosine similarity, Jaccard similarity and Euclidean distance. Here is a blog that not only discusses these techniques but also other techniques and how to implement them in Python.

Most content-based rec sys examples I encoutered in blogs and other websites is the cosine similarity technique. A blog by Christian Perone is a great resource for understanding cosine similarity. It is easier to understand if you can remember some things you might have learned in high school trigonometry, in my opinion. Or in my case, I was able to understand it because I still remember the unit circle and the mnemonic SOHCAHTOA.

Some examples of content-based systems calculate the dot product of the item and user profiles (treated as vectors) instead of the cosine similarity. I’ll put these two ways of calculating distance or similarity in the same category. It actually confused me to see some examples illustrate the use of the dot product while most of the discussions I read are talking about cosine similarity.

So suppose we have two vectors, i and u (for item profile and user profile, respectively), the dot product of i and u is shown by the formula below.

The dot product can either be expressed using the cosine of the angle between the two vectors (first formula) or using the sum of the products of the components of the vectors (third formula). And I take it that it is the reason why some examples use the latter instead of the cosine similarity, (the second formula written above).

Similarity (or closeness) of the two vectors is judged by the magnitude of the dot product or the cosine similarity. If the value is positive, the more similar the two vectors are (or the closer they are) and if the value of cosine similarity is negative, the more the vectors are far from each other.

Here is an illustration directly from Christian Perone:

To learn how to implement cosine similarity in scikit-learn, head on over to his blog! I highly recommend it. I was so happy to have found it.

I would think that cosine similarity does not go beyond -1 and 1 while the dot product can be be any number, but judging closeness is the same for the two–if negative, vectors are far from each other and if positive, vectors are close to each other.

Aside from determining similarities using the techniques above, one can use linear regression to predict recommendations. Here, the item scores are fitted against the features and the user’s score is “extrapolated or interpolated” (for a lack of a better term, I’m not a stat major, but the better term might be just “predicted”) to predict the user’s degree of affinity for a particular item.

I think there’s really a free reign in how people create content-based recommender systems. One example I encountered used unsupervised machine learning, not for prediction but just for cleaning up the item profile. This was the recommender by Thom Hopmans who blogged about a content-based recommender system for blogs. Here, since the items are documents, text analysis was used to extract features and scores for populating the utility matrices. However, prior to determining similarity, the utility matrix was subjected to dimensionality reduction using singular value decomposition (he also recommended principal component analysis).


See links in the text above. I am grateful for the bloggers and data scientists who discussed their work.