# How do you build a “People who bought this also bought that”-style recommendation engine

## Collaborative Filtering

Collaborative Filtering (CF) is a method of making automatic predictions about the interests of a user by learning its preferences (or taste) based on information of his engagements with a set of available items, along with other users’ engagements with the same set of items. in other words, CF assumes that, if a person A has the same opinion as person B on some set of issues X={x1,x2,…}, then A is more likely to have B‘s opinion on a new issue y than to have the opinion of any other person that doesn’t agree with A on X.

So for example, while person A‘s favorite movies are {American Hustle, Hunger Games and Delivery Man}, person B really likes {American Hustle, Hunger Games and Captain Philips} and person C just loves {12 Years a Slave, Reasonable Doubt and Die Hard II}, following CF approach it should be safer to assume person B should also like Delivery Man and person A would have liked to watch Captain Philips, than to assume any of them would be interested in watching 12 Years a Slave, Reasonable Doubt or Die Hard II.

By discovering usage/engagement patterns among users and items in the data, CF algorithms are able to infer users’ hidden preferences and to exploit those preferences to recommend them new potentially-good items. CF algorithms are best known for their use on e-commerce web sites, where they serve as cornerstones for their recommendation-engines.

In this post I will introduce some basic theory behind one of the most popular CF algorithms, known as Matrix Factorization, review some of its applications, provide some basic details about large-scale implementations of it, and hopefully will trigger you to read more and get interested in this field.

Let’s begin!

### Clustering

In data mining, clustering is the process of grouping objects based of object-similarity, in such a way that objects that belong to the same group (also known as a cluster) are more similar to each other than to those in other groups. In most scenarios it is considered an unsupervised learning task, as no real “feedback” (a ground truth mapping of an object to its group) exists.

There are various algorithms that aim to solve clustering problems. They are mostly based on machine learning techniques (among them the famous K-Means and K-Medoids models) or statistical inference techniques (a popular example is the Gaussian Mixture Model), and are widely used in a wide range of applications.

Depending on the nature of the data and the purpose for which clustering is being used, different clustering methods and different similarity measures may be used. More advanced algorithms extend traditional clustering methods to exploit temporal dynamics, connectivity in graphs, intensity or desired attributes of more complex signals. In this post, I’m going to concentrate on the special case of co-clustering and on the Matrix Factorization approach for solving it.

### Hard vs. soft (fuzzy) clustering

On the highest abstract level of clustering approaches, they are roughly divided into two groups: hard and soft clustering. In hard clustering, objects of the original data (observations) are mapped into distinct clusters – each observation is associated with exactly one cluster. In soft clustering (also known as fuzzy clustering), observations are mapped into clusters in a “fuzzy” fashion, such that each observation is a member of potentially more than one cluster, and is assigned with a set of cluster-membership-strength values that indicate how much it is associated with each of the clusters.

### The Co-Clustering problem

Real world data is often bimodal, that is to say created by a joint interaction between two types of entities. For example, a user rating a document is affected by both the user characteristics (affinity to some subjects/categories) and the document characteristics (its connections to one or more of those subjects/categories). Often, this type of signal is represented as a matrix, of which each dimension represents one of the entity types. Co-clustering (or Biclustering) is a term in data mining that relates to a simultaneous clustering of the rows and columns of a matrix. Where classical clustering methods assume that a membership of an object (in a group of objects) depends solely on its similarity to other objects of the same type (same entity type), co-clustering can be seen as a method of co-grouping two types of entities simultaneously, based on similarity of their pairwise interactions.

The signal in the former example of users rating documents can be represented as a matrix with users as rows and documents as columns, and the inner cells of the matrix as the ratings, or any affinity signal of a user towards a document. Co-clustering this matrix can be explained as grouping both similar usersandsimilar documents into, let’s say, categories or interests, synchronously.

Co-clustering is extremely useful when the above mentioned pairwise interactions signal is sparse. And again, it is easier to demonstrate with an example: consider the case of internet users reading news articles. Now think of two users with similar affinity to the same news categories. Even if the two match perfectly in their preferences, it is extremely unlikely that the two will read exactly the same articles, due to the huge variety of news articles offered daily on the net. Moreover, it is even fair to assume that the amount of articles read by both users is very low. In this case, clustering similar users based on the articles they read (or on the opposite side, clustering articles based on overlapping readers) seems pretty useless. And it is the case where co-clustering approach really shines, comparing to “regular” unimodal clustering.

## Matrix Factorization

### Model Formulation

One of the most popular algorithms to solve co-clustering problems (and specifically for collaborative recommender systems) is called Matrix Factorization (MF). In its simplest form, it assumes a matrix ${{R}\in{R^{m \times n}}}$ of ratings given by musers to nitems. Applying this technique on R will end up factorizing R into two matrices ${{U}\in{R^{m \times k}}}$ and ${{P}\in{R^{n \times k}}}$ such that $R \approx U \times P$ (their multiplication approximates R).

Note that this algorithm introduces a new quantity, k, that serves as both U‘s and P‘s dimensions. This is the rank of the factorization. Formally, each ${R_{i,j}}$ is factorized into the dot product of ${u_i}\cdot{p_j}$, having ${u_i},{p_j} \in {R^k}$. Intuitively, this model assumes every rating in R is affected by k effects. Moreover, it represents both users and items in U and P accordingly, in terms of those k effects.

So, assuming R is a matrix of users rating movies, it is pretty straightforward to assume that each movie is tied to one or more categories with some strength, each user likes/dislikes some movie-categories with different strengths, and the rating that a user should have given a movie depended on the similarity between his characteristics and this movie’s characteristics. The problem is, how to form those categories efficiently. Hell, it can even depends of certain affinity to some movie actors, directors, language, location of filming, and more, and the number of possible features to create is immense.

Actually, we are back to the co-clustering problem where we have pairwise interactions between two entities (a user and a movie), that eventually can incorporate a rating matrix that is quite sparse (due high variety of movies, for example, users will be able to rate only a small fraction of them). Intuitively, we can represent users (or movies) by their affinity (or connection) to categories/actors/languages or any other factors, and finally try to cluster them based on this representation. The problem is, it is quite hard to quantify the effect of each such factor on the original ratings. Moreover, there might be factors that are inherent to the rating process, but are missed from our new representations.

So back to linear algebra, MF is a form of optimization process that aims to approximate the original matrix R with the two matrices U and P, such that it minimizes the following cost function:

$J = ||R - U \times {P^T}||_2 + \lambda \left(||U||_2 + ||P||_2 \right)$

The first term in this cost function is the Mean Square Error (MSE) distance measure between the original rating matrix R and its approximation ${U \times {P^T}}$. The second term is called a “regularization term” and is added to govern a generalized solution (to prevent overfitting to some local noisy effects on ratings).

This optimization problem is naturally solved by machine learning techniques, which will be explained soon. But first, note that this cost function introduces two parameters: k and $\lambda$. Apart from trying to produce a minimal value of this cost function for a given k and $\lambda$, it is essential to determine what is the optimal values of those parameters. In the rest of this chapter, I will go over two optimization process that are used to have this cost function converged to minimum, given k and $\lambda$. Bear in mind that some higher level optimization process (such as Cross Validation) can be incorporated to insure a good selection of those parameters.

### Gradient Descent

Gradient Descent is a first-order optimization algorithm that is widely used in the field of machine learning. Conceptually, it is a simple iterative optimization process that assumes the existence of a cost function and arbitrary initial values for the optimization variables.
In every iteration it re-computes the gradient of the cost function with respect to the optimization variables, and updates them in a step that is proportional to the negative of the gradient of the cost function, targeting at minimizing the cost function, until the latter converges to a minimum point. However, optimizing a cost function with gradient descent only guarantees convergence to a local minima.

Gradient descent has been shown to work well optimizing MF models. However, it is not a popular choice for an optimizer for MF if the dimensionality of the original rating matrix is high, as there are effectively $\left({n}\cdot{k}+{m}\cdot{k}\right)$ parameters to optimize. In real life problems, this number can get very large quite often, requiring both a parallelization mechanism and a better exploitation of matrix factorization’s unique cost function nature.

### Alternating Least Squares

Looking again at MF’s cost function, it appears that we aim at learning two types of variables – those of U and those of P, and the two types are tied in the multiplication of $U \times {P^T}$. Recall that the actual cost function is the sum $||R-{U}\times{P^T}||_2 = \mathop \sum \limits_{i,j}\left({R_{i,j}-{u_i}\times{p_j}}\right)$ plus regularization term. The fact that both U’s and V’s values are unknown variables makes this cost function non-convex.

But another interesting fact lies in this term – if we fix P and optimize for U alone, the problem is simply reduced to the problem of linear regression. Recall that in linear regression we solve for $\beta$ by minimizing the squared error $||y - X\beta||_2$ given X and y. The solution is ultimately given by the Ordinary Least Squares (OLS) formula ${\beta = \left({{X^T}X}\right)^{ - 1}}{X^T}y$.

Alternating least squares does just that. It is a two-step iterative optimization process. In every iteration it first fixes P and solves for U, and following that it fixes U and solves for P. Since OLS solution is unique and guarantees a minimal MSE, in each step the cost function can either decrease or stay unchanged, but never increase. Alternating between the two steps guarantees reduction of the cost function, until convergence. Similarly to gradient descent optimization, it is guaranteed to converge only to a local minima, and is ultimately depends on initial values for U or P.

Since the actual cost function includes a regularization term, it is slightly longer. According to the two-step process, the cost function can be broken down into two cost functions:

$\forall{u_i}: J\left({u_i}\right) = ||R_i - {u_i}\times{P^T}||_2 + \lambda \cdot ||u_i||_2$

$\forall{p_j}: J\left({p_j}\right) = ||R_i - U\times{p_j^T}||_2 + \lambda \cdot ||p_j||_2$

And the matching solutions for ${u_i}$ and ${p_j}$ are:

${u_i} = {\left( {{P^T} \times P + \lambda I} \right)^{ - 1}} \times {P^T} \times {R_i}$

${p_j} = {\left( {{U^T} \times U + \lambda I} \right)^{ - 1}} \times {U^T} \times {R_j}$

And since each ${u_i}$ doesn’t depend on other ${u_{j \ne i}}$ vectors, each step can potentially be introduced to massive parallelization.

### SVD vs MF

Although Singular Value Decomposition (SVD) is not the main objective of this post, it is well worth mentioning due to its applicative relation to the subject of this post. SVD is a matrix decomposition technique that has mathematically originated from linear algebra. It decomposes any matrix into 3 matrices ${{U}\in{R^{m \times k}}}$ , ${{\Sigma}\in{R^{k \times k}}}$ and ${{V}\in{R^{m \times k}}}$ such that ${{\;A}} = {{U}} \times {{\Sigma }} \times {{{V}}^{{T}}}$.

It comes with stronger guarantees than Matrix Factorization’s:

• ${{\Sigma }}$ is a diagonal matrix having the singular values of A on its diagonal. A common convention is to list the singular values in ${{\Sigma }}$ in a descending order.
• U and V are orthonormal matrices (their columns are orthogonal and their norm equals 1)
• SVD solution is unique

So if we assume A is a matrix of user-item ratings (users as rows, items as columns), that means that:

• Each row in $U$ (or $V$) corresponds to a user (or item) characteristics. So for example, ${{{U}}_{i,k}}$ can be interpreted as the strength of membership of user i to the kth category, and ${{{V}}_{j,k}}$ can be interpreted as the strength of membership of user i to the samekth Since $U$ and $V$ form orthonormal bases, it follows that (for each k) the overall strength of the kth effect on ratings is already “deducted” from the values of the kth column in U or V.
• Each rating (value in ${{{A}}_{{{ij}}}}$) is explained by a set of independent categories / effects. For a specific useri and itemj, the rating ${A_{ij}}$ is a decomposed into the following summation $\mathop \sum \limits_{k = 1}^K \left[ {{u_{i,k}}*\;{\Sigma _{k,k}}*v_{j,k}^T} \right]$ where each value of k determines:
• ${{{u}}_{{{i}},{{k}}}}$user_i‘s kth latent factor
• ${{v}}_{{{j}},{{k}}}^T$item_j‘s kth latent factor
• ${{{\Sigma }}_{{{k}},k}}$ – the overall weight (magnitude of effect) of the kth factor

It follows that any rating given by user i to item j is affected by K factors, each, in turn, is decomposed into: similarity between user I and item j in this dimension (${u_{i,k}}*v_{j,k}^T$), and the overall effect of this dimension (${\Sigma _{k,k}}$) on ratings across all users and items.

• Since $\Sigma$ is ordered by the size of the singular values of A in descending order, the accumulative sum $\mathop \sum \limits_{k = 1}^K {\Sigma_{k,k}}$ accounts for the total variance (effect on the ratings) explained by the k strongest effects (or categories). Therefore, it is easy to roughly estimate what should be the rank of A (how many factors affect the ratings).

There are two main problems with this approach:

1. Its computational complexity
2. The fact that the original matrix must be fully-dense (no missing values allowed), and it is often the case where the matrix that should be decomposed is sparse (introduces missing values).

### Missing Values

MF allows missing values in the rating matrix R. The cost function is then changed into:

$J = \mathop \sum \limits_{i,j} w_{i,j} \cdot \left( R_{i,j} - {u_i}\times{p_j^T}\right)^2 + \lambda \left( ||U||_2 + ||P||_2 \right)$

where

${w_{i,j}} = \left\{ {\begin{array}{*{20}{c}}{\begin{array}{*{20}{c}}1&{{R_{i,j}}\;is\;known}\end{array}}\\{\begin{array}{*{20}{c}}0&{{R_{i,j}}\;is\;unknown}\end{array}}\end{array}} \right.$

And the matching solutions for ${u_i}$ and ${p_j}$ are:

${u_i} = {\left( {{P^T} \times {w_i} \times P + \lambda I} \right)^{ - 1}} \times {P^T} \times {w_i} \times {r_i}$

${p_j} = {\left( {{U^T} \times {w_j} \times U + \lambda I} \right)^{ - 1}} \times {U^T} \times {w_j} \times {r_j}$

## MF Applications

So far we have been dealing mostly with how to co-cluster a matrix of rating signals. In a more realistic scenario of a rating signal that contains missing values, MF variations have been shown to perform better than other alternative approaches. But co-clustering is not the target, it is only the tool.

### Recommending Items

The real objective of a recommender engine is to produce the best possible item recommendation lists for each user, rather than just clustering users/item. Luckily for us, the fact that MF is actually a soft clustering algorithm (recall that each user/item is represented as a vector of factors that represents how much it is associated with a phenomena/cluster) enables us to produce such recommendation lists. Following the fact that each known value in the rating matrix R was decomposed into the dot product of its matching user/item factor vectors, it is pretty straightforward to reconstruct a “full” ratings matrix R* by multiplying each user factor vector with every item factor vector.

Consequently, R* is a smoother (filtered) approximation of R that is lacking every effect on ratings that is not inherent to the rank of the model K (the length of user/item factor vectors). It can be interpreted as the set of the expected ratings given by any user to any item, given the collaborative patterns learned from the known values in R. Thus, it is straightforward to use all expected item-ratings in R* that were previously unknown in R, for a certain user, to produce a potentially well-ordered recommendation list consists of never-seen-before-by-that-user items.

One thing to note, however, is that the accurate and formal definition of the recommendation problem would be to produce recommendation lists for of the right order of items per user, or in other words to rank the items (per user) in the best possible way. Even though ranking is not inherent to MF optimization objective (recall that MF’s cost function is all about approximating the known values in R well using the factorization, rather than ranking items for each users), and there are other approaches and methods that are aimed at just that, MF is considered an industry standard approach for collaborative recommender systems. Although, in most cases a more complex extension of the algorithm reviewed in this post is used.

### MF for dimensionality reduction

In some applications, MF isn’t used to produce recommendation, but rather as a pre-processing step for high-dimensional data, followed by another learning mechanism (possibly classification/regression). An example is can be found here.

An additional marginal insight can be the dense low-rank representation of both users and items as the learned vectors of factors. For example, MF’s user-factor matrix can be used to learn how users are similar (in light of item-usage), overcoming the sparsity in the original rating matrix. It can be done by computing your favorite distance measure between users’ factor-vectors, be it RMSE, Cosine Similarity or any other suitable measure.

## Introducing Metadata (aka Content Based approach)

Another common approach to building a recommender systems is the content-based (CB) approach. As the name suggests, this approach is based on a features that relate to the actual content of the items and the profiles of the users. The main drawback of this approach is the need to describe both users and items content prior to running MF. Since such information is not always present or available to extract from either users of items (or on the contrast, it is available but not accurate enough), CB approach is often infeasible to apply to a lot of real world problems. However, when item/user metadata is available, it is common to use either CB or a combination of CF and CB altogether, in order to augment the recommender system with potentially greater learning power.

Sometimes, connecting MF’s factor values to an actual meaning it is also required, for clearer interpretation of the results. Here, user/item metadata can be leveraged in a post-processing step by explaining the magnitude of the factors with correlating them to the metadata labels.

## Frameworks and libraries for applying MF

### MapReduce-based Apache Mahout

MapReduce is a programming paradigm that is suited for big data processing. Apache Hadoop is an open-source software framework for distributed storage and MapReduce-based processing of large data sets, harnessed with failure-safe capabilities that are transparent to the end-user.

Apache Mahout is an open-source project founded to offer MapReduce-based library of distributed machine learning algorithms. Besides distributed classification, linear algebra modules, clustering and more, it offers an implementation of MF based on both GD and ALS. An example of how to use the preferable ALS optimizer for MF can be found here.

### Spark’s MLlib

Spark is the new industry-standard distributed batch processing framework for general-purpose cluster computing. It provides APIs in Java, Scala, Python and R. It is augmented with failure-safe capabilities, and offers better coding flexibility and many other improvements over Hadoop. It excels mostly when applying multi-pass algorithms (algorithms that make use of the data more than once).

MLlib was built on top of Spark to take advantage of Spark’s capabilities when running iterative Machine Learning algorithms. Due to the fact MF (using either GD or ALS) is an iterative optimization process, leveraging Spark framework’s capabilities for fast iterative processing can be very beneficial. MLLib has an ALS implementation, offering both explicit feedback and implicit feedback MF cost functions. There is also a nice tutorial of it.

## Further Readings

### Online Lectures

1. Andrew Ng’s Machine Learning course on Coursera.org has great lectures on recommender systems and specifically on MF. Look for Chapter 9.
2. Collaborative Filtering for Implicit Feedback Datasets is an article written back in 2008 by Netflix-Prize‘s winners and is actually implemented today as part of MLLib’s ALS. It is targeted at dealing with implicit feedback – worth reading!

Like this post? Subscribe to get a notification every time a new post is published! Want to choose what will I write on next? Vote!