Candidate generation in recommender systems
Soft negatives, in-batch negative sampling and log-Q correction
Background
Choice overload is a well known problem in Neuroscience, our brain struggles to choose an item from a collection of as few as 15 items.
Whereas modern recommender systems (RecSys) need to pick items for a user-query from a pool of billions of candidates. As mentioned in my last post, this is achieved by dividing RecSys into 3 stages namely:
Generators - One or many generators that use simple models, and approx nearest neighbor (ANN) to quickly select a few thousands of candidates to be passed onto the next stage. Note ANN algorithms use precomputed item embedding to make inference computation much more tractable.
Early stage ranker (ESR) - A more complex but still simpler model to further rank thousands of candidates from the output of generator stage and filter them to a few hundreds.
Late stage ranker (LSR) - A full blown model that is run only on hundreds of candidates from the output of ESR.
As biggest drop occur at generator stage they are central in solving choice overload problem in RecSys. In this post, I will deep dive into generators, discuss the challenges they face and present solutions used by the state-of-the-art (SOTA) implementations.
Problem formulation
The goal of generators is to develop a model that predicts if a given candidate is likely to generate positive user interaction. That is, for every user-query, qj, one needs to find the probability
for every candidate ci in the pool. This is a multi-class classification problem, with each candidate being a separate class. Recall in multi-class classification formulation we use softmax to convert output of the neural network, lets say s(ci, qj), into probabilities:
Here index k runs over all candidates in the pool. The overall log-likelihood then is:
where y is 1 iff ci generates positive user interaction for query qj and 0 otherwise.
Challenges
Generators face multiple issues, most important of these are:
Data sparsity: Even in the best of RecSys a given user is only exposed to a few thousand candidates in the billion item pool. Therefore, models built on this data lack the capability to generalize well on all the candidates in the pool. In this post I will cover how data sparsity is tackled in SOTA RecSys.
Training computation: Computation of L require computation of s for all (ci, qj) pairs which is O(#queries x #candidates). For a global scale RecSys this amounts to O(quintillion) passes over the underlying neural network for training, which is intractable. In this post I will discuss SOTA methods to reduce this.
Inference computation: These generators need to pick thousand of candidates from a pool of billion of candidates in response to a user-query. This computation needs to happen in 100s of milliseconds. Therefore, one need smart design of these generators. This is a topic for a later post and involves smart architecture design, pre-computation and and ANN search.
Solutions
Below I describe how SOTA RecSys solve data sparsity and training computation challenges.
Data Sparsity
Classical way of handling data sparsity is to expose users to a randomly selected set of candidates from the pool. However, this does not work well in RecSys as most of the random candidates from the pool are irrelevant to the user-query and generate negative response. Hence, exposing users to such candidates becomes detrimental to the quality of the RecSys, and can compromise user trust.
That being said, the clue to tackle the problem lies in the problem itself. That is “most randomly selected candidates from the pool are irrelevant to the user-query”. Therefore SOTA generators assumes that only candidates with user-provided positive interaction are positive for the query and all other candidates are negatives. These candidates are called soft-negatives as they are without explicit negative feedback.
Training computation
In order to make training computation tractable, first time complexity of computing p for a given ci, qj is reduced. This is achieved by doing in-batch negative sampling approximation as:
where N is the overall pool size and B is the batch size. This reduces time complexity of computing p to O(B) computation of s(ci, qj)). Note- one computation of s require a pass over the underlying neural network, which itself is quite costly.
Then computation of L is simplified by invoking soft-negative assumption again. Particularly under this assumption for a given query y is 1 only for handful of candidates and is 0 otherwise. Therefore, computation of L only require to compute p only for these handful of candidates.
These approximation and simplification make computation of L require computation of s for only O(#queries x B) (ci, qj) pairs instead of O(#queries x #candidates). A staggering million times reduction for a global scale RecSys. However, doing this introduces a bias towards popular candidates as we will see in the popularity bias section below.
Final computation reduction comes from modifying the neural network architecture to compute s(ci, qj). Particularly, observe that if s(ci, qj) is of following form
where f and g are two independent neural networks, [.] is the dot product and f(ci) and g(qj) are candidate and query embeddings. In this formulation in order to get s(ci, qj) for all (ci, qj) pairs in a batch one needs to compute f for the B candidates and g for the B queries only1 as oppose to computing s for B2 (ci, qj) pairs. This architecture is the famous two-tower architecture2. This further gives a O(B) ~ O(1000) reduction.
Overall in-batch negative sampling and two tower architecture reduces computation cost of training generator models from the O(#queries x #candidates) to O(#queries), a billion times reduction for a global scale RecSys.
Popularity bias and Log-Q correction
Although in-batch sampling reduces the computation cost during training dramatically, it introduces a bias.
In the modern RecSys certain candidates get popular and get exposed much often then other candidates. Hence these candidates are much more frequent in the training batch, making the batch a biased representation of the entire candidate pool. This makes in-batch approximation done above biased and causes popularity bias in the RecSys.
Log-Q correction is a way to further reduce popularity bias. Particularly here we correct in-batch approximation by dividing individual terms by 𝜙(ck) the probability of candidate ck appearing in the training data. This leads to to a corrected logit s as shown below3.
Although doing this reduces the popularity bias significantly but its not a complete fix since many candidates in the pool never show up in the training data, and hence are never considered. There are ways to handle this such as mixed negative sampling. One can refer to this paper from Google for details.
Conclusion
Overall candidate generation in RecSys is a very hard problem where one needs to train models over vast space of candidates and user-queries in a reasonable amount of time. In-batch negative sampling, with logQ correction and two tower networks lead to O(#candidates in the pool) ~ O(billion) reduction in training time, and is central to modern RecSys performance.
ref: Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations
Disclaimer: These are the personal opinions of the author. Any assumptions, opinions stated here are theirs and not representative of their current or any prior employer(s).
Once one have f(ci) and g(qj) for all the candidates and queries in a batch one can get s(ci, qj) for all the (ci, qj) pairs using matrix multiplication which is a much faster operation, particularly on GPUs.
One can then get s by much faster matrix multiplication for f(ci) and g(qj).
For further simplification one can replace s by sc everywhere since one can subtract log(𝜙(ci)) from the log likelihood loss equation since it does not depend on the neural network parameters.