Learning-to-rank (LTR) is a machine learning field that replaced manually adjusted ranking formulas with models trained to arrange documents by how relevant they are to a search. Earlier articles explained which signals matter (PageRank, HITS, entity importance). This article explains how a search engine automatically learns to balance those signals without engineers setting every limit by hand. This change is important because no team can manually adjust hundreds of connected signals or handle billions of different queries. This article explains why hand-made formulas reached their limit, the three LTR design methods and their purposes, RankNet as the first neural pairwise model, the problem with NDCG (Normalized Discounted Cumulative Gain) not being differentiable, and how LambdaRank fixed it, and what the DOJ (Department of Justice) trial showed about how much of Google’s ranking is still manually made.[1]Source 1Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML '05) (pp. 89–96). ACM.View source ↗
Why Did Hand-Crafted Ranking Formulas Stop Working?
The earliest commercial search engines ranked results with explicitly programmed formulas: a TF-IDF score, a PageRank score, a freshness score, a domain authority score. Engineers set the weights of each factor by inspecting query logs, reading user complaints, and adjusting numbers by hand. This process worked when search engines handled tens of millions of queries per day with a handful of well-understood signals.[2]Source 2Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART: An overview. (Technical Report No. MSR-TR-2010-82). Microsoft Research.View source ↗[3]Source 3Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.View source ↗
It stopped working for three interconnected reasons.
The number of signals grew beyond what humans can handle. As signals increased from a few to dozens and then hundreds, their interactions became too complex to manage by hand. Making link authority more important might help some searches but hurt others. A human team cannot keep track of all these complex interactions at once, much less adjust them for thousands of different query types.
The variety of queries became too wide for one fixed formula. One ranking formula has to work well for very different searches like “weather tomorrow,” “how to appeal a denied insurance claim,” “Albert Einstein birthplace,” and “best Python ORM library” using the same settings. Each search needs a different mix of signals. A fixed formula is a compromise that is not perfect for any of them.
The way signals affect each other is not simple. A short page with high PageRank should be treated differently from a long page with the same PageRank. Page speed matters more on mobile than on desktop. Freshness is more important for news than for timeless reference questions. Writing all these special rules by hand leads to many exceptions. Eventually, the formula becomes impossible to manage.
The main problem is that a hand-made formula uses the same signal weights for every query and document. A learned ranking model changes the weights based on the type of query, the type of document, and how they relate, all automatically from training data.
The change happened slowly. Google did not suddenly switch from all hand-made to all machine-learned ranking. The 2024 DOJ trial showed that most of Google’s ranking signals are still hand-made using sigmoid functions and manual settings. Only RankBrain and DeepRank are fully machine learning systems based on large language models. So, learning-to-rank does not fully replace hand-made signals. Instead, it is a layer that learns to combine them better than a fixed formula.
What Is Learning-to-Rank?
Learning-to-rank uses supervised machine learning to put documents in order based on how relevant they are to a search. Instead of a programmer setting weights for each signal, the LTR model learns these weights from training data where human raters have judged how relevant each document is for a query.[3]Source 3Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.View source ↗[4]Source 4Wikipedia contributors. (n.d.). Learning to rank. Wikipedia, The Free Encyclopedia.View source ↗
The training data for a production search engine LTR model consists of thousands of queries paired with their candidate result sets, where each result has been assessed by human raters using a graded relevance scale. Google uses a five-point scale ranging from “Completely Off-Topic” to “Perfectly On-Topic,” documented in its publicly available Search Quality Rater Guidelines. Bing uses a similar graded scale. These human judgments become the ground truth that the model learns to predict.
Once the model is trained, ranking works like this: the system finds a set of possible documents, calculates features for each query-document pair (like PageRank, TF-IDF score, entity match, freshness, click history, and many more), runs these features through the model to get a relevance score, and sorts the documents by these scores. This all happens in milliseconds because the model is kept in memory and runs fast.
Three architectural approaches dominate the LTR literature, differing in what unit of comparison the model reasons about during training.
Html Block
What Is the Pointwise Approach?
The pointwise approach treats ranking like a regular prediction problem. Each query-document pair is scored on its own: the model predicts how relevant one document is without looking at the other documents in the set.[3]Source 3Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.View source ↗
How it works: A training example consists of a single query-document pair and a relevance label (for example, “highly relevant” = 3 on a 0-4 scale). The model learns to predict that label given the feature vector. At inference time, all documents in the candidate set are scored individually and sorted by their predicted scores.
The key advantage: Any standard regression or classification algorithm works out of the box. Logistic regression, gradient-boosted trees, and neural networks can all serve as pointwise rankers without modification to their algorithms.
The main drawback is that the model learns absolute relevance, but users care about the order of results. A document marked “highly relevant” that appears at position 5 because four others scored the same is not a prediction mistake. But it is a ranking problem because the best result is not at the top. Pointwise models cannot understand the order of documents compared to each other.
What Is the Pairwise Approach?
The pairwise approach treats ranking as deciding between pairs of documents. Instead of predicting a score for a single document, the model learns which of two documents should appear first in a search.[1]Source 1Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML '05) (pp. 89–96). ACM.View source ↗[3]Source 3Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.View source ↗
How it works: A training example consists of a query and two documents where the ground truth ordering is known (document A is more relevant than document B for this query). The model learns to predict the correct ordering of each pair. At inference time, pairwise comparisons produce a total ordering of the candidate set.
This is better than pointwise because the model learns the important signal for ranking: the order between documents. It does not need to predict exact relevance scores; it only needs to determine which document is better in each pair. A model that always picks the more relevant document in pairs will create a good ranked list even if the exact scores are off.
The number of document pairs grows very fast as the number of documents increases. For 1,000 documents, there can be up to 500,000 pairs per query. This makes training and using the model costly. Also, pairwise models treat all pairs the same: swapping documents at the top of the list counts the same as swapping ones near the bottom, even though users care mostly about the top results.
What Is RankNet?
RankNet was the first important neural network using the pairwise learning-to-rank method. Published by Chris Burges and team at Microsoft Research in 2005, it introduced a probabilistic cost function over document pairs and was trained fully using backpropagation.
RankNet’s main idea is to estimate the chance that document A should rank higher than document B using a sigmoid function based on their score difference. The model is trained to reduce the difference between predicted order and true order using cross-entropy loss. Gradient descent adjusts the network to lower the number of wrong pair orderings in training.
RankNet’s importance is not its design, which was a simple two-layer network with one output per document. Its value lies in showing two things: first, a neural network trained with gradient descent can learn to rank from pairwise data; second, it outperforms hand-tuned formulas on real search engine data. Burges tested RankNet on real internet search data and found clear improvements in ordering accuracy.
RankNet proved that ranking can be treated as a learning problem and that neural networks can solve it. All later pairwise and listwise learning-to-rank methods in commercial search systems build on this foundation.
What Is the Listwise Approach, and Why Can’t You Optimize NDCG Directly?
The listwise approach improves the ranking of the entire candidate list at once, rather than scoring each document separately (pointwise) or comparing pairs (pairwise). Its target metric is usually NDCG (Normalized Discounted Cumulative Gain), the top evaluation metric for ranked lists covered in article 1.9.[4]Source 4Wikipedia contributors. (n.d.). Learning to rank. Wikipedia, The Free Encyclopedia.View source ↗
NDCG rewards placing highly relevant documents near the top of the list and applies a logarithmic discount to gains from lower positions: a document at position 1 contributes more to the score than an equally relevant document at position 5. This is exactly what matters for user experience.
NDCG is calculated using fixed rank positions. The rank of a document is a whole number: 1, 2, or 3, not 1.7. Whole number positions cannot be differentiated. A function that cannot be differentiated cannot be used as a loss function for gradient descent.
This is not just a technical problem. It is the main mathematical barrier that has stopped direct NDCG optimization for years. The pairwise loss used by RankNet can be differentiated and gives gradient descent something to reduce. NDCG cannot.
How Did LambdaRank Solve the Non-Differentiability Problem?
LambdaRank, developed by Burges and colleagues in a 2006 paper, solved this problem with an idea that is simple to explain but not easy to find: instead of finding a differentiable substitute function that approximates NDCG, define the gradient (the lambda) directly, without getting it from any loss function at all.
The lambda for a document pair in LambdaRank is the gradient of the RankNet cross-entropy loss, scaled by the absolute change in NDCG that would result from swapping the positions of the two documents. In formula terms:
If swapping document A and document B would produce a large improvement in NDCG, the lambda for that pair is large, and training pushes hard to correct the ordering. If swapping them would barely change NDCG, the lambda is small, and the training signal is weak. The model, therefore, focuses its learning effort on the swaps that matter most to the metric users who actually care about.
The insight that makes this work: Burges showed empirically that the lambdas defined this way, even though they are not the gradients of any explicit loss function, behave like proper gradients for gradient descent. The model trained with LambdaRank gradients consistently improved NDCG on held-out test queries, outperforming both RankNet and pointwise baselines.
LambdaMART improved this by replacing LambdaRank's neural network with gradient-boosted decision trees (specifically, the MART algorithm). Decision trees are faster to train and run than neural networks on structured feature data used in ranking (hundreds of numerical features such as PageRank scores, TF-IDF values, and freshness scores). LambdaMART combines the metric-aware gradient from LambdaRank with the speed of gradient boosting. A group of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning-to-Rank Challenge, making it the strongest baseline algorithm for web search ranking at that time.
| Column 1 | Column 2 | Column 3 | Column 4 |
| Algorithm | Architecture | Training objective | Key advance |
| RankNet (2005) | Two-layer neural net | Pairwise cross-entropy loss | First neural pairwise ranker trained with gradient descent |
| LambdaRank (2006) | Neural net | Lambdas scaled by NDCG delta | NDCG-aware gradients without a differentiable NDCG surrogate |
| LambdaMART (2010) | Gradient-boosted trees | Same lambdas as LambdaRank | Decision tree efficiency + metric-aware gradients |
Html Block
How Does LTR Fit into Google’s Actual Architecture?
The 2024 DOJ antitrust trial included sworn testimonies from Google engineers that gave the clearest public view of Google’s ranking system so far. The reality is more complex than the simple story that “ML replaced hand-crafted formulas.”[5]Source 5Goodwin, D. (2025, May 13). The ABCs of Google ranking signals: What top search engineers revealed. Search Engine Land.View source ↗[6]Source 6Schwartz, B. (2025, May 13). Google says Navboost is not machine learning. Search Engine Roundtable.View source ↗
Google engineer HJ Kim testified that the vast majority of Google’s ranking signals are hand-crafted, meaning engineers analyze data, choose a sigmoid or similar function, and manually set the threshold. This approach is preferred for most signals because “if anything breaks, Google knows what to fix.” Machine learning is harder to interpret when it goes wrong.
The ML-based systems that do exist inside Google’s ranking stack include RankBrain (Google’s first ML ranking component, introduced in 2015) and DeepRank, an LLM-based system built on BERT-style architecture that decomposes complex signals. Navboost, Google’s click-based signal system aggregating user interactions over 13 months of data, is not an ML system. It is described by former Google Distinguished Engineer Eric Lehman as “just a big table” that maps queries to documents and their aggregated click counts.
The correct mental model for how LTR fits into Google’s pipeline is therefore not replacement but combination:
- Hundreds of signals are computed at indexing time and query time (PageRank, TF-IDF, entity match, freshness, structured data presence, mobile-friendliness).
- Most of these signals are hand-crafted with manually tuned thresholds.
- A learned ranking model (using LTR principles) takes those signals as input features and learns the optimal combination for each query type.
- ML-based components like RankBrain and DeepRank contribute additional features or post-processing layers.
- Click-based systems like Navboost provide usage signals that effectively serve as implicit human judgments, similar to the labeled training data LTR models use.
The LTR framework does not make individual signals unnecessary. It makes individual signal weight-tuning unnecessary. The weights are learned. The signals still must be computed and fed to the model.
What Does Learning-to-Rank Mean for Content Strategy?
LTR has three direct implications for how content should be built and positioned.[7]Source 7Murrell, T. (2025, July 30). LambdaMART explained: The workhorse of learning-to-rank. Shaped.View source ↗
First: no single signal can be gamed in isolation. A hand-crafted formula with known weights is, in principle, gameable: maximize the highest-weighted signals, and the formula favors your page. An LTR model weighs signals differently depending on query context. A page that has high PageRank but poor entity coverage and thin content will not receive the same benefit from its PageRank on a long-tail informational query that it would on a navigational query. The model learns that PageRank assigns different relevance scores to different query types. Optimizing one signal without the others produces diminishing returns faster than it did with fixed-formula ranking.
Second: click and behavioral signals are real ranking features. LTR models are trained on relevance judgments from human raters and from aggregated user behavior. When Navboost records that users who see your page for a given query consistently click it and stay there, that signal feeds directly into the system that the ranking model was trained to satisfy. A page that generates strong behavioral signals reinforces its ranking position. A page with high technical authority scores but poor user satisfaction metrics is sending conflicting signals to a model trained to maximize user-judged relevance.
Third: content quality has a measurable, trainable target. Human rater guidelines, which Google publishes (the Search Quality Rater Guidelines document), define exactly what “relevant” and “high quality” mean in the training data that LTR models learn from. Those guidelines emphasize expertise, authoritativeness, trustworthiness, and clear, accurate answers to user intent. The guidelines are not decorative. They describe what the training labels mean, and the ranking model is optimizing for them. Content that consistently satisfies the guidelines aligns with the model’s training objective.
Key Takeaways
Key takeaways from this article:
- Learning-to-rank is the ML discipline that learns to automatically combine hundreds of ranking signals from labeled training data, eliminating the need for engineers to manually tune every signal weight for every query type.
- Three architectural approaches: pointwise (score each document independently; any classifier works), pairwise (compare document pairs; explicitly optimizes relative order), listwise (optimize the entire ranked list; targets NDCG). Listwise approaches outperform the others on standard benchmarks but require solving the NDCG non-differentiability problem.
- RankNet (2005) by Burges et al. at Microsoft Research was the first neural pairwise LTR model trained with gradient descent. It used a probabilistic cost function over document pairs and demonstrated that a learned neural ranker could outperform a hand-crafted formula on real commercial search engine data.
- NDCG is non-differentiable because it uses discrete rank positions. You cannot directly use gradient descent on NDCG. LambdaRank solved this by defining a proxy gradient (the lambda) as the RankNet pairwise gradient scaled by the NDCG delta from swapping each document pair, making the metric optimizable without a differentiable surrogate.
- LambdaMART combines LambdaRank’s metric-aware gradients with gradient-boosted decision trees. It won the 2010 Yahoo! LTR Challenge and remains a production baseline at major search engines.
- Google’s actual architecture (per 2024 DOJ testimony) blends hand-crafted signals with ML layers. The majority of signals are still manually engineered. RankBrain and DeepRank are the primary ML-based components. Navboost is a large aggregated click table, not an ML system.
- For content strategy: LTR means no single signal can be isolated and gamed, because the model weights signals differently based on query context. Behavioral signals (clicks, dwell time) are genuine ranking features, not just nice-to-haves. Content that satisfies the human rater guidelines used to generate training labels aligns with the model’s training objective.
Your next step: Read Google’s Search Quality Rater Guidelines (publicly available, search “Google Search Quality Rater Guidelines PDF”). Focus on the definitions of “Expertise, Authoritativeness, and Trustworthiness” in Section 3 and “Needs Met” in Section 4. These definitions describe precisely what the training labels for Google’s LTR models represent. Content that consistently satisfies those definitions is content that produces a positive training signal for the ranking model at every query where it appears.
Coming up next: Article 1.9 covers the evaluation metrics that learning-to-rank models are trained to optimize: Mean Average Precision, Mean Reciprocal Rank, and NDCG. Understanding how these metrics work is the final piece of the foundation. Every A/B test at Google, every LambdaRank training objective, and every search quality improvement initiative is measured against one of these metrics. Article 1.9 explains what each one captures, what each one misses, and why NDCG became the dominant standard for web search evaluation.
Sources
-
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML '05) (pp. 89–96). ACM.
-
Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART: An overview. (Technical Report No. MSR-TR-2010-82). Microsoft Research.
-
Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.
-
Wikipedia contributors. (n.d.). Learning to rank. Wikipedia, The Free Encyclopedia.
-
Goodwin, D. (2025, May 13). The ABCs of Google ranking signals: What top search engineers revealed. Search Engine Land.
-
Schwartz, B. (2025, May 13). Google says Navboost is not machine learning. Search Engine Roundtable.
-
Murrell, T. (2025, July 30). LambdaMART explained: The workhorse of learning-to-rank. Shaped.





