Information retrieval (IR) is the process of finding documents in a large collection that meet a person’s information need. A
search engine is essentially an IR system operating at scale. Knowing three IR ideas, precision, recall, and relevance, explains why some SEO methods work, and others don’t. This article explains what IR is, its history, how relevance is measured, and what the precision-recall tradeoff means for every page you create.
Information retrieval is the study of finding and returning relevant items from a large, unorganized collection of documents based on a user’s search. These items can be web pages, academic papers, legal contracts, or product descriptions. The goal is always to meet the user’s information need, the real question or task behind the search, which is often broader than the exact words they type.
In one sentence: IR is the study of how to order a group of documents based on how likely each is to meet a specific information need.
This definition, refined across six decades of academic research, is exactly what Google, Bing, and every other search engine implement at massive scale. The three core vocabulary terms every IR system uses are:
| Term |
What it means in IR |
What it means for SEO |
|
Query
|
The words or phrase a user submits |
The search term a page is optimized for |
|
Document
|
Any unit of content in the collection |
A web page, article, or URL |
|
Relevance
|
How well a document satisfies the query’s information need |
Whether your page actually answers what the searcher wanted |
The word “relevance” does the heaviest lifting in all of IR. It is not a switch that is either on or off. Relevance is a scored, graded property that search engines continuously calculate, and the way they do so has changed dramatically since 1958.
In the late 1950s, scientific publishing was growing faster than any indexing system could manage. Physicists, engineers, and medical researchers missed important papers not because they didn’t exist, but because the indexing systems back then, physical card catalogs and hand-coded summaries, couldn’t reliably match searches to documents.
A librarian named Cyril Cleverdon at the College of Aeronautics in Cranfield, UK, conducted the first formal experiments to answer one question: which indexing method finds the most relevant documents? Natural language words? Controlled vocabulary headings? Chemical Abstracts Notation?
The Cranfield Experiments, done in two parts from 1958 to 1966, used 1,398 aeronautical engineering summaries, 225 test searches, and human-made relevance ratings for each query-document pair. These experiments created the field’s two key measurement tools: precision and recall.
Here is what most IR articles get wrong: Cleverdon was not trying to judge search engines. He was comparing types of indexing methods. Precision and recall were created to measure how well indexing methods worked, not ranking algorithms. They became the standard for all retrieval systems because they proved very useful.
What the Cranfield Data Actually Showed
The experiments found something surprising. More specific controlled vocabularies improved precision but lowered recall. Looser natural language terms improved recall but lowered precision. No single indexing method maximized both at once. This is where the precision-recall tradeoff comes from.
The Cranfield 1400 corpus, published in machine-readable form in 1967, became the primary benchmarking dataset for IR research through the 1970s at a time when the IBM System/360 Model 50 had between 64 and 512 kilobytes of main memory.
What Are Precision and Recall, and Why Do They Create a Tradeoff?
Precision measures the share of returned documents that are actually relevant. If a search engine shows 10 results and 7 are relevant, precision is 70%.
Recall measures the share of all relevant documents in the collection that were actually found. If there are 50 relevant documents and the search engine returns 20, recall is 40%.
Neither number is useful alone. You can get 100% recall by returning every document in the collection. You can get 100% precision by returning just one document, as long as it is relevant.
The tradeoff appears when you try to improve one metric:
- Making the query broader (by adding synonyms or loosening match rules) increases recall: more relevant documents appear. But irrelevant documents also come in, lowering precision.
- Making the query stricter (requiring exact phrase matches, adding filters) increases precision; results stay focused. But relevant documents worded differently get missed, lowering recall.
The Brain Surgery Analogy
Think of a brain surgeon removing a tumor. The surgeon who removes only cancer cells has high precision, low recall; some cancer cells remain. The surgeon who removes all possibly cancerous tissue has high recall, low precision; healthy tissue is removed too. Neither is good. The goal is the best balance between the two.
Search engines face the same tradeoff on every query. The question is: what is the right operating point for web search?
Where Google Sits on the Tradeoff Curve
Search engines are purposely designed to favor precision over recall. The reason is simple: showing 10 highly relevant results is much more useful to a user than showing every relevant page in a collection of hundreds of billions. A user who finds their answer in the top 3 results does not care that 4,000 other relevant pages were not shown.
This directly affects SEO content. A page that focuses deeply on one topic scores better on precision than a page that tries to cover many related queries lightly. Depth on one information need beats covering many shallowly.
Why Relevance Is Not a Binary Property
You might first think of relevance as a yes/no label: a page either answers the query or it does not. This is how the earliest Boolean retrieval systems worked. A document either had the query terms (relevant) or it did not (irrelevant).
Modern search engines use multi-level relevance scoring. The key distinction is between three levels:
| Relevance level |
What it signals |
How search engines detect it |
|
Topically relevant
|
The document is about the same subject |
Keyword presence, entity matching |
|
Informationally relevant
|
The document addresses the information need |
Semantic similarity, query intent matching |
|
Purposefully relevant
|
The document satisfies the user’s actual goal |
Click behavior, dwell time, task completion signals |
The third level, purposeful relevance, is what Google’s NavBoost system tries to measure via engagement signals. A page can be topically and informationally relevant but fail at the purposeful level, for example, a guide that explains a concept but does not help the user apply it. Google’s quality rater guidelines use the label “Needs Met” to describe this purposeful layer, scoring results on a 5-point scale from “Fully Meets” to “Fails to Meet.”
The path from the Cranfield experiments to Google goes through one researcher: Gerard Salton of Cornell University. In 1968, he defined information retrieval as the field concerned with representing, storing, organizing, and accessing information. Salton led the creation of the SMART Retrieval System, the first large-scale IR system to use mathematical models rather than controlled-vocabulary lists.
In 1975, Salton, Wong, and Yang published a paper that gave every modern embedding model its basic idea: the
Vector Space Model[2]Source 2Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.View source ↗ for automatic indexing. Communications of the ACM, 18(11), 613-620.. The main idea was that every document and query could be shown as a vector of term weights in a multi-dimensional space. Relevance is the angle between two vectors: a small angle means high relevance; a large angle means low relevance.
This is not just old academic history. When Google talks about “vector similarity search” and “nearest neighbor search” in its technical documents, it is describing the same geometric idea Salton explained in 1975. The embedding models behind semantic search, BERT, and AI Overviews all come from that paper.
The Three-Stage Pipeline That Replaced Manual Indexing
Every search engine built since the 1990s uses a three-step process taken directly from IR research:
- Acquisition: discovering and fetching documents (crawling)
- Indexing: analyzing documents and building a data structure for fast retrieval (the inverted index)
- Ranking: scoring indexed documents against a query and returning the top-k results
Knowing that Google is an IR system working at web scale is not just a theory. It is practical. Every ranking factor, from anchor text to topical authority to time spent on a page, exists to solve one of the three basic IR problems: representing, storing, or finding relevant information.
What Does IR Theory Say About Why Keyword Stuffing Never Worked?
The short answer is that keyword stuffing increases word frequency without improving relevance at any of the three levels mentioned above. It does not make a page more informative. It does not make a page more useful. It worsens the user experience, which lowers purposeful relevance signals.
What surprises most beginners is that even early IR systems, before click data existed, were designed to prevent cheating by repeating words too often. The IDF component (Inverse Document Frequency) of
TF-IDF downweights words that appear in every document. A word that appears on every web page has almost no distinguishing power. Keyword stuffing pushes a page toward common, low-IDF areas in relevance, not away from them.
The reason stuffing seemed to work briefly in the late 1990s is that early web search engines had simple versions of IR theory. They had not yet added IDF weighting, the link graph, or user behavior signals that IR researchers had recommended for 30 years. Google’s rise was partly an IR research team using the field’s standard tools at scale.
Key Takeaways
- Information retrieval is the science of finding relevant documents in a large collection given a user’s information need. Search engines are IR systems at web scale.
- Precision and recall were invented in the Cranfield Experiments (1958-1966) to evaluate index languages, not ranking algorithms. They became the universal standard for IR evaluation.
- The precision-recall tradeoff means that improving one metric typically degrades the other. Web search engines are designed to favor precision, because 10 highly relevant results matter more than exhaustive coverage.
- Relevance has three levels: topical, informational, and purposeful. SEO content that achieves only topical relevance falls short of the level search engines actually optimize for.
- Gerard Salton’s Vector Space Model (1975) is the direct conceptual ancestor of modern semantic search and embedding models. Understanding it makes BERT and AI Overviews intuitive rather than magical.
Your next step: Read the original Salton, Wong, and Yang (1975) paper abstract on CACM to see how the vector space geometry is described in its own language. You do not need the full mathematics yet. The abstract alone will change how you read technical search documentation.
Coming up next: Article
1.2 covers the Vector Space Model in full: how documents and queries become vectors, what the cosine similarity calculation looks like, and why this 1975 model is still the conceptual foundation of every modern embedding search system.