On the Theoretical Limitations of Embedding-Based Retrieval
Neural Embedding Models
- Dense retrieval has gained a lot of traction in the past few years.
- In the era of LLMs, dense retrieval is used for more complicated queries involving conditioning or reasoning.
- The authors focus on finding the limitations of embedding based retrieval by providing a theoretical connection between the embedding dimension and the sign-rank of the query relevance \((qrel)\) matrix.
Representational Capacity of Vector Embeddings
Formalization
- \(m\) queries, \(n\) documents with a ground-truth relevance matrix \(π΄ β{\{0,1\}}_{πΓπ}\) where \(π΄_{ππ} = 1\) if and only if document \(π\) is relevant to query \(π\).
- Vector embedding models map each query(i) to a d-dimensional vector \(u_i\) and each document to a d-dimensional vector \(v_j\).
- Relevance is modeled by the dot product \(π’^Tv\), with the goal that relevant documents should score higher than irrelevant ones. We represent the scores all queries corresponding to all documents in a score matrix \(B=U^TV\)
- The smallest embedding dimension π that can realize a given score matrix is the rank of π΅. Therefore, finding the minimum rank of a score matrix π΅ that correctly orders documents according to the relevance specified in matrix π΄
If we consider the relevance matrix A to be a binary matrix for simplicity, then based on the above, we can define the following for finding the theoretical bounds. Bounds for what? Let us see..
- row-wise order-preserving rank (rop): Given a matrix \(π΄ β β_{πΓπ}\), the row-wise order-preserving rank of π΄ is the smallest integer π such that there exists a rank-π matrix π΅ that preserves the relative order of entries in each row of π΄.
- row-wise thresholdable rank (rt): minimum rank of a matrix π΅ for which there exist row-specific thresholds ππ such that for all π, π, \(π΅_{ππ} > π_π\) if \(π΄_{ππ} = 1\) and \(π΅_{ππ} < π_π\) if \(π΄_{ππ}\) = 0.
- globally thresholdable rank (gt): the minimum rank of a matrix π΅ for which there exists a single threshold π above which all entries in A equals to 1 otherwise 0
Propositions
Based on these definitions, the authors make two propositions.
- Proposition 1: For a binary matrix π΄ of size (mxn), \(rank_{rop} π΄ = rank_{rt} π΄\).
- Proposition 2: If A is a binary matrix, and say \(1_{m \times n}\) is a βonesβ matrix, then the sandwich inequality shown below holds, where rank Β± is sign rank. A sign rank of a matrix \(π β\{β1,1\}_{πΓπ}\) is the smallest integer π such that there exists a rank \(π\) matrix \(π΅ ββ_{πΓπ}\) whose entries have the same sign as those of π.
You can prove these literally using pen-paper. But why do we care about these? Well, the authors find a number of consequences: - It provides a lower and upper bound on the dimension of vectors required to exactly capture a given set of retrieval objectives. Given some binary relevance matrix π΄, we need at least \(rankΒ±(2π΄ β 1_{m \times n})\) β 1 dimensions to capture the relationships in π΄ exactly, and can always accomplish this in at most \(rankΒ±(2π΄ β 1_{m \times n})\) dimensions. - There exists a binary relevance matrix which cannot be captured via π-dimensional embeddings, meaning no matter how large (but finite) you make your embedding dimension, there are always retrieval tasks too complex to be represented exactly.
Empirical Connection: Best Case Optimization
- Free embedding optimization as the vectors can be optimized with gradient descent
- Premise: If the free embedding optimization cannot solve the problem, real retrieval models will not be able to either.
- The authors created create a random document matrix (size π) and a random query matrix of size m with top-π sets, both with unit vectors
- Adam optimizer, InfoNCE loss function. The number of documents and the queries are increases gradually till they hit the point where no further improvement is observed. They call this
critical-n
point. - They use \(π= 2\) and increase πby one for each π value until it breaks, and then fit a polynomial regression line to the data
Empirical Connection: Real-World Datasets
- Argue that in the existing datasets like QUEST, BrowseComp, etc., the space of queries used for evaluation is a very small sample of the number of potential queries. For example, the number of unique top-20 document sets that could be returned with the QUEST corpus would be 325kC20 which is equal to \(7.1e+91\). Thus, the 3k queries in QUEST can only cover an infinitesimally small part of the \(qrel\) combination space.
- Proposes the LIMIT Dataset for evaluation. 50K documents, 1000 queries. For each query, they choose to use π=2 relevant documents both for simplicity in instantiating and to mirror past works
- Though it is hard to find out the hardest \(qrel\) matrix using sign rank, they the \(qrel\) matrix with the highest number of documents (46 in this case) for which all combinations would be just above 1000 queries for a top-π of 2. There are two versions: one where each query has 46 relevant documents, and 49.95K non-relevant docs, and the second version with only the 46 documents that are relevant to one of the 1000 queries.
- All sorts of embedding models including Qwen-3, Gemini, Mistral, MRL based, etc.
Results
- Models struggle even on the small (46 documents in total) version, struggle to reach even 20% \(recall@100\) and in the 46 document version models cannot solve the task even with \(recall@20\).
- Models trained with diverse instructions do a bit better, but the performance mainly depends on the embedding dimensionality (higher the better).
- The poor performance is not due to domain shift.
What about other kinds of models? Lexical, cross-encoder?
- Sparse models can be thought of as single vector models but with very high dimensionality. This dimensionality helps models like BM25 avoid the problems of the neural embedding models
- Multi-vector models are better then other neural models, but they are not widely used on instruction-following or reasoning-based tasks
- Cross-Encoders are expensive but are much better. For example, Gemini reranker can solve all 1000 queries in a single pass. The problem is these kinds of models canβt be used be used for first stage retrieval when there are large numbers of documents.