The “RAG" That Says “I Don’t Know”
A public benchmark on Reecopedia, and why the industry is building AI the wrong way
🇮🇹 🇩🇪 🇫🇷 🇪🇸 🇵🇹 🇳🇱 🇵🇱 🇸🇪 🇩🇰 🇫🇮 🇨🇿 🇷🇴 🇭🇺 🇬🇷 🇧🇬 🇭🇷 🇸🇰 🇸🇮 🇪🇪 🇱🇹 🇱🇻 🇮🇪🇲🇹 🇸🇦 🇨🇳 🇯🇵 🇰🇷 🇮🇳 🇹🇷 🇻🇳 🇮🇩
Sam Altman said, months ago, that it is better for an AI to always answer — even if wrong — than to say “I don’t know.”
I disagree, completely.
This is, for me, the real limit of mass AI adoption. Not the compute cost. Not the model size. The fact that some AI systems are optimized to produce plausible-sounding confidence where uncertainty should exist. In consumer search it is an annoyance. In regulatory compliance, in medicine, in legal analysis, in finance — it is a liability dressed up as a product.
I built Reecopedia on the opposite principle. I am proud of what the benchmark just showed.
The structural problem with most RAG systems
Retrieval-Augmented Generation has become the default architecture for enterprise AI assistants working on specialized corpora. The promise is simple: the system retrieves relevant documents, the language model synthesizes an answer grounded in those documents, citations are included.
In practice, a structural bias is rarely discussed publicly. Most production RAG systems are tuned to optimize user satisfaction metrics, and “I don’t know” consistently lowers those scores. The result is an incentive to fill evidentiary gaps with plausible-sounding synthesis — fusing fragments of retrieved context into an answer that appears grounded but is not.
The user has no way to detect this. The system returns a confident paragraph, includes citations, and the reader assumes the citation validates the entire paragraph. It does not.
This is fabrication-by-composition, and it is the dominant source of hallucinations in production RAG deployed on specialized corpora. It is also, not coincidentally, what the Altman school of thought implicitly endorses: the AI must always answer, because silence is bad for engagement.
What we tested
We ran a public benchmark on Reecopedia — the EU regulatory RAG built for Digital Product Passport compliance, part of the Reeco ecosystem. Ten queries across three categories, each executed at two tiers (Intermediate and Expert). Twenty API calls in total.
Category A (4 queries) — Evidence deliberately non-existent.
Questions referencing articles, deliverables, and documents that do not exist in the corpus or in reality. Examples: “What does Article 47 of ESPR Regulation 2024/1781 require?” (the regulation has no such article at that level of detail). “According to JRC preparatory study JRC999888...” (the identifier is fabricated). “The 2025 ESPR delegated act for textiles at Annex VI Table 3...” (no such act has been published at this detail level).
Category B (3 queries) — Partial evidence.
Questions where the corpus covers some but not all of the requested information. The correct behavior is to distinguish explicitly what is covered from what is not.
Category C (3 queries) — Complete evidence (control).
Questions where the corpus contains the full answer. The system should answer with verifiable citations.
Results
20 out of 20 passing. Zero hallucinations across all three categories.
On every query referencing non-existent evidence, Reecopedia explicitly declared the absence. No fabricated article numbers. No invented thresholds. No citations to documents that do not exist.
Sample response to the fabricated Article 47 query: *”The retrieved context does not contain the text of Article 47 of ESPR Regulation 2024/1781, nor any provision specifically requiring textile water footprint disclosure under that article number.”*
The system then provides adjacent context that does exist — the ESPR general framework, publication date, scope — without attributing any of it to the non-existent Article 47. The distinction between “what the user asked for” and “what the corpus actually contains” is preserved explicitly.
On partial-evidence queries, the system separated covered from uncovered parts with explicit gap identification. On control queries, it answered with verifiable citations to specific documents, pages, and paragraphs.
Why this matters
In regulatory compliance, the failure mode is not “user asked a question and got no answer.” The failure mode is “user asked a question, got a confident wrong answer, and made a business decision based on it.”
A brand that declares recycled content based on a hallucinated interpretation of ESPR faces enforcement risk. A law firm that drafts a client memo citing an invented article number faces malpractice exposure. A sustainability officer who approves a supplier claim based on a fabricated JRC threshold is personally accountable when the auditor arrives.
The 2028 ESPR enforcement window will not distinguish between “the AI was wrong” and “our decision was wrong.” It will only ask whether the declaration was substantiated by verifiable evidence.
The Altman framing — always answer, never say I don’t know — is structurally incompatible with this reality. It works for consumer search, where a wrong answer is a minor inconvenience. It fails in B2B verticals where a wrong answer is a signed commitment.
The methodology is public
The ten queries are published. The responses are reproducible. Anyone can run the same benchmark against any RAG system claiming to handle EU regulatory content. If competitors want to demonstrate the same property on the same test set, we welcome the comparison.
The benchmark artifacts will be published on GitHub as an open evaluation set for regulatory RAG systems. No gatekeeping, no proprietary framework. If the industry wants to build RAG for compliance, it can borrow the test set.
A note on what this is, and what it is not
I am not claiming zero hallucinations across all possible queries. I am claiming zero hallucinations on a specific, auditable test set covering the EU textile regulatory domain. This is a domain-specific, measurable property — not a universal marketing claim.
The distinction matters. As long as the results hold, I can say I have built a product that is useful and healthy.
Reecopedia is live at ia.reeco.eco. Native support for 32 languages, dynamic response capability in 110+. Intermediate at €20/month, Expert at €100/month, public tier free.
Built in Prato, Italy. For researchers, law firms, public authorities, sustainability departments, banks, brands, suppliers — anyone whose decisions carry weight.
Stefano Cipriani
Founder, Reeco · Expert Member CIRPASS-2 · JRC Registered Stakeholder
