Site icon SEO Vendor

RankLens Entities: AI Brand Visibility (Open-Source)

Measuring how LLMs recommend brands & sites — with open data, code, and a reproducible method

Summary: We ran 15,600 samples across 52 categories/locales to study how LLMs (ChatGPT, Claude, Gemini, etc.) mention or rank brands/sites. We open-sourced the method (Entity-Conditioned Probing + Resampling), data, and code so teams can audit, reproduce, and extend.

Links:

Why this matters

LLMs increasingly act like recommenders. For queries like “best [tool]” or “top [service]”, users often see a small set of brands/sites. If you’re building AI products or care about brand visibility, you need answers to:

  • Which brands/sites appear most often?
  • How stable are those results across samples, locales, and models?
  • How reliable is a “top-k” list derived from an LLM?

Our goal: provide a transparent, reproducible way to measure and discuss LLM “recommendations.” – Jim Liu, CEO, SEO Vendor

What is RankLens Entities?

A public, reproducible suite for measuring LLM brand/site visibility:

  • Method: Entity-Conditioned Probing (ECP) with multi-sampling + half-split consensus to estimate reliability (overlap@k).
  • Data: 15,600 samples over 52 categories/locales, with parsed entities and metadata.
  • Code: Scripts to aggregate lists, compute consensus top-k, and evaluate reliability.
ECP sampling + half-split consensus diagram

Key findings (high level)

  • Reliability varies by category & locale. Some categories show high overlap@k; others are volatile.
  • Alias risk is real. The same brand can appear under multiple strings (e.g., “Acme CRM”, “acme.com”). Normalization helps but isn’t perfect.
  • Prompt sensitivity exists. Small phrasing changes can shift inclusion/ordering—use multi-sampling and track prompts.
  • Model drift happens. Time-stamp runs and avoid comparing apples to oranges across model updates.

Who it’s for

  • AI/ML engineers: a reproducible eval harness for LLM list-style answers.
  • SEO/marketing teams: visibility tracking for brands/sites in LLM responses.
  • Researchers: open artifacts + method to extend, critique, or benchmark.

What’s included

  • Preprint: full method + limitations and guidance.
  • Repository: parsing, consensus, reliability metrics, export utilities.
  • Dataset: per-prompt lists, JSONL results, schema docs.
  • Examples: ready-to-run code for overlap@k and consensus top-k.

 

How it works (quick overview)

  1. Probe each (category, locale, model) with standardized prompts.
  2. Parse responses into entity lists (brands/sites).
  3. Resample & split: divide lists into two halves.
  4. Consensus: compute top-k for each half via frequency aggregation.
  5. Reliability: measure overlap@k between the halves.

If overlap@k is high → the “top-k” is likely stable for that setup. Low → treat any single top-k as noisy.

Limitations (read before using)

  • Not a ranking of “truth,” but a measurement of what LLMs surface under given conditions.
  • Aliases & normalization: we include baseline rules; entity resolution is imperfect.
  • Prompt & temperature sensitivity: track your configurations; use multi-sampling.
  • Locale effects: some markets/categories are intrinsically turbulent.
  • Model updates: repeat runs over time to monitor drift.

Get started

  1. Skim the preprint for method + caveats.
  2. Clone the repo and run the overlap@k example.
  3. Load the dataset (CSV/JSONL) and reproduce a chart.
  4. Add your category/locale and compare reliability.
  5. Open a PR with improvements (normalizers, metrics, charts).

 

FAQ

Is this a product?
No—the study and artifacts are open for research and transparency; they inform products but aren’t a product themselves.

Can I cite this?
Yes. Cite the Zenodo preprint; include the GitHub repo and dataset link.

What’s “alias risk”?
Multiple strings referring to the same brand/site (e.g., name variants, URL vs. brand). We apply normalization; you can extend it.

Can I use this commercially?
Check the licenses in the repo/dataset. Contributions welcome.