SourceBench: A Benchmark for Cited Source Quality in Generative Engines

SourceBench evaluates whether a generative engine cites high-quality web sources, not only whether it produces a fluent final answer.

Overview

SourceBench focuses on the quality of cited sources along dimensions such as:

semantic relevance
factual accuracy
freshness
objectivity / tone
layout / ad density
accountability
transparency
authority

This repository is the benchmark codebase. It contains:

the fixed public query split for open evaluation
source collection scripts
source judging scripts
metric computation scripts
official submission validation and runner scripts
split policy and official submission contract

The public leaderboard site should be hosted separately from this repository.

Repository Layout

data/queries/
  sourcebench_public_queries_v1.csv
leaderboard/
  QUERY_SPLIT_POLICY.md
  OFFICIAL_SUBMISSION_CONTRACT.md
  README.md
  examples/
src/source-collection/
src/content-scoring/scripts/
src/evaluation/
requirements.txt

Quick Start

Install dependencies:

pip install -r requirements.txt

Run the public evaluation pipeline:

python src/source-collection/get_urls.py \
  --input path/to/public_queries.jsonl \
  --output output/public_urls.json \
  --model YOUR_MODEL_NAME \
  --openai-base-url YOUR_OPENAI_COMPATIBLE_ENDPOINT \
  --openai-api-key YOUR_GE_API_KEY \
  --ai-model-name YOUR_MODEL_NAME

python src/source-collection/collect_sources_from_urls.py \
  --input output/public_urls.json \
  --output output/public_sources.json \
  --rejected-output output/public_rejected.jsonl \
  --ai-model-name YOUR_MODEL_NAME

export QWEN_API_KEY=YOUR_QWEN_API_KEY
python src/content-scoring/scripts/scoring.py \
  --input-file output/public_sources.json \
  --out-dir output/scored \
  --run-name YOUR_MODEL_NAME

python src/evaluation/compute_metrics.py \
  --run YOUR_MODEL_NAME=output/scored/YOUR_MODEL_NAME.enriched.json \
  --query-metadata data/queries/sourcebench_public_queries_v1.csv \
  --out-dir output/metrics

Evaluation Scripts

The evaluation-side scripts are documented in:

src/evaluation/README.md

In short:

validate_official_submission.py: validate official submission schema
official_submission_backend.py: intake and queue a submission
official_run.py: run hidden official evaluation
compute_metrics.py: aggregate final metrics

Official Evaluation

Open evaluation can be run locally on the public split.

Official leaderboard evaluation is intended to be run server-side by the SourceBench team using:

a hidden holdout split
the fixed judge model and prompts
the fixed metrics code
a standardized submission contract

Relevant files:

Public Release Notes

This repository keeps only the public split.

The hidden holdout queries should not be committed here. The benchmark master query pool should also stay out of the public release if it can be used to reconstruct the holdout split. For that reason, data/queries/queries.csv is excluded from this benchmark base.

Internal official submission artifacts are also excluded. The ignored directory:

leaderboard/.official_submissions/

is only for server-side submission intake and official evaluation runs.

Citation

If you use SourceBench in your research, please cite:

@article{sourcebench2026,
  title={SourceBench: Can AI Answers Reference Quality Web Sources?},
  author={Hexi Jin and Stephen Liu and Yuheng Li and Simran Malik and Yiying Zhang},
  journal={arXiv preprint arXiv:2602.16942},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SourceBench: A Benchmark for Cited Source Quality in Generative Engines

Overview

Repository Layout

Quick Start

Evaluation Scripts

Official Evaluation

Public Release Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/queries		data/queries
leaderboard		leaderboard
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SourceBench: A Benchmark for Cited Source Quality in Generative Engines

Overview

Repository Layout

Quick Start

Evaluation Scripts

Official Evaluation

Public Release Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages