SourceBench evaluates whether a generative engine cites high-quality web sources, not only whether it produces a fluent final answer.
SourceBench focuses on the quality of cited sources along dimensions such as:
- semantic relevance
- factual accuracy
- freshness
- objectivity / tone
- layout / ad density
- accountability
- transparency
- authority
This repository is the benchmark codebase. It contains:
- the fixed public query split for open evaluation
- source collection scripts
- source judging scripts
- metric computation scripts
- official submission validation and runner scripts
- split policy and official submission contract
The public leaderboard site should be hosted separately from this repository.
data/queries/
sourcebench_public_queries_v1.csv
leaderboard/
QUERY_SPLIT_POLICY.md
OFFICIAL_SUBMISSION_CONTRACT.md
README.md
examples/
src/source-collection/
src/content-scoring/scripts/
src/evaluation/
requirements.txt
Install dependencies:
pip install -r requirements.txtRun the public evaluation pipeline:
python src/source-collection/get_urls.py \
--input path/to/public_queries.jsonl \
--output output/public_urls.json \
--model YOUR_MODEL_NAME \
--openai-base-url YOUR_OPENAI_COMPATIBLE_ENDPOINT \
--openai-api-key YOUR_GE_API_KEY \
--ai-model-name YOUR_MODEL_NAMEpython src/source-collection/collect_sources_from_urls.py \
--input output/public_urls.json \
--output output/public_sources.json \
--rejected-output output/public_rejected.jsonl \
--ai-model-name YOUR_MODEL_NAMEexport QWEN_API_KEY=YOUR_QWEN_API_KEY
python src/content-scoring/scripts/scoring.py \
--input-file output/public_sources.json \
--out-dir output/scored \
--run-name YOUR_MODEL_NAMEpython src/evaluation/compute_metrics.py \
--run YOUR_MODEL_NAME=output/scored/YOUR_MODEL_NAME.enriched.json \
--query-metadata data/queries/sourcebench_public_queries_v1.csv \
--out-dir output/metricsThe evaluation-side scripts are documented in:
In short:
validate_official_submission.py: validate official submission schemaofficial_submission_backend.py: intake and queue a submissionofficial_run.py: run hidden official evaluationcompute_metrics.py: aggregate final metrics
Open evaluation can be run locally on the public split.
Official leaderboard evaluation is intended to be run server-side by the SourceBench team using:
- a hidden holdout split
- the fixed judge model and prompts
- the fixed metrics code
- a standardized submission contract
Relevant files:
leaderboard/QUERY_SPLIT_POLICY.mdleaderboard/OFFICIAL_SUBMISSION_CONTRACT.mdleaderboard/README.mdleaderboard/examples/
This repository keeps only the public split.
The hidden holdout queries should not be committed here. The benchmark master query pool should also stay out of the public release if it can be used to reconstruct the holdout split. For that reason, data/queries/queries.csv is excluded from this benchmark base.
Internal official submission artifacts are also excluded. The ignored directory:
leaderboard/.official_submissions/
is only for server-side submission intake and official evaluation runs.
If you use SourceBench in your research, please cite:
@article{sourcebench2026,
title={SourceBench: Can AI Answers Reference Quality Web Sources?},
author={Hexi Jin and Stephen Liu and Yuheng Li and Simran Malik and Yiying Zhang},
journal={arXiv preprint arXiv:2602.16942},
year={2026}
}