Skip to content

test(sdk): add evals for llm judge, tool selection, followup quality, and multi-turn memory#1838

Open
Mason Daugherty (mdrxy) wants to merge 7 commits intomainfrom
mdrxy/ab-evals
Open

test(sdk): add evals for llm judge, tool selection, followup quality, and multi-turn memory#1838
Mason Daugherty (mdrxy) wants to merge 7 commits intomainfrom
mdrxy/ab-evals

Conversation

@mdrxy
Copy link
Member

@mdrxy Mason Daugherty (mdrxy) commented Mar 12, 2026

Port from Agent Builder: LLM-as-judge assertion (LLMJudge) and three new eval suites — tool selection, followup question quality, and multi-turn memory behavior.

The judge assertion fills a gap where substring matching can't evaluate semantic correctness, using a second LLM to grade agent responses against human-readable criteria with per-criterion pass/fail granularity.

@github-actions github-actions bot added deepagents Related to the `deepagents` SDK / agent harness internal User is a member of the `langchain-ai` GitHub organization size: L 500-999 LOC tests Adding tests or correcting existing labels Mar 12, 2026
Copy link
Collaborator

@eyurtsev Eugene Yurtsev (eyurtsev) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! do you think you could run the evals against this branch to just get a sense of what it looks like?

@mdrxy Mason Daugherty (mdrxy) changed the title test(sdk): add llm judge assertion and tool selection evals test(sdk): add evals for llm judge, tool selection, followup quality, and multi-turn memory Mar 12, 2026
@mdrxy

This comment was marked as outdated.

@mdrxy
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepagents Related to the `deepagents` SDK / agent harness internal User is a member of the `langchain-ai` GitHub organization size: L 500-999 LOC tests Adding tests or correcting existing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants