llm-evaluation

Here are 673 public repositories matching this topic...

mlflow / mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

Updated Mar 17, 2026
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Mar 17, 2026
TypeScript

comet-ml / opik

Star

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground evaluation openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

Updated Mar 17, 2026
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Mar 16, 2026
TypeScript

confident-ai / deepeval

Star

The LLM Evaluation Framework

python evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Mar 13, 2026
Python

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Mar 17, 2026
Jupyter Notebook

NVIDIA / garak

Star

the LLM vulnerability scanner

ai vulnerability-assessment security-scanners llm-security llm-evaluation

Updated Mar 16, 2026
HTML

jeinlee1991 / chinese-llm-benchmark

Star

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

artificial-intelligence llm-agent llm-evaluation agentic-ai

Updated Mar 17, 2026

Helicone / helicone

Star

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated Mar 15, 2026
TypeScript

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Mar 17, 2026
Python

PacktPublishing / LLM-Engineers-Handbook

Star

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

aws rag mlops llm llmops genai fine-tuning-llm llm-evaluation ml-system-design

Updated Mar 2, 2026
Python

Marker-Inc-Korea / AutoRAG

Star

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated Mar 10, 2026
Python

Agenta-AI / agenta

Star

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

evaluation agents observability prompt-engineering llmops prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability

Updated Mar 17, 2026
TypeScript

EvolvingLMMs-Lab / lmms-eval

Star

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

benchmark evaluation agi video-understanding vlm multimodal large-language-models vision-language-model llm-evaluation audio-evaluation multimodal-evaluation

Updated Mar 15, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Mar 10, 2026
Python

lmnr-ai / lmnr

Star

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Updated Mar 17, 2026
TypeScript

genieincodebottle / generative-ai

Star

Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

Updated Mar 16, 2026
Jupyter Notebook

msoedov / agentic_security

Star

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

Updated Feb 3, 2026
Python

huggingface / aisheets

Star

Build, enrich, and transform datasets using AI models with no code

oss ai synthetic-data nocode llms llm-evaluation

Updated Oct 23, 2025
TypeScript

cyberark / FuzzyAI

Star

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

security ai jailbreak fuzzing jailbreaking llm llms ai-red-team llm-security llm-evaluation

Updated Feb 6, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation

Here are 673 public repositories matching this topic...

mlflow / mlflow

langfuse / langfuse

comet-ml / opik

promptfoo / promptfoo

confident-ai / deepeval

Arize-ai / phoenix

NVIDIA / garak

jeinlee1991 / chinese-llm-benchmark

Helicone / helicone

Giskard-AI / giskard-oss

PacktPublishing / LLM-Engineers-Handbook

Marker-Inc-Korea / AutoRAG

Agenta-AI / agenta

EvolvingLMMs-Lab / lmms-eval

truera / trulens

lmnr-ai / lmnr

genieincodebottle / generative-ai

msoedov / agentic_security

huggingface / aisheets

cyberark / FuzzyAI

Improve this page

Add this topic to your repo