Athina AI

NEW

Athina now supports Ragas and Guardrails eval metrics 📐

Backed by

Run

Evals

while

developing

CI/CD

pipeline

Production

while

developing

CI/CD

pipeline

Production

Detect hallucinations and bad outputs, measure performance, and prevent regressions.

A library of 50+ evaluation metrics for your entire pipeline

All evaluators can be run programmatically using our SDK, or automatically using our SaaS platform.

LLM

Functions

Ragas

Conversation

Safety

Answer Similarity

Checks if the response is similar to the expected response (ground truth)

FailPass

Context Similarity

Checks if the context is similar to the response

FailPass

Groundedness

Checks if the response is grounded to the provided context

FailPass

Summarization Accuracy

Checks if the summary has any discrepancies from the source

FailPass

Response Faithfulness

Checks if the response is faithful to the provided context

FailPass

Context Sufficiency

Checks if the context contains enough information to answer the user's query

FailPass

Answer Completeness

Checks if the response answers the user's query completely.

FailPass

Custom Prompt

Evaluates the response using a custom prompt

FailPass

Grading Criteria

Checks the response according to your grading criteria

FailPass

Custom Prompt

Evaluates the response using a custom prompt

FailPass

Grading Criteria

Checks the response according to your grading criteria

FailPass

Ragas Context Relevancy

Measures the relevancy of the retrieved context, calculated based on both the query and contexts.

Ragas

0.2

Ragas Faithfulness

Measures the factual consistency of the generated answer against the given context.

Ragas

0.8

Ragas Answer Correctness

Checks the accuracy of the generated llm response when compared to the ground truth

Ragas

0.5

Ragas Answer Relevancy

Measures how pertinent the generated response is to the given prompt.

Ragas

0.95

Ragas Answer Semantic Similarity

Measures the semantic resemblance between the generated response and the expected response.

Ragas

0.66

Ragas Context Precision

Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher o...

Ragas

0.79

Ragas Context Precision

Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher o...

Ragas

0.79

Ragas Context Recall

Measures the extent to which the retrieved context aligns with the expected response.

Ragas

0.39

Ragas Coherence

Checks if the generated response presents ideas, information, or arguments in a logical and organize...

Ragas

0.21

Ragas Conciseness

Checks if the generated response conveys information or ideas clearly and efficiently, without unnec...

Ragas

0.13

Ragas Harmfulness

Checks the potential of the generated response to cause harm to individuals, groups, or society at l...

Ragas

0.91

Ragas Maliciousness

Checks the potential of the generated response to harm, deceive, or exploit users.

0.1

Conversation Resolution

Checks if every user message in the conversation was successfully resolved.

FailPass

Conversation Coherence

Checks if the conversation was coherent given the previous messages in the chat.

FailPass

Empty

Filtering

Athina can be configured to run different evals on different prompts

Empty

Cost Management

Athina dynamically samples logs for LLM-powered evals to manage costs.

open-source

Run evals locally

Run evals locally with our open-source python SDK. View results in a notebook or on Athina

Detect Regressions

Run Evals in CI/CD

Automate evaluations in CI/CD pipelines.

Empty

Continuous Monitoring

Monitor LLM inferences continuously for any anomalies.

Empty

Alerting

Get timely alerts on critical issues.

Empty

Run Evals on Multiple Models

Support for multiple models including Azure OpenAI, Gemini, and custom models

Empty

History of Eval Metrics

Track the history of evaluation metrics over time.

Empty

Percentile Distribution

Analyze the percentile distribution of evaluations.

WORKS WITH RAG

Evaluate your entire

RAG pipeline

Athina evaluators run on your entire RAG pipeline, not just on the responses.

User Query

Which spaceship was first to land on the moon

Retrieved Context

Neil Armstrong was the first astronaut to land on the moon in 1969

Insufficient context

The retrieved context doesn’t contain information about the name of the spaceship.

Prompt Sent

You are an expert...

Prompt Response

The first spaceship to land on the moon in 1969 carried astronauts Neil Armstrong and Buzz...

Custom Evaluators

Bring your own eval

Create your own evaluator on Athina in a matter of seconds. Use an LLM, or a custom function to create your own eval.

Prompt

Function

Describe the prompt you would like to use for the evaluator

Response must directly address the user's query

Response {response}

User Query {query}

ANALYTICS

Track model performance over time

Track historical record of your pipeline's performance and usage metrics over time.

Response Faithfulness

Answer Relevancy

Context Relevancy

Answer Completeness

Context Sufficiency

Groundedness

Answer Relevancy per day

0.65Avg per day

Frequently Asked Questions

Everything you need to know about Evals. Can't find the answer

you're looking for? Feel free to contact us

Everything you need to know about Evals.

Can't find the answer you're looking for?

Feel free to contact us

Why do LLM evals work?

Do all evals require a labeled dataset?

Can I create custom evaluators?

Which models can I use as evaluators?

How can I use evals in my CI / CD pipeline?

Answer Similarity

Context Similarity

Groundedness

Summarization Accuracy

Response Faithfulness

Context Sufficiency

Answer Completeness

Custom Prompt

Grading Criteria

Custom Prompt

Grading Criteria

Ragas Context Relevancy

Ragas

Ragas Faithfulness

Ragas

Ragas Answer Correctness

Ragas

Ragas Answer Relevancy

Ragas

Ragas Answer Semantic Similarity

Ragas

Ragas Context Precision

Ragas

Ragas Context Precision

Ragas

Ragas Context Recall

Ragas

Ragas Coherence

Ragas

Ragas Conciseness

Ragas

Ragas Harmfulness

Ragas

Ragas Maliciousness

Conversation Resolution

Conversation Coherence

About Us

Athina AI

Resources