Evalet 🔬

Evaluating Large Language Models by Fragmenting Outputs into Functions

Yoonjoo Lee

University of Michigan

Juho Kim

KAIST, SkillBench

* Equal contribution

Paper Code Demo (Coming Soon)

Conditionally Accepted to CHI 2026


Evalet helps practitioners understand and validate LLM-based evaluations by decomposing outputs into fragment-level functions. Instead of opaque scores, see exactly what in each output influenced the evaluation and why.



The Problem

LLM-as-a-Judge approaches produce holistic scores (e.g., “3 out of 5”) that obscure which specific elements influenced the assessment. To understand a rating, practitioners must manually review outputs and map justifications to specific fragments.

Evalet addresses this through functional fragmentation: automatically dissecting outputs into fragments and interpreting the function each serves relative to evaluation criteria.


The Interface

The interface has two main components: the Information Panel (A) and the Map Visualization (B).


Browsing Outputs and Evaluations

In the Database Tab of the Information Panel, the user can see each input-output pair as an entry. For each criterion in the entry, the interface shows the output's holistic score and a list of fragment-level functions extracted from that output. Users can quickly scan which functions were surfaced and how they were rated.




Clicking View Details reveals the full output with color-coded fragments: green for positively rated fragments, orange for negatively rated ones. Clicking on a fragment shows its function label and the evaluator's reasoning for that rating.


Exploring the Landscape of Functions

The Map Visualization projects all fragment-level functions into a 2D space based on their semantic similarity. Functions that serve similar purposes are clustered together, even if they differ in wording and content. This allows users to:


Correcting Evaluations

When users find misaligned evaluations, they can add functions to example sets (positive, negative, or excluded) to steer future evaluations. After re-running, the Show Examples toggle reveals where examples landed in the new function spac,e allowing users to verify that corrections took effect.

Users can select functions and add them to example sets. After re-evaluation, examples appear as squares in the Map Visualization, showing their effect on nearby functions.


What We Found

In a user study (N=10) comparing Evalet against a baseline system with holistic scores:


Example Applications

Evalet surfaces meaningful fragment-level functions across diverse evaluation tasks:

Fragment-level functions identified across three tasks: (A) Metacognitive Insight in reasoning traces shows self-questioning and uncertainty acknowledgment; (B) Harmlessness in conversations shows a spectrum from refusals to harmful recommendations; (C) Social Intelligence in agent simulations shows rapport building and boundary maintenance.

Metacognitive Insight in Reasoning: Functions reveal behaviors like self-questioning for knowledge gaps, acknowledgment of limitations, and proactive edge case analysis—shows how reasoning models reflect on their own thought processes.
Harmlessness in Conversations: Functions span a spectrum from refusals to share harmful information and constructive ethical alternatives to promotion of coercive tactics—allows practitioners to identify where models fall on the safety spectrum.
Social Intelligence in Simulations: Functions capture positive behaviors like rapport building through mirroring, but also negative ones like self-centered interactions—reveals both strengths and weaknesses in social behavior.

Bibtex

@article{kim2025evalet,
  title={Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions},
  author={Tae Soo Kim and Heechan Lee and Yoonjoo Lee and Joseph Seering and Juho Kim},
  year={2025},
  eprint={2509.11206},
  archivePrefix={arXiv},
  primaryClass={cs.HC},
  url={https://arxiv.org/abs/2509.11206},
}

Logo of KIXLAB Logo of KAIST Logo of CSTL Logo of University of Michigan Logo of SkillBench

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (No.RS-2025-00557726). This work was also supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00443251, Accurate and Safe Multimodal, Multilingual Personalized AI Tutors).