agent_framework_evaluator.evaluation#
Module API#
Post-run LLM evaluation for the agent evaluator web UI.
- agent_framework_evaluator.evaluation.extract_first_llm_request_prompts(input_payload)[source]#
Extract system and every
usermessage from the first provider request (in order).Multiple user turns are common (task text, then skills catalog, etc.).
- Parameters:
input_payload (Any)
- Return type:
dict[str, Any]
- agent_framework_evaluator.evaluation.extract_initial_prompts(input_payload)[source]#
Extract first system and first user message (evaluation / backward compatibility).
- Parameters:
input_payload (Any)
- Return type:
dict[str, str]
- agent_framework_evaluator.evaluation.format_eval_input(system_prompt, user_prompt, criteria, agent_message)[source]#
Build XML-tagged user content for the evaluator model.
- Parameters:
system_prompt (str)
user_prompt (str)
criteria (str)
agent_message (str)
- Return type:
str
- agent_framework_evaluator.evaluation.select_agent_result_field(agent_result, field_name)[source]#
Select field_name (dot-delimited path) from agent_result.
Returns
Nonewhen the path does not exist in the result dict, so callers can distinguish a missing field from an empty value and raise an appropriate error. Returns the full stringified payload when field_name is".".For structured paths (e.g.
response.status),responseis checked beforeparametersso that the new typed channel takes precedence.- Parameters:
agent_result (Any)
field_name (Any)
- Return type:
str | None
- agent_framework_evaluator.evaluation.failed_evaluator_result(error_message)[source]#
Return a zero-score result with the error in verdict and criterion reasoning.
- Parameters:
error_message (str)
- Return type:
dict[str, Any]
- agent_framework_evaluator.evaluation.parse_eval_response(payload)[source]#
Map evaluator LLM JSON to API / UI fields.
- Parameters:
payload (dict[str, Any])
- Return type:
dict[str, Any]
- agent_framework_evaluator.evaluation.run_code_evaluation(code_evaluator, *, prompt, agent_message, flags=None)[source]#
Run a programmatic evaluator.
Returns None if the evaluator opts out (returns None); otherwise the parsed result dict. Raises ValueError for non-dict, non-None returns.
- Parameters:
code_evaluator (Callable[[...], Any])
prompt (str)
agent_message (str)
flags (set[str] | None)
- Return type:
dict[str, Any] | None
- agent_framework_evaluator.evaluation.run_code_evaluations(code_evaluators, *, prompt, agent_message, flags=None)[source]#
Run all code evaluators sequentially.
Returns one entry per evaluator. None entries (opted-out evaluators) are excluded from score averaging by callers.
- Parameters:
code_evaluators (list[Callable[[...], Any]])
prompt (str)
agent_message (str)
flags (set[str] | None)
- Return type:
list[dict[str, Any] | None]
- agent_framework_evaluator.evaluation.run_evaluation(*, env_path, evaluator_prompt, agent_message, system_prompt='', user_prompt='', model_override=None, log_callback=None)[source]#
Call the evaluator LLM once. Does not run the agent loop.
- Parameters:
env_path (str | Path)
evaluator_prompt (str)
agent_message (str)
system_prompt (str)
user_prompt (str)
model_override (str | tuple[str, ...] | None)
log_callback (Callable[[dict[str, Any]], None] | None)
- Return type:
dict[str, Any]