agent_framework_evaluator.evaluation#

Module API#

Post-run LLM evaluation for the agent evaluator web UI.

agent_framework_evaluator.evaluation.extract_first_llm_request_prompts(input_payload)[source]#

Extract system and every user message from the first provider request (in order).

Multiple user turns are common (task text, then skills catalog, etc.).

Parameters:: input_payload (Any)
Return type:: dict[str, Any]

agent_framework_evaluator.evaluation.extract_initial_prompts(input_payload)[source]#

Extract first system and first user message (evaluation / backward compatibility).

Parameters:: input_payload (Any)
Return type:: dict[str, str]

agent_framework_evaluator.evaluation.format_eval_input(system_prompt, user_prompt, criteria, agent_message)[source]#

Build XML-tagged user content for the evaluator model.

Parameters:

system_prompt (str)
user_prompt (str)
criteria (str)
agent_message (str)

Return type:

str

agent_framework_evaluator.evaluation.select_agent_result_field(agent_result, field_name)[source]#

Select field_name (dot-delimited path) from agent_result.

Returns None when the path does not exist in the result dict, so callers can distinguish a missing field from an empty value and raise an appropriate error. Returns the full stringified payload when field_name is ".".

For structured paths (e.g. response.status), response is checked before parameters so that the new typed channel takes precedence.

Parameters:

agent_result (Any)
field_name (Any)

Return type:

str | None

agent_framework_evaluator.evaluation.failed_evaluator_result(error_message)[source]#

Return a zero-score result with the error in verdict and criterion reasoning.

Parameters:: error_message (str)
Return type:: dict[str, Any]

agent_framework_evaluator.evaluation.parse_eval_response(payload)[source]#

Map evaluator LLM JSON to API / UI fields.

Parameters:: payload (dict[str, Any])
Return type:: dict[str, Any]

agent_framework_evaluator.evaluation.run_code_evaluation(code_evaluator, *, prompt, agent_message, flags=None)[source]#

Run a programmatic evaluator.

Returns None if the evaluator opts out (returns None); otherwise the parsed result dict. Raises ValueError for non-dict, non-None returns.

Parameters:

code_evaluator (Callable[[...], Any])
prompt (str)
agent_message (str)
flags (set[str] | None)

Return type:

dict[str, Any] | None

agent_framework_evaluator.evaluation.run_code_evaluations(code_evaluators, *, prompt, agent_message, flags=None)[source]#

Run all code evaluators sequentially.

Returns one entry per evaluator. None entries (opted-out evaluators) are excluded from score averaging by callers.

Parameters:

code_evaluators (list[Callable[[...], Any]])
prompt (str)
agent_message (str)
flags (set[str] | None)

Return type:

list[dict[str, Any] | None]

agent_framework_evaluator.evaluation.run_evaluation(*, env_path, evaluator_prompt, agent_message, system_prompt='', user_prompt='', model_override=None, log_callback=None)[source]#

Call the evaluator LLM once. Does not run the agent loop.

Parameters:

env_path (str | Path)
evaluator_prompt (str)
agent_message (str)
system_prompt (str)
user_prompt (str)
model_override (str | tuple[str, ...] | None)
log_callback (Callable[[dict[str, Any]], None] | None)

Return type:

dict[str, Any]