AI Agent Evaluation Questions: 17 Tests for Fake Agents |…

Q: What artifacts should a real vendor provide?

At minimum: - a sample end-to-end audit log for one prospect, - a tool permissions matrix, - retry and stop-condition policies, - retention and deletion policy, - outcome reporting that ties actions to booked meetings.

Q: What outcome metrics matter for an autonomous SDR?

Judge it like an SDR team: - meetings booked per 1,000 prospects, - positive reply rate, - cost per meeting, - show rate, - pipeline created. Everything else is noise. ---

AI agent washing is the new “AI-powered.” Same old workflow. New paint. Bigger invoice.

A real sales agent runs outbound end-to-end, makes decisions, uses tools, retries when things break, logs everything, and gets judged on one metric: booked meetings. If the vendor can’t explain those pieces in plain English, it’s a chatbot in a trench coat.

TL;DR

“Agentic” means the system takes action without a human clicking buttons every step.
Your RFP should force vendors to prove autonomy boundaries, tool permissions, audit logs, retry logic, grounding sources, failure handling, human-in-the-loop points, data retention, and measurable outcomes.
Below: 17 copy-paste AI agent evaluation questions, plus a scoring rubric that makes agent washing expensive.

Agent washing, defined (and why you should care)

“Agent washing” is when vendors rebrand automation, workflows, or a prompt box as an “agent” without real autonomy. Gartner analysts have warned about this exact pattern in the CRM space. (poly.ai)

Here’s why it matters in procurement:

A chatbot can demo well. It can “draft an email,” “suggest next steps,” and “summarize a call.”
A sales agent has to ship outcomes. It must find leads, qualify them, run sequences, handle replies, and book meetings. That requires tool access, guardrails, logging, and failure recovery.

If your vendor’s definition of “agent” starts and ends with “it can chat,” you’re buying demo theater.

What a real “sales agent” is (in procurement terms)

Use this definition in your doc. It keeps everyone honest.

Sales agent (procurement definition):
A system that can plan and execute multi-step outbound work with scoped tool permissions, monitor its own failures, retry safely, escalate to a human at defined checkpoints, and produce auditable logs that connect actions to outcomes.

This maps cleanly to what risk frameworks already push: transparency, accountability, and traceability. NIST’s AI RMF explicitly emphasizes transparency and explainability as prerequisites for accountability. (nist.gov)

If a vendor can’t show traceability, they can’t prove control. If they can’t prove control, your legal team gets a new hobby.

The 17 AI agent evaluation questions that expose a fake “sales agent”

Each section includes:

What you’re really testing
Copy-paste RFP questions
Red flags that scream “chatbot in a trench coat”

1) Autonomy boundary: where does it act without a human?

You’re testing: Whether the product actually runs the process, or just suggests stuff for humans to click.

Copy-paste RFP question

“List every step in your outbound workflow that runs without human approval. For each step, specify: trigger, decision logic, tools used, and output.”

Red flags

“The rep stays in control of every action.” Translation: not an agent.
“It drafts messages and your team sends them.” Translation: glorified copywriter.

2) Tool permissions: what can it touch, exactly?

You’re testing: Blast radius. A real agent needs tools. A safe agent needs scoped permissions.

OpenAI’s agent builder guidance calls out risks in tool calling and prompt injection. Tools are power. Power needs guardrails. (platform.openai.com)

Copy-paste RFP questions 2. “Provide your permission model for tools (CRM write access, email send, calendar booking, enrichment, web browsing). Can we scope by workspace, role, object type, and field?” 3. “Can the agent send email from our domain? Under what constraints (daily caps, throttles, warmup, domain rotation, approvals)?”

Red flags

“It has full access to your CRM for best results.” That’s not confidence, that’s negligence.
No mention of field-level controls.

3) Grounding sources: what does it use as truth?

You’re testing: Whether the agent makes claims from real data, or vibes.

Copy-paste RFP questions 4. “When the agent personalizes an email, what sources can it cite (CRM fields, enrichment, website crawl, job posts, intent signals)? Show a sample output with a source list per sentence.” 5. “What happens when sources conflict (CRM says 200 employees, enrichment says 1,000)? Describe conflict resolution.”

Red flags

“It uses the LLM’s knowledge.” That’s not grounding, that’s roulette.

4) Audit logs: can you reconstruct every action?

You’re testing: Whether you can defend actions in an incident, an audit, or a customer complaint.

Logging and auditability are standard expectations in security programs. ISO 27001 includes logging controls and retention expectations. (iseoblue.com)

Copy-paste RFP questions 6. “Provide a sample audit log for a single prospect from first touch to meeting booked. Include timestamps, prompts, tool calls, tool outputs, approvals, retries, and final actions.” 7. “Are logs immutable or tamper-evident? Who can access them? Can we export them to our SIEM?”

Red flags

“We store chat history.” Chat history is not an audit trail.
Logs that omit tool outputs. That’s the part that matters.

5) Retry logic: what does it do when things fail?

You’re testing: Real autonomy. Anything works in a demo. Production fails constantly.

Copy-paste RFP questions 8. “Describe your retry strategy per failure type: API timeout, enrichment failure, bounced email, calendar conflict, CRM write failure. Include max retries, backoff, and escalation.” 9. “What is your idempotency strategy? If the agent replays a step, how do you prevent duplicate emails, duplicate tasks, or double-booked meetings?”

Red flags

“We haven’t seen that happen.” You will.

6) Failure handling: what’s the safe default?

You’re testing: Whether the product fails safely or fails loudly in your customer’s inbox.

Copy-paste RFP questions 10. “List your ‘stop conditions.’ What events force the agent to halt outbound automatically (spam complaints, high bounce rate, low domain health, negative replies, legal keywords, competitor names, etc.)?” 11. “Show the escalation path when the agent is uncertain. What confidence thresholds trigger human review?”

Red flags

No stop conditions.
“The model decides.” Cool. Show the policy.

7) Human-in-the-loop: where do humans actually matter?

You’re testing: Whether humans handle exceptions, not every step.

Copy-paste RFP questions 12. “Identify the required human checkpoints, if any. For each, provide default settings and how we can change them.” 13. “Can we run in three modes: (a) full autonomy, (b) approve-before-send, (c) approve-only-high-risk?”

Red flags

Only one mode.
“We recommend approve everything.” That’s not autonomy. That’s busywork.

8) Data retention and training: what happens to your data?

You’re testing: Whether your customer data becomes someone else’s product roadmap.

Copy-paste RFP questions 14. “Specify data retention by data type: prompts, emails, contact data, tool outputs, logs. Provide default retention periods and deletion SLAs.” 15. “Do you use our data to train models? If not, state it contractually. If yes, describe opt-out and isolation.”

Red flags

“We may use data to improve the service” with no controls.
No deletion mechanism.

9) Measurable outcomes: how do you prove it works?

You’re testing: Whether they can tie actions to pipeline.

Copy-paste RFP questions 16. “Define success metrics you commit to in writing (reply rate, positive reply rate, meetings booked per 1,000 prospects, cost per meeting). Provide baseline assumptions and exclusions.” 17. “Show how you attribute outcomes to the agent vs human work. What is the system of record?”

Red flags

Vanity metrics only: “emails sent,” “tasks created,” “time saved.”
No attribution model.

Copy-paste RFP block: the full checklist (ready to drop into a doc)

Paste this as-is.

List autonomous steps with trigger, decision logic, tools used, outputs.
Provide tool permission model and scoping (role, object, field).
Define email sending constraints and safeguards (throttles, caps, approvals).
Provide grounding sources per personalization claim.
Describe conflict resolution between data sources.
Provide end-to-end audit log for one prospect lifecycle.
Describe log integrity, access control, and SIEM export.
Provide retry strategy per failure type with thresholds.
Describe idempotency controls to prevent duplicates.
List stop conditions that halt outbound automatically.
Define uncertainty handling and human escalation thresholds.
Identify required human checkpoints and configurability.
Support autonomy modes: full, approve-before-send, approve-high-risk-only.
Specify retention by data type and deletion SLAs.
State training policy for customer data with contractual terms.
Define outcome metrics and what you’ll commit to.
Explain attribution of meetings booked to agent actions.

If they dodge any of these, you learned what you needed to know.

Scoring rubric: kill “agentic” hand-waving with math

Use a 0-3 scale per category. Total score out of 30.

Scoring scale (0-3)

0 = Marketing answer. No specifics. No artifacts.
1 = Partial. Some specifics. No proof.
2 = Real. Specifics plus sample artifacts (logs, policies, configs).
3 = Mature. Everything in 2 plus customer-controlled settings, exports, and clear defaults.

Interpretation

0-12: Not an agent. It’s assisted CRM.
13-21: Some agency. Expect heavy babysitting.
22-27: Real agent. Start a pilot.
28-30: Dangerous in the best way. Hold them to outcomes.

Procurement tip: Require vendors to attach artifacts for any score of 2 or 3. No artifacts, no points.

Demo script: the fastest way to smoke out agent washing

Most demos start with “type a prompt.” That’s theater.

Run this instead:

Step 1: Give it a real ICP and a real constraint

ICP: “US-based IT services firms, 50-500 employees, selling managed SOC.”
Constraint: “No healthcare. No education. No companies using Vendor X.”

If the system can’t build and enforce an ICP, it’s guessing. That’s why a real stack starts with an ICP definition. Chronic’s ICP Builder exists because the agent needs a target, not a vibe.

Step 2: Force tool use

Ask it to:

find 100 leads,
enrich them with verified contacts,
score fit and intent,
send a 4-step sequence,
handle replies,
book meetings.

If they can’t run end-to-end, they’re selling a tool, not an agent. Chronic runs the workflow end-to-end till the meeting is booked. That includes Lead Enrichment, AI Lead Scoring, an AI Email Writer, and a real Sales Pipeline.

Step 3: Demand the audit trail

Pick one prospect. Ask for:

the source used for personalization,
the exact tool calls,
what failed,
what retried,
what escalated to a human,
what got booked.

If they can’t show it, they can’t control it.

Hard reality check: outbound is hostile now

Email providers keep tightening rules. Microsoft has been actively enforcing bulk sender requirements for Outlook-related inboxes, which raises the cost of sloppy automation. (proofpoint.com)

That matters because fake “agents” spam. Real agents respect constraints:

throttle volume,
personalize from grounded sources,
stop when signals say stop,
protect domain health.

If your “agent” only optimizes for send volume, it’s not an agent. It’s a liability generator.

For the deeper deliverability angle, pair this checklist with Chronic’s playbook: Microsoft’s Bulk Sender Enforcement: The 2026 Cold Email Playbook That Still Books Meetings.

Where Chronic fits (one clean contrast, then back to the checklist)

Apollo, HubSpot, Salesforce, Pipedrive, Attio, Close, Zoho, Clay, Instantly, HeyReach: all useful in their lanes. Some find leads. Some store data. Some send emails. Some orchestrate workflows. None of that guarantees autonomy.

Chronic gets judged on booked meetings. Not “AI features.”

If you’re evaluating platforms, here are direct comparisons:

Want the bigger stack view? Read: The 2026 ‘All-in-One’ Outbound Stack Map. If you care about signal-based outbound, read: The Trigger Engine: 25 Real-Time Outbound Triggers.

FAQ

What are “AI agent evaluation questions” in a sales RFP?

They’re procurement-grade questions that force a vendor to prove real autonomy, safe tool use, auditability, failure handling, and measurable outcomes. If the answers don’t include specific controls and sample artifacts, you’re not buying an agent.

What’s the difference between an AI copilot and an AI sales agent?

A copilot suggests. A sales agent acts. A copilot drafts an email. A sales agent finds the lead, enriches it, scores it, sends the sequence, handles the reply, and books the meeting with logged evidence.

What artifacts should a real vendor provide?

At minimum:

a sample end-to-end audit log for one prospect,
a tool permissions matrix,
retry and stop-condition policies,
retention and deletion policy,
outcome reporting that ties actions to booked meetings.

How do I test if “personalization” is real or made up?

Require grounding. Ask them to show the exact sources used for each personalization claim and what happens when sources conflict. If they can’t cite sources, it’s fiction with better formatting.

Do we need ISO or NIST alignment to buy an agent?

You don’t need a certificate to run a pilot. You do need the behaviors those frameworks imply: transparency, traceability, logging, and accountability. NIST’s AI RMF explicitly emphasizes transparency and explainability as part of trustworthy AI risk management. (nist.gov)

What outcome metrics matter for an autonomous SDR?

Judge it like an SDR team:

meetings booked per 1,000 prospects,
positive reply rate,
cost per meeting,
show rate,
pipeline created. Everything else is noise.

Run the trench-coat test

Send the 17 questions. Ask for artifacts. Score the answers.

If the vendor can’t explain autonomy boundaries, tool permissions, audit logs, retry logic, grounding, and failure handling in plain English, you’re not buying an agent.

You’re buying a chat window with a quota.

If you want pipeline on autopilot, buy an autonomous SDR. Chronic runs outbound end-to-end, till the meeting is booked.

AI Agent Washing Is Everywhere. 17 Questions That Expose a Fake ‘Sales Agent’.

Agent washing, defined (and why you should care)

What a real “sales agent” is (in procurement terms)

The 17 AI agent evaluation questions that expose a fake “sales agent”

1) Autonomy boundary: where does it act without a human?

2) Tool permissions: what can it touch, exactly?

3) Grounding sources: what does it use as truth?

4) Audit logs: can you reconstruct every action?

5) Retry logic: what does it do when things fail?

6) Failure handling: what’s the safe default?

7) Human-in-the-loop: where do humans actually matter?

8) Data retention and training: what happens to your data?

9) Measurable outcomes: how do you prove it works?

Copy-paste RFP block: the full checklist (ready to drop into a doc)

Scoring rubric: kill “agentic” hand-waving with math

Categories (10 total)

Scoring scale (0-3)

Interpretation

Demo script: the fastest way to smoke out agent washing

Step 1: Give it a real ICP and a real constraint

Step 2: Force tool use

Step 3: Demand the audit trail

Hard reality check: outbound is hostile now

Where Chronic fits (one clean contrast, then back to the checklist)

FAQ

What are “AI agent evaluation questions” in a sales RFP?

What’s the difference between an AI copilot and an AI sales agent?

What artifacts should a real vendor provide?

How do I test if “personalization” is real or made up?

Do we need ISO or NIST alignment to buy an agent?

What outcome metrics matter for an autonomous SDR?

Run the trench-coat test

Related Articles

Sales Engagement Platforms in 2026: 15 Tools Ranked by What They Actually Do (Email, Dialer, Data, Agent, CRM)

GTM Engineering Playbook for SMB: The 2026 Org Chart, Weekly Cadence, and the Exact Systems to Build

Cold Email Data Supply Chain: Where Bad Data Enters, How It Wrecks Deliverability, and the Fix (End to End)