The Six Questions You Should Ask About That Magic AI Demo

Written by Nibble | Sep 3, 2025 6:55:34 PM

In five years of selling Nibble's AI Negotiation platform, I can count on one hand the number of buyers who have asked me the most basic question: "How does it work?"

Many senior leaders are scared to ask, fearful they won’t understand the answer or worried about asking a stupid question.

This is a big issue. We all need to better understand, in plain terms, how these tools work, how they might fail, and whether they’ll deliver real commercial impact for your business. This wave of AI technology is so much more accessible than we have seen before but I have found myself that you don’t properly understand it until you are willing to ask what I call questions at the “second layer”. Just go a little bit deeper.

Too often, I see conversations getting stuck on abstract risks or endless training data debates, when the more fundamental questions are ignored: Does this thing actually work? Will it scale? Can I trust it with my business model?

Why This Matters

This wave of AI technology looks like magic.

But magic isn’t enough, it needs to be reliable, explainable, and commercially transformative. You want ROI.

Think about autonomous vehicles: if Uber or UPS were to bet their business model on driverless fleets, they’d need more than slick demos. They’d need absolute certainty the system is more reliable than a human driver. They need to be sure they will see payback on a wholesale switch to a new driverless fleet which will be a multi-year business transformation.

It’s the same with AI in the enterprise. “Co-pilots” are nice — but they won’t fundamentally change commercial models. Autonomous agents might. But only if they scale, remove the need for constant human oversight and prove their ROI.

That’s why leaders need to be asking tougher questions from their AI providers. Here’s my guidebook for doing just that. Six, easy-to-ask, deceptively simple questions that drive you into the “second layer”.

1. What kind of AI is this, really?

The issue: AI is now a marketing umbrella. Every vendor calls their system “AI-powered” when it’s just a decision tree or the thinnest wrapper around ChatGPT. AI today normally means LLM (large language model like Gemini or Chat GPT) but it could mean machine learning which has been around for yonks and does something different (and needs LOADS of data to be effective).

If you don’t know what you’re buying, you can’t know its limits.

Ask: “Is this system an LLM, an ML model, a rules-based system — or some combination? Show me exactly what sits under the hood.”

Bad answer: “It’s AI — we use cutting-edge machine learning algorithms that constantly evolve.” (Vague, buzzword-heavy, no clarity.)
Good answer: “This uses a LLM (in this case GPT-5) to understand the input. The output is subject to defined guardrails and partially generated by the AI and partially uses fixed responses. In this way, the LLM generates draft responses; the rules engine filters outputs by comparison to compliance rules before serving them to users so you can be certain they are within your defined scope.” (Specific, clear, shows architecture.)

2. Why does it make different decisions for different people or scenarios?

The issue: AI systems often personalise results — but on what basis? Purchase history is one thing; inferring gender, race, or socio-economic status is a big issue. If the signals are murky, the outputs risk crossing into discrimination. Did anyone see the recent example where LLMs encourage women to ask for lower salaries in negotiations than men? You can imagine how pissed off I was reading that one...

Ask: “What signals does the system use to differentiate between users, and can you prove it doesn’t bias on protected characteristics?”

Bad answer: “Our system automatically learns from the data and optimises for each individual.” (Hand-wavey, evasive, no clarity on fairness.)
Good answer: “We personalise based on transaction history and recency/frequency of interactions. We’ve explicitly excluded variables such as gender or postcode to avoid discriminatory bias. The AI simply doesn’t see these so cannot use them for decision making. We run quarterly fairness audits against protected categories to ensure accidental bias isn’t occurring.” (Transparent, specific, proactive risk management.)

3. What happens when it doesn’t understand?

The issue: Every AI has blind spots. The critical question is not whether it will fail (it will) — but how. Does it escalate to a human? Does it politely ask for clarification? Or does it hallucinate nonsense that could damage trust?

Ask: “Show me examples where your system misunderstood the input — what happened next?”

Bad answer: “Our model is 99% accurate, so you don’t need to worry about errors.” (Red flag — overconfidence, no examples.)
Good answer: “When the system doesn’t recognise intent, it asks the user to clarify. If that fails, it hands off to a human within three steps. In testing, 7% of queries hit this fallback, and 90% of those were resolved by human agents.” (Acknowledges errors, explains fail-safe, gives data.)

4. How do you measure reliability?

The issue: Demos are cherry-picked and case studies are highly curated. I should know I live by them!! What you need is the full statistical spread and the opportunity to test the technology for yourself. When testing, think carefully about the eventual user - do your best to put yourself in their shoes.

We see something interesting at Nibble in this regard: Our demo bot sees huge, statistically significant, differences in behaviour between people testing our demo and people using it for real. Real conversations are 50% shorter in time and the initial financial bids are 15% “better” i.e. real users are infinitely more reasonable than casual testers who are mostly thinking “how dumb is this chatbot / can I break it?”.

Real world negotiation failures happen very differently, they are when someone is using the chatbot in a genuine way to try and reach agreement but it is simply a scenario we have not planned and trained for yet. These are the ones you want to discover in testing.

Ask: “What’s your accuracy rate across 1,000 random examples, not your curated top five?” or (better) “Can I test it myself in a live test environment” and “how would you recommend I test it, what KPIs and metrics would you measure?”

Bad answer: “Our customers love the results — just look at this testimonial from Company X.” (Fluff, not data.)
Good answer: “In a blind test of 1,000 random interactions, the system produced correct and compliant outputs 93% of the time. That’s compared with 85% for human agents on the same task.” (Quantified, benchmarked, transparent.)

5. How transparent are the decisions?

The issue: A black-box recommendation (“the AI says so”) is commercially useless and often non-compliant in regulated industries. You need to know why the system made a decision — and whether you can audit that reasoning.

In fact, we are seeing some industries avoiding LLM based technology altogether because it is so hard to explain. This doesn’t need to be the case. At Nibble we work on a hybrid LLM / deterministic approach which means all decisions taken by the bot are explainable.

Ask: “Can you explain, in plain English, why the AI recommended X over Y?”

Bad answer: “The model uses a complex neural network that can’t really be explained in human terms.” (Unacceptable for governance or accountability.)
Good answer: “For this recommendation, the system weighted three variables: recency of purchases (40%), basket size (35%), and churn risk (25%). Here’s the reasoning chain that led to the final output.” (Explainable, auditable, simple enough for a boardroom.)

6. How does the model know facts — and how does it learn?

The issue: Models only know what they were trained on. An LLM might be capped at 2023 data. A machine-learning system might only know your internal datasets. If you don’t know how it learns or updates, you risk using stale, narrow, or even junk information.

But here’s the nuance: agents built on top of LLMs don’t usually retrain the base model every time. Instead, they improve by using structured memory, feedback loops, and rules. This is in the coding, not in the AI.

Take negotiation as an example: if an AI agent negotiates with 50 suppliers, you don’t need to retrain the underlying LLM to handle the next 50. Instead, the agent layer might be coded to:

Store outcomes (what offers were accepted, rejected, or escalated).
Update or prioritise certain negotiation tactics (e.g., if discount requests above 20% usually fail, stop offering them or reinforce successful negotiation strategies by weighting those options higher).
Leverage short-term memory (so it uses reasons or context from earlier negotiations conversationally in later ones).

This means the LLM provides language ability, but the agent provides learning ability in context. That’s very different from retraining the model itself.

Ask: “How does your system improve over time without retraining the base model?”

Bad answer: “Don’t worry, it just learns as it goes.” (Vague and risky — could mean uncontrolled data drift.)
Good answer: “The LLM itself is static, but the agent layer tracks negotiation outcomes, stores them in a memory database, and adjusts its strategy using rules and feedback. This lets it improve on the same task without changing the base model.”

Final Word on evaluating AI Negotiation solutions

This is the level of detail you should be pushing for. Not the glossy demo. Not the curated success story. But the uncomfortable, second-layer answers that prove the system is safe, explainable and commercially viable.

These aren’t PhD-level questions. You do not need to be an expert or an AI developer to understand how the technology works. Don’t forget I used to be a finance person, I am not a die-hard AI expert, I just like learning about all this so I keep asking questions.

These are conscientious leadership questions. And the leaders who ask them will be the ones who implement AI with confidence and deliver the ROI.

Bookmark this list. Use it in your next board meeting, AI strategy workshop, or sales pitch with Google, Microsoft or Salesforce. It might be the most commercially important set of questions you ask this year.

Find out more from Nibble's experience negotiating 100,000 times a month here.

Interested in Nibble?

View full post