You may know The Riddle of the Two Guards.
It’s a classic riddle where there are two gates, each with a gatekeeper. One gate safely leads to a traveller’s destination. One gate leads to destruction.
The traveller doesn’t know which gate is which, so they ask the guards which way to go. But the traveler is only allowed to ask one question.
However, one guard always tells the truth. The other guard always lies.
So which one question could the traveler ask in order to find out which gate leads to the safe route to their destination?
Spoiler alert: If you’ve not heard it before and want to try and figure the riddle out yourself (or you’ve not seen The Labyrinth), don’t read on.
It might seem like an impossible challenge because one guard always lies and one always tells the truth. How can you determine which is the correct door by asking just one question?
You solve the puzzle by asking one guard:
“What would the *other* guard say is the correct gate?”
Because one lies and the other tells the truth, the answer you receive is *always* going to be wrong. So the solution is to choose the opposite gate.
The puzzle works because the system contains deception as a built-in possibility.
Even though one guard is always truthful and honest the entire system and its solution is built on dishonesty.
LLMs frequently uses phrases that directly signal sincerity:
“To be honest…”
“Frankly…”
“In my opinion…”
“The reality is…”
“My honest take is…”
“You can trust me.”
It sounds reassuring. But logically, it introduces the same structure as the gatekeeper puzzle.
For even a rhetorical statement to be meaningful, the AI must be capable of both:
- Truthful responses
- Misleading or incorrect responses
Otherwise the claim of honesty would be meaningless.
In other words:
- A system that cannot lie doesn’t need to claim honesty.
- A system that claims honesty implies it can also fail to be honest.
This creates an honesty paradox similar to the gatekeepers. A system which is built around the concept of dishonesty in order to work.
If the system can mislead, its claim of honesty cannot be taken at face value.
If it cannot mislead, the claim adds no information.
Which leaves us in a strangely familiar position.
When a machine assures us it is being honest, we are effectively standing in front of the same two gates as the traveller in the riddle. One path may be correct. The other may confidently lead us in the wrong direction. And the system itself cannot resolve the uncertainty simply by asserting sincerity.
The safest approach is not to treat the machine as a trustworthy guide, but as a source of answers that still need checking.
In other words, we might apply the same strategy that solved the original puzzle.
Always take the opposite gate.