Here is the most trippy thing I know in the AI safety space, as far as I understand it
Suppose you want to program a system to make reasonable decisions in full generality. You then need to be able to solve this extremely difficult and complicated dilemma:
Choose between taking a 5 dollar bill and a 10 dollar bill.The reason this is complicated is that, assuming you take the 5 dollar bill, you can prove "If I took the 10 dollar bill, I would get 0$.". This is because "I take the 10 dollar bill" is false, and "false implies x" is true for any x.
And worse, you can also prove "If I take the 5 dollar bill, I get 5$; if I take the 10$ bill, I get 0$". This is because a thing called Löb's theorem, which says that the following sentence is true and provable in any formal system that is at least as strong as basic arithmetic, and for any P:
The box means "provable" so this sentence says "if you can prove that [a proof of P implies P], then you can prove P". This sounds totally insane because obviously a proof of P should imply P; this shouldn't mean P is provable. But for reasons I don't quite understand, it is a theorem.
The natural language analog to Löb's theorem is roughly this:
ErrinF created the greatest dominion expansion of all time. Don't believe me? Here is the water-tight proof.
Let S be the sentence, 'if S is true, ErrinF created the greatest dominion expansion of all time'
Suppose S is true. Then the above is true, i.e., it's true that, 'if S is true, ErrinF created the greatest dominion expansion of all time' -- and since S is, in fact, true, that means ErrinF created the greatest dominion expansion of all time.
Thus we have shown that, if S is true, ErrinF created the greatest dominion expansion of all time.
Since the above sentence is precisely S, that means we have shown that S is true -- without assuming anything.
As shown above, this means that ErrinF created the greatest dominion expansion of all time.
(You can't do this in a formal language because a sentence can't be about itself, i.e., you can't define S := S -> X, but you can
can construct a sentence S that is equivalent to []S -> X. With [] being the "provable" box. I even used to know how to do this.)
So if your decision algorithm does the provably better thing, then if P = "if I take the 5 dollar bill, I get 5$; if I take the 10 dollar bill, I get 0$", then you can prove that []P -> P. That is because []P means you've proven P, i.e., you've proven that taking 5$ gives 5$ but taking 10$ gives 0$, and since again you do the provably better thing, you take the 5$.
So you've proven []P -> P, and because of Löb's theorem, you can thus prove P
and then you take the 5$.