I mean, it happens that people instruct facial recognition algorithms to be racist by accident and stuff like that, but those kinds of mistakes happen because of biases that exist in the current society that lead to the same biases being reflected in what the AI is doing, i.e. the devs are specifically, explicitly (although not deliberately) asking it to be racist, and then they get the result they asked for in the way they asked for. It's not like the AI decides to be racist on its own or does any other unexpected Monkey's Paw stuff.
Quite. You'll notice that I never said an AGI will be racist.
The relevant difference between an AGI and facial recognition is that AGI will run an optimization process. Which is to say, it has a goal (not necessarily the one we want it to have since we don't have a straight-forward way of just handing it a goal, but still a goal) and then searches for actions that lead to the goal being as fully met as possible.
Image recognition models don't do anything of the sort; they're just a bunch of heuristics, usually encoded in a neural network.
Optimization matters because it leads to instrumental subgoals. in the mario example, the AI doesn't have a terminal goal of hacking the game (terminal == what it wants for its own sake), but it has a terminal goal of achieving a higher score, and this leads to an instrumental subgoal (instrumental == what it wants to ultimately help with a terminal goal) of hacking because hacking is effective at raising its score.
The main problem here is that most goals incentivize perverse behaviors if they're maximized to the extreme; this is what I was getting at with the WBC2 examples. Once the AI is smart enough to model its programmers, it'll know that they don't want this and will turn it off if it pursues them. Not being turned off is an instrumental subgoal for just about every terminal goal, so it'll want to avoid that. And "kill all humans" is the obvious thing to do here. This is a totally different mechanism from biases in the training data. It doesn't matter if your training data is perfect; everyone being dead is what naturally happens if you have a goal and optimize it to the physical limit. It takes an extremely specific configuration of atoms to allow humans to exist, and if the AI is powerful enough to reshape the world however it wants, then only a tiny number of goals are compatible with not everyone dying.
So this is why it'll want to hack the game. How does it know how to hack? Well, it depends on what information you give it. If you give it zero inputs, it won't learn dangerous things, but then it's also not useful. The problem is that you need to give it inputs for it to learn how to do stuff. There's a tradeoff between capability and safety. And it's going to be extremely smart -- this is the whole problem, none of this stuff happens if it's not extremely smart -- so it can figure out things based on a lot less information than humans can. If, e.g., it sees its own code, this is already lots of information. It could infer how humans think and that they're not very smart. ("Look at that, they write in a weird, totally inefficient language and the put useless comments into it that don't do anything".) You probably don't need a lot more than that to figure out how to hack something.
Right now, we're training models by letting them process massive amounts of training data, so there is plenty of information.