2017-09-21 @ 18.55.44 In English

"In Pathak and Agrawal’s machine-learning version of this surprise-driven curiosity, the AI first mathematically represents what the current video frame of Super Mario Bros. looks like. Then it predicts what the game will look like several frames hence. Such a feat is well within the powers of current deep-learning systems. But then Pathak and Agrawal’s ICM does something more. It generates an intrinsic reward signal defined by how wrong this prediction model turns out to be. The higher the error rate — that is, the more surprised it is — the higher the value of its intrinsic reward function. In other words, if a surprise is equivalent to noticing when something doesn’t turn out as expected — that is, to being wrong — then Pathak and Agrawal’s system gets rewarded for being surprised.

This internally generated signal draws the agent toward unexplored states in the game: informally speaking, it gets curious about what it doesn’t yet know. And as the agent learns — that is, as its prediction model becomes less and less wrong — its reward signal from the ICM decreases, freeing the agent up to maximize the reward signal by exploring other, more surprising situations. “It’s a way to make exploration go faster,” Pathak said.

This feedback loop also allows the AI to quickly bootstrap itself out of a nearly blank-slate state of ignorance. At first, the agent is curious about any basic movement available to its onscreen body: Pressing right nudges Mario to the right, and then he stops; pressing right several times in a row makes Mario move without immediately stopping; pressing up makes him spring into the air, and then come down again; pressing down has no effect. This simulated motor babbling quickly converges on useful actions that move the agent forward into the game, even though the agent doesn’t know it."