Exploration & Observation Fractions

It's always struck me as a surprising limitation of BECCA that the Exploration and observation fractions in the planner are fixed constants. It seems to me that for practically all tasks, these ratios ought to be weighted by the agent's learning.

While playing with BECCA tonight, it occurred to me that the Observation step provides valuable information that could be used to weight the Exploration fraction -- that is, observation allow the agent to measure the stochasticity of it's environment. I.e., "if I do nothing, and my inputs are the same as before, is my reward the same as before?" If not, we increase our stochaticity measure. If taking no action (observation) leads to no change in inputs or rewards, stochaticity is low.

When stochastic is high, we should explore more, because it's more likely I'll stumble on a strategy that works even better, or its more likely that my current strategy will cease to be effective.

Is there anything like this in BECCA already that I'm not aware of?

Is there any reason not to weight the exploration fraction based on this kind of stochasiticty measure?

Comments

in search of a better hack

Matt, you're spot on. The exploration fraction is a hack so that sometimes Becca does something surprising. This can serve to find a better solution or to get it out of a rut when it is acting on an inaccurate model of its world. The observation fraction is another hack to get Becca to sit still occasionally and watch what's going on in its world. This can be useful when its world takes more than one time step to repsond to one of Becca's actions.

The stochasticity (or unpredictability) of the world is also very important, as you say, but its relationship to exploration is not always straightforward. Sometimes unpredictability can be decreased by gaining greater experience in some part of the world. This is the case in a complex, yet deterministic world. And sometimes no amount of experience can make a world predictable. Imaging having Becca stare at television static for instance--it would never make Becca better at predicting the static. And it's important to note that the unpredictability of a world doesn't indicate that you need a new strategy. The card game craps is entirely unpredictable on a hand-by-hand basis, but a player shouldn't mix her strategy around based on that--the best strategy in the long run is always the same.

Finding a better hack for the exploration fraction has different labels: instrinsic motivation, active learning, goal exploration, directed exploration, etc. But the idea is the same as what you express: choose how and when to explore based on what you experience. There are a few main strategies for this that I know of:

0) Occasional random exploratory actions, like Becca does now. The dumbest solution.
1) Searching out unpredictable stimuli, similar to what you describe.
2) Searching out *somewhat* unpredictable stimuli. Look for experiences that you can kinda predict, but not perfectly.
3) Search out parts of your environment where your new experiences most rapidly decrease the unpredictability.

I'm sure there are others, but these give you the idea. Each implementation has its own bias, that is, there are problems that it solves well and others that it solves very poorly. The question is What will work best for the set of worlds that Becca is expected to address? That's an open question.

My strategy for Becca has been to avoid implementing sophisticated mechanisms until I need them, and I haven't run hard into a need for directed exploration yet. But if it excites you, feel free to dive in! I would google 'goal directed exploration' and read a half dozen papers or so to oriented first.

The observation fraction on

The observation fraction on the other hand, is much more closely tied to predictability. If I learn through experience that an environment does react to my actions almost immediately in most cases, then I can get more reward by acting more quickly. But if I need to wait around to see what effects my actions will have, then acting too quickly will probably just mess me up. It would indeed be good to decide whether to observe or act based on our predictions of what is likely to happen next.