Tuesday, April 10, 2012

Bucket-brigading neural networks

I've recently been playing around with some Python code to explore a hunch I've had for a couple of years: that you can train a feed-forward neural network by simply indicating whether a output in response to an input was "good" or "bad".

I'd always imagined that I would hook up a small robot with a embedded neural network, giving myself a remote control with a button like this:


The robot would rove around, and whenever it did something "bad" (e.g. ran into a wall that it should have registered on its sensors) I'd press the button and it would train itself using that "bad" input->output pairing - e.g. that "move forward" when the front sonar sensor is registering an obstruction is "bad". I could also have a "good" button if it did something like turn just before a wall, for instance, to reinforce the correct behaviours.

This appealed to me as it was also very similar to how I (attempt to) train our cat...

Yes, that is our cat. No, that was not a training session...
Anyway, I have migrated this hunch to the GitHub repository BadCat. It has taken a few twists and turns along the way, but I have been able to "train" some very elementary neural networks using a simple set of rules based on the original hunch. I ended up taking a few pointers out of genetic algorithms theory just for fun too.

The algorithm works in the following way:

  1. Read the "sensors"
  2. Apply sensor readings to a learning tool (neural network), get the output
  3. Try out the output "in the real world"
  4. If the result of trying out the output is "bad":
    1. Slightly mutate the output
    2. Goto 3 above
  5. Train the network with the resultant (original or mutated) output
The mutation amount increases the longer the output is "bad", based on the assumption that the original output will be close to the desired already, but allowing the output to chance dramatically if the robot is stuck in a new situation. The "good" input->output pairs form part of a fixed length queue of recent memory that is used for regular training.

This approach is similar to the "bucket brigade" rule-reinforcement technique that can be used to train expert systems. It is also not dissimilar to reinforcement learning principles, except that the observation-action-reward mechanism is implicit instead of being explicit - the action is the output generated based on the observation and the weighting of the neural network and the reward (or penalty) is externally sourced and applied to the network only when needed.*

I am looking forward to trying this out a real mobile robot as soon as I can order my Pi and I will keep you up-to-date on how it turns out.

* Oh, and just to be clear, I am not a robotics or AI PHD student and this is not part of a proper academic research paper. It is very likely that what I am doing here has been done before so I make no claim to extraordinary originality or breakthrough genius - just consider this some musings and a pinch of free Python code :)