DeepMind’s AI Learns Locomotion From Scratch | Two Minute Papers #190

DeepMind’s AI Learns Locomotion From Scratch | Two Minute Papers #190

Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. We have talked about some awesome previous
works where we used learning algorithms to teach digital creatures to navigate in complex
environments. The input is a terrain and a set of joints,
feet, and movement types, and the output has to be a series of motions that maximizes some
kind of reward. This previous technique borrowed smaller snippets
of movements from a previously existing database of motions and learned to stitch them together
in a way that looks natural. And as you can see, these results are phenomenal. And the selling point of this new one, which
you might say, looks less elaborate, however, it synthesizes them from scratch. This problem is typically solved via reinforcement
learning, which is a technique that comes up with a series of decisions to maximize
a prescribed score. This score typically needs to be something
reasonably complex, otherwise the algorithm is given too much freedom to maximize it. For instance, we may want to teach a digital
character to run or jump hurdles, but it may start crawling instead, which is still completely
fine if our objective is too simple, for instance, just maximizing the distance from the starting
point. To alleviate this, we typically resort to
reward engineering, which means that we add additional terms to this reward function to
regularize the behavior of these creatures. For instance, we can specify that throughout
these motions, the body has to remain upright which likely favors locomotion-type solutions. However, one of the main advantages of machine
learning is that we can reuse our solutions for a large set of problems. If we have to specialize our algorithm for
all terrain and motion types, and different kinds of games, we lose out on one of the
biggest advantage of learning techniques. So researchers at DeepMind decided that they
are going to solve this problem with a reward function which is nothing else but forward
progress. That’s it. The further we get, the higher score we obtain. This is amazing because it doesn’t require
any specialized reward function but at the same time, there are a ton of different solutions
that get us far in these terrains. And as you see here, beyond bipeds, a bunch
of different agent types are supported. The key factors to make this happen is to
apply two modifications to the original reinforcement learning algorithm. One makes the learning process more robust
and less dependent on what parameters we choose, and the other one makes it more scalable,
which means that it is able to efficiently deal with larger problems. Furthermore, the training process itself happens
on a rich, carefully selected set of challenging levels. Make sure to have a look at the paper for
details. A byproduct of this kind of problem formulation,
is, as you can see, that even though this humanoid does its job with its lower body
well, but in the meantime, it is flailing its arms like a madman. The reason is likely because there is not
much of a difference in the reward between different arm motions. This means that we most likely get through
a maze or a heightfield even when flailing, therefore the algorithm doesn’t have any reason
to favor more natural looking movements for the upper body. It will probably choose a random one, which
is highly unlikely to be a natural motion. This creates high quality, albeit amusing
results that I am sure some residents of the internet will honor with a sped-up remix video
with some Benny Hill music. In summary, no precomputed motion database,
no handcrafting of rewards, and no additional wizardry needed. Everything is learned from scratch with a
few small modifications to the reinforcement learning algorithm. Highly remarkable work. If you’ve enjoyed this episode and would like
to help us and support the series, have a look at our Patreon page. Details and cool perks are available in the
video description, or just click the letter P at the end of this video. Thanks for watching and for your generous
support, and I’ll see you next time!

81 thoughts on “DeepMind’s AI Learns Locomotion From Scratch | Two Minute Papers #190”

  1. Shouldn't it be possible to get natural looking results for the majority of use cases by using just a single reward function that favors less "effort" (i.e. least amount of joint movements)? For example, this would indirectly imply upright position for a humanoid character because walking upright is presumably more efficient.

  2. The footage in this video is hilarious. I can only imagine how fun and infuriating debugging and developing this must have been.

  3. Maybe they would walk more natural if the agents were also instructed to take energy conservation in mind. Because the reason we walk with our arms mostly hanging down is because gravity pulls them down and it costs energy to hold them up.

  4. It's highly unlikely that the arm motions are random. More likely the contribute in some minuscule way that in real life would be outweighed by the energy expenditure. I mean, we've seen AIs learn to manipulate the psuedo-random number generators in nintendo games and my prior probability on that is WAY lower than my prior on getting slight boosts to balance from bizarre arm gestures.

  5. it looks like they are using their arms for balance at least some times, just not in the most efficient way. is energy use part of the calculation? like the more the robot move its arms the more energy it consumes. i think if they encourage energy efficiency it would probably look more natural.

  6. you can't just say "click the letter p at the end of this video" and then nothing happens.
    i feel really sad now
    but this video was cool so we're even

  7. They should have added an energy cost for flailing limbs, the extra energy wasted by the disorderly limbs would make it more difficult for the test dummy to have sufficient energy left to jump across gaps or over barriers or get to its destination quickly enough.

  8. The movement of the arms are definitely not random since he clearly use them to balance himself, even if this wasn't intended, it is the result of lower level NNs, whether is it because the feets are too small/narrow (therefore he is always on very thin balance), or there isn't enough joints/data points that account for internal balance.

  9. Why don't they put some second function for minimizing energy consumption on it as a constraint? This way it would favor more natural movements of the arms etc probably.

  10. All the people who talk about energy efficiency completely missed the point of the paper.

    The point is that you don't have to do any tuning or engineer a reward function. You just say "go as far as possible" and it works.

    If you do energy tuning you add a layer of complexity and a problem-specific issue, that completely ruins the advantage that their method gives.

  11. if they would make movements cost energy and add a small penalty for energy used you could probably get rid of the weird arm movements.

  12. This is really neat to think about nature doing something like this over and over again. The flailing is hilarious but isn't related to the goal (in this case). Maybe this is why men have nipples?

  13. Waving your arms like crazy when walking seems to be the most efficient solution. Humans just have evolved bad. This type of locomotion is the future!

  14. A lot of this looks panicked, which makes me wonder if the objectively hilarious way that panicking people run is actually efficient.

  15. They look not natural, because in virtual world you not add STAMINA, if they add stamina as in games, and moving parts cost stamina, so it be one of score or as limit: you put max stamina, and it limits distance. If standing cost stamina, movement cost stamina, algorithm stop making big moves.

  16. I'm in tears… so much silly walking. They should add penalties for every movement of a muscle to simulate the energy we need to move our muscles and try to avoid unnecessary moves like flailing our arms.

  17. why aren't they phasing the reward? phase one, get as far as you can. phase two, efficiency of motion or conservation of expended energy. in this way you could get multiple results in a sort of priority queue?

  18. A guess the movement of arms isn't totally random, but more to counterbalance the movement of the lower body (like on human locomotion). It's not energy efficent as done by the algorithm, but works. A conservation of energy part in the algorithm could do some helping? Or some fatigue after continuing overly fast motion?

  19. you should add as a reward for the algorithm to make the smallest amount of movement to make it more natural, beacuse in nature, energy conservation is an important aspect, I suppose

  20. What if the constant waving of the arm was to achieve a gyroscope-like effect? If the AI can coordinate itself well enough to redirect the momentum achieved by the motion, at will, then this can have a powerful stabilization effect without any need for wheels.

  21. I would really like to see DeepMind doing smt beyond simulated environments. Because they haven't answered the most important question. Is all their work applicable beyond simulation? Any ideas on that question?


  23. I'd like to know how long this took to achieve the final running movement shown in the later generations. Can this be done in seconds, or does it take weeks of high horsepower supercomputing to do this?

  24. How about a loss that incurs a penalty for every limb movement? In this case, the upper body is more likely to stay still, as it movements doesn't greatly improve the distance score, but incurs a penalty on the energy consumption term.

  25. So, in the case of animated motion, computer programming still beats artificial intelligence (video game characters have much more natural movements than these did). Yet, in complex games like Go and chess, A.I. wins easily. Interesting.

  26. Wouldn't it be wise to include "energy efficiency" as a secondary optimization parameter (additional to distance). Basically this should reduce the amount of "unnecessary" movement.

  27. With Control Suite (dm_control) how do you save state (the learning achieved to file) so as to apply it to new terrain?
    Can anyone help please in answering a question (in four parts) to do with dm_control please?
    I have installed DeepMind's dm_control (from GitHub/deepmind/dm_control) with the physics engine MuJoCo and have it working, and now need to save results to disk. I can’t find the command or way to do this, so this is probably just be a lack of my understanding, or possibly its something about the software. This state saving requirement breaks down into four use cases:
    1. If I am doing a long training run it would be useful to be able to take checkpoints at intervals, say after an hour. These checkpoints could then be used to restart a run from that point without having to restart from the beginning. The goal is largely to make the learning process more robust (and hence faster and cheaper). Tensorflow, in comparison, has the ability to write all or a subset of variables to file as a checkpoint. Does dm_control have some such capability?
    2. The second use case is that after learning it would be very useful to be able to save the state of the learning so that it can be used operationally. State here includes the neural network node values within the policy networks, plus all other variables needed to reproduce results. It would appear that the videos that DeepMind provides of the humanoid running past obstacles have probably been created in this way on saving a checkpoint after each stage of the curriculum (although maybe they used another approach). How should I do this?
    3. The third use case relates to the essence of the paper, in being able to take a partially trained agent, trained on one terrain (or environment), and then give it a different environment to train it further. In the dm_control XML files for the various bodies the <geom > tag is used to define the basic terrain, however I can’t see how richer terrains are generated and applied. What is the best way to do this?
    4. A fourth use case not directly described in the papers would be to apply learning from one body to another body. It would be interesting to use say the planar walker (which just has legs and no arms) as the initial learning vehicle and then apply that learning to a body with arms (such as the humanoid).
    Any help on this would be greatly appreciated.

  28. why does it even have hands if it only randomizes its own movement? we do use our hands to balance ourselves in different scenarios in real life by using momentum and such , but the algorithm seems to missing that point. although the generation is hilarious

  29. The reason this training program converged into this weird movement style is that human(or other animals) knows how to move towards goal and consume minimal energy and minimal torque, add this two factor to optimizing target will converge to a more natural movement style

Leave a Reply

Your email address will not be published. Required fields are marked *