Hiding Action Prediction Inference Time in Robotics Applications

Reducing inference latency in robotics is critical to smooth robotics. To be explicitly clear, I’m talking about worst-case per action inference latency.

The strategy that generally has been persued is action chunking. This strategy leverages predicting the next few actions from a specific observation. This clearly would reduce the average action inference latency by reducing the number of inferences that you are doing. But every few timesteps, you will still have to do an inference which would take much longer. This results in a choppy execution pattern where sometimes you move fast, sometimes you just stop and wait for the model to finish running. There’s a simple enough trick here using pipelining. At training time for a sequence model that takes in a sequence of observations and predicts a sequence of actions, you would predict the actions associated with the next observation as well as the current observation. There’s a critical problem with this approach that stems from control theory. Simple analogy to represent the problem: if you were asked to close your eyes and walk in a straight line, odds are you can’t do it. In other words, open loop control builds error exponentially in time. With the overlapping action prediction you are still running a problem, if you discover new information in the next timestep, you have no way of capturing that information in your prediction of the action at that timestep, because you entirely relied on the past observation to make that observation. So is there a way to make a partial action prediction that could be refined with more information from the world?

Recently, there was a paper titled Diffusion Forcing. I’ll briefly sketch what the paper discusses. It explores how diffusion can be applied to sequence modeling problems by using a per-timestep diffusion level. For instance, if you had a world model and wanted to use diffusion forcing, you would noise each observation in the episode with a different diffusion level. This approach contrasts with methods like Video Diffusion Models.

Side note, I’ve tried video diffusion models for robotics applications which is an alternative approach. It worked but tuning it was a huge pain. I had to carefully assign learning rates to the actions and observations independently in the sampling. Also deploying gradient descent methods on edge hardware isn’t trivial. Edge hardware can lack the kernels for training. My conspiratory belief is that hardware makers want you to buy their training hardware or consumer graphics cards as opposed to their edge inference machines for training applications. Therefore, they handicap training applications on edge inference hardware.

In diffusion forcing, authors take the forward diffusion process in DDPM style training as a “partial masking” in the style of casual masking from transformers. With this perspective you can explicitly represent your uncertainty about the future. Let’s walk through it. Imagine that you have a trajectory model that predicts observations and actions now. You train it using diffusion forcing on a trajectory dataset. You could now start predicting the future actions while holding the noise levels of the unknown observations steady. In other words, you can start denoising the actions a bit, then when the observations are ready, you can then assign the observations some value close to certainty. Then you complete the rollout of the actions. While you do this rollout, you also start the partial rollout of the future actions. Therefore, you can reduce the number of diffusion steps you are taking for a given action at the time of prediction because you’ve already partially decoded it relying more on the past information.

Also, I have the feeling that in the nominal case, you could actually jump directly to the fully decoded action really quickly, bringing it on par with the pipelined case. For example, if a new observation reveals nothing new about the environment, the model should be able to just move forward by taking the partially denoised prediction and using the DDPM reverse process equation to recover \(x_0\) directly. I believe you could verify this by looking at the partially denoised observation prediction in your trajectory model. If your prediction is close to the real observation, then you already have the information to jump to the right answer. Mind you, the math doesn’t entirely support this. You should really use a DDIM based approach to reduce the number of steps in a more continuous fashion. For example, you change the number of steps using the DDIM formulation from 200 -> 20 if you know that the predicted observation is close to the observed observation (lol ik observed observation).

In terms of deployment, edge hardware has had a lot of work done recently in terms of getting huge models to be able to run. However, they run deadly slowly. If you imagine using the above inference latency hiding method, you could build a single model that’s runnable across many target hardware platforms by varying how many future action predictions are happening. You could do this in conjunction with DDIM like formulations which reduce the overall number of diffusion steps that need to be taken. This will of course cut performance, but that what you get for being cheap and buying shit hardware :).

I should note, I think there’s a way to do the same thing with a 2 timestep attentive model. I don’t think this is really unique to diffusion forcing, but as someone that spends a lot of time thinking of ways to run models on embedded hardware, I find this application fascinating.