PDPP: Projected Diffusion for Procedure Planning in Instructional Videos

Hanlin Wang, Yilu Wu, Sheng Guo, Limin Wang, H. Wang, Y. Wu, and L. Wang are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. (E-mail: {hlwang, yiluwu}@smail.nju.edu.cn, [email protected]) L. Wang is also with Shanghai AI Laboratory, Shanghai 200232, China. S. Guo is with MYbank, Ant Group, Hangzhou 310000, China. (E-mail: [email protected]) Corresponding author: L. Wang. Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP (Projected Diffusion model for Procedure Planning), to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a series of condition-introducing methods in our framework, including condition embeddings, Mixture-of-Experts (MoEs), two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance (VPA) problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstrates the effectiveness and generalization ability of our model. Code and trained models are available at https://github.com/MCG-NJU/PDPP.

Index Terms:

Goal-directed procedure planning, instructional video, diffusion model, conditional projection, AI assistants.

1 Introduction

Instructional videos[58, 2, 48] are strong knowledge carriers, which contain rich scene changes and various action steps. People can learn new skills from these videos by figuring out what actions should be performed to achieve the desired goals. Although this seems to be natural for humans, it is quite challenging for AI agents. Training a model that can learn how to make action plans from the start state to goal is crucial for the next-generation AI system as such a model is able to mine procedure knowledge from huge amounts of videos to help people with goal-directed problems like cooking. Nowadays the computer vision community is paying growing attention to the instructional video understanding[6, 35, 56, 28, 11, 10]. Among them, procedure planning in instructional vieos[6] is becoming an important problem, which requires a model to produce goal-directed action plans given the current and goal visual observations. Different with traditional procedure planning problem in structured environments[15, 46], this task needs to deal with unstructured environments and thus forces the model to learn structured and plannable representations from realistic videos. Specifically, given the visual observation at start and the desired goal, we need to produce a sequence of actions that transform the start state to the goal state, as shown in Fig. 1.

Refer to caption — Figure 1: Planning example. ”Seen observations” denotes the visual inputs for model, ”Intermediate supervision” denotes additional supervision used in previous approaches, and ”Predicted action” means the actions to be predicted by model. For procedure planning, given a start observation $o_{start}$ and a goal state $o_{goal}$ , the model is required to generate an action sequence that can transform $o_{start}$ to $o_{goal}$ . Previous approaches rely on heavy intermediate supervision during training, while our model only needs the task class labels (shown in bottom row). We also evaluate another similar Visual Planners for human Assistance (VPA) problem, which takes visual inputs and a language-described goal to make plans. We remove the requirement of visual history and plan directly with the given start observation and goal description.

Previous approaches for procedure planning in instructional videos often treat it as a sequence decision making problem and focus on predicting the individual action accurately in each step. Since ground truth intermediate observations are not available for procedure planning, most autoregressive methods have to predict intermediate states and actions alternately with two-branch structure in a step-by-step manner[47, 6, 4]. Such models are complex and easy to accumulate errors during the planning process, especially for long sequences. Recently, Zhao et al.[55] solves procedure planning with a single branch non-autoregressive model based on Transformer[50] to predict all intermediate steps in parallel. Yet, to obtain good performance, this method brings multiple learning objectives, complex training schemes, and tedious inference process. In contrast, we assume the procedure planning problem as a distribution fitting problem and thus can be solved with a sampling process. We directly model the joint distribution of the whole action sequence in instructional video rather than every discrete action. In this perspective, we can use a simple MSE loss to optimize our generative model and generate the whole action sequence at once, which contains less learning objectives and enjoys simpler training schemes.

For supervision in procedure planning, in addition to the action sequence, previous methods often need heavy intermediate visual[47, 6, 4] or language[55] annotations for their learning process. In contrast, we remove the requirement for these annotations and only use task labels from instructional videos as a condition for our model training (as shown in Fig. 1), which could be easily obtained from the keyword or caption of the video with less labeling cost. Meanwhile, task information is closely related to the action sequences and can provide effective guidance for action planning. For example, in a video of $jacking$ $up$ $a$ $car$ , the possibility of action $add\ sugar$ in it is almost zero. To some extent, supervision signals applied in previous methods are more specific task information. For instance, intermediate languages contain words unique to certain tasks, and intermediate observations include scenes relevant to specific tasks. These supervision signals implicitly contain the current task category, whereas the task labels we used explicitly leverage this information. We use the task-prediction model in the first stage to reduce the learning complexity of the later process. From the annotation perspective, task labels are much easier to obtain than intermediate supervision and are also easier to train with. Therefore, we chose to use task labels as the supervision signal.

Previous approaches[47, 6, 4, 55] for procedure planning train models with different horizons (i.e. the number of predicted action steps) separately, which means different models have to be applied when predicting with varied horizons. This setting presents some challenges as real-world tasks typically vary in the number of actions required to complete them. The main issue is that as the prediction horizon increases, the total number of trained parameters grows linearly, which is not conducive to development. We thus propose to train our model that can predict different planning horizons by joint training with data sequences of varied lengths. Our main goal is to reduce the number of training parameters to improve computation speed and avoid complex inference processes. To achieve this, embedding for horizon as condition and Mixture-of-Experts (MoEs)[43] are applied to our PDPP.

In addition, modeling the uncertainty in procedure planning is a key factor for consideration. In reality, there might be more than one reasonable action sequences that can transform the given start state to goal state. For example, changing the order of $add$ $sugar$ and $add$ $butter$ in $making$ $cake$ process will not affect the final result. So action sequences can vary even with the same start and goal states. To address this problem, we consider adding randomness to our distribution-fitting process and perform training with a diffusion model[20, 34]. Solving procedure planning problem with a probabilistic diffusion model has two main benefits. First, a diffusion model changes the goal distribution to a random Gaussian noise by adding noise progressively to the initial data and learns the sampling process at inference time as an iterative denoising procedure starting from a random Gaussian noise. So randomness is naturally involved both for training and sampling in a diffusion model, which is helpful to model the uncertain action sequences for procedure planning. Second, it is easy to apply conditional diffusion process with the given start observation and goal state based on diffusion models, so we can model the procedure planning problem as a conditional sampling process with a simple training scheme. In this work, we directly concatenate conditions and action sequences together and propose a projected diffusion model to explicitly perform conditional diffusion process.

Finally, we extend our PDPP framework to a more challenging task called Visual Planners for human Assistance (VPA)[35]. In this task, it argues that procedure planning is not useful in a real-world assistance setting due to the unavailability of the goal observation. Instead, VPA uses a language described goal like “make a shelf”. So we can apply PDPP to the VPA problem by simply replacing the goal observation with task description. This extension demonstrates the flexibility and generalization power of our PDPP framework.

In summary, our contributions are summarized as follows: 1) We cast the goal-directed procedure planning in instruction videos as a conditional distribution-fitting problem and directly model the joint distribution of the whole action sequence, which can be learned with a simple training scheme. 2) We introduce an efficient scheme for training the goal-directed planner, which avoids additional supervision of visual or language features and simply relies on task supervision instead. 3) We propose a projected diffusion model (PDPP) to learn the distribution of action sequences, which is able to naturally model the uncertainty in procedure planning. Our PDPP is actually a multi-step generative prediction model based on probabilistic modeling rather than a simple action retrieval model. PDPP takes observation from the current environment and a video clip or text description as goal to make multi-step look ahead predictions and real-time inference. By introducing diffusion, our model is capable of generating multiple plausible action sequences and effectively modeling the uncertainty in planning task. 4) We apply joint training to PDPP to save the amount of training parameters. The resulting model can be more flexible and practical for procedure planning with varied horizons. 5) Extensive experiments on three instructional videos datasets show that our method outperforms the existing state-of-the-art methods across different prediction horizons, demonstrating its generalization ability on goal-directed planning in instructional videos.

A preliminary version of this work[51] was published on the conference of IEEE CVPR 2023 as a highlight presentation. In this new version, we make the following extensions: i) To save the amount of training parameters and make one single model handle planning with different horizons, we apply joint training to our model. We introduce horizon embedding and MoEs[43] to PDPP to improve the model performance. ii) We instantiate our PDPP with three popular diffusion model backbones: UNet[41], UNet with attention layers[20] and Transformer[36]. For task supervision, we further divide it into three levels: no task supervision, task-level one-hot supervision, and task-level mask supervision. We study the performance of these backbones with different levels of task supervision on three datasets. iii) We propose to split the planning process into two stages to achieve better planning performance. That is, we first predict the first and last actions with the given start and goal observations. Then the predicted first and last actions are used as conditions to predict the intermediate actions. Experiments show that this two-stage planning process can bring better results on large scale dataset. iv) We address the VPA problem by replacing the goal observation with task description in our PDPP framework, and our approach sets a new state-of-the-art performance on this task. Experiments demonstrate that our methods have excellent generalization performance and can handle long-horizon prediction. We also provide more detailed analysis and ablations for the task supervision, joint training, and uncertainty modeling in our PDPP.

2 Related work

2.1 Procedural video understanding

With an aim to learn the inter-relationship between different events or steps in videos, the problem of procedural video understanding, which is important for the training of AI models as human assistants, has gained more and more attention recently. Zhao et al.[56] investigated the problem of abductive visual reasoning, which requires vision systems to infer the most plausible visual explanation for the given visual observations. Furthermore, Liang et al.[28] proposed a new task: given an incomplete set of visual events, AI agents are asked to generate descriptions not only for the visible events and but also for the lost events by logical inferring. Unlike these works trying to learn the abductive information of intermediate events, Chang et al.[6] introduced procedure planning in instructional videos which requires AI systems to plan an action sequence that can transform the given start observation to the goal state in order to assist humans in daily scenarios. Different with the long-term action anticipation (LTA) task[29, 14, 31], procedure planning requires a visual observation as goal to make goal-directed planning and covers a wide range of goal-oriented activities in everyday life. Patel et al.[35] further point out the unavailability of the goal visual state in procedure planning and propose VPA task to use a language described goal instead. In this paper, we study the procedural video understanding problem by learning the goal-directed procedure planning task.

Multiple methods have been proposed to solve the procedure planning. Chang et al.[6] conduct a dual dynamic network to model the transformation relationship between intermediate states and actions. When inference, the actions and states are generated step by step. Bi et al.[4] follow this idea and use model-based reinforcement learning to predict the state-action pairs in an autoregressive way. They also learn the time-invariant context information from the given observations for better planning. Instead of using the intermediate states supervision, Zhao et al.[55] apply a single branch non-autoregressive transformer decoder to predict all intermediate actions in parallel with language annotations. We here propose our PDPP model, which can generate both accurate and diverse plans without any states or language supervision, to solve this problem.

2.2 Diffusion probabilistic models

Diffusion probabilistic models[44] are generative models which have been gaining significant popularity nowadays and have achieved great success in many areas. Ho et al.[20] used a reweighted objective to train diffusion model and achieved great synthesis quality for image synthesis problem. Janner et al.[23] studied the trajectory planning problem with diffusion model and get remarkable results. Besides, diffusion models are also used in video generation[22, 19], density estimation[25], human motion[49], sound generation[54], text generation[27] and other domains, all achieved competitive results. There are various architectures for implementing diffusion models, such as Unet[41], Unet with attention layers[20] and Transformer[36].

Diffusion Models work by destroying training data through the successive addition of Gaussian noise. Then these models learn to reverse this noising process by removing the added noises step by step to recover the initial data. Under such a design, diffusion models require no adversarial training and have the added benefits of learning scalability and parallelizability with a simple learning scheme. Besides, the generated results can vary with different noises, which is helpful to the diverse generation. With the above advantages of diffusion models, we in this work treat the whole action sequence in procedure planning as a distribution and apply diffusion process to this problem to model the uncertainty. Our proposed projected diffusion model PDPP achieves state-of-the-art performance only with a simple learning scheme.

2.3 Projected gradient descent

Projected gradient descent is an optimal solution suitable for constrained optimization problems, which is proven to be effective in optimization with rank constraints[7], online power system optimization problems[16] and adversarial attack[8], etc. The core idea of projected gradient descent is adding a projection operation to the normal gradient descent method to project the output onto the feasible set, so that a function subject can be optimized to a constraint and the result is ensured to be in the feasible region. In our PDPP, we apply diffusion to procedure planning to fit the goal distribution, which contains both the action sequence and condition-guided information. These conditions can be changed during the noise-adding and denoising stages. Inspired by projected gradient descent, we add a similar projection operation to our diffusion process, which keeps the conditional information for diffusion unchangeable and thus provides accurate guides for learning. The process of “projecting back to a bounded normal range from an abnormal state” is analogous to how we rectify a condition altered by noise to its accurate state, thereby remapping the overall distribution to a precisely guided range.

2.4 Mixture of Experts (MoEs)

As an ensemble learning technology, MoEs[43] divides a problem space into homogeneous regions by learning multiple expert modules to process the input signal, together with a gating module called as router to help combine the outputs of different experts. When training, the router is learned to assign the combination weight across different experts, thus each expert is expected to focus on learning a certain sub-region of the input space. MoEs has shown its remarkable ability to scale neural networks such as LSTM[43], CNN[1, 53] and Transformer[38, 42, 26]. By activating dynamic sub-networks for multiple inputs with routing strategy, MoEs provides separate parameters for different tasks and can thus reduce the bad influence of sharing parameters for conflicting inputs[57]. We in this paper apply MoEs to our Unet-attention based model on large scale dataset COIN[48] for better joint training. Specifically, we try two routing strategies: direct routing and learned routing for MoEs and apply this method to different parts of our model when planning with varied horizons.

3 Method

In this section, we present the details of our projected diffusion model for procedure planning in instructional videos (PDPP). An overview of PDPP is provided in Fig. 2. We will first introduce our setup for the planning problem in Sec. 3.1. Then we present the diffusion model used to model the action sequence distribution and the way to add conditions in Sec. 3.2. To provide more precise conditional guidance both for the training and sampling process, a simple projection method is applied to our model, which we will discuss in Sec. 3.3. Sec. 3.4 presents the two stage prediction process for procedure planning, which predicts { $a_{1},a_{T}$ } first and is proven to be useful for large scale dataset. Finally, we show the training scheme (Sec. 4.1) and sampling process (Sec. 4.2) of our PDPP.

3.1 Problem formulation

For procedure planning, we follow the problem set-up of Chang et al.[6]: given two video clips, start visual observation $o_{s}$ and visual goal $o_{g}$ , a model is required to plan a sequence of actions $a_{1:T}$ so that the environment state can be transformed from $o_{s}$ to $o_{g}$ . Here $T$ is the horizon of planning, which denotes the number of action steps for the model to take and { $o_{s}$ , $o_{g}$ } indicates two different environment states appeared in an instructional video. We decompose the procedure planning problem into two sub-problems, as shown in Eq. 1. The first problem is to learn the task-related information $c$ with the given { $o_{s}$ , $o_{g}$ } pair. This can be seen as a preliminary inference for procedure planning. Then the second problem is to generate action sequences with the task-related information and given observations. Note that Bi et al.[4] also decompose the procedure planning problem into two sub-problems, but their purpose of the first sub-problem is to provide long-horizon information for the second stage since Bi et al.[4] plans actions step by step, while our purpose is to get condition for sampling to achieve an easier learning.

p(a_{1:T}|o_{s},o_{g})=\int p(a_{1:T}|o_{s},o_{g},c)p(c|o_{s},o_{g})dc.

(1)

One of our motivations is to use one model for multiple planning horizons with joint training. That is, we sample one batch of action sequences for every prediction horizon and conducting gradient descent on all these data to complete one training step. Thus horizon information $h$ for every action sequence is also provided to our model as condition.

At training time, we first train a simple model (implemented as multi-layer perceptrons (MLPs)) with the given observations { $o_{s},o_{g}$ } to predict which the task category is. We use the task labels in instructional videos $\overline{c}$ to supervise the output $c$ . After that, we evaluate $p(a_{1:T}|o_{s},o_{g},c,h)$ in parallel with our model and leverage the ground truth (GT) intermediate action labels as supervision for training. Compared with the visual and language supervision in previous works, task label supervision is easier to get and brings simpler learning schemes. At inference phase, we just use the start and goal observations to predict the task class information $c$ and then samples action sequences $a_{1:T}$ from the learned distribution with the given observations, horizon and predicted $c$ , where $T$ is the planning horizon.

3.2 Projected diffusion for procedure planning

As explained in Sec. 3.1, the main part of our model is how to generate $a_{1:T}$ with given conditions to solve the procedure planning problem. Bi et al.[4] assume procedure planning as a Goal-conditioned Markov Decision Process and use a policy $p(a_{t}|o_{t})$ along with a transition model $\tau_{\mu}(o_{t}|c,o_{t-1},a_{t-1})$ to perform the planning step by step, which is complex to train and slow for inference. We instead solve this as a direct distribution fitting problem with a diffusion model. In this section, we will first introduce the standard diffusion model. Then show our conditional sampling ways and our learning objective.

Diffusion model. A diffusion model[20, 34] solves the data generation problem by modeling the data distribution $p(x_{0})$ as a denoising Markov chain over variables { $x_{N}...x_{0}$ } and assume $x_{N}$ to be a random Gaussian distribution. The forward process of a diffusion model is incrementally adding Gaussian noise to the initial data $x_{0}$ and can be represented as $q(x_{n}|x_{n-1})$ , by which we can get all intermediate noisy latent variables $x_{1:N}$ with a diffusion step $N$ . In the sampling stage, the diffusion model conducts iterative denoising procedure $p(x_{n-1}|x_{n})$ for $N$ times to approximate samples from the target data distribution. The forward and reverse diffusion processes are shown in Fig. 3.

In a standard diffusion model, the ratio of Gaussian noise added to the data at diffusion step $n$ is pre-defined as $\{\beta_{n}\in(0,1)\}_{n=1}^{N}$ . Each adding noise step can be parametrized as

q(x_{n}|x_{n-1})=\mathcal{N}(x_{n};\sqrt{1-\beta_{n}}x_{n-1},\beta_{n}\textbf{I}).

(2)

Since hyper-parameters $\{\beta_{n}\}_{n=1}^{N}$ are pre-defined, there is no training in the noise-adding process. As discussed in[20], re-parameterize Eq. 2 we can get:

x_{n}=\sqrt{\overline{\alpha}_{n}}x_{0}+\sqrt{1-\overline{\alpha}_{n}}\epsilon,

(3)

where $\overline{\alpha}_{n}=\prod_{s=1}^{n}(1-\beta_{s})$ and $\epsilon\sim\mathcal{N}(0,I)$ .

In the denoising process, each step is parametrized as:

p_{\theta}(x_{n-1}|x_{n})=\mathcal{N}(x_{n-1};\mu_{\theta}(x_{n},n),\Sigma_{\theta}(x_{n},n)),

(4)

where $\mu_{\theta}$ is produced by a learnable model and $\Sigma_{\theta}$ can be directly calculated with $\{\beta_{n}\}_{n=1}^{N}$ [20]. Then Ho et al.[20] conduct a further derivation and set the learning objective for their diffusion model as the noise added to the uncorrupted data $x_{0}$ at each step. When training, the diffusion model first selects a diffusion step $n\in[1,N]$ and calculates $x_{n}$ as shown in Eq. 3. Then the learnable model will compute $\epsilon_{\theta}(x_{n},n)$ and calculate loss with the true noise add to the distribution at step $n$ . After training, the diffusion model can simply generate data like $x_{0}$ by iteratively processing the denoising step starting from a random Gaussian noise.

However, such a diffusion model can not be applied to procedure planning problem since the sampling process in this task is condition-guided while no condition is applied in the standard diffusion model. There are multiple methods for implementing conditional-guided diffusion model, which could be divided into two categories: adding condition to diffusion model or conducting sampling guidance. The former adds conditional inputs to the model through cross-attention[39], AdaGN[9] or token concatenation along the sequence dimension[30], while the later conducts conditional sampling by providing explicit[9] or implicit[21] guidance during the generation phase. Inspired by Janner et al.[23], we here choose to add condition to our diffusion model via designing conditional action sequence input and applying condition projection. Besides, we also tried to restrict predicted actions by task mask and apply MoEs to our model for task and horizon conditions.

Another noteworthy point is that the distribution we want to fit is the whole action sequence, which has a strong semantic information. We notice that in experiments which take noise as the predicting objective like[20], our model just fails to be trained, indicating that directly predicting the semantically meaningless noise sampled from random Gaussian distribution can be hard. To address this problem, we change the learning objective of our model to provide a clear and strong learning distribution.

Conditional action sequence input. The input of a standard diffusion model is the data distribution it needs to fit and no guided information is required. For the procedure planning problem, the distribution we aim to fit is the intermediate action sequences $[a_{1},a_{2}...a_{T}]$ , which depends on the given observations. Thus we need to find how to add these guided conditions into the diffusion process. For a concise learning strategy, we here apply a simple way to achieve our goal: just treat these conditions as additional information and concatenate them along the action feature dimension. Specifically, our conditions include observation(o), task label(c) and horizon(h) for joint learning. For observation, since the start/end observations are more related to the first/last actions, we concatenate $o_{s}$ and $a_{1}$ together, same for $o_{g}$ and $a_{T}$ . Task label and horizon features are useful to the whole predicted sequence, thus we duplicate them for every action. Thus our model input for training now can be represented as a multi-dimension array:

\begin{bmatrix}h&h&&h&h\\ c&c&&c&c\\ a_{1}&a_{2}&...&a_{T-1}&a_{T}\\ o_{s}&0&&0&o_{g}\end{bmatrix}.

(5)

Each column in our model input represents a certain action and the corresponding condition information. Note that we do not need the intermediate visual observations as supervision, so all observation dimensions are set to zero except for the start and end observations. For VPA, $o_{g}$ is not provided and thus set as zero.

Adding condition by task mask and MoEs. Note that the conditional action sequence input can change with different ways to add task and horizon conditions. That is, task label dimension is removed when task mask condition is applied and horizon dimension will be discarded when using MoEs.

Task mask can be seen as a more adequate application of task information. Once the task category is determined, the planning sequence can only consist of actions belonging to this task. For example, actions in task ”Build Simple Floating Shelves” are ”cut shelve”, ”assemble shelve”, ”sand shelve”, ”paint shelve” and ”attach shelve”. Only these five actions can be involved in any action sequence describing how to build simple floating shelves. So with the predicted or given task condition, we can get a task mask which blocks out all actions that are not part of this task and thus reduce the searching space for action planning. In this sense, we can introduce task condition by multiplying the action distribution by the task mask.

As for horizon condition, in addition to concatenating horizon information and action together, we also apply MoEs[43]. The main idea of MoEs is to replicate certain parameters within a model and use routing algorithms to process different inputs with different sets of parameters. Thus, we replicate the parts of our model where MoEs is used and provide condition information to the routing gate module added in PDPP. We implement both direct routing and learned routing strategies. Direct routing just distributes input to specified sub-networks according to the horizon condition while learned routing learns a weight to integrate the outputs from all sub-networks together with the given conditions. Specifically, the learned routing module is implemented as MLP. Since our main goal is to save model parameters by joint training, we here only consider applying MoEs to the convolution part or attention part of Unet with attention layers model. By introducing MoEs, we hope PDPP can learn how to handle inputs with different planning horizons to minimize the negative impact brought by joint training.

Learning objective of our diffusion model. As mentioned above, the learning objective of a standard diffusion model is the random noise added to the distribution at each diffusion step. This learning scheme has demonstrated great success in visual data synthesis area. For procedure planning, however, the distribution we need to fit contains high-level features rather than raw pixels. Predicting random noise for procedure planning results in fault training, which is similar to text generation[27]. One possible reason is that the action sequence distribution we need to predict has strong semantics. When the noise predicted in the early stage of sampling is not very accurate, the feature distribution after initial denoising steps can not have the required strong semantics, which leads to the following denoising operation deviates from correct direction and get meaningless sampling result. So we modify the learning objective to the initial input $x_{0}$ , which is described in Sec. 4.1.

3.3 Condition projection during learning

Our model transforms a random Gaussian noise to the final result, which has the same structure with our model input by conducting $N$ denoising steps. Since we combine the conditional information with action sequences as the data distribution, these conditional guides can be changed during the denoising process. However, the change of these conditions will bring wrong guidance for the learning process and make the conditions useless. To address this problem, we add a condition projection operation into the learning process. That is, we force the condition dimensions not changed during training and inference by assigning the initial value. The input $x$ of condition projection is either a noise-add data (Alg.1 L5) or the predicted result of model (Alg.1 L7). We use $\{\hat{c},\hat{a},\hat{o},\hat{h}\}$ to represent different dimensions in $x$ , then our projection operation $\mathrm{Proj}()$ can be denoted as:

	$\displaystyle\begin{bmatrix}\hat{h}_{1}&\hat{h}_{2}&&&\hat{h}_{T}\\ \hat{c}_{1}&\hat{c}_{2}&&&\hat{c}_{T}\\ \hat{a}_{1}&\hat{a}_{2}&...&&\hat{a}_{T}\\ \hat{o}_{1}&\hat{o}_{2}&&&\hat{o}_{T}\end{bmatrix}\to\begin{bmatrix}h&h&&&h\\ c&c&&&c\\ \hat{a}_{1}&\hat{a}_{2}&...&&\hat{a}_{T}\\ o_{s}&0&&&o_{g}\end{bmatrix},$		(6)
	$\displaystyle x\qquad\qquad\qquad\qquad\mathrm{Proj}(x)$		(6)

where $\hat{h}_{i}$ , $\hat{c}_{i}$ , $\hat{o}_{i}$ and $\hat{a}_{i}$ denote the $i^{th}$ horizon, task, observation dimensions and predicted action logits in $x$ , respectively. $h$ , $c$ , $o_{s},o_{g}$ are the conditions. When task mask is applied, condition projection for task information is replaced with multiplying the action distribution by the task mask. For joint training with MoEs, horizon condition is used in routing gate module and there is no condition projection for horizon dimensions.

3.4 Two stage prediction process

On the basis of discussions in Sec. 3.1, we further split $p(a_{1:T}|o_{s},o_{g},c,h)$ modeling process into two stages for procedure planning. Considering that the given observations are visual states around the start and end actions, we propose to predict $a_{1},a_{T}$ first, then the predicted two actions can be regarded as action conditions for planning and stay unchanged with condition projection. The two stage prediction process makes full use of the given observations and provides more guidance to the sampling process. One problem with this scheme is that the action condition used in training is the ground truth, but $a_{1},a_{T}$ predicted by model cannot be completely correct, thus the distribution difference between the training set and test set may be further amplified, resulting in overfitting problem. Experiments show that this strategy can benefit our model on large scale dataset, while performs bad for datasets with limited size, which confirms our thought.

Algorithm 1 Training

0: Initial input

x_{0}

, total diffusion steps number

N

, model

f_{\theta}

\{\overline{\alpha}_{n}\}_{n=1}^{N}

, weight matrix

w

1: repeat

n\sim Uniform(\{1,...,N\})

\epsilon\sim\mathcal{N}(0,I)

x_{n}=\sqrt{\overline{\alpha}_{n}}x_{0}+\sqrt{1-\overline{\alpha}_{n}}\epsilon

\hat{x}_{0}=f_{\theta}(Proj(x_{n}),n)

6: Take gradient descent step on

\bigtriangledown_{\theta}\left|\left|\left(x_{0}-Proj(\hat{x}_{0})\right)*w\right|\right|^{2}

8: until converged

4 Training And Inference

4.1 Training scheme

Our training scheme contains two stages: a) training a task-classifier model $\mathcal{T}_{\phi}(c|o_{s},o_{g})$ that extracts conditional guidance from the given start and goal observations; b) leveraging the projected diffusion model to fit the target action sequence distribution.

For the first stage, we apply MLP models to predict the task class $c$ with the given $o_{s},o_{g}$ . Ground truth task class labels $\overline{c}$ are used as supervision. In the second learning stage, we follow the basic training scheme for diffusion model, but change the learning objective as the initial input $x_{0}$ . We denote our learnable model as $f_{\theta}(x_{n},n)$ and our training loss is:

\mathcal{L}_{\mathrm{diff}}=\sum_{n=1}^{N}(f_{\theta}(x_{n},n)-x_{0})^{2},

(7)

We believe that predicting actions corresponding to the given observations are more important because they can be inferred directly from observation conditions. Thus we rewrite Eq. 7 as a weighted loss by assigning a greater weight to { $a_{1},a_{T}$ } in procedure planning with a weight matrix $w$ .

Besides, we add a condition projection step to our diffusion process. So given the initial input $x_{0}$ which contains action sequences, task conditions and observations, we first add noise to the input to get $x_{n}$ , and then apply condition projection to ensure the guidance not changed. With $x_{n}$ and the corresponding diffusion step $n$ , we calculate the denosing output $f_{\theta}(x_{n},n)$ , followed by condition projection again. Finally, we compute the weighted $\mathcal{L}_{\mathrm{diff}}$ and update model, as shown in Algorithm 1. Note that the planning horizon of $x_{0}$ can vary for joint training.

4.2 Inference

At inference time, only the start observation $o_{s}$ and goal observation $o_{g}$ are provided for procedure planning. We thus first predict the task class by choosing the maximum probability value in the output of task-classifier model $\mathcal{T}_{\phi}$ . Then the predicted task class $c$ is used as the task condition. To sample from the learned action sequence distribution, we start with a Gaussian noise, and iteratively conduct denoise and condition projection for $N$ times. The detailed inference process is shown in Algorithm 2.

Algorithm 2 Inference

0: total diffusion steps number

N

, model

f_{\theta}

\{\overline{\alpha}_{n}\}_{n=1}^{N}

\{\beta_{n}\}_{n=1}^{N}

\hat{x}_{N}\sim\mathcal{N}(0,I)

2: for

n=N,...,1

\hat{x}_{0}=f_{\theta}(Proj(\hat{x}_{n}),n)

4: if

n>1

then

\hat{\mu}_{n}=\frac{\sqrt{\overline{\alpha}_{n-1}}\beta_{n}}{1-\overline{\alpha}_{n}}\hat{x}_{0}+\frac{\sqrt{\alpha_{n}}(1-\overline{\alpha}_{n-1})}{1-\overline{\alpha}_{n}}\hat{x}_{n}

\hat{\Sigma}_{n}=\frac{1-\overline{\alpha}_{n-1}}{1-\overline{\alpha}_{n}}\cdot\beta_{n}

\hat{x}_{n-1}\sim\mathcal{N}(\hat{x}_{n-1};\hat{\mu}_{n},\hat{\Sigma}_{n}\textbf{I})

8: end if

9: end for

10: return

\hat{x}_{0}

To further accelerate the sampling process, we apply DDIM[45] to our model and thus generate output with $N^{\prime}(<N)$ sampling steps. The sampling process with DDIM is presented in Algorithm 3.

Algorithm 3 Inference with DDIM

0: DDIM sampling timesteps

[t_{1},t_{2},...,t_{N^{\prime}}]

, model

f_{\theta}

\{\overline{\alpha}_{t_{n}}\}_{n=1}^{N^{\prime}}

, hyperparameter

\eta

\hat{x}_{t_{N^{\prime}}}\sim\mathcal{N}(0,I)

2: for

n=N^{\prime},...,1

\hat{x}_{0}=f_{\theta}(Proj(\hat{x}_{t_{n}}),t_{n})

\sigma_{t_{n}}=\eta\sqrt{(1-\overline{\alpha}_{t_{n-1}})/(1-\overline{\alpha}_{t_{n}})}\cdot\sqrt{1-\overline{\alpha}_{t_{n}}/\overline{\alpha}_{t_{n-1}}}

5: if

n>1

then

\hat{\mu}_{t_{n}}=\sqrt{\overline{\alpha}_{t_{n-1}}}\hat{x}_{0}+\sqrt{1-\overline{\alpha}_{t_{n-1}}-\sigma_{t_{n}}^{2}}\cdot\frac{\hat{x}_{t_{n}}-\sqrt{\overline{\alpha}_{t_{n}}}\hat{x}_{0}}{\sqrt{1-\overline{\alpha}_{t_{n}}}}

\hat{x}_{t_{n-1}}\sim\mathcal{N}(\hat{x}_{t_{n-1}};\hat{\mu}_{t_{n}},\sigma_{t_{n}}^{2}\textbf{I})

8: end if

9: end for

10: return

\hat{x}_{0}

Once we get the predicted output $\hat{x}_{0}$ , we take out the action sequence dimensions $[\hat{a}_{1},...,\hat{a}_{T}]$ and select the index of every maximum value in $\hat{a}_{i}(i=1,...,T)$ as the action sequence plan. Note that in the training stage of procedure planning, the class condition dimensions of $x_{0}$ are the ground truth task labels, not the output of our task-classifier as in inference.

5 Experiments

In this section, we evaluate our PDPP model on three real-life datasets and show our competitive results. We first present the result of our first training stage, which predicts the task class with the given observations in Sec. 5.3. Then we conduct detailed ablation studies to provide a thorough analysis on our proposed PDPP in Sec. 5.4. After that, we compare our performance with other alternative approaches on the three datasets and demonstrate the effectiveness of our model in Sec. 5.5. For a more comprehensive evaluation, we analyze the failure cases and discuss the computational efficiency of PDPP in Sec. 5.6 and Sec. 5.7, respectively. Finally, we show our prediction uncertainty evaluation results in Sec. 5.8. For VPA, we train our PDPP with best settings obtained from procedure planning and the results are presented in Sec. 5.9. To avoid results influenced by initial random noise when sampling, we calculate the mean values of multiple sampling results with different initial random noises as our results. Unless otherwise specified, we use the DDIM sampling process to get all results.

5.1 Evaluation protocol

Datasets. We evaluate our PDPP model on three instructional video datasets: CrossTask[58], NIV[2], and COIN[48]. CrossTask contains 2,750 videos from 18 different tasks, with an average of 7.6 actions per video. The NIV dataset consists of 150 videos about 5 daily tasks, which has 9.5 actions in one video on average. COIN is much larger with 11,827 videos, 180 different tasks and 3.6 actions/video.

For procedure planning, We randomly select 70% data for training and 30% for testing as previous work[6, 4, 55]. Following previous work[6, 4, 55], we extract all action sequences $\{[a_{i},...,a_{i+T-1}]\}_{i=1}^{n-T+1}$ with predicting horizon $T$ from the given video which contains $n$ actions by sliding a window of size $T$ . Then for each action sequence $[a_{i},...,a_{i+T-1}]$ , we choose the video clip feature at the beginning time of action $a_{i}$ and clip feature around the end time of $a_{i+T-1}$ as the start observation $o_{s}$ and goal state $o_{g}$ , respectively. Both clips are 3 seconds long. For experiments conduct on CrossTask, we use two kinds of pre-extracted video features as the start and goal observations. One are the features provided in CrossTask dataset: each second of video content is encoded into a 3200-dimensional feature vector as a concatenation of the I3D, ResNet-152 and audio VGG features[17, 18, 5], which are also applied in[4, 6]. The other kind of features are generated by the encoder trained with the HowTo100M[32] dataset, as in[55]. For experiments on the other two datasets, we follow[55] to use the HowTo100M features for a fair comparison.

For VPA, we follow Patel et al.[35] to conduct our experiments on CrossTask and COIN. Ratios of training set and test set are 7/3 and 9/1 for these two datasets, respectively. Action sequences are extracted as in procedure planning, and the start observation video clip is 4 seconds long. All video features are get with the encoder trained on HowTo100M dataset. Note that our PDPP do not require video history as additional input.

Metrics. Following previous work[6, 4, 55], we apply three metrics to evaluate the performance. a) Success Rate (SR) considers a plan as a success only if every action matches the ground truth sequence. b) mean Accuracy (mAcc) calculates the average correctness of actions at each individual time step, which means an predicted action is considered correct if it matches the action in ground truth at the same time step. c) mean Intersection over Union (mIoU) measures the overlap between predicted actions and ground truth by computing Iou $\frac{|\{a_{t}\}\cap\{\hat{a_{t}}\}|}{|\{a_{t}\}\cup\{\hat{a_{t}}\}|}$ , where $\{a_{t}\}$ is the set of ground truth actions and $\{\hat{a_{t}}\}$ is the set of predicted actions. Previous approaches[55, 4] compute the mIoU metric on every mini-batch (batch size larger than one) and calculate the average as the result. This brings a problem that the mIoU value can be influenced heavily by batch size. Consider if we set batch size is equal to the size of training data, then all predicted actions can be involved in the ground truth set and thus be correct predictions. However, if batch size is set to one, then any predicted action that not appears in the corresponding ground truth action sequence will be wrong. To address this problem, we standardize the way to get mIoU as computing IoU on every single sequence and calculating the average of these IoUs as the result (equal to setting of batch size $=1$ ).

5.2 Implementation details

Architectures. We use three backbones to implement our learnable model for projection diffusion.

$\bullet$ Unet[41]. We implement our model as a basic 3-layer Unet. As in[23], each layer in our model consists of two residual blocks[17] and one downsample or upsample operation. One residual block consists of two convolutions, each followed by a group norm[52] and Mish activation function[33]. Time embedding is produced by a fully-connected layer and added to the output of first convolution. We apply a 1d-convolution along the planning horizon dimension as the downsample/upsample operation. We set the kernel size of 1d-convolution as $2$ , stride as $1$ , padding as $0$ so the length change of planning horizon dimension keeps $1$ after each downsample or upsample. The hidden dimensions for each layer are 256, 512, 1024, respectively.
$\bullet$ Unet with attention layers[20]: We implement our Unet-attention architecture by adding self-attention[50] operation to a 2-layer Unet. The hidden dimensions are 512, 1024, respectively. Follow[20], we use SiLU as activation function and set the number of attention heads as 32.
$\bullet$ Transformer[36]: Since there is no downsample operation for transformer based model, we here stack more layers in our model to study whether attention with large number parameters can bring better results. We instantiate our transformer-based diffusion model with 12 layers and set the hidden dimension as 1024. The number of attention heads is 32. Layer norm[3] is applied to this backbone, time embedding is produced by a fully-connected layer and added with AdaLN, as in[36].

Diffusion process. For diffusion, we use the cosine noise schedule to produce the hyper-parameters $\{\beta_{n}\}_{n=1}^{N}$ , which denote the ratio of Gaussian noise added to the data at each diffusion step. We set diffusion step to 200 for CrossTask and COIN dataset and 50 for NIV due to their different scales. When DDIM sampling is applied, we set the sampling step number to 10 for faster inference.

Training. We use the linear warm-up training scheme to optimize our model and train different steps for the three datasets, corresponding to their scales. In CrossTask_Base, we set the diffusion step as $200$ and train a Unet based model for $12,000$ steps with learning rate increasing linearly to $8\times 10^{-4}$ in the first $4,000$ steps. Then the learning rate decays to $4\times 10^{-4}$ at step $10,000$ . In CrossTask_How, we keep diffusion step as 200 and train our Unet based model for $24,000$ steps with learning rate increasing linearly to $5\times 10^{-4}$ in the first $4,000$ steps and decays by 0.5 at step $10,000,16,000$ and $22,000$ . When joint training is applied to CrossTask_How, only $12,000$ training steps are required since gradient descent is conducted on multiple batches with different horizons in one training step. In NIV, the diffusion step is $50$ and we train a Unet based model for $6,500$ steps due to the small size of this dataset. The learning rate increases linearly to $3\times 10^{-4}$ in the first $4,500$ steps and decays by $0.5$ at step $6,000$ . We train a Unet-attention model on the COIN dataset with task mask and two stage prediction. We set diffusion step as $200$ and train our model for $14,000$ steps.The learning rate increases linearly to $1\times 10^{-4}$ in the first $4,000$ steps, then we keep learning rate unchanged for the remaining training steps. The training batch size for all experiments is $256$ . We set $w=10$ for the weighted loss. All our experiments are conducted with ADAM[24] on 8 NVIDIA TITAN XP GPUs.

5.3 Results of task classifier

The first stage of our learning is to predict the task class with the given start and goal observations. We implement this with MLP models. The classification results for different planning horizons on three datasets are shown in Tab. I. We can see that our classifier can perfectly figure out the task class in the NIV dataset since only 5 tasks are involved. For larger datasets CrossTask and COIN, our model can make right predictions most of the time.

TABLE I: Classification results on all datasets. CrossTask_Base uses features provided by the dataset while CrossTask_How applies features extracted by HowTo100M trained encoder.

	CrossTask_Base	CrossTask_How	COIN	NIV
$T$ = 3	83.87	92.43	79.42	100.00
$T$ = 4	83.64	92.98	78.89	100.00
$T$ = 5	83.37	93.39	-	-
$T$ = 6	83.85	93.20	-	-

5.4 Exploration Studies

To further study the effectiveness and performance of our PDPP, we in this section conduct detailed ablation studies. We start with the basic setting that we use Unet as backbone and train our model separately. Task condition is added by concatenating task feature in the conditional action sequence input. We mainly focus on the ablation study on CrossTask with better HowTo100M features.

5.4.1 Impact of batch size on mIoU

In this section, we study the impact of batch size on mIoU, which can illustrate the importance for the standardization of computing mIoU. As we discussed in Sec. 5.1, previous approaches calculate the IoU value on every mini-batch and take their mean as the final mIoU. However, the batch size value for different methods may be different, which results in an unfair comparison.

We use models trained with the basic setting and vary the evaluation batch size to compute the mIoU metric on CrossTask. We use CrossTask_Base to denote our model with features provided by CrossTask and CrossTask_How as model with features extracted by HowTo100M trained encoder. Planning horizon is set to $\{3,4,5,6\}$ . The results are shown in Tab. II, which validate our thought and show the huge impact of batch size on mIoU. The value of mIoU evaluated on the same model can vary widely as batch size changes, so comparing mIoU with different evaluation batch size has no meaning. To address this problem, we standardize the way to compute mIoU as setting inference batch size to 1.

TABLE II: Evaluation results of mIoU with different batch size on CrossTask.

	Batch size	T = 3	T=4	T=5	T=6
CrossTask_Base	1	56.90	56.99	56.32	57.51
	4	65.30	67.14	67.10	70.48
	8	68.83	69.64	67.39	69.31
	16	69.79	67.26	64.53	63.19
CrossTask_How	1	66.57	65.13	65.32	64.70
	4	75.21	77.07	78.56	78.59
	8	79.74	81.74	81.73	80.88
	16	80.50	82.32	81.41	78.64

TABLE III: Ablation study on the effect of joint training. HowTo100M features are applied to CrossTask.

	Dataset	Train jointly			Train separately
	Dataset	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	CrossTask	37.00	65.16	67.24	37.20	64.67	66.57
	NIV	30.60	47.64	57.35	30.20	48.45	57.28
	COIN	18.21	43.65	49.11	21.33	45.62	51.82
$T$ = 4	CrossTask	22.53	59.09	66.57	21.48	57.82	65.13
	NIV	27.01	46.90	60.13	26.67	46.89	59.45
	COIN	13.38	43.58	50.65	14.41	44.10	51.39
$T$ = 5	CrossTask	13.96	54.64	66.43	13.45	54.01	65.32
$T$ = 6	CrossTask	8.65	50.46	66.18	8.41	49.65	64.70

TABLE IV: Ablation study on the role of task supervision. The

w.

task

sup.

denotes learning with task supervision and

w.o.

task

sup.

means training with no task information.

	Dataset	w. task sup.			w.o. task sup.
	Dataset	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	CrossTask	37.20	64.67	66.57	35.69	63.91	66.04
	NIV	30.20	48.45	57.28	28.37	45.96	54.31
	COIN	21.33	45.62	51.82	16.48	36.57	43.48
$T$ = 4	CrossTask	21.48	57.82	65.13	20.52	57.47	64.39
	NIV	26.67	46.89	59.45	26.50	46.08	58.94
	COIN	14.41	44.10	51.39	11.65	35.04	41.75
$T$ = 5	CrossTask	13.45	54.01	65.32	12.80	53.44	64.01
$T$ = 6	CrossTask	8.41	49.65	64.70	8.15	50.45	64.13

5.4.2 Joint training vs. training separately

Previous approaches involved training a separate model for each prediction horizon, known as ”training separately”, which suffers from high training costs and complex inference. Here, we consider using data from different horizons to train one system to save training parameters, referred to as ”joint training”. Here we study the effects of directly introducing joint training to our model. That is, we just apply joint training to the basic training setting without giving any horizon condition. Tab. III presents the results, which show that joint training has a positive effect on smaller datasets CrossTask and NIV, while brings bad performance to COIN. We assume this is because much more data in large scale dataset has a more serious conflict, thus modeling the data with shared parameters can be much harder. We will introduce methods to help our model achieve better joint training results later.

5.4.3 Study on task supervision and backbones

Task label plays a crucial role in our PDPP, which can provide class information about action sequence thus narrow down the searching space and help model mask better planning. In experiments we notice that the best way to add task information for different backbones can vary. So in this section, we conduct ablation studies on task supervision and backbones to get the best setting for the three datasets.

We first conduct a simple study on the role of task supervision with the basic setting by removing the task dimensions in conditional input, thus model can only plan with the start and end observations. The results are shown in Tab. IV. It can be seen from the results that task label is useful for all three datasets when training separately. Planning without task information brings slight performance drop on CrossTask and NIV, while all metrics drop dramatically for large scale dataset COIN. We think the reason for this is that providing task class explicitly to large scale dataset can effectively reduce the difficulty of model training, while model can learn task information of action sequence implicitly on CrossTask and NIV, thus the impact on these two datasets is much smaller.

Based on the above analysis, we try three applications of task condition on CrossTask and NIV: planning with no task, adding task information by concatenating, and multiplying task mask. For COIN dataset, task condition is crucial, so we only study the concatenation and task mask methods. Since in Sec. 5.4.2 we find that joint training is good for CrossTask and NIV while bad for COIN, here we just apply joint training to CrossTask and NIV and train models separately on COIN. Results for different model backbones with varied task-adding methods are shown in Fig. 4.

Analyzing the study results on task supervision and backbones, we can draw the following conclusions: a) Unet based model performs best for smaller datasets and Transformer model is more suitable for large scale datasets. Task mask works bad for CrossTask and NIV but benefits COIN a lot. We believe these results are related to the dataset size. It can be expected that Transformer only performs well for COIN since applying large model to small datasets CrossTask and COIN can cause serious overfitting problem and get bad generalization results on the test set. For task mask, it is a strong planning guide which constrains the predicted results to a smaller range of actions. We assume the strong limit amplifies the distribution difference between the training and test sets. That is, task mask further breaks down the problem into learning in subspaces and reduces the difficulty of fitting the training set, thus results in overfitting problem on small datasets. b) Compared with results in Tab. IV, joint training makes the gap between no task and task concatenation smaller. One explanation for this is that with joint training, more data with different horizons are provided for the model, thus model can better infer the task information from action sequences implicitly and get similar results to model trained with task supervision. c) Transformer model performs better with task concatenation than task mask on COIN dataset. One possible reason is that transformer model trained with task mask just overfits the training set of COIN.

We also notice that applying task mask to COIN can speed up the model convergence greatly. For example, when task mask is applied to Unet-attention based model on COIN, the model can converge in around 25,000 training steps. However, when concatenation is used to add task condition to model, the training step is 200,000, which is much time-consuming.

TABLE V: Ablation study on two-stage prediction process for CrossTask. We experiment with Unet models and joint training is applied.

	Task sup.	w. two-stage prediction			w.o. two-stage prediction
	Task sup.	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	no task	34.61	65.08	65.91	37.64	65.47	67.45
$T$ = 3	concatenation	33.92	64.61	65.63	37.00	65.16	67.24
$T$ = 4	no task	19.69	58.59	65.52	22.35	59.39	66.59
$T$ = 4	concatenation	19.89	58.47	65.42	22.53	59.09	66.57
$T$ = 5	no task	12.33	54.46	65.98	13.68	54.91	65.92
$T$ = 5	concatenation	12.08	53.98	65.89	13.96	54.64	66.43
$T$ = 6	no task	7.55	50.29	66.19	8.38	51.12	66.26
$T$ = 6	concatenation	7.46	50.00	65.68	8.65	50.46	66.18

TABLE VI: Ablation study on two-stage prediction process for COIN. Task mask is applied to Unet and Unet-attention models. Transformer model is trained with task concatenation. Models are trained separately.

	Backbone	w. two-stage prediction			w.o. two-stage prediction
	Backbone	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	Unet	27.62	48.16	57.36	26.97	47.42	56.91
	Unet-attention	31.05	50.81	59.18	27.82	48.61	57.60
	Transformer	29.70	51.53	58.66	30.14	51.74	58.76
$T$ = 4	Unet	21.53	46.67	59.43	20.20	46.30	58.92
	Unet-attention	22.45	47.45	60.00	20.87	46.99	59.42
	Transformer	21.31	49.10	60.32	20.86	48.70	60.06

5.4.4 Study on two-stage prediction process

With ablation studies on task supervision and model architectures, we can select Unet backbone as the best model for CrossTask and NIV. However, for COIN dataset, though transformer based model with task concatenation performs the best, the training cost is higher than training a Unet-attention model with task mask. So in this section, we try to further split the planning process by two-stage prediction process, with an aim to achieve better performance and faster training.

The first stage in two-stage prediction strategy is to predict start and end actions with given conditions. We implement this with the same architecture of the corresponding planning model and change the learning objective to action sequence $\{a_{1},a_{T}\}$ . However, the downsample and upsample operations in Unet and Unet-attention makes action sequence of length two not available to these models. So we make some modifications by instantiating 2-layer Unet and Unet-attention models to predict action sequence $\{a_{1},a_{T}\}$ . Hidden dimensions are set to 512, 1024, respectively.

After the first prediction stage, we can consider the predicted start and end actions as additional condition and apply them to the condition projection. In experiments we find that the SR metric for predicting $\{a_{1},a_{T}\}$ on NIV is lower than directly predicting the whole action sequence, thus using the predicted $\{a_{1},a_{T}\}$ as condition will definitely gets worse performance. So here we just conduct experiments on CrossTask and COIN. The results for study on two-stage prediction process are shown in Tabs. V and VI, in which joint training is applied only to CrossTask. As we expected in Sec. 3.4, the two-stage prediction strategy provides more guidance to the sampling process and benefits performance on COIN, but causes overfitting problem on smaller datasets, thus the results on CrossTask are worse than planning the whole action sequence. Note that with the two-stage prediction process, the performance of Unet-attention model surpasses transformer based model. Besides, with the predicted $\{a_{1},a_{T}\}$ , the training of Unet-attention model can be completed in 10,000 steps, and less than 20,000 training steps are required for predicting $\{a_{1},a_{T}\}$ . Thus now we can conduct better and faster learning on COIN with Unet-attention model rather than transformer.

TABLE VII: Ablation study on horizon condition for CrossTask with Unet backbone.

	Task sup.	w. horizon			w.o. horizon
	Task sup.	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	no task	37.96	65.76	67.75	37.64	65.47	67.45
$T$ = 3	concatenation	37.46	65.34	67.35	37.00	65.16	67.24
$T$ = 4	no task	22.56	59.43	66.72	22.35	59.39	66.59
$T$ = 4	concatenation	22.83	59.22	66.39	22.53	59.09	66.57
$T$ = 5	no task	14.30	55.17	66.68	13.68	54.91	65.92
$T$ = 5	concatenation	13.79	54.58	66.19	13.96	54.64	66.43
$T$ = 6	no task	8.93	51.45	66.56	8.38	51.12	66.26
$T$ = 6	concatenation	8.65	50.47	66.11	8.65	50.46	66.18

TABLE VIII: Ablation study on horizon condition for NIV with Unet backbone.

	Task sup.	w. horizon			w.o. horizon
	Task sup.	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	no task	29.18	45.21	55.08	31.46	47.00	56.27
$T$ = 3	concatenation	30.74	48.10	57.96	30.60	47.64	57.35
$T$ = 4	no task	26.15	46.74	59.58	26.87	46.72	58.02
$T$ = 4	concatenation	27.78	46.85	59.62	27.01	46.90	60.13

TABLE IX: Ablation study on horizon for COIN with Unet-attention backbone and task mask. Two stage prediction process is applied. MoEs_atten and MoEs_conv denote applying MoEs to all attention or convolution layers.

Horizon sup.	Routing	$T$ = 3			$T$ = 4
Horizon sup.	Routing	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
Train separately	-	31.05	50.81	59.18	22.45	47.45	60.00
No horizon	-	28.97	50.58	58.35	21.95	48.06	60.16
Concatenation	-	30.12	50.95	59.00	22.24	48.15	60.22
MoEs_atten	Direct	29.56	50.70	58.46	22.25	48.17	60.31
MoEs_atten	Learned	29.64	50.82	58.53	21.98	48.08	60.17
MoEs_conv	Direct	29.44	50.61	58.37	22.46	48.25	60.04
MoEs_conv	Learned	29.47	50.70	58.35	22.23	48.22	60.26

5.4.5 Study on methods for better joint training

We in this section study two methods to utilize horizon information for better joint training: horizon concatenation and MoEs. We try horizon concatenation on all datasets. For MoEs, as explained in Sec. 3.2, we only apply it to the convolution part or attention part of Unet-attention model with direct routing and learned routing strategies on COIN since Unet-attention model only performs well on large scale dataset. The results are presented in Tabs. VII, VIII and IX.

For CrossTask, we can see that adding horizon condition is overall positive to our model, especially for models trained with no task. For NIV, applying horizon information along with task concatenation can benefit models, while harms the performance of models with no task. As for COIN dataset, joint training still has a negative impact on our model. Compared with directly train PDPP on COIN jointly(”No horizon” in Tab. IX), utilizing horizon concatenation and applying MoEs can both alleviate the influence caused by joint training. On the whole, horizon concatenation performs the best and requires the minimum additional training parameters.

With all above ablation studies, we can now get the best settings for our joint trained PDPP on different datasets. For CrossTask, Unet based model trained with horizon condition concatenation is applied. Task supervision is not required. For NIV, we just concatenate task and horizon conditions to the action sequence to train our Unet diffusion model. For COIN, we use the Unet-attention backbone to implement our model. Horizon information is added by concatenation and task condition is added by task mask. Two stage prediction is applied for COIN.

5.4.6 Study on conditions applied to PDPP

In this section we conduct a more comprehensive study on conditions applied to PDPP. We chose to experiment on the largest and most comprehensive dataset COIN to achieve a thorough evaluation.

We first ablate on sampling conditions to figure out how different conditions affect PDPP. Apart from predicted task class ( $task$ ), predicted start-end actions ( $a_{1}a_{T}$ ) and given observations ( $ob.$ ) mentioned in the article, we also follow Stable Diffusion[40] to encode the name of predicted task with CLIP[37] text encoder as language condition ( $lang.$ ). Tab. X below shows the ablation results. It can be seen that the input observations and start-end actions predicted by given observations are important for the successful prediction of PDPP, which is natural because without observations (see the first row in Tab. X), we can only randomly predict action sequences from the corresponding category task. Although this sequence may maintain the logical relationship between actions, it would not correspond to the given observations. Moreover, introducing language embedding does not improve the prediction effect. Useful conditions are the given observations, predicted actions and tasks. We think this is because in a non-open-vocabulary prediction setting, the main role of conditions is to guide model to distinguish between different tasks and actions, thus using mask or one-hot embedding as condition is more direct and easier to learn than introducing language embeddings.

TABLE X: Ablation study on conditions applied to PDPP with the COIN dataset.

$ob.$	$a_{1}a_{T}$	$task$	$lang.$	T=3			T=4
$ob.$	$a_{1}a_{T}$	$task$	$lang.$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
✗	✗	✓	✗	0.11	0.45	0.70	0.06	0.52	0.82
✗	✓	✓	✗	27.79	49.91	56.83	16.74	44.11	54.32
✓	✗	✓	✗	28.85	50.06	58.07	21.38	48.05	59.86
✓	✓	✓	✗	30.12	50.95	59.00	22.24	48.15	60.22
✓	✓	✗	✗	28.41	51.35	58.17	19.87	48.68	59.52
✓	✓	✗	✓	28.10	50.84	58.01	19.68	47.97	59.29
✓	✓	✓	✓	29.77	50.90	58.90	21.58	47.78	60.21

Based on above findings, we retained the effective conditions and studied the effect of applying the Classifier-Free Guidance (CFG) strategy to PDPP. Specifically, we follow[21] to train PDPP by removing a certain proportion (20%) of conditions, allowing it to perform both conditional and unconditional predictions simultaneously. After trained, we sample from the learned distribution with the following formula:

\epsilon_{t}=\epsilon_{\theta}(x_{t},c)+\lambda(\epsilon_{\theta}(x_{t},c)-\epsilon_{\theta}(x_{t}))

(8)

where $c$ denotes conditions and $\lambda$ controls the degree to which the sampling is influenced by the condition. We then conducted ablation on the value of $\lambda$ . Since predicted start-end actions from the first stage are part of the prediction result, we combine them with randomly selected intermediate actions as a baseline for comparison. The results provided in Tab. XI show that: (1) Even without any conditions ( $\lambda=-1$ ), PDPP can learn some of the logical relationships between actions and thus obtain better results than the random baseline; (2) A smaller $\lambda(0-1)$ slightly improves the planning result, while larger $\lambda(>1)$ can cause prediction errors. (3) Removing a certain proportion of conditions during training is a data augmentation method that can improve model performance on COIN. Overall, the CFG strategy provides little improvement for the effectiveness of PDPP. We believe this is due to the semantic differences between procedure planning and image generation. Unconditional prediction, lacking observations, cannot achieve accurate plans and tends to predict action sequences that have a higher probability of occurring. Thus using the difference between conditional and unconditional predictions as the main factor ( $\lambda>1$ ) would result in poorer results.

TABLE XI: Ablation study of the cfg_scale value

\lambda

on the largest COIN dataset. Two-stage prediction process is applied.

	T=3			T=4
	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
Random Baseline	9.59	42.63	47.22	2.27	36.31	47.75
$\lambda$ =-1.0(uncondition)	18.54	47.55	53.47	13.08	45.74	56.50
$\lambda$ =0.0(condition)	30.33	51.16	58.99	22.60	48.59	60.32
$\lambda$ =0.1	30.34	51.16	59.00	22.60	48.58	60.34
$\lambda$ =0.3	30.33	51.15	58.99	22.75	48.62	60.38
$\lambda$ =0.5	30.31	51.12	58.97	22.72	48.59	60.39
$\lambda$ =1.0	30.21	51.08	58.93	22.52	48.45	60.31
$\lambda$ =2.0	30.01	51.00	58.83	21.97	48.23	60.09
$\lambda$ =3.0	29.74	50.90	58.70	21.09	47.87	59.76
$\lambda$ =4.0	29.40	50.76	58.53	19.80	47.36	59.21
$\lambda$ =5.0	28.90	50.55	58.27	18.44	46.81	58.56

5.5 Comparison with other approaches

TABLE XII: Evaluation results on CrossTask for procudure planning with prediction horizon

T\in\{3,4\}

. The

Feature

column denotes the type of video feature used for learning. The

Supervision

column denotes the type of supervision applied in training, where V denotes intermediate visual states, L denotes language feature and C means task class. Note that we compute mIoU by calculating average of every IoU of a single antion sequence rather than a mini-batch.

			$T$ = 3			$T$ = 4
Models	Feature	Supervision	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
Random	Base	-	$<$ 0.01	0.94	1.66	$<$ 0.01	0.83	1.66
Retrieval-Based	Base	-	8.05	23.30	32.06	3.95	22.22	36.97
WLTDO[12]	Base	-	1.87	21.64	31.70	0.77	17.92	26.43
UAAA[13]	Base	-	2.15	20.21	30.87	0.98	19.86	27.09
UPN[46]	Base	V	2.89	24.39	31.56	1.19	21.59	27.85
DDN[6]	Base	V	12.18	31.29	47.48	5.97	27.10	48.46
Ext-GAILw/o Aug.[4]	Base	V	18.01	43.86	57.16	-	-	-
Ext-GAIL[4]	Base	V	21.27	49.46	61.70	16.41	43.05	60.93
Ours_Base	Base	C	26.43	55.04	57.93	16.20	50.85	58.47
P³IV[55]	HowTo100M	L	23.34	49.96	73.89	13.40	44.16	70.01
Ours_How	HowTo100M	-	37.96	65.76	67.75	22.56	59.43	66.72

In this section, We follow previous work[55] and compare our approach with other alternative methods for procedure planning on the three datasets, across multiple prediction horizons.

Baselines.

- $Random$ . This policy randomly selects actions from the available action space in dataset to produce the plans.

- $Retrieval$ - $Based$ . Given the observations $\{o_{s},o_{g}\}$ , the retrieval-based method retrieves the closest neighbor by calculating the minimum visual feature distance in the train dataset. Then the action sequence associated with the retrieved result will be used as the action plan.

- $WLTDO$ [12]. This approach applies a recurrent neural network(RNN) to predict action steps with the given observation pairs.

- $UAAA$ [13]. UAAA is a two-stage approach which uses RNN-HMM model to predict action steps autoregressively.

- $UPN$ [46]. UPN is a physical-world path planning algorithm and learns a plannable representation to make predictions. To produce the discrete action steps, we follow[6] to add a softmax layer to the output of this model.

- $DDN$ [6]. DDN model is a two-branch autoregressive model, which learns an abstract representation of action steps and tries to predict the state-action transition in the feature space.

- $PlaTe$ [47]. PlaTe model follows DDN and uses transformer modules in two-branch to predict instead. Note that the evaluation protocol of PlaTe is different with other models, so we move the comparison with PlaTe to supplementary material, which we will discuss later.

- $Ext$ - $GAIL$ [4]. This model solves the procedure planning problem by reinforcement learning techniques. Similar to our work, Ext-GAIL decomposes the procedure planning problem into two sub-problems. However, the purpose of the first sub-problem in Ext-GAIL is to provide long-horizon information for the second stage while our purpose is to get condition for sampling.

- $P^{3}IV$ [55]. P³IV is a single-branch transformer-based model which augments itself with a learnable memory bank and an extra generative adversarial framework. Like our model, P³IV predicts all action steps at once during inference process.

CrossTask (short horizon). We first evaluate on CrossTask with two prediction horizons typically used in previous work. We use Ours_Base to denote our model with features provided by CrossTask and Ours_How as model with features extracted by HowTo100M trained encoder. For Ours_Base, here we apply Unet model trained jointly with task and horizon concatenation to compare with previous methods. Note that we compute mIoU by calculating the mean of every IoU for a single antion sequence rather than a mini-batch as explained in Sec. 5.1, though the latter can achieve a higher mIoU value. Results in Tab. XII show that Ours_Base beats all methods for most metrics except for the success rate (SR) when $T=4$ , where our model is the second best, and Ours_How just significantly outperforms all previous methods. Specifically, for using HowTo100M-extracted video features, we outperform[55] by more than 14% and 9% for SR when $T=3,4$ , respectively. As for features provided by CrossTask, Ours_Base outperforms the previous best method[4] by more than 5% both for SR and mAcc when $T=3$ .

TABLE XIII:

SR

evaluation results on CrossTask with longer planning horizons.

	$T$ = 3	$T$ = 4	$T$ = 5	$T$ = 6
Models	SR $\uparrow$	SR $\uparrow$	SR $\uparrow$	SR $\uparrow$
Retrieval-Based	8.05	3.95	2.40	1.10
DDN[6]	12.18	5.97	3.10	1.20
P³IV[55]	23.34	13.40	7.21	4.40
Ours_Base	26.43	16.20	9.20	6.28
Ours_How	37.96	22.56	14.30	8.93

CrossTask (long horizon). We further study the ability of predicting with longer horizons for our model. Following[55], we here evaluate the SR value with planning horizon $T=\{3,4,5,6\}$ . We present the result of our model along with other approaches that reported results for longer horizons in Tab. XIII. This result shows our model can get a stable and great improvement with all planning horizons compared with the previous best model.

TABLE XIV: Evaluation results on NIV and COIN with prediction horizon

T\in\{3,4\}

. Sup. denotes the type of supervision in training. Note that we compute IoU on every action sequence and take the mean as mIoU.

		NIV			COIN
Horizon	Models	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
$T$ = 3	Random	2.21	4.07	6.09	$<$ 0.01	$<$ 0.01	2.47
	Retrieval	-	-	-	4.38	17.40	32.06
	DDN[6]	18.41	32.54	56.56	13.9	20.19	64.78
	Ext-GAIL[4]	22.11	42.20	65.93	-	-	-
	P³IV[55]	24.68	49.01	74.29	15.4	21.67	76.31
	Ours	30.74	48.10	57.96	30.12	50.95	59.00
$T$ = 4	Random	1.12	2.73	5.84	$<$ 0.01	$<$ 0.01	2.32
	Retrieval	-	-	-	2.71	14.29	36.97
	DDN[6]	15.97	27.09	53.84	11.13	17.71	68.06
	Ext-GAIL[4]	19.91	36.31	53.84	-	-	-
	P³IV[55]	20.14	38.36	67.29	11.32	18.85	70.53
	Ours	27.78	46.85	59.62	22.24	48.15	60.22

NIV and COIN. Tab. XIV shows our evaluation results on the other two datasets NIV and COIN, from which we can see that our approach remains to be the best performer for both datasets. Specifically, in the NIV dataset where mAcc is relatively high, our model raises the SR value by more than 6% both for the two horizons and outperforms the previous best by more than 8% on mAcc metric when $T=4$ . As for the large COIN dataset where mAcc is low, our model significantly improves mAcc by around 30% and almost double the SR value when $T=3,4$ .

All the results suggest that our model performs well across datasets with different scales.

5.6 Analysis of Failure Cases

We further analyze failure cases where PDPP makes wrong predictions on the largest and most comprehensive COIN dataset. Specifically, we calculate the proportions of the following three types of errors: task prediction error (task error), start-end actions prediction error (action error), and intermediate action prediction error (intermediate error). The results are presented in Tab. XV. Visualization results are shown in Fig. 5. It can be seen that the primary cause of failure cases is start-end actions prediction error, indicating that inferring semantic action information from given observations is the key bottleneck to be improved. Model needs to incorporate finer details observed, such as the lower hammer and seasoning bottle in Fig. 5 “Action error”, to make more accurate predictions. For task predicting error, an imbalance in the distribution of task categories within the training set makes certain tasks difficult to recognize accurately (e.g. very few training samples of task “Resize Watch Band”). Besides, similarities between certain tasks give rise to errors in prediction due to their overlapping nature (e.g. “Make Cookie” and “Make Chocolate” share similar steps, “Change Battery Of Watch” and “Resize Watch Band” share similar appearances). Intermediate errors are caused by two reasons: the uncertainty of procedure planning and the gap between training and inference. In the example of Fig. 5, all the ingredients in the test video have already been cut, but the model has encountered many examples during training where ingredients are chopped and then added into the bowl, so it predicts the action of “cut ingredients”.

TABLE XV: Statistical results of different error types on the largest COIN dataset.

	task error	action error	intermediate error
$T$ = 3	29.6%	56.3%	14.1%
$T$ = 4	26.9%	47.7%	25.4%

TABLE XVI: Computational efficiency study of PDPP. The

w.

j.t.

denotes learning with joint training and

w.o.

j.t.

means training separately.

Dataset	FLOPs(M)	time(s)	Training Params(M)		Training steps
Dataset	FLOPs(M)	time(s)	w. j.t.	w.o. j.t.	w. j.t.	w.o. j.t.
CrossTask	52.98	0.13	41.78	41.77 * $T$	24,000	24,000 * $T$
NIV	52.77	0.12	41.71	41.71 * $T$	6,500	6,500 * $T$
COIN	341.54	0.26	157.44	157.42 * $T$	14,000	14000 * $T$

TABLE XVII: Evaluation results of probabilistic modeling metrics on COIN and NIV. Joint training with horizon concatenation is applied. Unet-attention based model with task mask and Unet based model with task concatenation are used for COIN and NIV, respectively.

Metric	Model	$T$ = 3	$T$ = 4	$T$ = 3	$T$ = 4
		COIN		NIV
NLL $\downarrow$	Deterministic	5.29	5.69	5.53	5.57
	Noise	5.15	5.46	5.11	5.13
	Ours	5.03	5.20	4.77	4.58
KL-Div $\downarrow$	Deterministic	4.34	4.29	5.45	5.44
	Noise	4.21	4.06	5.03	5.01
	Ours	4.08	3.79	4.69	4.46
SR $\uparrow$	Deterministic	30.70	23.39	27.94	24.14
	Noise	30.66	23.02	25.36	23.83
	Ours	30.12	22.24	30.74	27.78
ModePrec $\uparrow$	Deterministic	37.89	35.64	30.00	25.76
	Noise	37.86	34.90	26.64	24.66
	Ours	37.62	34.18	32.72	29.56
ModeRec $\uparrow$	Deterministic	29.99	22.51	27.42	23.37
	Noise	34.24	29.98	36.50	31.43
	Ours	36.13	34.25	37.62	34.88

TABLE XVIII: Evaluation results of probabilistic modeling metrics on CrossTask. Joint training with horizon concatenation is applied to the Unet based model.

Metric	Model	$T$ = 3	$T$ = 4	$T$ = 5	$T$ = 6	$T$ = 3	$T$ = 4	$T$ = 5	$T$ = 6
		no task				task concatenation
NLL $\downarrow$	Deterministic	3.62	4.23	4.74	5.11	3.62	4.17	4.61	5.03
	Noise	3.52	4.06	4.43	4.78	3.52	4.07	4.45	4.80
	Ours	2.96	3.26	3.49	3.58	2.93	3.19	3.40	3.48
KL-Div $\downarrow$	Deterministic	3.04	3.35	3.59	3.80	3.04	3.29	3.46	3.72
	Noise	2.94	3.17	3.28	3.48	2.94	3.19	3.29	3.49
	Ours	2.38	2.37	2.33	2.27	2.35	2.30	2.24	2.17
SR $\uparrow$	Deterministic	36.54	22.08	12.04	8.05	37.37	22.75	13.29	7.83
	Noise	36.44	21.89	13.14	7.96	36.72	21.48	13.50	7.92
	Ours	37.96	22.56	14.30	8.93	37.46	22.83	13.79	8.65
ModePrec $\uparrow$	Deterministic	54.64	46.14	33.56	21.57	55.27	47.48	36.80	24.96
	Noise	53.77	45.55	34.73	24.21	54.06	45.27	35.41	24.08
	Ours	55.31	47.38	36.79	25.94	54.86	46.96	36.75	25.48
ModeRec $\uparrow$	Deterministic	33.15	18.56	10.12	6.00	33.67	19.54	11.39	6.59
	Noise	37.34	23.41	14.69	9.62	38.13	23.47	15.04	9.83
	Ours	52.44	43.99	36.90	31.59	53.77	44.90	38.68	33.44

5.7 Discussion on Computational Efficiency

The computational efficiency of PDPP is provided in Tab. XVI, including the number of training parameters, FLOPs and time required for a single sampling of PDPP across different datasets. We have also studied the effect of joint training on model parameter savings and training cost. Experimental results demonstrate that joint training effectively saves model training parameters and training steps when planning step $T>1$ , which also enables flexible multi-step procedure predicting. All the experimental results are obtained on a single NVIDIA TITAN XP GPU.

5.8 Evaluation on probabilistic modeling

As discussed in Sec. 1, we introduce diffusion model to procedure planning to model the uncertainty in this problem. Here we follow[55] to evaluate our probabilistic modeling.

Our model is probabilistic by starting from random noises and denoising step by step. We here introduce two baselines to compare with our diffusion based approach. We first remove the diffusion process in our method to establish the Noise baseline, which just samples from a random noise with the given observations and task class condition in one shot. Then we further establish the Deterministic baseline by setting the start distribution $\hat{x}_{N}=0$ , thus the model directly predicts a certain result with the given conditions. For the Deterministic baseline, we just sample once to get the plan since the result for Deterministic is certain when observations and task class conditions are given. For the Noise baseline and our diffusion based model, we sample 1500 action sequences as our probabilistic result to calculate the uncertain metrics. Note that the multiple sampling process is only required while evaluating probabilistic modeling and our model can generate a good plan just by sampling once.

We reproduce the KL divergence, NLL, ModeRec and ModePrec in[55] and use these metrics along with SR to evaluate our probabilistic model. Specifically, the SR and ModePrec can reflect the accuracy of planning results, ModeRec reflects the diversity of plans, and KL divergence, NLL indicate the overall agreement between predictions and the ground truth distribution.

Role of diffusion process. Results in Tabs. XVIII and XVII suggest that PDPP achieves the best performance on most metrics, which shows our approach has an excellent ability to model the uncertainty in procedure planning and can produce both

TABLE XIX: Study on influence of joint training and DDIM sampling for probabilistic modeling results with PDPP.

Metric	Train&Sample	$T$ = 3	$T$ = 4	$T$ = 5	$T$ = 6	$T$ = 3	$T$ = 4	$T$ = 3	$T$ = 4
		CrossTask				COIN		NIV
NLL $\downarrow$	Separate_DDIM	3.61	3.85	3.77	4.06	5.15	5.47	4.93	4.75
	Join_DDIM	2.93	3.19	3.40	3.48	5.03	5.20	4.77	4.58
	Join_DDPM	3.14	3.42	3.65	3.72	5.13	5.41	5.01	4.95
KL-Div $\downarrow$	Separate_DDIM	3.03	2.96	2.62	2.76	4.20	4.07	4.85	4.62
	Join_DDIM	2.35	2.30	2.24	2.17	4.08	3.79	4.69	4.46
	Join_DDPM	2.56	2.54	2.49	2.42	4.18	4.00	4.92	4.82
SR $\uparrow$	Separate_DDIM	37.20	21.48	13.45	8.41	31.05	22.45	30.20	26.67
	Join_DDIM	37.46	22.83	13.79	8.65	30.12	22.24	30.74	27.78
	Join_DDPM	38.07	23.79	14.75	9.52	30.34	22.68	30.85	27.96
ModePrec $\uparrow$	Separate_DDIM	53.14	44.55	36.30	25.61	38.21	33.66	31.78	29.10
	Join_DDIM	54.86	46.96	36.75	25.48	37.62	34.18	32.72	29.56
	Join_DDPM	55.12	48.02	38.44	27.78	37.81	34.68	33.26	30.49
ModeRec $\uparrow$	Separate_DDIM	36.49	31.10	29.45	22.68	34.63	29.64	33.09	33.08
	Join_DDIM	53.77	44.90	38.68	33.44	36.13	34.25	37.62	34.88
	Join_DDPM	45.96	36.66	30.97	26.50	33.90	28.96	34.33	32.66

diverse and reasonable plans. Specifically, Deterministic baseline generates more accurate results while Noise baseline gives more diverse results when $T=3,4$ . Longer horizons bring more diverse plans, thus the Noise baseline can achieve better SR results when $T=5,6$ since the added noise helps to model the uncertainty. Our diffusion based PDPP makes a good trade-off between the diversity and accuracy of planning, thus generates both accurate and diverse results that best match the ground truth.

Influence of joint training and DDIM sampling. In this section, we study how joint training and DDIM sampling can influence the modeling of uncertainty. For NIV and COIN, we just apply the best setting as described in the end of Sec. 5.4. For CrossTask, we here use the Unet based model trained with task and horizon concatenation instead since its $NLL$ and $KL-Div$ metrics are better than training without task condition(shown in Tab. XVIII). The results are shown in Tab. XIX. Results of $Separate\_DDIM$ and $Join\_DDIM$ indicate that joint training can further improve the ModeRec performance and generates more diverse plans. So we can draw the conclusion that joint training has a positive effect on modeling the uncertainty. As for DDIM sampling, we can know that sampling with DDPM gets more precise results and DDIM sampling helps for planning diversely. One explanation for this is that the DDPM sampling is consistent with our training process, while DDIM sampling conducts skip sampling for acceleration. Thus DDIM sampling can deviate from the intended sampling route slightly and brings more diversity to planning results.

Visualization Results. In Figs. 6 and 7, we show the visualizations of different plans with the same start and goal observations produced by our PDPP for different prediction horizons on CrossTask. In each figure, the images denote the start and goal observations, the first row denotes the ground truth actions (rows with ”GT”) and other rows denote multiple reasonable plans produced by our model, respectively. Here the reasonable plans are plans that share the same start and end actions with the ground truth plan and exist in the test dataset. We can see that our model can generate both right and diverse plans with the given observations.

TABLE XX: Evaluation results on CrossTask and COIN for VPA with prediction horizon

T\in\{3,4\}

. Here ”Random_g” denotes the

Random

w/goal

baseline, ”P, p1, p2” means Protocol, protocol1 and protocol2, respectively.

			$T$ = 3			$T$ = 4
Dataset	Model	P	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	SR $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$
Crosstask	Random	p1	0.0	0.9	1.5	0.0	0.9	1.9
	Random_g	p1	0.3	13.4	23.6	0.0	12.7	27.8
	DDN[6]	p1	6.8	25.8	35.2	3.6	24.1	37.0
	VLaMP[35]	p1	10.3	35.3	44.0	4.4	31.7	43.4
	PDPP	p1	17.5	48.5	55.3	9.8	44.3	56.6
	PDPP	p2	11.6	36.7	47.7	6.3	35.1	50.9
COIN	Random	p1	0.0	0.1	0.2	0.0	0.1	0.2
	Random_g	p1	1.7	21.4	42.7	0.3	20.1	47.7
	DDN[6]	p1	10.1	22.3	32.2	7.0	21.0	37.3
	VLaMP[35]	p1	18.3	39.2	56.6	9.0	35.2	54.2
	PDPP	p1	25.5	52.9	67.3	19.2	51.0	71.5
	PDPP	p2	21.6	47.2	65.1	16.2	46.9	69.5

5.9 Results on the VPA task

We in this section apply our PDPP to the Visual Planners for human Assistance (VPA) problem by removing the goal observation and using the ground truth task label as condition instead. The input of VPA task includes the current observation, the goal description, and the video history. The video history is an additional input (not required by procedure planning) that records the entire process from the beginning to the current state. Although it can provide some contextual information, it is computationally expensive and difficult to achieve real-time planning in the existing frameworks. Therefore, we choose to only use the current observation as starting point to predict action sequence to reach the goal, which is more convenient and efficient. We follow Patel et al.[35] to evaluate our model on CrossTask and COIN datasets. For CrossTask, we use the Unet based model and add both the task and horizon conditions by concatenation. For COIN, we apply the Unet-attention model with task mask and horizon concatenation.

Take the start time of the first action to be predicted as $t_{start}$ , Patel et al.[35] gets their start observation $o_{s}$ by capturing the video clip from time $t_{start}-2$ to $t_{start}+2$ , which we denote as ”protocol 1”. However, we here argue that since the aim for VPA problem is to assist human in everyday lives, taking video clip from time $t_{start}$ to $t_{start}+2$ is not suitable because actions after $t_{start}$ are inaccessible and have not been performed. Thus we select 4 seconds-long video content before $t_{start}$ as $o_{s}$ , which we denote as ”protocol 2”.

We compare our model with all approaches in[35]:

- $Random$ . This policy randomly selects actions from the available action space in dataset to produce the plans.

- $Random$ $w/goal$ . This policy randomly selects actions from the task-limited action space in dataset to produce the plans. Only actions applicable to the given task class can be involved.

- $DDN$ [6]. DDN model is a two-branch autoregressive model, which learns an abstract representation of action steps and tries to predict the state-action transition in the feature space.

- $VLaMP$ [35]. VLaMP implements their planning model with pretrained transformer-based language model and predicts action and observation tokens autoregressively. Beam search is applied for inference. Besides, video history is required and processed by the segmentation module for better planning.

Tab. XX presents the experiment results for VPA. Our PDPP achieves the state-of-the-art performance, even without the video history. This again shows the great generalization ability of our PDPP on goal-directed planning in instructional videos and the effectiveness of modeling action sequence as a whole.

6 Conclusion

In this paper, we have casted procedure planning in instructional videos as a distribution fitting problem and addressed it with a projected diffusion model PDPP. Compared with previous work, our model requires less supervision and can be trained with a simple learning scheme. PDPP performs diffusion process both for training and sampling, thus models the uncertainty well and can generate reasonable and diverse plans. We instantiate our PDPP with three popular architectures for diffusion model and conduct extensive ablation study on the design and condition introduction of our model. We also apply joint training to PDPP to plan with multiple horizons and save training parameters. Evaluation results on three datasets with different scales demonstrate that our model obtains a notable improvement over previous approaches among multiple planning horizons. Our work demonstrates that modeling action sequence as a whole distribution is an effective solution to goal-directed planning in instructional videos. In the future, we consider extending PDPP to open domain planning with language described action and task labels.

Acknowledgments

This work is supported by the National Key R $\&$ D Program of China (No. 2022ZD0160900), the National Natural Science Foundation of China (No. 62076119), the Fundamental Research Funds for the Central Universities (No. 020214380119), Jiangsu Frontier Technology Research and Development Program (No. BF2024076), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

References

[1] Alhabib Abbas and Yiannis Andreopoulos. Biased mixtures of experts: Enabling computer vision inference under data transfer limitations. IEEE Trans. Image Process., 29:7656–7667, 2020.
[2] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, pages 4575–4583. IEEE Computer Society, 2016.
[3] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
[4] Jing Bi, Jiebo Luo, and Chenliang Xu. Procedure planning in instructional videos via contextual modeling and model-based policy learning. In ICCV, pages 15591–15600. IEEE, 2021.
[5] João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pages 4724–4733. IEEE Computer Society, 2017.
[6] Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. In ECCV (11), volume 12356 of Lecture Notes in Computer Science, pages 334–350. Springer, 2020.
[7] Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. CoRR, abs/1509.03025, 2015.
[8] Yingpeng Deng and Lina J. Karam. Universal adversarial attack via enhanced projected gradient descent. In ICIP, pages 1241–1245. IEEE, 2020.
[9] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
[10] Nikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, and Allan D. Jepson. Drop-dtw: Aligning common signal between sequences while dropping outliers. In NeurIPS, pages 13782–13793, 2021.
[11] Nikita Dvornik, Isma Hadji, Hai X. Pham, Dhaivat Bhatt, Brais Martínez, Afsaneh Fazly, and Allan D. Jepson. Flow graph to video grounding for weakly-supervised multi-step localization. In ECCV (35), volume 13695 of Lecture Notes in Computer Science, pages 319–335. Springer, 2022.
[12] Kiana Ehsani, Hessam Bagherinezhad, Joseph Redmon, Roozbeh Mottaghi, and Ali Farhadi. Who let the dogs out? modeling dog behavior from visual data. In CVPR, pages 4051–4060. Computer Vision Foundation / IEEE Computer Society, 2018.
[13] Yazan Abu Farha and Juergen Gall. Uncertainty-aware anticipation of activities. In ICCV Workshops, pages 1197–1204. IEEE, 2019.
[14] Yazan Abu Farha, Qiuhong Ke, Bernt Schiele, and Juergen Gall. Long-term anticipation of activities with cycle consistency. In GCPR, volume 12544 of Lecture Notes in Computer Science, pages 159–173. Springer, 2020.
[15] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, pages 2786–2793. IEEE, 2017.
[16] Adrian Hauswirth, Saverio Bolognani, Gabriela Hug, and Florian Dörfler. Projected gradient descent on riemannian manifolds with applications to online power system optimization. In Allerton, pages 225–232. IEEE, 2016.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
[18] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. CNN architectures for large-scale audio classification. In ICASSP, pages 131–135. IEEE, 2017.
[19] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022.
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
[21] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022.
[22] Jonathan Ho, Tim Salimans, Alexey A. Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In NeurIPS, 2022.
[23] Michael Janner, Yilun Du, Joshua B. Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 9902–9915. PMLR, 2022.
[24] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
[25] Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021.
[26] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In ICLR. OpenReview.net, 2021.
[27] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-lm improves controllable text generation. In NeurIPS, 2022.
[28] Chen Liang, Wenguan Wang, Tianfei Zhou, and Yi Yang. Visual abductive reasoning. In CVPR, pages 15544–15554. IEEE, 2022.
[29] Siyuan Brandon Loh, Debaditya Roy, and Basura Fernando. Long-term action forecasting using multi-headed attention-based variational recurrent neural networks. In CVPR Workshops, pages 2418–2426. IEEE, 2022.
[30] Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. VDT: an empirical study on video diffusion with transformers. CoRR, abs/2305.13311, 2023.
[31] Esteve Valls Mascaro, Hyemin Ahn, and Dongheui Lee. Intention-conditioned long-term human egocentric action anticipation. In WACV, pages 6037–6046. IEEE, 2023.
[32] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640. IEEE, 2019.
[33] Diganta Misra. Mish: A self regularized non-monotonic neural activation function. CoRR, abs/1908.08681, 2019.
[34] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8162–8171. PMLR, 2021.
[35] Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis Iuzzolino, Unnat Jain, and Ruta Desai. Pretrained language models as visual planners for human assistance. In ICCV, pages 15256–15268. IEEE, 2023.
[36] William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4172–4182. IEEE, 2023.
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
[38] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In NeurIPS, pages 8583–8595, 2021.
[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
[40] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
[41] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
[42] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. Mesh-tensorflow: Deep learning for supercomputers. In NeurIPS, pages 10435–10444, 2018.
[43] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR (Poster). OpenReview.net, 2017.
[44] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256–2265. JMLR.org, 2015.
[45] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021.
[46] Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 4739–4748. PMLR, 2018.
[47] Jiankai Sun, De-An Huang, Bo Lu, Yun-Hui Liu, Bolei Zhou, and Animesh Garg. Plate: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics Autom. Lett., 7(2):4924–4930, 2022.
[48] Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, pages 1207–1216. Computer Vision Foundation / IEEE, 2019.
[49] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit Haim Bermano. Human motion diffusion model. In ICLR. OpenReview.net, 2023.
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
[51] Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. PDPP: projected diffusion for procedure planning in instructional videos. In CVPR, pages 14836–14845. IEEE, 2023.
[52] Yuxin Wu and Kaiming He. Group normalization. In ECCV (13), volume 11217 of Lecture Notes in Computer Science, pages 3–19. Springer, 2018.
[53] Brandon Yang, Gabriel Bender, Quoc V. Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS, pages 1305–1316, 2019.
[54] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:1720–1733, 2023.
[55] He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G. Derpanis, Richard P. Wildes, and Allan D. Jepson. P ${}^{\mbox{3}}$ iv: Probabilistic procedure planning from instructional videos with weak supervision. In CVPR, pages 2928–2938. IEEE, 2022.
[56] Wenliang Zhao, Yongming Rao, Yansong Tang, Jie Zhou, and Jiwen Lu. Videoabc: A real-world video dataset for abductive visual reasoning. IEEE Trans. Image Process., 31:6048–6061, 2022.
[57] Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, and Jifeng Dai. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. In NeurIPS, 2022.
[58] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David F. Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. In CVPR, pages 3537–3545. Computer Vision Foundation / IEEE, 2019.