
Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production
Abstract.
Amateurs working on mini-films and short-form videos usually spend lots of time and effort on the multi-round complicated process of setting and adjusting scenes, plots, and cameras to deliver satisfying video shots. We present Virtual Dynamic Storyboard (VDS) to allow users storyboarding shots in virtual environments, where the filming staff can easily test the settings of shots before the actual filming. VDS runs on a “propose-simulate-discriminate” mode: Given a formatted story script and a camera script as input, it generates several character animation and camera movement proposals following predefined story and cinematic rules to allow an off-the-shelf simulation engine to render videos. To pick up the top-quality dynamic storyboard from the candidates, we equip it with a shot ranking discriminator based on shot quality criteria learned from professional manual-created data. VDS is comprehensively validated via extensive experiments and user studies, demonstrating its efficiency, effectiveness, and great potential in assisting amateur video production.
1. Introduction

Born in 1930, storyboarding technique helps video directors and cinematographers design each shot111Video is usually made up of several shots, where each one is a series of visual continuous frames., figure out potential problems, and communicate ideas to save time and resources raised in practical video production (Wikipedia, 2023). A conventional storyboard (Fig. 2(a,b)) represents each shot with one or two still frames, depicting the scene layout, character actions, as well as camera parameters such as scale and angle. However, such a static storyboard is often dry and rigid, since it lacks the intrinsic capability to faithfully demonstrate dynamic semantics including character and camera movements. Although arrows and textual instructions can be used to indicate movement directions as amendments, they often lead to significant semantic ambiguity in the final static storyboard. Moreover, most conventional storyboards are hand-painted sketches that require both time and drawing skills. A rudimentary storyboard of a target video lasting for minutes can take artists hours to finish, let alone amateur creators who lack skills.
Given the aforementioned drawbacks, it is thus of great demand to improve conventional storyboard generation from two aspects: 1) producing dynamic storyboards instead of static ones, where the demonstration of dynamic semantics is straightforward. 2) building a handy tool that can create customized storyboards in a semi-auto way. There are two candidate solutions, namely neural video generative models as well as virtual cinematography. While neural video generation models have made significant progress in recent years (Singer et al., 2022; He et al., 2022), their outputs still suffer from frame-wise spatial-temporal inconsistency, let alone meet the cinematic requirements of dynamic storyboards. A better alternative is virtual cinematography (He et al., 1996), which includes three modules, real-time executor to drive characters move, virtual cinematographer to handle the visual layout, and render to output the results, which can be adapted for storyboard generation. Still, two significant problems need to be further addressed in order to fulfill this task with enough high quality. First, most existing works (Shah et al., 2018; Fabbri et al., 2021; Jiang et al., 2021a, 2020) only study the camera behavior under fixed plots and environments that require heavy manual force to modify. Secondly, the full action space of camera movements is too large to allow effective per-frame decisions (Truong et al., 2018). The camera action space in filming consists of seven degrees of freedom (7DoF), where a decision on camera position, rotation, and focal length should be determined for each frame. Moreover, abrupt changes can easily occur when creating video frame by frame, yet one of our goals is to produce smooth videos with harmonious content and cinematic styles.
In this paper, we propose a novel virtual dynamic storyboard approach in an efficient and customized way for amateurs, as shown in Fig.4. It follows a “propose-simulate-discriminate” paradigm. For each shot, it first translates the inputted high-level story and camera script in the form of char do sth/swh tuples for story scripts and movement scale angle tuples for camera scripts, into proposals of shot hyperparameters. Each proposal is then executed by the simulation module built on top of a modern graphic engine (e.g., Unity, Unreal, Omniverse (Unity, 2023; Unreal, 2023; Nvidia, 2023)) to acquire its corresponding rendered output. Finally, a data-driven discriminator is adopted to rank all these rendered outputs as a recommendation for users to choose the final dynamic demonstration of this shot. Our key insights are as follows: 1) The enormous action space for all possible character and camera trajectories on a frame-by-frame basis can easily lead to low-quality shot candidates in terms of inter-frame abruptness and less meaningful camera motions, let alone the desired artistic feelings. Our shot-based story and camera proposal generation significantly reduces the search space for appropriate camera configurations, further ensuring the results are more plausible. 2) By wrapping the graphic engine into a simulation module, the proposed approach significantly improves its accessibility and computational efficiency, thanks to engines’ rich and highly-structured functionalities and highly-optimized computation pipeline. Moreover, the proposed approach also readily enjoys a series of additional benefits brought by the graphic engine, such as function-wise scalability and high rendering quality. 3) There might be several plausible shots satisfying the requirements of story and camera scripts. No universal criteria can serve as direct supervision and evaluation metrics for generated shots. We resort to a data-driven learning approach that aims to automatically discover rules from professional manual-created clips. with a shot ranking discriminator. It learns from a large amount of data to evaluate shots by distinguishing between different quality samples. Through the class-aware contrastive objectives, our approach can successfully select shots with top-ranked qualities among generated proposals.
In summary, we contribute Virtual Dynamic Storyboard (VDS): (1) It takes story and camera scripts as input and translates them into dynamic shot sequences for pre-production that follows cinematic filming rules. (2) The design of our cinematic filming module with the associated camera action subspace not only brings interpretable control to users but also effectively reduces the action space which leads to more plausible results. (3) The shot ranking discriminator mitigates the gap that there are no ground truth and universal criteria by learning from professional manual-created clips with carefully designed class-aware contrastive objectives. (4) Extensive user studies and qualitative evaluations demonstrate the effectiveness of VDS. Codes will be released upon publication.
Assumption. Though our tool can generate a sequence of shots, its technique novelty mainly focuses on how to compose a single shot. The users have the freedom to decide how different shots are connected to bring out a sequence and their transition effects. Hence, we assume that each shot duration is equal to its corresponding atomic character action and they are connected directly in our paper to form a sequence, noting that different shots can be “cut” and connected in various ways using our shots as raw data (Pardo et al., 2021a, b). This corresponds to the industry process that applies multiple cameras to record one action and take them as raw inputs to the post-editing process with “cut”.
2. Related Work
Storyboard and dynamic content creation. Storyboards are investigated in keyframe or text summarization from videos (Mohanta et al., 2013; Bhaumik et al., 2015; Ronfard et al., 2022), textual script writing (Chandu et al., 2019; Mirowski et al., 2022) and video creation assistance (Goldman et al., 2006; Ye and Baldwin, 2008; Pizzi et al., 2010). Among the most relevant ones, Ye et al (Ye and Baldwin, 2008) focus on the language mapping between the action descriptions and avatar animation instead of the visual content and PIzzi et al (Pizzi et al., 2010) only generate sketch-style static images. Besides, intelligent creation tools are in great demand as they can help users efficiently create customized dynamic content, e.g., video, animation (Louarn et al., 2018, 2020). Some researchers focus on several key steps, such as frame composition (Zhong et al., 2021), shot selection (Liao et al., 2020; Jiang et al., 2021b), shot cut suggestion (Pardo et al., 2021a). Others tackle high-level automatic procedures with simple user interactions and take multiple videos captured by different cameras to produce a coherent video in different application scenarios (Arev et al., 2014; Leake et al., 2017; Truong and Agrawala, 2019) using different data sources (Wang et al., 2019; Chi et al., 2021; Moorthy et al., 2020; Rao et al., 2022b, a). Our system also belongs to high-level automatic creation that takes story/camera scripts as input.
Camera control for cinematography. Camera motion plays an important role in delivering content from a given environment (He et al., 1996; Wu et al., 2018). Early works start from handcrafting quality criteria such as visibility and smoothness (Huang et al., 2016; Galvane et al., 2015) for route planning (Oskam et al., 2009). However, they are limited in the scope of possible actions and lack generalizability in broader scenarios. Later works tend to directly imitate exemplar videos to generate similar camera trajectories with SfM (Sanokho et al., 2014). Jiang et al (Jiang et al., 2020) use deep neural networks to extract camera behaviors from real movie clips based on toric space (Lino and Christie, 2015), which is helpful in imitating camera motion (Yoo et al., 2021) and keyframing (Jiang et al., 2021a). However, the inaccuracy of SfM and toric space estimation may severely affect their performance. Another research direction targets drone photography (Gebhardt and Hilliges, 2021; Galvane et al., 2018), which explores imitation learning to learn from experts. Reinforcement learning is used to maximize an aesthetic-based reward (Gschwindt et al., 2019; Huang et al., 2019) with extension to style control (Huang et al., 2021). However, the shot styles they studied are quite limited compared to the broad categories of plausible shots in real film production, and it is generally not easy to acquire ground truth trajectory data for training.
Simulation and virtual environments. Simulation engines can facilitate the training of machine learning models in autonomous driving, drones, robots and so on (Richter et al., 2016; Brodeur et al., 2018; Gao et al., 2019), with the potential to augment an infinite number of data samples. While many virtual platforms for real-world scenes can only complete some single agent tasks (Starke et al., 2019; Zhang et al., 2018), a line of research focus on social AI to construct multi-people tasks recently (Savva et al., 2019; Shridhar et al., 2020) and develop virtual scenes are based on Unity (Unity, 2023) or Unreal (Unreal, 2023). With developed APIs, our VDS supports script-level auto control over Unity (Unity, 2023) and Omniverse (Nvidia, 2023) engines. It is able to simulate over kinds of camera shot styles and character actions to fulfill a storyboard, and render it out.
3. Virtual Dynamic Storyboard
The framework of Virtual Dynamic Storyboard (VDS) is shown in Fig. 4. It first proposes several candidates of executable parameters for each shot’s story script and camera script as inputs (Sec. 3.1 and Sec. 3.2), following the story and cinematic rules. The proposal storyboard videos are then rendered by an engine-based simulator that satisfy the corresponding parameters. Through class-aware contrastive learning, a shot ranking discriminator helps to score the generated proposals and assists users in selecting their favorites (Sec. 3.3). The practical UI is presented in Sec. 3.4.
3.1. Story Script Proposal
One of the raw inputs to VDS is a sentence of story scripts in the format of char do sth/swh. Although graphic engines have strong ability in scene establishment and character animation with manual force (mouse clicking/dragging and keyboard typing), these functions cannot be directly called with the aforementioned story scripts. To fulfill this goal, we develop an engine-based simulation module with a series of automatic APIs including: scene selection, character placement and animation, camera control.
After the selection of characters and scenes from available assets, each story script allows a chosen virtual character to interact with a predefined scene and executes its animation for each timestamp in the simulator. Specifically, 1) char and sth/swh: char links the selected character asset. And scene is represented as a hierarchical tree, e.g., a house is composed of several rooms, which in turn are composed of several objects. From this scene graph, each object and place have their position, which can be associated with the sth/swh in the story scripts. 2) do: For each atomic action do, we retrieve corresponding pre-recorded proposal animation clips from a predefined animation clip pool and associate them with the character and object/place. Each clip can be executed by the simulator that output a video reflecting the semantics of atomic motion. If the atomic action involves a distance-motion, e.g., walk/run, proposal paths are generated between the character and the object for . Each paths are selected with the scene graph to avoid objects getting in the way. The motion trajectory of the character can be accurately represented by in the world coordinate, and is equal to the length of .
The story parameters of each executable storyboard animations for one story script can be represented as:
(1) |
3.2. Camera Script Proposal
Preliminaries for camera control. Following our assumption, each story script char do sth/swh will produce executable animations, and the filmed shot duration is the same as the character’s animation length, which is specified by its corresponding action clip . Thus, in order to take a shot for each atomic action, it is necessary to obtain a list of 7DoF camera parameters that represent the camera trajectory:
(2) |
where stands for the position in world coordinates. The roll , pitch , and yaw describe the camera rotations along the axes in the camera local coordinate, where the upward direction is specified by a “Look-At” constraint.222The commonly adopted “Look-At” constraint for camera is realized by a TR matrix that positions/rotates the camera to look at a target point in space from its positioned point, according to the orientation set up by the up-vector and the looking direction. It positions the camera to be horizontal in most cases except for some advanced effects, such as “roll” shots. Details are specified in the supplementary. The focal length is set in the commonly used range mm of films.


Cinematic camera control. To associate the produced shots with the classical cinematic style filming language, and provide easy control to users, we introduce three dimensions for shot type control and allow the input in the format of movement scale angle, which are widely used in the filming industry to increase production efficiency (Rao et al., 2020). The definition of these three factors and their complete list of subcategories can be found in (Giannetti and Leach, 1999). Instead of freely searching from the general camera action space, we identify a set of camera subspaces that defines meaningful camera trajectories in terms of shot scale, angle, and movements.
In the following, we adopt a human-centric explanation and introduce a direction vector represented in the spherical coordinate system in terms of the radial distance , polar angle , and azimuthal angle to facilitate control. This representation explicitly derives the relative positions between the main character and the cameras, and bears the merit of having an one-to-one mapping from to via:
(3) |
Based on the specification of the shot scale and angle, we can determine the camera parameters for the keyframes in a shot, while the type of camera movement finally determines the camera parameter for each interpolated timestamp in-between . Fig. 3 shows an illustration of the camera parameters and a selection of shot types. Some basic categories are shown below, which can be naturally extended to more with details explained in the supplementary.
3.2.1. Shot Angle
This is largely determined by the relative position between camera and target, with a special focus put on their altitude difference. As shown in Fig. 3, different angles are reflected by the vector , e.g., represents eye-level shot, produces high-angle shot, and serves low-angle shot.
3.2.2. Shot Scale
It is determined by the size of the target object within the frame, which is implemented with a virtual sphere centered on the target character with the distance radius and the focal length . For example, for mm and a character with height , is pre-set for close-ups, medium, full shots respectively. and can be adjusted accordingly to meet individual needs.
3.2.3. Shot Movement
In the human-centric scenario, the camera movements are correlated with the motion trajectory of the character. Based on the above controls for shot scale and angle, we elaborate the control of camera movements that depicts the camera parameters for . Nine basic movement types are briefly explained below.
Static shot keeps the 7DoF as constants throughout the time duration with the reference target coordinate as the character’s start or end position.
Follow shot aims to follow the character’s motion trajectory within the time duration , and keeps looking-at the character. With an easing function parameterized by to control the movement rhythm, the camera position at time is determined by,
(4) | ||||
(5) |
In general, a large makes the shot “slow first, fast later” and vice versa. The camera rotation parameters are determined by the Look-At constraint.
Push/pull shot compresses/enlarges the shooting space to focus on a single object or show the surroundings. It adjusts the distance between the camera and the subject with a zoom ratio parameter and easing function in Eq. (5),
(6) |
Zoom shot are similarly achieved by adjusting the camera focal length and its scale and angle are defined based on the first frame. Tilt shot and pan shot rotate the pitch angle or the yaw angle respectively to shift audience’s focus from one to the other vertically or horizontally. Dolly (horizontal) and pedestal (vertical) shots are implemented by specifying a camera trajectory based on starting and ending points with the easing function, which is similar to push/pull. Arc shot is usually applied to a character staying at a specific location, where the camera moves around the subject rotating the azimuthal angle in the direction vector.
The above definition constrain some variables in depending on the shot angle, scale and movement types. Then we enumerate the unconstrained variables within their ranges (a subspace of 7DoF) to acquire camera trajectory proposals.
3.3. Shot Ranking Discriminator
For a given story and camera script, the subspaces defined above can produce multiple plausible shot proposals rendered by our simulator differing in their specific parameters, e.g., character paths and camera views. The next question is how to effectively evaluate these shots and select the optimal shots that look best in the combination of content, environment, and proposed camera trajectory. Since there are no standard metrics for evaluating a good shot, and simple criteria such as smoothness are quite limited in telling the overall shot quality, we propose a data-driven shot ranking discriminator to score the quality of generated shots, such that users can easily select high-quality shots based on the scores. To acquire more capacity in discriminating the spatial-temporal structure among shot proposals, it is implemented with a TSN structure (Wang et al., 2018) that samples 8 images from a shot of size 224 × 224, and subsequently feeds each image to a ResNet50 and fuses extracted features in the end.
We train the binary classifier using the professional manual-created clips as positive samples and the randomly generated virtual samples as negative ones, under the assumption that the professional manual-created clips represent higher quality. The network is trained with the loss,
(7) |
During inference, the generated samples can be sorted by their classification scores for being categorized as “professional”. And our goal is to find the one with the highest score that fools the network into treating them as professional samples.
Nevertheless, we found that the network could only learn superficial appearance-level criteria to distinguish two classes, e.g., color/texture due to the shot type variance. The features of the generated samples to be ranked also stick to each other in the feature space, and it is less informative to distinguish their quality in terms of likeness towards high-quality shots. To overcome this problem, we design the following two objectives to facilitate the pick-up of high-quality shots.
Class-aware contrastive objectives. Considering the variance among generated samples caused by their intrinsic shot styles, we add a class-aware loss to encourage it to be more class type specific and be expert in determining the corresponding shot type’s quality:
(8) |
where is the total number of shot types in training data.
To better select the high-quality generated shots, we need to magnify the feature difference among samples in different quality. Inspired by recent successes in video contrastive learning (Pan et al., 2021) that could learn high-level features for each shot, such as layout, temporal pace, which is helpful to determine the generated quality, we propose to include a contrastive objective,
(9) |
The loss is computed over one sample and other samples in the training set. A large means that more shots are taken into account when maximizing the difference among shot clips which leads to better performance. comes from different frames at other timestamps within the same shot as the target sample .
In summary, at the training time, the shot ranking discriminator has been optimized with the following composite loss, with the aim of figuring out the high-quality ones from multiple shots: We take professional manual-created clips as positive samples, and randomly generated shots as negative samples. For the positive samples, 9 professional designers select the VDS generated samples in high quality with double checked by each other. The negative samples come from a random perturbation in the camera action space. At the inference time, the shot ranking discriminator sorts proposals of each shot according to their classification scores, and the sample with the highest positive score will be chosen as the final output.
3.4. Practical User Interface

As a pre-production tool designed for the practical video production pipeline, Virtual Dynamic Storyboard is used to rehearse the plot and provide a guide in the downstream on-stage videography. To fulfill this goal, we resort to the advice from professional storyboard staff and present a two-stage user interface prototype including Environment Setting Stage and Filming Stage. In Fig. 5, we show a brief introduction of it with alphabets highlights of each panel. The key design idea is to divide the storyboard creation into static and dynamic stages, that allow users to apply clearer controls on the static scenes, dynamic characters and cameras.
Following this idea, the first step is to prepare scene and character assets for the story and camera scripts in a static Environment Setting Stage. Choose Scene window (a) and Choose Characters window (b) allow user to select the characters and their initial locations in story. Users can freely use Scene Setting window (d) to control the viewing angle of the monitor camera to facilitate the operation of the scene and view the results through Monitor View window (c).
With the chosen assets, the system is ready to produce dynamic storyboard in the Filming Setting Stage. Input windows (e, f) provide users with text input to the system. Users can type story scripts and camera script to check the results in the Output window (g) of the corresponding story. To monitor the generated results timely, the keyframe of each shot can be found in the Preview window (i). Users are allowed to click the drag box and select more cases according to the ranking score. Additionally, in the Statistics window (h), we visualize some basic real-time statistics of the generated storyboards, e.g., the shot style counting, total shot number, etc.
4. Experiments
4.1. Implementation Details
The assets used in VDS contain over different scenes, actions and characters. The character height in the simulation engine is set as . Each story and camera script pair obtains 40200 proposals. For a fixed shot scale, we generate proposals with different azimuthal angles to represent different shooting directions. For the easing function, we enumerate . In push/zoom-in shots, is set to and for pull/zoom-out shots, is set to . In tilt and pan shots, the angle change is constrained within 30∘60∘, and we enumerate all the combinations of the camera moving directions (e.g. up/down), ending points (on/off the person). In arc shots, the azimuthal angle changes within 90∘120∘. We perform the inference of VDS on a laptop with an Intel i7 CPU and an NVIDIA 2080Ti GPU to output 720P videos and the processing time is shown in Tab. 1. The training process of shot ranking discriminator takes batch size 128, as the learning rate with training epochs. To compute , we used the momentum updated dictionary (He et al., 2020) and set the dictionary size as corresponding to . And is the sum of shot movement, scale, angle classification losses. More details can refer to the supplementary.
4.2. User Evaluation on Storyboard Designer
Setting. To evaluate the performance of VDS in practical usage, we invite 20 amateur designers to compare the usage of hand-paint and our VDS. 8 of them are skilled users who learn painting over two years while the rest 12 of them are beginners without painting skills. Before the experiments, they take a 30-minute training session to familiarize with the tool and read reference material showing some examples of the types of activities and characters. After the training, each of them first uses traditional hand-paint to create storyboards for 2 different practical stories. Each story depicts a scene containing 10 shots. Then they utilize VDS to generate top 5 results for each shot and pick up one to create the dynamic storyboards on the same stories. In the process, we count their creation time to study the time efficiency, and ask them to pair-wised self-evaluate which approach can generate better results that reach their expectation. In the end, we ask them to vote which method holds higher flexibility in creation in Tab. 2.
step | proposal | render | shot ranking |
speed | 24.8 item/s | 13.0 frame/s | 10.3 shot/s |

Results. Fig. 6 compares the storyboard generated by hand-paint and VDS with the same story and camera script in an indoor and an outdoor scene respectively. Beginners without painting skills have difficulties in expressing the story with the camera script, e.g., the character just stands in front of door in the 5-th row of Fig. 6 (a). The skilled users tend to use simple shapes and omit details in the backgrounds to design, e.g., the road is depicted by some simple lines in the 9-th row of Fig. 6 (b). For actions, hand-paint also tends to use some symbols, e.g., uses musical notes to indicate that the character is singing in the 12-th row of Fig. 6 (b). It is observed that hand-paint requires painting skills for designers and more time cost to create a better storyboard. Thanks to VDS’s support for rich action and scenes, amateurs do not need painting skills to create a dynamic storyboard. And it also provides easier modification and more options for users as shown in Fig. 6 (d).
Designer Type | Approach | Time Cost per shot | Reach Expectation | Camera Flexibility | Story Flexibility |
Beginner | Hand-paint | 6.69 min | 16.67% | 8.33% | 75.00% |
VDS | 3.42 min | 83.33% | 91.67% | 25.00% | |
Skilled user | Hand-paint | 12.73 min | 25.00% | 12.50% | 62.50% |
VDS | 3.13 min | 75.00% | 87.50% | 37.50% |
Tab. 2 shows that VDS provides a lower creation time cost of 3 minutes per shot to reach their expectation. And they vote for its high flexibility on camera settings. Compared to the hand-paint that can freely draw anything, VDS sacrifices some of its story flexibility. To have a deeper understanding of their choices, we conduct a semi-structured interview with them. The designers enjoyed the efficiency, performance and rich camera choices brought by VDS. Though the new tool sacrifices a bit of flexibility on story design, 11 out 20 of them mentioned that this is relevant to their painting skills under different scenarios. “When I deal with common cases, e.g., a couple talking in a bedroom, hand-paint allows me to design freely. But if I want to design some things in a different style, e.g., the magic realism like The Lord of the Rings or Postmodernism like Westworld, the provided assets ease our design process.” Based on the above observation, we believe that as the development of more assets, VDS will provide more flexibility when we put more available assets in it. And it also opens up opportunities for those people without painting skills to design their dynamic storyboards.
4.3. User Evaluation on Storyboard Reader
Setting. The most important functionality of storyboard is to convey the ideas of creators and guide the videography process. To validate this, we invite 22 amateurs as storyboard readers, which be splitted into 3 groups (8,10,4) according to the year they use storyboard in video production.
Facing the storyboards created by the above designers in Sec. 4.2, we ask the readers whether they can understand the content in the storyboard and get the information for filming. To further understand which aspects play important roles, we conduct pair-wise rating on the elements of story and camera, where we set the hand-paint as and asks them to rate corresponding VDS results in seven-point Likert scale.
Year use storyboard/# users | 00.5 / 8 | 0.51.5 /10 | 1.52.5 / 4 |
Content Delivery | 72.50% | 62.00% | 95.00% |
Instruction Ability | 77.50% | 80.00% | 100.00% |
Story Character | 5.45 0.53 | 5.36 0.37 | 6.25 0.50 |
Story Action | 5.25 0.68 | 5.50 0.95 | 6.70 0.47 |
Story Scene | 5.30 0.38 | 5.12 0.42 | 6.20 0.80 |
Camera Scale | 4.98 0.32 | 4.36 0.70 | 6.05 0.81 |
Camera Angle | 4.58 0.59 | 4.02 0.69 | 6.80 0.71 |
Camera Movement | 5.67 0.39 | 5.18 0.49 | 6.85 0.51 |
Results. Tab. 3 shows that the results produced by VDS reach significantly better performance in content delivery and instruction ability, with above 70% for readers without much experience, and above 95% for people with 2-year experience. Readers with more experiences appreciate its better incorporation and usefulness in the practical video production pipeline. As we dive deeper into the affecting components behind it, it is found that people prefer the story setting of VDS, i.e., character, action and scene. As mentioned above, hand-painted storyboards tend to be very simple and simplify the characters, actions, and scenes to a great extent. Only those designers who are skilled in painting can accurately convey their ideas with the hand-paint. In contrast, the dynamic storyboards generated with VDS make it easy for users to understand the information in the story. For camera movement, users prefer VDS too. Still images are unfriendly for users to restore the camera motion. Instead, VDS produces more intuitive results, and users can directly see how the frames proceed.
To have a deeper understanding of VDS and its comparison with conventional hand-paint, we conduct a semi-structured interview with participants and collect the reasons why they prefer or dislike VDS or hand-paint. They feel enthusiastic about the ability offered by VDS to speed up their creation procedure, and consider this tool a good substitute for the storyboard, which is more descriptive and vivid. Among the 22 storyboard users, 16 of them prefer VDS due to its dynamics which let the shot to be much easier to understand. 19 of the participants appreciate its standardized process that is similar to the real-world videography. Especially for the skilled users, they give the highest scores and praise the potential towards a more standardized pipeline for the whole video production.
We also receive feedback on the drawbacks and suggestions from them such as “…, looking forward to adding more characters into the tool, …”, “.., it would be better if it could support more characters facial expressions”. These comments reveal insightful and exciting directions for future improvements to our system.
4.4. Ablation Study on Shot Ranking
To show the rationality of our shot ranking discriminator, we additionally collect 200 shots picked by professional designers from VDS proposals and score them with 20,000 generated proposals by the ranking network together. All designers’ shots are ranked in the top 10% which shows that our shot ranking discriminator is able to pick up high-quality shots. We also demonstrate the effectiveness of the loss function for training the shot ranking discriminator by visualizing the top 3 shot candidates of fixed scripts (Fig. 7). The top 3 results from Ours w/o look similar and their shot types are not accurate. e.g., the shot scale of Anna does not perfectly match the medium type. With the help of class-aware contrastive objectives, the full model is able to display more accurate shot types matched with the input scripts, and the top 3 results provide enough diversity for users to pick up. This is matched with our expectation, since the ablated model without the presence of has difficulty in learning a class-aware distinguishable feature space for shot proposals.

5. Discussion and Conclusion
Virtual Dynamic Storyboard (VDS) is a semi-auto storyboard creation tool performing on virtual environments. It takes user-specified story script and camera script as inputs and proposes multiple plausible character and camera trajectories that can be rendered to be dynamic shots. A shot ranking discriminator then sorts these generated shots with class-aware contrastive objectives to output the top results. Experiments show that our tool can effectively compose dynamic storyboards and assist amateurs in their creation. It also bears the following limitations.
Quality of assets. The quality of VDS’s results heavily depends on the quality of assets, i.e., character model, action animation and virtual environment. For example, VDS can not generate rich expressions for a character if this character asset itself lacks detailed modeling on the face. As our design is easy to extend with different assets, we will continue to improve the quality of VDS with more advanced 3D assets including characters, associated actions, prefabs and scenes. There are also opportunities to incorporate character/motion synthesis (Wang et al., 2021; Hong et al., 2022).
Trade-off of flexibility and efficiency on control. VDS provides a relatively flexible and fast way to generate dynamic storyboards for pre-production, but it does not yet allow users to control everything as they do with a pen. Our current version of VDS addresses the most important steps in setting a shot, i.e., the story plot and the shot cinematic styles. For different user groups, the interaction portal for control needs to be adjusted to meet their requirements and preferences. It is promising to provide more detailed control on character and camera for professionals, e.g., details of characters, lens aperture, though it would trigger more creation time cost. And for novice users, it would be more friendly to provide more fuzzy input control.
References
- (1)
- Arev et al. (2014) Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and Ariel Shamir. 2014. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG) 33, 4 (2014), 1–11.
- Bhaumik et al. (2015) Hrishikesh Bhaumik, Siddhartha Bhattacharyya, Mausumi Das Nath, and Susanta Chakraborty. 2015. Real-time storyboard generation in videos using a probability distribution based threshold. In 2015 Fifth International Conference on Communication Systems and Network Technologies. IEEE, 425–431.
- Brodeur et al. (2018) Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron Courville. 2018. HoME: a Household Multimodal Environment. In International Conference on Learning Representations Workshop.
- Chandu et al. (2019) Khyathi Chandu, Eric Nyberg, and Alan W Black. 2019. Storyboarding of recipes: grounded contextual generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6040–6046.
- Chi et al. (2021) Peggy Chi, Nathan Frey, Katrina Panovich, and Irfan Essa. 2021. Automatic Instructional Video Creation from a Markdown-Formatted Tutorial. In The 34th Annual ACM Symposium on User Interface Software and Technology. 677–690.
- Fabbri et al. (2021) Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Orcun Cetintas, Riccardo Gasparini, Aljoša Ošep, Simone Calderara, Laura Leal-Taixé, and Rita Cucchiara. 2021. Motsynth: How can synthetic data help pedestrian detection and tracking?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10849–10859.
- Galvane et al. (2015) Quentin Galvane, Marc Christie, Chrsitophe Lino, and Rémi Ronfard. 2015. Camera-on-rails: automated computation of constrained camera paths. In Proceedings of the 8th ACM SIGGRAPH Conference on Motion in Games. 151–157.
- Galvane et al. (2018) Quentin Galvane, Christophe Lino, Marc Christie, Julien Fleureau, Fabien Servant, Fran¸ ois-louis Tariolle, and Philippe Guillotel. 2018. Directing cinematographic drones. ACM Transactions on Graphics (TOG) 37, 3 (2018), 1–18.
- Gao et al. (2019) Xiaofeng Gao, Ran Gong, Tianmin Shu, Xu Xie, Shu Wang, and Song-Chun Zhu. 2019. VRKitchen: An interactive 3D environment for learning real life cooking tasks. In International Conference on Machine Learning Workshop.
- Gebhardt and Hilliges (2021) Christoph Gebhardt and Otmar Hilliges. 2021. Optimization-based User Support for Cinematographic Quadrotor Camera Target Framing. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–13.
- Giannetti and Leach (1999) Louis D Giannetti and Jim Leach. 1999. Understanding movies. Vol. 1. Prentice Hall Upper Saddle River, New Jersey.
- Goldman et al. (2006) Dan B Goldman, Brian Curless, David Salesin, and Steven M Seitz. 2006. Schematic storyboarding for video visualization and editing. Acm transactions on graphics (tog) 25, 3 (2006), 862–871.
- Gschwindt et al. (2019) Mirko Gschwindt, Efe Camci, Rogerio Bonatti, Wenshan Wang, Erdal Kayacan, and Sebastian Scherer. 2019. Can a robot become a movie director? learning artistic principles for aerial cinematography. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1107–1114.
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.
- He et al. (1996) Li-wei He, Michael F Cohen, and David H Salesin. 1996. The virtual cinematographer: A paradigm for automatic real-time camera control and directing. In Proceedings of the annual conference on Computer graphics and interactive techniques. 217–224.
- He et al. (2022) Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. 2022. Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths. arXiv preprint arXiv:2211.13221 (2022).
- Hong et al. (2022) Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–19.
- Huang et al. (2021) Chong Huang, Yuanjie Dang, Peng Chen, Xin Yang, and Kwang-Ting Tim Cheng. 2021. One-Shot Imitation Drone Filming of Human Motion Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
- Huang et al. (2019) Chong Huang, Chuan-En Lin, Zhenyu Yang, Yan Kong, Peng Chen, Xin Yang, and Kwang-Ting Cheng. 2019. Learning to film from professional human motion videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4244–4253.
- Huang et al. (2016) Hui Huang, Dani Lischinski, Zhuming Hao, Minglun Gong, Marc Christie, and Daniel Cohen-Or. 2016. Trip Synopsis: 60km in 60sec. In Computer Graphics Forum, Vol. 35. Wiley Online Library, 107–116.
- Jiang et al. (2021a) Hongda Jiang, Marc Christie, Xi Wang, Bin Wang, and Baoquan Chen. 2021a. Camera Keyframing with Style and Control. ACM Transactions on Graphics (TOG) (2021).
- Jiang et al. (2020) Hongda Jiang, Bin Wang, Xi Wang, Marc Christie, and Baoquan Chen. 2020. Example-driven virtual cinematography by learning camera behaviors. ACM Transactions on Graphics (TOG) 39, 4 (2020), 45–1.
- Jiang et al. (2021b) Xuekun Jiang, Libiao Jin, Anyi Rao, Linning Xu, and Dahua Lin. 2021b. Jointly Learning the Attributes and Composition of Shots for Boundary Detection in Videos. IEEE Transactions on Multimedia (2021).
- Leake et al. (2017) Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36, 4 (2017), 130–1.
- Liao et al. (2020) Junhua Liao, Haihan Duan, Xin Li, Haoran Xu, Yanbing Yang, Wei Cai, Yanru Chen, and Liangyin Chen. 2020. Occlusion Detection for Automatic Video Editing. In Proceedings of the ACM International Conference on Multimedia. 2255–2263.
- Lino and Christie (2015) Christophe Lino and Marc Christie. 2015. Intuitive and efficient camera control with the toric space. ACM Transactions on Graphics (TOG) 34, 4 (2015), 1–12.
- Louarn et al. (2018) Amaury Louarn, Marc Christie, and Fabrice Lamarche. 2018. Automated staging for virtual cinematography. In Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games. 1–10.
- Louarn et al. (2020) Amaury Louarn, Quentin Galvane, Fabrice Lamarche, and Marc Christie. 2020. An interactive staging-and-shooting solver for virtual cinematography. In Motion, Interaction and Games. 1–6.
- Mirowski et al. (2022) Piotr Mirowski, Kory W Mathewson, Jaylen Pittman, and Richard Evans. 2022. Co-writing screenplays and theatre scripts with language models: An evaluation by industry professionals. arXiv preprint arXiv:2209.14958 (2022).
- Mohanta et al. (2013) Partha Pratim Mohanta, Sanjoy Kumar Saha, and Bhabatosh Chanda. 2013. A novel technique for size constrained video storyboard generation using statistical run test and spanning tree. International Journal of Image and Graphics 13, 01 (2013), 1350001.
- Moorthy et al. (2020) KL Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, and Vineet Gandhi. 2020. Gazed–gaze-guided cinematic editing of wide-angle monocular video recordings. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–11.
- Nvidia (2023) Nvidia. 2023. Omniverse Platform. https://www.nvidia.com/omniverse/.
- Oskam et al. (2009) Thomas Oskam, Robert W Sumner, Nils Thuerey, and Markus Gross. 2009. Visibility transition planning for dynamic camera control. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 55–65.
- Pan et al. (2021) Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11205–11214.
- Pardo et al. (2021a) Alejandro Pardo, Fabian Caba, Juan León Alcázar, Ali K Thabet, and Bernard Ghanem. 2021a. Learning to Cut by Watching Movies. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6858–6868.
- Pardo et al. (2021b) Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, and Bernard Ghanem. 2021b. MovieCuts: A New Dataset and Benchmark for Cut Type Recognition. arXiv preprint arXiv:2109.05569 (2021).
- Pizzi et al. (2010) David Pizzi, Jean-Luc Lugrin, Alex Whittaker, and Marc Cavazza. 2010. Automatic generation of game level solutions as storyboards. IEEE Transactions on Computational Intelligence and AI in Games 2, 3 (2010), 149–161.
- Rao et al. (2022a) Anyi Rao, Xuekun Jiang, Sichen Wang, Yuwei Guo, Zihao Liu, Bo Dai, Long Pang, Xiaoyu Wu, Dahua Lin, and Libiao Jin. 2022a. Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows. arXiv preprint arXiv:2210.08737 (2022).
- Rao et al. (2020) Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A Unified Framework for Shot Type Classification Based on Subject Centric Lens. In The European Conference on Computer Vision (ECCV).
- Rao et al. (2022b) Anyi Rao, Linning Xu, and Dahua Lin. 2022b. Shoot360: Normal View Video Creation from City Panorama Footage. In ACM SIGGRAPH 2022 Conference Proceedings. 1–9.
- Richter et al. (2016) Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In European conference on computer vision. Springer, 102–118.
- Ronfard et al. (2022) Rmi Ronfard, Vineet Gandhi, Laurent Boiron, and A Murukutla. 2022. The prose storyboard language: A tool for annotating and directing movies (version 2.0, revised and illustrated edition). In Eurographics Workshop on Intelligent Cinematography and Editing, Vol. 4.
- Sanokho et al. (2014) Cunka Bassirou Sanokho, Clement Desoche, Billal Merabti, Tsai-Yen Li, and Marc Christie. 2014. Camera Motion Graphs.. In Symposium on Computer Animation. Citeseer, 177–188.
- Savva et al. (2019) Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9339–9347.
- Shah et al. (2018) Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2018. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics. Springer, 621–635.
- Shridhar et al. (2020) Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. 2020. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10740–10749.
- Singer et al. (2022) Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).
- Starke et al. (2019) Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machine for character-scene interactions. ACM Trans. Graph. 38, 6 (2019), 209–1.
- Truong and Agrawala (2019) Anh Truong and Maneesh Agrawala. 2019. A Tool for Navigating and Editing 360 Video of Social Conversations into Shareable Highlights.. In Graphics Interface. 14–1.
- Truong et al. (2018) Anh Truong, Sara Chen, Ersin Yumer, David Salesin, and Wilmot Li. 2018. Extracting regular fov shots from 360 event footage. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–11.
- Unity (2023) Unity. 2023. Real-Time Development Platform. https://unity.com/.
- Unreal (2023) Unreal. 2023. Real-Time 3D Creation Tool. https://www.unrealengine.com/.
- Wang et al. (2021) Jingbo Wang, Sijie Yan, Bo Dai, and Dahua Lin. 2021. Scene-aware Generative Network for Human Motion Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12206–12215.
- Wang et al. (2018) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence 41, 11 (2018), 2740–2755.
- Wang et al. (2019) Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, and Ariel Shamir. 2019. Write-a-video: computational video montage from themed text. ACM Trans. Graph. 38, 6 (2019), 177–1.
- Wikipedia (2023) Wikipedia. 2023. Storyboard. https://en.wikipedia.org/wiki/Storyboard.
- Wu et al. (2018) Hui-Yin Wu, Francesca Palù, Roberto Ranon, and Marc Christie. 2018. Thinking like a director: Film editing patterns for virtual cinematographic storytelling. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 4 (2018), 1–22.
- Ye and Baldwin (2008) Patrick Ye and Timothy Baldwin. 2008. Towards Automatic Animated Storyboarding.. In AAAI. 578–583.
- Yoo et al. (2021) Jung Eun Yoo, Kwanggyoon Seo, Sanghun Park, Jaedong Kim, Dawon Lee, and Junyong Noh. 2021. Virtual Camera Layout Generation using a Reference Video. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–11.
- Zhang et al. (2018) He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–11.
- Zhong et al. (2021) Lei Zhong, Feng-Heng Li, Hao-Zhi Huang, Yong Zhang, Shao-Ping Lu, and Jue Wang. 2021. Aesthetic-guided outward image cropping. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–13.