Visual Story Generation Based on Emotion and Keywords
Abstract
Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters’ emotional development. This work proposes a story generation pipeline to co-create visual stories with the users. The pipeline allows the user to control events and emotions on the generated content. The pipeline includes two parts: narrative and image generation. For narrative generation, the system generates the next sentence using user-specified keywords and emotion labels. For image generation, diffusion models are used to create a visually appealing image corresponding to each generated sentence. Further, object recognition is applied to the generated images to allow objects in these images to be mentioned in future story development.
Introduction
Storytelling has been an important way of communication, entertainment, and even making sense of the world. Stories can be told with just text or using the visual median. The use of visualizations can make stories more expressive and engaging. In this work, we propose a visual story co-creation system that leverages the power of large language models and diffusion models to create short visual stories with a user. 111GitHub repository: https://github.com/Stry233/Visual-Story-Generation-Based-on-Emotional-and-Keyword-Scheme.
The example short stories we have in mind are from the ROCStories, a popular dataset for commonsense reasoning and narrative understanding (?) named Story Cloze Test. Each story is composed of five sentences and follows a character through a series of events to an ending event or situation. When the data were collected, the crowdsource workers were instructed that “the story should read like a coherent story, with a specific beginning and ending, where something happens in between” (?). These instructions make them good examples of meaningful short stories. While the stories in the dataset only have text, our proposed system aims to help people create similar stories with visualization.
To co-create visual stories, the proposed system aims to provide plausible suggestions for story development while allowing users to override using their own suggestions. Figure 1 shows the overall pipeline of the system. The system is iterative in nature. Based on the existing partial story, the system can suggest the characters’ emotions and keywords included in the next sentence, such as an object, a location, etc. The user can accept them as defaults or provide their own. The system will then generate the sentence below based on the input parameters and the previously generated story. The user will then be presented with several visualizations for the new sentence and must choose one to proceed. Finally, the generated images will be subjected to object recognition so that the system can extract additional keywords and recommend their use in future storytelling.

Related Work
Automatic story generation has long been a challenging task in natural language processing, and such efforts date back to the 1970s (?). Earlier attempts, such as symbolic planning systems (?), utilize planning techniques to create plausible and coherent plots for stories. In addition, case-based or analogical modeling systems generate new narratives based on adapted existing stories (?). However, while these approaches can create stories with impressive coherence and consistency, they often require extensive knowledge engineering and restrict to a limited domain. To address the issue, large crowd-sourced corpora of stories are used to help generate stories (?; ?). Modern approaches based on neural language models have been explored to generate plot-driven stories (?). Moreover, (?) incorporates commonsense knowledge into the generator using a novel memory mechanism and utilizes adversarial training to improve the diversity and originality of essay generation. (?) leverages reward strategies during the generation to guide the language model toward generating a coherent story. (?) proposes a character-centric story-generation model that can produce stories consistent with characters’ profiles.
In contrast to previous work, we propose an interactive visual story generation pipeline in which the user can specify keywords and emotions that will appear in the next sentence. The system generates the sentence and associated images for the user based on this information.
Visual Story Co-creation Pipeline
This section describes our approach to story generation pipelines and their interactive workflow, as shown in Figure 1. The pipeline is divided into two sections: next-sentence generation and image generation. The next sentence generation is in charge of producing a plausible story based on specified keywords and emotions, whereas image generation is in charge of producing visualization for the generated story. We can mine the generated images for additional keywords to include in future story development when it operates interactively.
Next Sentence Generation
To begin the co-creation process, the pipeline must be fed with the first sentence of the story. It will then suggest emotions for the characters as well as keywords for the next sentence, inviting the user to edit them. This process is repeated iteratively by the pipeline to build the entire story.
We trained a next-sentence emotion prediction model using the Story Commonsense dataset (?). Here, the existing story text was used as input, while the emotion labels from the next sentence were used as labels. We fine-tuned a BERT model (?), hereinafter noted as BERT, for this task. We provide the objects recognized from the generated images as the default for keyword suggestions.
To enable the system to generate a sentence with the suggested keywords and emotions, we fine-tuned a pre-trained T5 (?) model using the ROCStories dataset(?). The T5 model was pre-trained using the CommonGen dataset (?), which enabled it to generate a sentence with a set of given keywords(hereinafter denoted as T5). Fine-tuning it with the ROCStories dataset enabled the T5 model to further learn about the story writing styles in ROCStories.
Keywords Extraction
For training the model and evaluation, we extract keywords from the original five-sentence stories. We used the SceneGraphParser() from (?) to parse sentences (in natural language) into scene graphs. The entities in the scene graphs become the keywords. For example, for the following sentence:
: I brought the movie home and watched the whole thing.
We are expected to generate the following result:
’I, the movie, the whole thing’
Categorical Emotion Inference
We use Plutchik’s Wheel of Emotions as our basic model of characters’ emotions. The wheel includes eight emotions in pairs: joy/sadness, trust/disgust, fear/anger, and surprise/anticipation (?) as shown in Figure 2. We define emotion entries in a sentence as a real-valued, low-dimensional vector .

(1) |
Given the context of a story, i.e., all previous sentences , the goal is to predict the emotions that will appear in the next sentence.
For example, for the following input sequence of sentence that serves as context, the model may predict the emotions in the next sentence to be joy, anticipation, and trust.
: He was hoping this year to be tall enough for the coaster.
For preparing the training data, we obtained 17,910 pairs of story context and next-sentence emotions from the Story Commonsense dataset. The dataset was divided into training and validation sets in the ratio of 6:4. We then fine-tuned the BERT model using the task of multi-label classification. The story context is the input, and the next-sentence emotions are the output. The maximum input length of the tokenizer was set to 120. The batch size was set to 16, and the learning rate was set to . We ran 16 epochs for training, which took about 45 minutes on a Tesla P100-PCIE-16GB GPU. The model achieved a Macro ROC-AUC score of 0.69 on prediction. The prediction results are not perfect. However, they are only used as suggestions, and the user can always overwrite them.
Next Sentence Generation with Prompts

An overview of our story generation model is given in Figure 3. We constructed our training data using the stories from both the ROCStories and Story Commonsense datasets (denoted as hereinafter).
For a sample story from ROCStories – , we define it as a 5-dimensional vector where each represents the corresponding sentence in the story.
For training our system to generate the next sentence, is fed into the emotion classifier as well as the keyword extractor for identifying the emotions and keywords in the next sentence. Based on story context and the additional knowledge input, T5 is trained on a triad of data: context, keyword, and emotion labels.
In addition, we used the following prompting structure in both the training as well as inference processes:
We adopt T5 as the base language model for the training process. The CommonGen dataset contains pairs of sentences and keywords. The task is to generate a meaningful sentence that contains all the given keywords. The T5 is a T5 model fine-tuned using this dataset. For example, given the following input:
X1: <’Mike’, ’tree’, ’ground’, ’hole’>
may generate:
Y1: is digging a in the near a .
By fine-tuning T5 based on the emotions and the keyword information as prompts, the model can more effectively capture hidden information in the story and thus improve the accuracy and variety of sentences. Compared to the baseline model fine-tuned only with context information, we found that, due to the lack of a priori knowledge of the link between emotions and events, the baseline model is highly susceptible to confounding empathy problems arising from different behavioral motivations. As shown in Table 1, the baseline model could not use the context effectively and correctly without the prompts. It generated repeated sentences. Further, it is unclear why the character – Mary – would feel happy about her situation.
generated - prompt | generated - no prompt |
Mary had been feeling depressed lately. | Mary had been feeling depressed lately. |
she decided to go see a psychiatrist. | She decided to go to a psychiatrist. |
Psyched, her psychiatrist diagnosed her with depression and sent her to see. | She was diagnosed with schizophrenia. |
Medicant took her to get an antidepressant and prescribed her. | She was very happy. |
Thankfully it eventually made her feel better again. | She was very happy. |
In terms of implementation, we used the previous format to create a dataset with 18,680 entries. In an 8:2 ratio, we divided it into a training set and a validation set. The maximum lengths of the source and output text are set to 512 and 50, respectively. To control the diversity of the output content, we use top-K search in the decoding process, with , and set the repetition penalty and length penalty to 2.6 and 1.0, respectively, for the same reason. In the training process, we set the batch size and learning rate to 10 and 5e-4, respectively, and introduced the early stop mechanism to prevent overfitting. We ran 16 epochs, and the training phase of the language model took about 10 hours on a Tesla P100-PCIE-16GB platform.
Image Generation
When image generation is added to text-based stories, it increases flexibility and creates a sense of immersion in the generated story plot. As a result, in addition to the text pipeline, we implement an image pipeline that generates additional keywords based on images as suggestions, potentially pushing the story forward in different directions. We separate the pipeline into two parts: image generation and object detection. We experimented with both Stable Diffusion (?) and Disco Diffusion as our primary method for image generation, as they can produce fairly relevant figures with artistic quality. For object detection, we chose YOLOv5 (?) and Faster R-CNN (?) as our models. The generated keywords will serve as suggestions for users to select from, as well as hints to guide the subsequent generation process.
Image Generation
To create images for the generated sentences, we utilize Disco Diffusion, a tool that combines both Diffusion, a mathematical process for removing noise from an image, and CLIP (?), a model for labeling images. Image generation is an iterative process in Disco Diffusion, in which CLIP evaluates the intermediate image based on the text prompt and provides a guideline for Diffusion to process at each step.
Disco Diffusion has been extensively researched as a means of generating creative art. In practice, however, we found it difficult to create a good image based solely on one line of a story. Although Disco Diffusion has a portrait generator, the model finds it difficult to generate images that describe actions in the format of ”someone is doing something.” To address this, we include user-specific keywords in the text prompt, including the artist’s name and background information. For example, we added ”Carl Spitzweg” as the artist and ”country view” as the background information in our example. Disco Diffusion can generate images that are consistent with the artist’s artistic styles, such as shadow, color, and strokes, by including the artist’s name in the text prompt. Furthermore, we fill in background information that can add more relevant elements to the generated image. As shown in Figure 4, the prompt ”Ray gathered his friends to tell them a funny joke he heard” will only produce a picture of several people who appear to be talking, whereas adding ”country view” to the text prompt generates an image with explicit yet magnificent scenery in the background. Users can change the artist’s name and background information based on their preferences.


Once Disco Diffusion is able to generate images with reasonably appropriate information based on the text prompt, we use it as a source of common sense knowledge to suggest additional keywords for the sentence that follows. For instance, consider the prompt, ”a boy is picking up shells on a beach.” This generates an image of a boy with a bag picking up shells on a beach with an ocean backdrop, as shown in Figure 5. While the image contains requested elements such as ”boy,” ”shells,” and ”beach,” it also introduces new but highly relevant elements such as ”bag” and ”ocean.” This is similar to how when people read stories, they frequently fill in details based on their imagination. The extra elements from the generated images are used to add an ”imagination” component to the story generation pipeline. Object detection algorithms are used to detect potential keywords from generated images, which are then presented to users as suggestions for creating the next part of the story.

Implementation
There are about 115 parameters in Disco Diffusion. We mainly focus on settings that directly impact our output images.
Clipguidancescale: This variable tells the model how strongly CLIP will follow the prompt. We set it to 5,000 as recommended.
Steps: Each step (or iteration) involves the model looking at a subset of images called “cuts” and calculating the “direction” in which the images should be guided so that the generated results are relevant to the prompt. We set this to 250.
nbatches: We generate three images for each prompt. Users can choose which one is ideal for that story.
Item | Number |
|
||
horse | 5 | 0.729 | ||
bird | 2 | 0.719 | ||
person | 42 | 0.694 | ||
handbag | 2 | 0.656 | ||
… | … | … |

Key Object Detection

We explored using YOLOv5 and Faster R-CNN (?) for object detection. In general, YOLOv5 outperforms Faster R-CNN in inference speed due to its smaller model size. On the other hand, faster R-CNN with ResNet-50 FPN (?) backbone can provide better performance in terms of mAP (i.e., mean average precision) improvement. Both models are pre-trained with the COCO dataset (?). Although these models are only trained on real-world images, they perform impressively on the paintings generated by Disco Diffusion. We suspect this is due to the fact that, as long as the artistic style isn’t too abstract, the objects in the generated images share characteristics with real-world objects.
With the same prompt, Disco Diffusion can generate a batch of images. We process all images created for object detection and summarize a corresponding dictionary of the detected object with their confidence levels. As a result, the user will have a better sense of potential objects associated with the current story and will be more likely to include them in the text prompt.
Implementation
We use YOLOv5x because it produces the best mAP results among the YOLOv5 family. The COCO dataset is used to pre-train YOLOv5 and Faster R-CNN, resulting in approximately 80 detectable objects. To improve the reliability of the results, we set the Faster R-CNN threshold to 0.4. The final result is sorted using the confidence score IOU (Intersection over Union).

Example Outputs
Figure 6 demonstrates a story generation example using various keywords and emotions using our pipeline. The pipeline can generate a reasonable story based on the provided information by modifying the input prompts of the last two sentences (i.e., the twist and ending parts of the story) in both examples. This demonstrates both the capability of our pipeline and its sensitivity to emotion and keywords. In addition, we demonstrate a full visual storytelling report in Figure 7 by combining generated images provided by the image generation pipeline.
Evaluation and Discussion
To assess the efficacy of a controlled story generation framework based on categorical emotion and keywords, we compared our approach to a baseline model fine-tuned solely based on the existing story. We trained another BERT-based model, hereinafter referred to as BERT, to generate sensible emotion keywords as input. This model is a fine-tuned emotion predictor based on the 11,129 emotion-sentence pairs in the Story Commonsense dataset. This model, unlike the BERT mentioned in the Categorical Emotion Inference section, aims to detect emotion information for a given (current) sentence. The macro AUC-ROC score reached 0.78 during training, leading us to believe that BERT is adequate for identifying and extracting emotions from sentences. To ensure that the models can apply prior knowledge to the inference process of story generation, both our pipeline and the baseline will use the same T5 and adapt the inputs to the prompt format used in the corresponding fine-tuning phase.
In order to obtain the validity of the prompts we added to the input, we used the average of BLEU (?) scores with n-grams ranging from 1 to 4 and BERT-scores (?) as basic metrics in our evaluation. In addition, we use METEOR (?) and SacreBLEU(?) as supplements to compensate for the lack of stemming and synonym matching, as well as standard exact word matching, when comparing the generated content with the ground truth.
We present the experimental results in Figure 8 and find that, compared to the baseline model, which relies solely on story context for inference, our pipeline shows consistent improvements in all metrics under the same model by introducing emotions and keywords as prompts.
Story generation methods based on emotion and keywords can present statistically higher upper bounds, regardless of how the metrics are implemented or what is tested. As shown in Figure 8, for BLEU-average scores, our approach improves the story generation task by an average of 62%. It also reduces the number of samples with zero BERT scores significantly. The average increase in BERT scores is also 6.7%, and the rate of score fluctuation is reduced. This implies that our approach has the potential to improve many aspects of story generation significantly.
Future Work
We are interested in making several improvements in the future. We hope to find a better prompt format to fully utilize the potential of language models for the generation task. We hope that, with the assistance of Chain of Thought Prompting (?), the model will be able to generate a story with more sophisticated reasoning and causal relationship. The resulting story will then feel more natural and logical. Another area that can be improved is the emotion classifier. At the moment, the model for predicting emotions in the following sentence can only achieve around 60% accuracy for each of Plutchik’s emotion categories which can be further improved.
The image pipeline has the potential to improve as well. Disco Diffusion, in particular, has difficulty producing images of creatures when configured incorrectly. While we can generate reasonable images by incorporating artists who frequently include corresponding objects and creatures in their paintings as a result of CLIP’s recognition capabilities, the Diffusion model will produce nonsense if we do not provide a ”reference” artist to paint. Furthermore, because they are only pre-trained with the COCO dataset and real-world objects, YOLOv5 and Faster R-CNN can detect a limited number of objects. In the future, we can use YOLOv5 or Faster R-CNN to implement few-shot learning to increase the number of detected objects (?).
Conclusion
We present a novel pipeline for creating visual stories based on Plutchik’s Wheel of Emotions model of emotional knowledge and keyword information. This method assists visual narrative designers in creating and iterating stories in a controlled manner. Our method also includes a complete narrative text-to-image generation pipeline. Experiments show that stories generated by using keywords as prompts have a higher level of grammatical and logical consistency, as well as being more consistent with human writing habits. This work, we believe, is an important step toward a more comprehensive and functional language model-based storytelling tool.
References
- [Banerjee and Lavie 2005] Banerjee, S., and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. Ann Arbor, Michigan: Association for Computational Linguistics.
- [Devlin et al. 2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- [Fan, Lewis, and Dauphin 2018] Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 889–898. Melbourne, Australia: Association for Computational Linguistics.
- [Gervás et al. 2005] Gervás, P.; Díaz-Agudo, B.; Peinado, F.; and Hervás, R. 2005. Story plot generation based on cbr. In Knowl. Based Syst.
- [Ibrahim et al. 2022] Ibrahim, B. I. E.; Eyharabide, V.; Le Page, V.; and Billiet, F. 2022. Few-shot object detection: Application to medieval musicological studies. Journal of Imaging 8(2).
- [Jocher et al. 2022] Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; TaoXie; Fang, J.; imyhxy; Michael, K.; Lorna; V, A.; Montes, D.; Nadar, J.; Laughing; tkianai; yxNONG; Skalski, P.; Wang, Z.; Hogan, A.; Fati, C.; Mammana, L.; AlexWang1900; Patel, D.; Yiwei, D.; You, F.; Hajek, J.; Diaconu, L.; and Minh, M. T. 2022. ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference.
- [Li et al. 2013] Li, B.; Lee-Urban, S.; Johnston, G.; and Riedl, M. 2013. Story generation with crowdsourced plot graphs. Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013 598–604.
- [Lin and Och 2004] Lin, C.-Y., and Och, F. J. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, 501–507. Geneva, Switzerland: COLING.
- [Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and Dollár, P. 2014. Microsoft coco: Common objects in context.
- [Lin et al. 2016] Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2016. Feature pyramid networks for object detection.
- [Lin et al. 2019] Lin, B. Y.; Zhou, W.; Shen, M.; Zhou, P.; Bhagavatula, C.; Choi, Y.; and Ren, X. 2019. Commongen: A constrained text generation challenge for generative commonsense reasoning. arXiv preprint arXiv:1911.03705.
- [Liu et al. 2020] Liu, D.; Li, J.; Yu, M.-H.; Huang, Z.; Liu, G.; Zhao, D.; and Yan, R. 2020. A character-centric neural model for automated story generation. Proceedings of the AAAI Conference on Artificial Intelligence 34(02):1725–1732.
- [Meehan 1977] Meehan, J. R. 1977. Tale-spin, an interactive program that writes stories. In IJCAI.
- [Mostafazadeh et al. 2016] Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.; Vanderwende, L.; Kohli, P.; and Allen, J. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696.
- [Plutchik 1980] Plutchik, R. 1980. A general psychoevolutionary theory of emotion. In Theories of emotion. Elsevier. 3–33.
- [Post 2018] Post, M. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, 186–191. Belgium, Brussels: Association for Computational Linguistics.
- [Radford et al. 2021] Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning transferable visual models from natural language supervision.
- [Raffel et al. 2020] Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J.; et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140):1–67.
- [Rashkin et al. 2018] Rashkin, H.; Bosselut, A.; Sap, M.; Knight, K.; and Choi, Y. 2018. Modeling naive psychology of characters in simple commonsense stories. In Gurevych, I., and Miyao, Y., eds., Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2289–2299. Association for Computational Linguistics.
- [Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks.
- [Riedl and Young 2010] Riedl, M. O., and Young, R. M. 2010. Narrative planning: Balancing plot and character. Journal of Artificial Intelligence Research 39:217–268.
- [Rombach et al. 2022] Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
- [Swanson and Gordon 2012] Swanson, R., and Gordon, A. S. 2012. Say anything: Using textual case-based reasoning to enable open-domain interactive storytelling. ACM Trans. Interact. Intell. Syst. 2(3).
- [Tambwekar et al. 2019] Tambwekar, P.; Dhuliawala, M.; Martin, L. J.; Mehta, A.; Harrison, B.; and Riedl, M. O. 2019. Controllable neural story plot generation via reward shaping. In IJCAI.
- [Wei et al. 2022] Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; and Zhou, D. 2022. Chain of thought prompting elicits reasoning in large language models.
- [Wu et al. 2019] Wu, H.; Mao, J.; Zhang, Y.; Jiang, Y.; Li, L.; Sun, W.; and Ma, W.-Y. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6609–6618.
- [Yang et al. 2019] Yang, P.; Li, L.; Luo, F.; Liu, T.; and Sun, X. 2019. Enhancing topic-to-essay generation with external commonsense knowledge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2002–2012. Florence, Italy: Association for Computational Linguistics.
- [Zhang et al. 2019] Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. Bertscore: Evaluating text generation with bert.