GenSim: A General Social Simulation Platform with Large Language Model based Agents
Abstract
With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM-agent-based simulation platform called GenSim, which: (1) Abstracts a set of general functions to simplify the simulation of customized social scenarios; (2) Supports one hundred thousand agents to better simulate large-scale populations in real-world contexts; (3) Incorporates error-correction mechanisms to ensure more reliable and long-term simulations. To evaluate our platform, we assess both the efficiency of large-scale agent simulations and the effectiveness of the error-correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.
1 Introduction
Social science, which focuses on human behavior, communication, and organization, is playing an increasingly significant role as world civilization advances. One important research paradigm in social science is collecting real human data. For instance, to study the effectiveness of positive psychology interventions, [1] recruited over 6,000 participants to observe their responses to controlled experiments. Similarly, [5] employed 60 participants for a six-week user study to collect mobile personally identifiable information to investigate the economics of personal mobile data. While the paradigm of collecting real human data is widespread in social science research, it suffers from significant drawbacks, such as high cost, poor controllability, and challenges in reproducibility, which have troubled researchers for a long time.
In the field of artificial intelligence (AI), researchers have discovered that language serves as a crucial carrier of intelligence [7], and the objective of “next-token prediction” using a massive training corpus (i.e., large language models, LLMs) has the potential to achieve human-like intelligence. With the advent of these high-intelligence models, a new “AI for Social Science” direction has emerged: leveraging LLMs as proxies for real humans to conduct social science experiments [2]. This approach provides the opportunities to fundamentally address the above challenges faced by social science research, potentially paving the way for an entirely new research paradigm. For instance, Generative agents [4] leverages 25 agents to simulate human daily life, and finds that these agents can autonomously host parties and conduct mayoral election. RecAgent [6] simulates user online behaviors, and studies the phenomenons of information cocoon and conformity behaviors. EconAgent [3] studies the macroeconomic behaviors using LLM-based agents in the context of dynamic markets.
While the above methods have shown promising results, they are primarily limited to specific scenarios and small-scale simulations. Moreover, when discrepancies arise between simulated behaviors and those observed in the real world, existing methods lack effective error-correction mechanisms. To address these limitations, we introduce GenSim, a general social simulation platform based on LLM agents. In specific, to avoid reinventing the wheel in simulating various social scenarios, we propose a general programming framework composed of three key modules on single-agent construction, multi-agent scheduling, and environment setup. Additionally, we provide three default scenarios as references to help users quickly implement their customized simulations. To achieve large-scale simulations in real-world scenarios, we leverage distributed parallel technology to support one hundred thousands of agents in our platform. Finally, we design several error-correction mechanisms, allowing the platform to first perform self-evaluation or seek human feedback, and then fine-tune itself to ensure more reliable simulation.
In summary, the main contributions of this paper are as follows: (1) We propose a general, large-scale, and correctable social simulation platform based on LLM agents. (2) We provide detailed usage examples to illustrate the capabilities of our platform. (3) We conduct a series of experiments to evaluate the platform’s effectiveness and efficiency.

2 Features of GenSim
There are several unique features of our platform. To begin with, we abstract a set of general functions to facilitate any customized simulation scenario according to the users’ requirements. Then, our platform supports one hundred thousand agents to better simulate large-scale populations in real-world contexts. At last, we provide a series of error-correction mechanism to ensure more reliable simulation. The first two can be seen as static features from the generality and scalability perspectives, respectively, while the last one extends previous work from the dynamic perspective, making our platform can continually correct and improve itself.
2.1 General Simulation Framework
Our framework consists of three modules focusing on single agent, multi-agents, and environments (see Figure 1). In the single agent module, users can flexibly configure the agent’s profile, memory and action components. The profile includes both public information, such as gender, name, and birthplace, as well as private attributes like income and health condition. To enable the agent to retain behaviors in various ways, users can assemble different memory components—short-term memory, long-term memory, and the reflection mechanism—to build the agent’s memory. The actions of the agents are driven by LLM prompts, where users can flexibly configure them to include agent profiles, memories and so on.
In the multi-agents module, we design two strategies for generating agent interactions: script mode and agent mode. In script mode, all interactions are treated as a whole and generated in a single call to the LLM. For example, one can directly prompt LLMs to generate a dialogue between a doctor and a teacher in one step. In this strategy, the LLM acts as a meta-agent, producing the dialogue from a third-person perspective. In agent mode, interactions are generated by different agents, each representing a distinct role, and each agent generates outputs from a first-person perspective. This interaction generation process requires multiple calls to the LLM. In the example above, two agents are deployed, with each agent’s output based on the entire history of interactions between them.
In the environment module, we store all the information beyond the agents necessary for running the simulation, such as the recommendation algorithm used in a web user simulator [6]. Additionally, we allow users to globally intervene in the platform, which is useful for counterfactual inferences. We also provide essential functions to facilitate interviewing, searching, and storing different agents.
Based on the above general framework, users can easily create customized simulations. To provide additional references, we offer three default scenarios: job market, recommender system, and group discussion. These scenarios can not only facilitate related research but also can provide code bases, enabling users to construct new scenarios with minimal effort.
2.2 Large-scale Simulation
While there are many previous studies on leveraging LLMs to simulate human social behaviors [2], the number of agents in their simulators are usually very small. In such cases, the users need to sample a small set of individuals from the real-world large-scale populations, and then leverage agents to simulate the sampled individuals, assuming that these samples can accurately approximate real-world populations. However, the sampled small number of individuals may lead to very large fluctuations of the simulation results. To verify this statement, we conduct a preliminary experiment by simulating the user-item rating behaviors on a movie website. In specific, we base our experiment on the well-known dataset MovieLens-32M111https://grouplens.org/datasets/movielens/, which consists of 200,948 users’ 32M ratings on 87,585 movies. For each user-movie pair, we use LLMs to simulate the user’s rating on the movie in the range of . To study the fluctuation of the simulation results with different agent scales, we first sample 3.2K, 32K, 320K and 3.2M user-item pairs from the complete dataset, and then, for each case, we repeat the simulation of predicting user-item ratings for 10 times. Formally, suppose represents the rating distribution of the th experiment, where . For each rating , we compute the standard deviation across all experiments as
where denotes the standard deviation operation. We use the sum of the standard deviations for all possible ratings to measure the fluctuation of the simulation results. The experiment results are presented in Figure 2(a), where we can see: as the number of samples becomes larger, the fluctuation of the simulation results is greatly lowered. This result suggests that if we only have a small number of agents, then the simulation results can be not reliable, since it can be hardly reproduced due to the large simulation fluctuation.
To solve the above problem, in our platform, we support up to one hundred thousand agents to better simulate real-world scenarios. To accelerate the simulation speed, we adapt a series of techniques including distributed parallel computing and so on. Using our platform, we evaluate the simulation speed in the job market and recommendation scenarios, where we run our simulator for one round for both settings222For all experiments in this paper, we used a server with a 192-core CPU, eight A100-40G GPUs, and 440 GB of memory.. The results are presented in Figure 2(b), from which we can see: as the number of agents becomes larger, the time cost increases, and when we have 10w agents, they cost 15492 and 3024 seconds for running one round in the job market and recommendation scenarios, respectively.
In addition, we also evaluate the acceleration effects of distributed parallel computing in our platform. In specific, we measure the time costs of running our platform for one round with different numbers of GPUs. The results are presented in Figure 3(a), where we can see, as the number of GPUs becomes larger, the time cost decreases, which suggests that, with the help of distributed parallel computing, our platform can effectively take the advantages of more GPUs.



2.3 Simulation Error Correction
Most previous LLM-agent-based simulation platforms lack error-correction mechanisms, which means that if unexpected results occur during the simulation process, they can be accumulated and amplified as the simulation progresses (see Figure 3(b)). To solve this problem, in our platform, we provide two strategies for correcting the simulation errors. The first one is based on LLMs, where we leverage GPT-4o to score on or revise the simulated results. The second one is based on real humans, where we provide interfaces for the users to score or revise the simulated agent behaviors. Between these two approaches, the first is more efficient and requires no human intervention, though it may be less accurate due to the inherent biases of LLM. The second approach is more aligned with real humans but can be labor-intensive and less efficient. Suppose the simulation result is represented by a pair, where is the prompt for driving an agent action, and is the action. For each of the above strategies, there are two forms of feedback provided by LLMs or real humans. Let the score for be , and be the revised results333It should be noted that and may not co-exist for the same pair.. Then, we use and to fine-tune the backbone LLMs using PPO and SFT, respectively.
To evaluate the effectiveness of our designed error-correction mechanisms, we conduct experiments based on the job market scenario with LLMs as the feedback provider. To begin with, we evaluate whether PPO and SFT can improve the simulation results in a single round. In the experiments, we select different numbers of samples for labeling, and use GPT-4o to measure the reasonableness of the simulation results. From Figure 4(a), we can see: for both PPO and SFT, they can improve the simulation performance across different numbers of labeled samples. Compared with PPO, the results of SFT are better, which is reasonable, since the revised action used in SFT may include more effective and comprehensive information.
Next, we evaluate the effectiveness of the error-correction mechanisms in a multi-round setting. Specifically, we fine-tune the backbone LLMs in the earlier round and use the updated models to simulate the results in the subsequent round. We present the results in Figure 4(b). We can see: if we do not conduct error-correction (the yellow line), the simulation performance is unsatisfactory. When using PPO (the blue line) or SFT (the red line), the simulation performance improves significantly, and these improvements continue to increase as the number of simulation rounds grows.

3 Usages of GenSim
In this section, we introduce the basic methods for running a default scenario. For more details, readers can refer to our project website https://github.com/TangJiakai/GenSim.
The complete interfaces of our platform are presented in Figure 5. In specific, the user needs to click the ‘Get Started’ button to initiate our platform. Then, the user can configure the simulation by specifying the scenarios, agent profile, memory type, number of agents, LLM parameters, and more. Once configured, the platform can be launched. The simulation interface consists of three sections: At the top, there is a search box that allows the user to search for an agent to observe its status and track its behavior. In the middle, there is a display window where the behaviors of multiple agents are shown. On the right, there is a functionality window where users can view agent profiles, intervene in the system, and interact with agents. After running the platform for several rounds, the user can stop the simulation and label the results. Finally, the backbone LLMs can be fine-tuned based on the labeled results, and used in the following simulation process.
4 Conclusions
In this paper, we introduce a general, large-scale, and correctable social simulation platform based on LLM agents. This is the initial version of our platform, we believe there is still much room left for improvement. In the future, we plan to incorporate more advanced simulation accelerating strategies, and develop more adaptive self-correction mechanisms.
References
- [1] Linda Bolier, Merel Haverman, Gerben J Westerhof, Heleen Riper, Filip Smit, and Ernst Bohlmeijer. Positive psychology interventions: a meta-analysis of randomized controlled studies. BMC public health, 13:1–20, 2013.
- [2] Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1):1–24, 2024.
- [3] Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: large language model-empowered agents for simulating macroeconomic activities. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15523–15536, 2024.
- [4] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023.
- [5] Jacopo Staiano, Nuria Oliver, Bruno Lepri, Rodrigo De Oliveira, Michele Caraviello, and Nicu Sebe. Money walks: a human-centric study on the economics of personal mobile data. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 583–594, 2014.
- [6] Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, et al. User behavior simulation with large language model based agents. arXiv preprint arXiv:2306.02552, 2023.
- [7] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.