Value-Enriched Population Synthesis: Integrating a Motivational Layer

Alba Aguilera\orcid0009-0003-5336-8570 Corresponding Author. Email: [email protected] Miquel Albertí\orcid0009-0005-1666-8421 Nardine Osman\orcid0000-0002-2766-3475 Georgina Curto\orcid0000-0002-1320-3873 Artificial Intelligence Research Institute (IIIA-CSIC), Barcelona Universitat de Barcelona, Barcelona University of Notre Dame, Notre Dame, USA

Abstract

In recent years, computational improvements have allowed for more nuanced, data-driven and geographically explicit agent-based simulations. So far, simulations have struggled to adequately represent the attributes that motivate the actions of the agents. In fact, existing population synthesis frameworks generate agent profiles limited to socio-demographic attributes. In this paper, we introduce a novel value-enriched population synthesis framework that integrates a motivational layer with the traditional individual and household socio-demographic layers. Our research highlights the significance of extending the profile of agents in synthetic populations by incorporating data on values, ideologies, opinions and vital priorities, which motivate the agents’ behaviour. This motivational layer can help us develop a more nuanced decision-making mechanism for the agents in social simulation settings. Our methodology integrates microdata and macrodata within different Bayesian network structures. This contribution allows to generate synthetic populations with integrated value systems that preserve the inherent socio-demographic distributions of the real population in any specific region.

\paperid

1 Introduction

Agent-based simulations are now widely used in interdisciplinary research, especially to support policy-making in social contexts. When applied to real-life domains, these simulations are evolving towards more nuanced, data-driven models that accurately reflect the complexities of socio-environmental systems. During the outbreak of the COVID-19 pandemic, the importance of accurate simulations in policy-making scenarios became starkly evident. An important body of studies emerged focusing on simulating disease outbreaks and public health interventions, such as lockdowns or mask mandates. These types of models (along with many others focusing on issues like gentrification, spatial inequality or urban ageing) require a reliable representation of agents, the interactions between them and their environment [46].

In such models, the interactions and behaviour of agents are strongly determined by their profile, which usually comprises specific demographic attributes. These attributes must closely resemble the characteristics of the human population they simulate. Ideally, this information would be sourced from census data; however, due to data privacy constraints, only certain open-source data can be employed. To achieve a simplified yet representative depiction of the population in a specific region, population synthesis methods are used. In this context, the most important challenge is to close the gap between the generated population and the actual one [13]. Traditional synthetic populations generate profiles limited to demographic and socioeconomic information [20]. We argue that there is an additional type of data, referred to as motivational attributes in this article, that can enrich population synthesis frameworks, contributing to reflect the complexities of social interactions and decisions. This data includes cognitive and cultural attributes such as values, ideologies, opinions and vital priorities that have a direct impact on the agents’ behaviours [38]. We argue that the behaviour of agents, which can either be learnt via machine learning techniques or modelled through mathematical decision-making architectures, can be enhanced by relying on these motivational attributes. Our approach opens the door to the incorporation of numerous frameworks that define and quantify both individual and collective values, such as the cross-cultural Schwartz Theory of Basic Values [38], the evolution of values in line with economic security (or post-materialistic), described by Inglehart [27], the capability approach to human development [39], multiculturalism and the struggle for recognition [41, 23], as well as the impact of prejudices [12, 45]. In this article, we have focused on incorporating the data obtained by surveys following the frameworks defined by Schwartz’s and Inglehart’s [21, 19, 8, 7]. By incorporating this data in a population synthesis framework, we are contributing to a line of research that increases the realism of social simulations and motivating researchers in the social domains to use empirical data in their population generation processes [42, 14]. Motivational attributes are not randomly nor uniformly distributed across different regions or social groups, but are deeply correlated with the socioeconomic development of a region [27]. [15] We propose to address this complexity by extending the current demographic-based profile of agents in synthetic populations with the addition of a motivational layer that automatically adapts to the specific region of the case study. The approach therefore facilitates the replication of the model in different regions of the world, working towards policy-oriented research that includes both the Global North and the Global South [16]. This motivation layer contains information about social aspects of the agents’ profiles, reflected in the surveys in scope, that potentially motivate behaviour. Decision-making techniques can then rely on these parameters to fine-tune data-driven outcomes, contributing to the "approach to reality" decision-making research direction [30]. However, sources of motivational data are often limited in size and do not cover the whole population’s socio-demographic attributes. They require meticulous integration with other data sources to ensure that the synthetic population, which includes both socio-demographic and motivational characteristics, is sufficiently representative within the desired geographical scope.

To bridge this gap, we propose a novel value-enriched population synthesis framework that integrates diverse data sources to generate a population that goes beyond the traditional individual and household layers. Through value learning from existing data, we explore the dependencies between attributes that accurately represent the underlying motivational ones in the population. In particular, by following our methodology, we can generate a synthetic population with a (1) socio-demographic layer and a (2) motivational layer at both the individual and collective levels (see Fig.1). Our approach aims to be replicable and scalable to a series of case studies. Therefore, we use Bayesian networks that can either be expanded or reduced depending on the data requirements of the study. To overcome data scarcity issues, which is one of the main challenges of population synthesis today, our approach is able to process both macrodata and microdata. We provide a proof of concept for the framework with a use case in the city of Barcelona. We present how we can potentially generate a synthetic population that uniquely characterizes individuals (i.e. their demographic profile along with their value system, ideologies, opinions, worries, priorities, etc.) in a representative manner.

The paper is organized as follows: Section 2 underlines the novelty of our work by exploring the state of the art in value-enriched population synthesis. In Section 3, we describe the proposed data model and the integration of values into the framework, while in Section 4, we explain the formulation of our population synthesis proposal. Section 5 presents a comprehensive application of the framework through a use case for the city of Barcelona. Finally, we conclude with insights into the main implications, limitations, and avenues for future research in Section 6.

2 Related Work

Synthetic populations are essential to develop useful applications in an ethical way that does not compromise individual privacy. Numerous works have generated open-source synthetic populations; for the UK [32], the US [43], Canada [36], Ile-de-France (France) [24], Tallinn (Estonia) [10] or some Australian cities [31] (e.g. Sydney, Melbourne and Brisbane), to name a few. These studies rely on two essential components: (i) data sources, such as publicly available microsamples, surveys or government databases, and (ii) population synthesis techniques, which are being constantly refined to match the complex interrelationships between the agent attributes. However, none of these studies have yet incorporated an agents’ motivational layer into their frameworks. Adding such a layer could be highly beneficial for models aiming to more accurately represent the complexities of the physical and social environments as a single fabric, a concept recently referred to as Social Urban Digital Twins (SUDTs) [46]. Let us review the main established data sources and synthesis techniques.

Handling data is perhaps the hardest challenge for population synthesis nowadays, especially due to the variability in format, size, level of detail and disaggregation of data sources across different regions. There are a lot of initiatives that work towards data availability and harmonization by offering detailed socio-demographic information, such as IPUMS (International Public Use Micro Samples) [2]. In our case, we are particularly interested in existing surveys on values. These are international research programs devoted to the scientific study of social, political, economic, religious and cultural values of people in the world. Similar surveys exist at the national level, offering practically the same sort of information along different territorial units. For instance, The World Value Survey (WVS) [21], the Europe Value Study (EVS) [19], the European Social Survey (ESS) [1], the Catalonian value survey [8] and the Barcelona value survey [7] contain almost completely harmonized data along continent, country, region, municipality and district. These data sources provide a link between socio-demographic and motivational attributes at different geographic scopes.

The literature categorizes the techniques used to generate synthetic populations into three main groups: synthetic reconstruction, combinatorial optimization, and statistical learning. The choice among the different techniques largely depends on the characteristics of the available data [20]. The first category uses deterministic algorithms that fit and allocate fractions of individuals and/or households to a region, while the second one attempts to reach an optimized solution by randomly drawing from the microsample while minimizing differences in marginals. More recently, researchers have adopted a probabilistic framework instead of a deterministic one, which corresponds to the third category and is the primary focus of this study. This approach searches for the joint distribution of all attributes using partial views available in the data. Within this category, we settle for Bayesian networks because of the main advantages this approach offers: it facilitates replicability (clear and graphical interface) and scalability (easy parallelization for large-size samples). Although many other methods have technically surpassed the capability of sampling a synthetic population, we prioritize successfully merging different data sources rather than the technical accuracy of the method. The foundational application of Bayesian network synthesis can be traced back to [40], followed by other contributions [47, 37, 25], which constitute our technical starting point.

This study aims to contribute to population synthesis for agent-based models under a social lens. The main novelty of our work is the aggregation of a motivational layer, sourced from social surveys, into the synthetic population. The use case developed for the city of Barcelona is intended to advance the enhancement of the Aporophobia Agent-Based Model (AABM) project [33, 11]. The successful addition of motivational attributes into the population signifies the possibility of adjusting the needs-based model [17] decision-making architecture with real-world data.

Refer to caption — Figure 1: Workflow scheme. From diverse data sources we extract personal profile attributes and classify them into socio-demographic and motivational layers, represented in blue and pink, at both the individual and collective levels. This information is used to generate comprehensive agent profiles in synthetic populations that help decision-making architectures tune behaviour.

3 Proposed Data Model

The motivation behind our model is to use diverse data sources to generate comprehensive agent profiles that help decision-making architectures simulate agents’ behaviour. In particular, we aim to extend the individual and collective socio-demographic attributes of agents with motivational ones, providing a deeper understanding of the individual’s value system that contributes to the agents’ actions [38]. The workflow is illustrated in Fig. 1. From the available data sources, denoted as $ds_{1},...,ds_{n}$ , we extract personal attributes and categorize them into two main layers: (i) socio-demographic and (ii) motivational, at both the individual and collective levels. Due to the high dimensionality, particularly of motivational attributes, we further divide the attributes within each layer into types. This allows having an organized overview of the information integrated into the synthetic population.

Socio-demographic attributes, represented in blue, cover the available social, demographic and economic characteristics that define an individual inside a population, a household and a social network. This layer includes "main" attributes at the individual level such as age, gender and nationality, as well as "household" and "network" attributes at the collective level, such as the number of people one lives with, the number of children, detailed information about the household assets and the number of friends. On the other hand, the motivational layer, represented in pink, encompasses attributes that directly impact or motivate behaviour. For this article, they are regrouped into four types: values, ideologies, opinions and vital priorities. This layer includes individual and collective information about prevalent values measured following Inglehart’s or Schwartz’s theory. Additionally, it includes other motivational attributes such as the alignment with various ideologies, the perceptions on political or economic situations or institutions, the points of view on controversial topics and the vital priorities given to certain aspects of one’s life (see Table 2 for further details). These individual motivational attributes can become collective (or consensus) attributes when aggregated or averaged along a region [29].

3.1 Values within the Motivational Layer

Information extracted from social surveys on values is used to construct the motivational layer of the synthetic population. The surveys are structured along thematic sub-sections covering diverse topics such as vital life priorities, societal well-being, social values and attitudes, religious values, political behaviour and ideology, cultural and national identity, and opinions towards minorities, climate change, migration, etc [18].

The sections directly related to values in the surveys are linked to the two major human values theories: Inglehart’s and Schwartz’s. However, keep in mind that surveys can be linked with other theoretical frameworks (such as the ones mentioned in the introduction). Some motivational data sources, such as the ESS [1], use Schwartz’s portrait values questionnaire (PVQ-21) [3] to measure the ten fundamental values. In contrast, sources such as WVS [21] or EVS [19], use other methods like Inglehart’s materialism/post-materialism (MPM) index [26, 28]. Other regional sources (such as Catalonia’s and Barcelona’s value surveys) use a combination of both approaches. In all the approaches, respondents’ choices are used to quantify their value preferences.

The resulting value preference can be depicted with cultural maps across two predominant dimensions that encompass different values. The Schwartz map [22] includes the dimensions "conservation versus openness to change" and "self-enhancement versus self-transcendence", while the Inglehart-Wezel map [44] features the dimensions "traditional versus secular-rational values" and "survival versus self-expression values." Both maps are represented schematically in Fig. 2, where each specific point indicates a personal value preference.

4 Formulation

In the context of population synthesis under a probabilistic lens, the aim is to infer the underlying joint probability distributions of data, denoted as P $(X_{i},..,X_{n})$ . The random variables $X_{i},..,X_{n}$ are the agent’s profile attributes, where each variable $X_{i}$ is capable of assuming various states $x_{k}$ . Bayesian networks can embed this joint distribution through two fundamental components: structure and parameters. The structure $S$ captures the dependencies among these variables by connecting them. The parameters $\delta$ determine the conditional probability agents have of being assigned a certain attribute given that they have been assigned another one.

Given a set of observational data $\mathcal{D}$ , one can either learn the parameters $\delta$ when the structure $S$ is known or learn both the structure $S$ and parameters $\delta$ . These processes are known as parameter learning and structural learning, respectively. Our approach employs both methods by (1) crafting an intuitive network based on prior knowledge (knowledge-based model) and (2) using heuristic search techniques to learn an optimal structure (learnt model). These models are then used to sample from the joint probability distribution, obtaining a synthetic dataset where each column is a socio-demographic or motivational attribute and each row uniquely characterizes one individual. The general workflow of a population synthesis process comprises three main steps: (i) data preparation, (ii) model selection and (iii) model validation.

4.1 Data Preparation

Our model is adapted to handle diverse data. We typically encounter two primary data forms: (1) macrodata, which provides information on separate attribute sets but lacks detail on their interdependencies, and (2) microdata, from which we can identify comprehensive interdependencies among attributes. Macrodata can provide information about the conditional probabilities of one, two, or three attributes at most. Relying solely on these conditionals or marginals to produce a representative sample of agents is not feasible, so it is often used as marginal constraints to control the population synthesis process. On the other hand, microdata offers richer insights into the interdependencies between the available attributes (contains their joint distribution) and can be efficiently used to generate a sample. Nevertheless, this microdata can be outdated, lack the needed granularity or contain numerous missing values.

Before data can be integrated into the model, it is essential to select, clean and harmonize the information in it. On one side, the selection of attributes largely depends on the quality of the data. If data on an attribute is substantially incomplete, which is a common situation regarding motivational data, the attribute may be discarded. However, for partially complete data, missing values can be handled either by direct elimination or through imputation techniques [34]. Data harmonization is then required to make diverse datasets compatible and consistent by standardizing and normalizing their values and integrating them into a single, cohesive dataset.

4.2 Model Selection

Following data preparation, the next crucial step is constructing the model that integrates all the selected attributes. These attributes can be organized according to our proposed data model, illustrated in Fig. 1, where attributes are classified within socio-demographic and motivational layers at both the individual and collective levels. Typically, we can associate each layer with one or several data sources that contain the corresponding household, network or motivational attributes. We denote the set of attributes within each data source $ds\in\mathcal{D}$ as $X^{ds}_{i}$ , where $i=1,...,m_{ds}$ and $m_{ds}$ is the total number of attributes selected from a specific data source.

We aim to connect the profile attributes in $\mathcal{D}$ with a structure $S$ and associated parameters $\delta$ , representing conditional probabilities between them. In the context of Bayesian networks, various strategies exist for structuring and parameterization, which can differ when dealing with microdata or microdata. Structural learning and parameter learning are straightforward processes for microdata using the common Bayesian network libraries. However, the joint distribution is not attainable for macrodata; only the conditional probabilities of the attributes present separately in each data chunk can be obtained by applying Bayesian estimators of one’s choice [9].

To address this limitation, we create two models: one that works solely with microdata (learnt model) and another that can integrate macrodata (knowledge-based model). The main difference between them is how the structure is obtained. While the learnt model incorporates dependencies between attributes directly from data, the knowledge-based model has a predefined structure based on prior knowledge. This combination of models allows for the enrichment of less comprehensive datasets with the robustness of more complete or up-to-date ones. The precise application of the models is detailed in the application through a use case (Section 5).

In either of the cases (learnt or crafted), for the attributes in each data source $ds\in\mathcal{D}$ , we define a structure $S_{ds}$ and parameters $\delta_{ds}$ that connect the attributes within that data source. In other words, $S_{ds}$ is a graph with vertices $\{X^{ds}_{i}\}_{i=1,...,m_{ds}}$ and directed edges $\{(X^{ds}_{i},X^{ds}_{j})\}_{i,j\in\{1,...m_{ds}\}}$ , while $\delta_{ds}$ is a set of conditional probability tables $\left\{P(X^{ds}_{i}\mid X^{ds}_{j})\right\}$ that characterize the dependencies between the attributes described by the $S_{ds}$ structure.

The functioning of our model relies on the combination of various datasets that share a common intersection. We will refer to this set of intersecting variables as the core attributes. As all data sources (including surveys, governmental sources and public use microsamples) typically have the same common information integrated, the core attributes always correspond to either just individual or individual and collective socio-demographic attributes (e.g. gender, age, nationality, etc.) as shown in Fig. 3. As signalled, by the edges, the rest of the attributes in the socio-demographic and motivational layers are obtained from the core attributes.

Let $S_{core}$ refer to the structure connecting the core attributes, learnt or craft from the richest dataset available $ds_{rich}$ . The richest dataset is selected based on various criteria, such as being the largest, the most disaggregated, or the most up-to-date, depending on the specific preferences and requirements of one’s study for accurately representing the socio-demographic characteristics of the population. We can use the core attributes as a binding factor for the remaining structure, which includes all $X^{ds}_{i}\notin core$ . We do so by fixing $S_{core}$ when learning the other structures, avoiding the overwriting of previously learnt $\delta_{core}$ parameters. This approach ensures that the motivational and remaining socio-demographic attributes are linked with the core attributes while respecting the probability distributions in each data source. The general procedure to learn the structure and parameters connecting all the attributes is the following

Algorithm 1 Procedure to merge attributes contained in different data sources.

1: Find the core attributes present in all data sources

ds

2: Detect the richest dataset available

ds_{rich}

3: Craft

S_{core}

or learn it from

ds_{rich}

4: Learn or computing

\delta_{core}

from

ds_{rich}

5: Fix

S_{core}

while learning or crafting

S_{ds}

for all

ds

with

X^{ds}_{i}\notin core

6: In the case of learning, eliminate any introduced edge that conflicts with

\delta_{core}

, i.e. edges between core attributes or edges that go from other data sources to core attributes.

7: Learn or compute

\delta_{ds}

from each

ds\neq ds_{rich}

for the newly incorporated variables

X^{ds}_{i}

\notin core

The final structure is a composition of the structures defined for each data source. The set of nodes is connected through a set of edges, which connect attributes within the same data sources, except from the ones going from the core to other data sources.

4.3 Model Validation

Model validation is the process of ensuring the accuracy of the model by comparing the generated synthetic population with the original data. Researchers choose the validation metrics based on the constraints they want to fulfil. In general, these constraints try to ensure that the probability distributions that will eventually be used in population synthesis (whether marginal, conditional, joint, or partially joint) are close enough to the distributions of the real data.

As we are considering an ensemble of different types of data, i.e. macrodata, microdata or a combination of both, we propose to adapt the validation to these different types. Various metrics can be used, such as: (i) Wasserstein distances [35] for evaluating marginal distributions (to be used with macrodata), and (ii) regression lines or SRMSE, described in [40], for evaluating joint distributions (to be used with microdata). The first metric intuitively shows how different two probability distributions are. Concretely, it measures the minimum amount of "work" or effort required to transform one distribution into another. The lower the distance, the closer the distributions. The second comparative approach is regression lines, which help visualize the fit of the synthetic population to the weighted microsample through a frequency plot, where the frequencies of every unique variable combination in the two datasets are plotted against each other. A perfect match is represented by a line of best fit with zero intercept, unit slope, and a correlation coefficient value of one.

5 Application through Use Case

Following the steps defined in Section 4, one can apply our value-enriched population synthesis framework for a specific set of data sources in a region. Once the data is prepared, the models can be created and validated against the original data. The metropolitan area of Barcelona is selected as a proof of concept for our framework. We generate a synthetic population with agents aged from 15 to 74 years old, as motivational data (from existing surveys) is only available for people in that age range.

We use Python’s Bayesian network library pgmpy [5], which allows for the direct implementation of parameter learning and structural learning techniques. All the project materials can be accessed from the corresponding GitHub public repository. ¹¹1https://github.com/albaaguilera/Population-Synthesis

5.1 Data Preparation

After a thorough analysis of the available data for our selected region, working in close collaboration with local governmental organizations (such as the Open Data department and the Opinion Studies Center of the Government of Catalonia) and non-profit organizations (such as fundació Bofill, which focuses on promoting critical knowledge through education-related studies), we decided upon a set of data sources. The primary data sources considered (and their most updated year of coverage) are: OpenData (2022) [4], IPUMS (2011) [2], Panel fundació Bofill (2012) [6], Barcelona values survey (2021) [7] and Catalonia values survey (2023) [8], which are presented in further detail in Table 1.

Table 1: Selected data sources after attribute selection and harmonization. The size of the data sources, represented by the number of individuals, is listed, along with the maximum geographic scope, the type of data and the models to which the data is fed.

No.	Source	Size	Scope	Type	Model
$ds_{1}$	OpenData	1,600,000	Neighborhood	Macrodata	Knowledge-based
$ds_{2}$	IPUMS	120,000	Municipality	Microdata	Knowledge- based and learnt
$ds_{3}$	Panel	1,500	Census section
$ds_{4}$	Bcn values survey	1,300	District
$ds_{5}$	Cat values survey	3,100	Region

OpenData [4] is a governmental database, updated annually, that provides socio-demographic macrodata for the entire population of Barcelona up to the neighbourhood level. As a macrodata source, only the knowledge-based model supports it. IPUMS, Panel and value surveys’ data are microdata sources, used to feed the learnt and knowledge-based models. IPUMS [2] contains detailed socio-demographic information from the Spanish census at both the individual and collective levels. Panel fundació Bofill’s data [6] originates from a longitudinal survey aimed at exploring social inequalities in Catalonia, conducted across households up to the census section level. The Barcelona values survey [7] and the Catalonia values survey [8], conducted every two years up to the district and municipality level, investigate ideological, ethical, or attitudinal questions to understand the prevailing value system of the population. We acknowledge that combining data from different years involves a significant assumption, as the populations described may have changed over time. However, the validation process is designed to address this assumption by evaluating the accuracy of the models against all the available data sources. This approach helps us ensure that, despite the temporal differences in the data, the synthetic population does not deviate from the actual characteristics of the population in the region.

Table 2: Summarized classification, definition and description of the selected profile attributes for population synthesis. The attributes are organized into types within the socio-demographic and motivational layers. Note that this simplified table does not specify the

70

attributes comprised in the agent’s profile.

Layer	Type	Attribute	Definition	Description
Socio- demographic	Main	D¹	District	Territorial unit
		G	Gender	Female or male
		A	Age	$0-100$ years old by groups of $10$
		N	Nationality	Spain / rest of EU / rest of the world
		E	Educational level	Last educational level attainment
		U	Unemployment	Registered employed or unemployed
		I	Income	Monthly amount of income
	Household	Hr	Number of people you live with	Number from 0 to 4 or more
	Household	Ch	Children in the household	No children / one or more children
	Network	Fr	Number of friends.	Number from 0 to 3 or more
	Network	$X_{Fr}$ ²²2	Friends’ main demographic attributes	Friends’ age, gender, educational level and nationality
Motivational	Values		Inglehart’s materialist/post-materialist index	Degree of materialism, mixed values or post-materialism ( $1-7$ )
			Alignment with Schwartz’s fundamental values	Degree of agreement or disagreement with the $10$ fundamental values ( $1-5$ )
	Ideologies		Individual’s and parents’ ideology	Political spectrum ( $1-8$ )
			Alignment with capitalism, socialism, communism and political independence movements	Degree of agreement: agreement, disagreement and indifference ( $1-3$ )
			Alignment with feminism, ecologism, multiculturalism and religion	Degree of agreement: agreement, disagreement and indifference ( $1-3$ )
	Opinions		Interest on politics, sports, culture, etc.	Degree of interest ( $1-4$ )
			View on controversial topics such as immigration, squatting, sustainability, etc.	Multiple options varying with topic
			Confidence in the police, the state, the government, the church, the people, etc.	Degree of confidence ( $1-4$ )
	Vital priorities		Importance given to or satisfaction provided by certain aspects of one’s life: family, friends, work, personal time and studies	Degree of importance or satisfaction ( $1-10$ )

¹ Territorial unit interchangeable for other geographic scopes present in the data sources (e.g. municipality "M", census section "CS" or neighbourhood "N").
² $X_{\text{Fr}}$ refers to the set of variables describing the demographic profile of the individual’s friends (e.g. $G_{1}$ being the number one friend’s gender).

Given the set of data sources, a thorough cleaning, harmonization and selection of attributes need to be performed. In our case, we resort to the direct elimination of missing values and the establishment of simplified and standardized states $x_{k}$ for each variable to ensure consistency. Additionally, the territorial unit node is designed to be flexible, allowing our application to adapt to various spatial scales.

5.2 Model Selection

A summarized classification, definition and description of the selected attributes’ is provided in Table 2. The models encompass over seventy variables, a number that can either be reduced or extended to narrow or widen the synthetic agent’s profile. We could include as much information as the data sources allow (e.g. workplace sector, extracurricular activities, information about the use of time, the perception of one’s health state or social class, detailed information on the household assets, etc). As explained in Section 4, we differentiate between two models: (1) the learnt model (Section 5.2.1) and (2) the knowledge-based model (Section 5.2.2).

5.2.1 Learnt Model

The learnt model draws data from the four microdata sources specified in Table 1. It learns the structures and parameters using the hill climb search method and the expectation maximization estimator [9]. Following the procedure explained in Section 4.2, we identify the core attributes present in the four datasets. Among the selected datasets, IPUMS stands out as the richest due to its significantly larger interviewed population. Consequently, we learnt both $S_{core}$ and $\delta_{core}$ from it. Once this core is established, we add the remaining attributes from the other datasets and learn their structure and parameters, along with their connection with the core attributes. The final learnt model structure is represented in Fig. 4. The structures and parameters $S_{\textit{$ds_{3}$}}$ , $\delta_{\textit{$ds_{3}$}}$ and $S_{\textit{$ds_{4}$}}$ $\delta_{\textit{$ds_{4}$}}$ encapsulate the dependencies and distributions of the network and motivational attributes, respectively. Furthermore, the remaining structures and parameters $S_{\textit{$ds_{5}$}}$ , $\delta_{\textit{$ds_{5}$}}$ can be added as aggregated motivational attributes at the collective level, with a regional scope.

The dependencies identified by the learnt model align with our expectations regarding the connections between profile attributes, indicating that the model is functioning correctly. For instance, within the socio-demographic layer, age emerges as the predominant influencer, as most of the other attributes depend on it. The social network attributes also exhibit a clear structure where friends share similar traits: the characteristics of individuals are found to influence the characteristics of their friends, as observed with attributes such as gender, education, and age. The individual and collective motivational structures, $S_{\textit{$ds_{4}$}}$ and $S_{\textit{$ds_{5}$}}$ , showcase several connections between attributes within the motivational layer and from the socio-demographic one. In the first case, the interdependencies are classified within ideologies, opinions, values and all of them together. such as the parents’ ideologies influencing the individual one It is important to highlight that these may vary significantly depending on the city of study, especially those related to politics and trust with institutions.

For a more concrete description of the whole structure learnt by the model, beyond the schematic one provided in Fig.4, we outline some of the interdependencies (connections between attributes within the same layer) and outer dependencies (connections between attributes from the socio-demographic to the motivational layer). Note that, in the context of dependencies, $a\prec b$ represents the causal relation " $b$ depends on $a$ ".

Interdependencies

1.

Socio-demographic layer
- •
  
  Age $\prec$ other main demographic attributes
- •
  
  Within network: individual’s demographic attributes $\prec$ friend’s demographic attributes
2.
Motivational layer
- •
  
  Within ideologies: parents’ ideologies $\prec$ individual ideology
- •
  
  Within opinions: trust in the monarchy $\prec$ trust in other institutions, opinions about feminism $\prec$ opinion towards ecologism and multiculturalism
- •
  
  Within Schwartz’s values: benevolence $\prec$ universalism, self-direction $\prec$ stimulation $\prec$ conformity $\prec$ hedonism $\prec$ power $\prec$ achievement
- •
  
  Within ideologies, values and opinions: Inglehart’s index $\prec$ political ideology, religion $\prec$ tradition, opinion towards immigration $\prec$ security and social trust in people

Outer dependencies

•

Between socio-demographic layer and values: nationality $\prec$ religion and opinion about political independence movements , children in household $\prec$ hedonism
•

Between socio-demographic layer and vital priorities: employment status $\prec$ satisfaction with professional and economic aspects, age $\prec$ importance given to work, children in the household $\prec$ importance given to family and social trust in people

5.2.2 Knowledge-based Model

The knowledge-based model draws data from all the sources listed in Table 1. We establish a basic structure based on prior knowledge, which involves imposing dependencies, available in data, that seem naturally evident between the attributes. We acknowledge the biases that the knowledge-based structure can introduce, but bear in mind that both models defined are intended to complement each other in the validation step. By detecting which model most accurately represents a specific data source, we plan to select that particular structure and parameters for each data source.

The crafted structure is simpler than the learnt one in terms of connections. The core attributes (mostly corresponding to main socio-demographic ones) are chosen to influence all the other socio-demographic and motivational attributes. In fact, the structure is geographically rooted: all attributes are influenced by the territorial unit if there is data that allows us to impose so. The final knowledge-based model structure is represented in Fig. 4, where the core attributes are the parents of all other attributes. The parameters $\delta_{ds_{1}}$ are computed using the maximum likelihood method, while the other parameters are estimated using expectation maximization. By manually crafting this structure, rather than learning it (feasible only with microdata sources), we can integrate macrodata from $ds_{1}$ into the socio-demographic layer. This approach allows us to leverage macrodata that can be more representative than microdata.

6 Conclusions and Future Work

In this paper, we have presented a novel population synthesis framework that incorporates motivational attributes into agent profiles. By feeding from social survey data, our framework connects information about individuals’ values, ideologies, opinions and vital priorities to the rest of their socio-demographic attributes, preserving the representativeness of the population. We propose two different models for evaluation: learnt-based and knowledge-based. These structures lead to the generation of synthetic populations with highly detailed motivational attributes at different geographic scopes. Researchers can use these datasets to initialize their simulations with comprehensive agents’ profiles and enhance decision-making architectures.

Future work includes developing a hybrid model that leverages the strengths of both the learnt and knowledge-based models: capturing complex variable dependencies effectively while integrating representative and up-to-date macrodata datasets. To achieve this, we will validate the generated synthetic populations against all data sources and compare the performance of the different structures to determine which model’s structure most accurately represents each data source selected for the use case. Additionally, the validation should be compared with other emerging machine-learning methods [VAE1, aemmer2022generative, albiston2024neural]. Once validated, the synthesized population will be integrated into an agent-based model application. By either creating or extending an already existing decision-making architecture with detailed motivational attributes, we are aiming to model the behaviour of agents closer to real-life complex scenarios.

While agent-based social simulations are always a conceptual simplification of a given social context, their use to inform policy making in sensitive and urgent topics such as poverty mitigation [11] or public health crisis [17] call for a more nuanced analysis and reflection regarding the values that guide the agents’ behaviour. Additional future steps will include testing the replicability of the model in other regions with alternative datasets. The article opens the door to operationalize value alignment of agent-based simulations in different contexts and geographical locations as well as to explore how a diversity of values, ideologies, opinions and vital priorities can affect the effectiveness of policy making by conditioning agents’ behaviours. \ack

This research has been supported by the EU-funded VALAWAI (# 101070930), the Spanish-funded VAE (# TED2021-131295B-C31) and the Rhymas (# PID2020-113594RB-100) projects. Special thanks to Raül Tormos, Head of Methodology and Research at CEO (Centre d’Estudis d’Opinió) Generalitat de Catalunya.

References

[1] European Social Survey. https://www.europeansocialsurvey.org/.
[2] Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.3. https://doi.org/10.18128/D020.V7.3. Minneapolis, MN: IPUMS, 2020.
[3] Portrait values questionnaire, 21 items. https://zis.gesis.org/skala/Schwartz-Breyer-Danner-Human-Values-Scale-(ESS)#.
[4] Societat i benestar - Open Data Barcelona. https://opendata-ajuntament.barcelona.cat/data/es/organization/societat-i-benestar.
[5] Supported data types - pgmpy 0.1.23 documentation.
[6] Panel de Desigualtats Socials a Catalunya-PaD. https://fundaciobofill.cat/panel-de-desigualtats-socials-catalunya-pad, 2012.
[7] Enquesta de Valors Socials Ajuntament de Barcelona - Oficina Municipal de Dades. https://portaldades.ajuntament.barcelona.cat/es/encuestas/21046, 2021.
[8] Enquesta de Valors a Catalunya. https://ceo.gencat.cat/ca/estudis/registre-estudis-dopinio/estudis-dopinio-ceo/societat/detall/index.html?id=9088, 2023.
[9] Structure learning - pgmpy 0.1.23 documentation, 2023.
[10] Serio Agriesti, Claudio Roncoli, and Bat-hen Nahmias Biran, ‘Assignment of a synthetic population for activity-based modeling employing publicly available data’, ISPRS International Journal of Geo-Information, 11, 148, (02 2022).
[11] Alba Aguilera, Nieves Montes, Georgina Curto, Nardine Osman, and Carles Sierra, ‘Can poverty be reduced by acting on discrimination? An agent-based model for policy-making’, AAMAS ’24: Proceedings of the 2024 International Conference on Autonomous Agents and Multiagent Systems, (2024).
[12] Gordon W Allport, The nature of prejudice, Basic Books, 1954.
[13] Kevin Chapuis and Patrick Taillandier, ‘A brief review of synthetic population generation practices in agent-based social simulation’, in submitted to SSC2019, Social Simulation Conference, (2019).
[14] Kevin Chapuis, Patrick Taillandier, and Alexis Drogoul, ‘Generation of synthetic populations in social simulations: a review of methods and practices’, Journal of Artificial Societies and Social Simulation, 25(2), (2022).
[15] Dov Cohen, ‘Cultural variation: considerations and implications.’, Psychological bulletin, 127(4), 451, (2001).
[16] Philippe De Wilde, Payal Arora, Fernando Buarque de Lima Neto, Yik Chin, Mamello Thinyane, Serge Stinckwich, Eleonore Fournier-Tombs, and T Marwala, ‘Recommendations on the use of synthetic data to train ai models’, (2024).
[17] Frank Dignum, Virginia Dignum, Paul Davidsson, Amineh Ghorbani, Mijke van der Hurk, Maarten Jensen, Christian Kammler, Fabian Lorig, Luis Gustavo Ludescher, Alexander Melchior, René Mellema, Cezara Pastrav, Loïs Vanhee, and Harko Verhagen, ‘Analysing the Combined Health, Social and Economic Impacts of the Corovanvirus Pandemic Using Agent-Based Social Simulation’, Minds and Machines, 30(2), 177–194, (jun 2020).
[18] European Social Survey European Research Infrastructure (ESS ERIC). Ess round 10 - 2020. democracy, digital social contacts, 2023.
[19] GESIS Data Archive, Cologne. ZA7500 Data file Version 5.0.0.
[20] Boyam Fabrice Yaméogo, Pascal Gastineau, Pierre Hankach, and Pierre-Olivier Vandanjon, ‘Comparing methods for generating a two-layered synthetic population’, Transportation research record, 2675(1), 136–147, (2021).
[21] C. Haerpfer, R. Inglehart, A. Moreno, and et al. Welzel, C.
[22] Thomas Herdin and Wolfgang Aschauer, ‘Value changes in transforming china’, Kome: An International Journal of Pure Communication Inquiry, 2(1), 1–22, (2013).
[23] Axel Honneth, The Struggle for Recognition, Polity Press, 1996.
[24] Sebastian Hörl and Milos Balac, ‘Synthetic population and travel demand for paris and île-de-france based on open and publicly available data’, Transportation Research Part C: Emerging Technologies, 130, 103291, (2021).
[25] Anugrah Ilahi and Kay W Axhausen, ‘Integrating bayesian network and generalized raking for population synthesis in greater jakarta’, Regional Studies, Regional Science, 6(1), 623–636, (2019).
[26] Ronald Inglehart, The Silent Revolution: Changing Values and Political Styles Among Western Publics, Princeton University Press, Princeton, NJ, 1977.
[27] Ronald Inglehart and Christian Welzel, Modernization, cultural change, and democracy: The human development sequence, volume 333, Cambridge university press Cambridge, 2005.
[28] Jacob Jordaan and Bogdan Dima, ‘Post materialism and comparative economic development: Do institutions act as transmission channel?’, Social Indicators Research, 148, (04 2020).
[29] Roger X Lera-Leri, Enrico Liscio, Filippo Bistaffa, Catholijn M Jonker, Maite Lopez-Sanchez, Pradeep K Murukannaiah, Juan A Rodriguez-Aguilar, and Francisco Salas-Molina, ‘Aggregating value systems for decision support’, Knowledge-Based Systems, 287, 111453, (2024).
[30] Xin Liang, Lizi Luo, Shiying Hu, and Yuke Li, ‘Mapping the knowledge frontiers and evolution of decision making based on agent-based modeling’, Knowledge-Based Systems, 250, 108982, (2022).
[31] Poh Ping Lim, ‘Population synthesis for travel demand modelling in australian capital cities’, (2020).
[32] Nik Lomax, Andrew P Smith, Luke Archer, Alistair Ford, and James Virgo, ‘An open-source model for projecting small area demographic and land-use change’, Geographical Analysis, 54(3), 599–622, (2022).
[33] Nardine Osman Nieves Montes, Georgina Curto and Carles Sierra, ‘An agent-based model for poverty and discrimination policy-making’, in Proceedings of the 2nd Workshop on Agent-based Modeling and Policy-Making (AMPM 2022) co-located with 35th International Conference on Legal Knowledge and Information Systems (JURIX 2022), (2022).
[34] Muhammad S Osman, Adnan M Abu-Mahfouz, and Philip R Page, ‘A survey on data imputation techniques: Water distribution system as a use case’, IEEE Access, 6, 63279–63291, (2018).
[35] Victor M Panaretos and Yoav Zemel, ‘Statistical aspects of wasserstein distances’, Annual review of statistics and its application, 6, 405–431, (2019).
[36] Manon Prédhumeau and Ed Manley, ‘A synthetic population for agent-based modelling in canada’, Scientific Data, 10(1), 148, (2023).
[37] Aurore Sallard and Miloš Balać, ‘Travel demand generation using bayesian networks: an application to switzerland’, Procedia Computer Science, 220, 267–274, (2023).
[38] Shalom H. Schwartz, ‘An Overview of the Schwartz Theory of Basic Values’, Online Readings in Psychology and Culture, 2(1), (2012).
[39] A Sen, Development as freedom, Oxford University Press, 2001.
[40] Lijun Sun and Alexander Erath, ‘A bayesian network approach for population synthesis’, Transportation Research Part C: Emerging Technologies, 61, 49–62, (2015).
[41] Charles Taylor, Multiculturalism and "the politics of recognition", Princeton University Press, 1931.
[42] Colin Wan, Zheng Li, Alicia Guo, and Yue Zhao, ‘Sync: A unified framework for generating synthetic population with gaussian copula’, arXiv preprint arXiv:1904.07998, (2019).
[43] William D Wheaton, James C Cajka, Bernadette M Chasteen, Diane K Wagener, Philip C Cooley, Laxminarayana Ganapathi, Douglas J Roberts, and Justine L Allpress, ‘Synthesized population databases: A us geospatial database for agent-based models’, Methods report (RTI Press), 2009(10), 905, (2009).
[44] https://www.worldvaluessurvey.org/WVSNewsShow.jsp?ID=467.
[45] Kaiyuan Xu, Brian Nosek, and Anthony G. Greenwald, ‘Data from the Race Implicit Association Test on the Project Implicit Demo Website’, Journal of Open Psychology Data, 2(1), e3, (mar 2014).
[46] Batel Yossef Ravid and Meirav Aharon-Gutman, ‘The social digital twin: the social turn in the field of smart cities’, Environment and Planning B: Urban Analytics and City Science, 50(6), 1455–1470, (2023).
[47] Meng Zhou, Jason Li, Rounaq Basu, and Joseph Ferreira, ‘Creating spatially-detailed heterogeneous synthetic populations for agent-based microsimulation’, Computers, Environment and Urban Systems, 91, 101717, (2022).