How to keep it adequate: A validation protocol for agent-based simulation

Authors: Christian Troost, Hans-Ruthenberg-Institute, Universität Hohenheim, Stuttgart, Germany, christian.troost@uni-hohenheim.de, ORCID: 0000-0003-4626-7117 (Corresponding author) Andrew R. Bell, Earth & Environment Department, Boston University, Boston, MA, USA, ORCID: 0000-0002-1164-312X Hedwig van Delden, Research Institute for Knowledge Systems (RIKS), Maastricht, the Netherlands, ORCID: 0000-0001-6976-4832 Robert Huber, Agricultural Economics and Policy Group ETH Zürich, ORCID: 0000-0003-4545456X Tatiana Filatova, Multi Actor Systems Department, Faculty of Technology Policy and Management, TU Delft, The Netherlands, ORCID 0000-0002-3546-6930 Quang Bao Le, International Center for Agricultural Research in the Dry Areas (ICARDA), Cairo, Egypt, ORCID: 0000-0001-8514-1088 Melvin Lippe, Thünen Institute of Forestry, Hamburg, ORCID: 0000-0003-4323-8767 Leila Niamir, Energy, Climate, and Environment Program, International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria, ORCID: 0000-0002-0285-5542 J. Gareth Polhill, The James Hutton Institute, Craigiebuckler, Aberdeen AB15 8QH, United Kingdom. ORCID: 0000-0002-8596-0590 Zhanli Sun, Leibniz Institute of Agricultural Development in Transition Economies (IAMO), Germany, ORCID: 0000-0001-6204-4533 Thomas Berger, Hans-Ruthenberg-Institute, Universität Hohenheim, Stuttgart, Germany, ORCID: 0000-0003-3316-9614

A variety of approaches for constructing valid ABM have been suggested (e.g. Augusiak et al. 2014;Brenner & Werker, 2007;Deichsel & Pyka, 2009;Moss & Edmonds, 2005;Grimm et al., 2005) and many examples for formal empirical validation and calibration of ABM exist: Indirect inference methods for ABM calibration in financial economics (Chen et al., 2012); pattern-oriented modelling as de-facto standard in ecological modelling (Grimm et al., 2005;Thiele et al., 2014); Approximate Bayesian Computation for individual-based models (van der Vaart et al., 2015), micro-validation in energy economics (Niamir, et al. 2020a), automatised calibration for innovation diffusion models (Jensen & Chappin, 2016) and real estate market interactions (Filatova 2015;Magliocca et al., 2016;de Koning and Filatova, 2020); robust parameter uncertainty reduction in agricultural economics (Arnold et al. 2015;Troost & Berger 2015a, Berger et al. 2017. A consensus or a formal guideline which method to choose for a specific ABM application context that transcends disciplines, has, however, not yet been established, even within the more confined field of ABM in environmental and land system sciences Polhill & Salt 2017;Filatova 2015).
Empirical output validation, i.e. comparing model predictions to observations of a real-world system, is widely regarded as the ideal of validation because it entails reproducible protocols and quantitative, replicable and transparently communicable results. However, it has also been clearly demonstrated that overreliance on goodness-of-fit to observations is misleading and inadequate if the underlying (statistical) assumptions for empirical validation are not fulfilled in a specific research context (e.g. Oreskes et al. 1994;Polhill & Salt 2017).
As inherently structure-rich models, ABM are often used in contexts where simpler, statistical approaches are not applicable and as a consequence also the prerequisites for (system-level) empirical validation are typically not fulfilled (Berger & Troost 2014). The importance of structural validation and sensitivity analysis for such contexts has been widely recognised (Moss & Edmonds 2005;Troost & Berger 2015a;Marshall & Galea, 2015;Polhill & Salt 2017). Structural validation, i.e. ensuring adequate correspondence of model structure and processes with their real-world counterparts, is often less formalised. When using empirical validation for model components at the micro level, similar statistical prerequisites have to be considered as in empirical macrovalidation. While formal approaches for conducting sensitivity analysis have been clearly formulated (e.g. Saltelli et al. 2008), it is not necessarily obvious which uncertainties and criteria for robustness should be considered and how they relate to the encompassing modelling process (Ligmann-Zielinska et al. 2020).
The recognition that models are by definition abstractions from reality and ultimately their absolute truth cannot be proven empirically (Oreskes et al., 1994;Quine, 1951) has led the scientific community to replace the condition for model validity from 'corresponds to the real system' to 'is adequate for its intended purpose' (e.g. Forrester & Senge, 1980;Gass, 1983;McCarl & Apland, 1986;Oreskes et al., 1994;Barlas, 1996;Kydland & Prescott, 1996;Rykiel, 1996;Beck et al., 1997;Jakeman et al., 2006;Augusiak et al., 2014;Edmonds et al. 2019). This means that the conditions for a valid, i.e. adequate, model and simulation analysis are context-dependent. They do not only depend on the characteristics of the system to be modelled, but also on the availability of data describing the system and its behaviour as well as the research question to be answered.
ABM are used for a large variety of purposes and contexts (Edmonds et al. 2019;Lippe et al. 2019;Schulze et al. 2017). Hence, on the one hand, formalising ABM validation cannot mean prescribing one statistical validation procedure to all ABM. On the other hand, context-dependency of validity does not mean 'anything goes'. There are fundamental principles that are essential for a valid analysis in certain contexts. There is a vast body of literature that suggests, justifies, discusses or criticises specific approaches for model selection, calibration, testing and analysis. Often, however, the modelling contexts for which these methods are applicable are not explicitly delineated, because they are implicit in the disciplinary context or even ignored.
In this article, we argue that, under a paradigm of adequacy, validity cannot be assured by the one confined, isolated step of the modelling process -typically located after calibration and before predictive simulations -which is commonly called validation. Instead, it requires context-adequate and mutually consistent choices at all stages of the simulation analysis including the choice of model components, choice of methods for parameterisation, model inference (inverse modelling, calibration, estimation), testing and a consistent tracing, documentation and interpretation of uncertainties through the modelling process to finally ensure the validity of the conclusions drawn from the analysis.
The ABM community has successfully adopted the ODD protocol  for formal model documentation. Schmolcke et al. (2010) and Grimm et al. (2014) have suggested the TRACE format for formally documenting the modelling process. Though TRACE highlights that all the elements of a modelling process are relevant for assessing the validity of simulation analysis, it does not provide formal guidelines, which methods to use in which contexts. Our article, which has resulted from community discussions initiated in workshop W9 of the 2020 IEMSS conference, aims to fill this gap.
In the first part of this article, we conceptualise validation as "challenging and substantiating the premises on which the conclusions from simulation analysis are built". We revisit premises typically used in simulation analysis and discuss in how far they are tested, respectively in how far they are actually presupposed by empirical and structural validation, uncertainty analysis, model selection, empirical parameter estimation and result interpretation.
On this basis, in the second part, we develop a protocol to help modellers keep it adequate (KIA): a protocol of guiding questions to characterise the modelling context for choosing adequate model components and methods of parameterisation, testing and uncertainty analysis step by step. The KIA protocol is intended to (a) guide modellers during the research process, (b) provide a template structure for transparently documenting the rationale for modelling choices, (c) serve as a checklist for reviewers and stakeholders (addressees of simulation results) when assessing the validity of a documented study and its conclusions, (d) foster efficient communication between authors and reviewers, and (e) help in structuring the scientific discussion on the merits of validation and calibration methods.

Validation: Arguments for model validity and their premises
If there is one cross-disciplinary consensus in the scientific literature on model validation, it is that model validity cannot be established in general, but only with respect to a specific purpose for which the model is intended to be used. Model validity is the adequacy of a model for its intended purpose (e.g. Forrester & Senge, 1980;Gass, 1983;McCarl & Apland, 1986;Oreskes et al., 1994;Barlas, 1996;Kydland & Prescott, 1996;Rykiel, 1996;Beck et al., 1997;Jakeman et al., 2006;Augusiak et al., 2014;Edmonds et al. 2019).
The purpose of any scientific simulation analysis is to answer a research question. Scientific answers result as conclusions from scientific argumentation and are accepted if the conclusions can be validly derived from accepted premises (McCloskey, 1983;Hands 2001). Scientific objectiveness is ensured by transparently subjecting all premises and deductions to critical scrutiny and peer review (Klappholz & Agassi, 1959;Caldwell 1991).
In its most generic form, scientific arguments that employ simulation modelling conform to the following logical proposition (Troost & Berger 2020): Premise B is a conjunction of two premises. The first premise "R(t,x,y): Our model results in Y for inputs X" is supported by result analysis. Showing that the second premise ("U(t): Our simulation analysis fulfils conditions U") holds is what is typically understood as validation.
A typical example: We conclude (Z) "Climate change will increase poverty among farming households" if R(t, x, y): "Simulated farm agent income is lower in climate change scenarios than in the baseline". The necessary condition U(s) is very often formulated as: "The model employed in our simulation analysis provides sufficiently reliable predictions of Y(X) in the real system." Empirical output validation and structural validation test whether a simulation t fulfils this (or a very similar) formulation of U(s) but they, in turn, rely on further necessary premises. These premises will be discussed in the following two subsections. Recognising the uncertainty in the simulation process, the third subsection discusses the role of uncertainty analysis for sound and robust conclusions (showing sufficient reliability). In the fourth subsection, we highlight that simulation analysis may also rely on differently formulated conditions U(s) that allow for more useful conclusions in some contexts.

Premises of empirical validation and inverse modelling
The key underlying premise of empirical output validation is: "Predictive performance of a model in observed situations can be generalised to the target situations (i.e. the system situations relevant for the research question)". This premise is trivially fulfilled if the target situation is part of the observed situations (in-sample setting). For contrast, whenever the simulation purpose is prediction or counterfactual simulation, the target situations (life after climate changed, in our example) have not been (fully) observed. The same holds implicitly for 'explanation' where the objective typically is to find a generalisable explanatory model (Edmonds et al. 2019).
Generalisation of behaviour from observations to unobserved target situations needs to involve statistical considerations in order to avoid propagating spurious, unsystematic relationships (Hansen & Heckman 1996): Direct generalisation of statistical relationships, including X-Y relationships and predictive performance, is only possible if the sample is redundant enough to control for sampling error and the target situations are part of a statistical population for which the observed sample is representative (representative sample setting).
Sampling error is the unavoidable, unsystematic error caused by using a sample and not the full population. It can potentially be reduced by increased sampling rates (Williams et al. 2022). Nonrepresentativity occurs due to a biased sample, which can be caused by different, sometimes subtle reasons, including attrition, self-selection, survivorship or failure bias, observer bias, and unobserved heterogeneity (Vandecasteele & Debels 2007;Gangl 2010;Gormley & Matsa 2014;Jager et al. 2020;Smith 2020). While some minor biases may be corrected by statistical means, structural breaks, non-stationarity or regime shifts -such as climate change -substantially alter statistical X-Y relationships causing extreme sample bias: Observed and target situations are so fundamentally different that they must be considered different statistical (sub)populations (nonrepresentative sample setting) and direct generalisation is not possible (Perron 2006;Andersen et al. 2009;Leamer 2010;Filatova et al. 2016).
It is very important to realise that these preconditions apply to any form of model inference by inverse modelling (i.e. calibration, empirical model selection or parameter estimation) using observed behaviour. In all cases, ignoring sampling error and bias leads to the generalisation of unsystematic, confounded or unstable relationships (overfitting) that cause inaccurate and misleading out-of-sample predictions (Browne 2000;Forster 2000;Hansen & Heckman 1996).
In non-representative sample settings, simulation of system behaviour for unobserved situations has to rely on structural knowledge about internal system processes (see next section). Nevertheless, a sample can still be useful here: Structural knowledge often admits alternative model formulations or parameterisations (candidates). Even if a sample is not representative of the target situations, it can be used to discriminate between the candidates if it is representative (and sufficiently redundant) in a domain in which the candidates imply clearly distinguishable behaviour. Generalisation to a target situation then relies exclusively on structural knowledge embodied in the chosen candidate, whereas observed behavioural data only contributes indirectly by selecting this candidate (indirect generalisation) 1 . Importantly, the predictive accuracy measured in the sample cannot be straightforwardly generalised to the target situation in these cases.
Using a sample to reliably discriminate between candidates or detect statistical relationships presupposes structural and practical identifiability (Bellman & Åström, 1970;Cobelli & DiStefano, 1980;Stigter et al. 2017;Guillaume et al. 2019): Structural identifiability means that different candidates are not observationally equivalent, i.e. do not imply the same system behaviour in the observed domain. Even a fully representative and redundant sample is not able to distinguish between models that predict the same output for the same input. 2 Practical identifiability means that the variation in the observational data in connection with auxiliary assumptions (e.g. on representativity and the form of model errors) is sufficient to unambiguously attribute effects to the individual parameters of a given model structure. Sampling error, confounded input variation (correlated variables, multicollinearity), unobserved heterogeneity, and omitted variable bias are key obstacles for unambiguous model selection and parameter estimation. More complex models require more data or more restrictive prior assumptions on parameters to be practically identifiable (Brown 2000;Burnham & Anderson 2004;Polhill & Salt 2017). Two model structures or parameter sets that cannot be discriminated by given data are termed 'equifinal' (Beven & Freer 2001).

Premises of structural validation and structure-based model choice
As argued above, structure-based simulation is essential to anticipate behaviour for target situations for which direct generalisation from observed data is not possible and to derive structural explanations of system behaviour. Structure-based simulation deduces system reaction from existing knowledge about system components and their interactions. It is sometimes argued that such a deductive process does not create new information. However, as Frisch (1931) argued, the key contribution of quantitative modelling is to analyse the interplay of processes and compare the magnitudes and directions of their individual effects in relation to each other in order to deduce the behaviour of the whole system. This anticipated or emergent behaviour is new information that was not obvious from looking at existing knowledge on individual processes in isolation.
The key premise of structure-based modelling and structural validation is: "A model that contains a sufficiently complete and accurate representation of the internal structure and processes of a system is expected to predict system behaviour well." Sufficient completeness is often complicated by incomplete knowledge of the system and its potential reconfigurations. In addition, modellers are typically forced to strike a balance between completeness and efficiency striving to include all relevant processes, while omitting unimportant ones that complicate the model construction (Forrester & Senge, 1980). 1 Similarly, indirect generalisation occurs if the output variable of interest has not been observed itself and a model is indirectly tested using another related output variable. Generalisation of the variable of interest then relies on the premise that the structural knowledge embodied in the model correctly relates the two variables.
2 Structural identifiability in our understanding subsumes also problems of endogeneity often encountered in econometrics.
Sufficient accuracy in the representation of individual processes is the subject of micro-validation (Moss & Edmonds 2005;Windrum et al. 2007;Midgley et al. 2007;Arnold et al. 2015;Ghaffarian et al. 2021). Some structural processes and their parameters may be directly observable and measurable. Others, however, may have been generalised from observed subsystem behaviour by inverse modelling and estimation. The premises for empirical estimation and validation of process models at the micro level are the same as at the macro (full system) level: sample representativity, identifiability and control of sampling error. The inclusion of estimated model components into a composite model requires ensuring that the observations from which they have been generalised are representative for all contexts for which they are applied in the composite system.

Uncertainty analysis: The premises for robust conclusions
In practice, all system knowledge and data used in simulation analysis are subject to uncertainty.
Just showing that one particular model results in a specific output for a particular input is hence not convincing: This implies, firstly, that implications of uncertainty in structural knowledge and uncertainty in model inference from data must be carefully assessed. In predictive analysis, the uncertainty in the anticipated input for a target situation needs to be considered, additionally. Secondly, the type and degree of uncertainty and bias that are compatible with conclusion Z must be carefully specified in the major premise.

Alternative basic premises
Not every scientific argument using simulation analysis is based on the premise that the model provides reliable predictions of Y(X) in the real system. Edmonds et al. (2019) have noted that some types of analysis (e.g. theoretical exposition) do not require any immediate claims about the relation of the model to reality at all or put more emphasis in representing stakeholder's views of the system.
A subtler relation is discussed by Troost & Berger (2020, p. 6f.), who use the following hypothetical ABM application: "Economic policy analysis often works in a normative context: Policy makers need to justify actions with respect to established societal values, norms or ideologies. For example, they might work in a political setting, in which the state is supposed to safeguard minimum living incomes but only to interfere in economic processes if market participants are not at all able to help themselves.
Assume that in this context analysts build their ABM to simulate the adaptation of farmers to climatic change and model each farm agent decision as a rational optimisation problem with perfect anticipation of (projected) climatic impacts on production and market conditions. In addition, farm agents are embedded into a social network of mutual solidarity, in which agents less affected by climatic extreme events indiscriminately help the severely affected ones. Analysing their simulations, the analysts find that their optimising farm agents become food insecure under projected impacts. They conclude that if perfectly-foresighted, optimising agents in a perfectly functioning social solidarity network do not fare well, real-world farmers are even more unlikely to do so and should receive government help." As Troost & Berger (2020) observe, the model would likely not pass conventional structural and empirical validation: Key modelled processes do not correspond to our best knowledge of their realworld counterparts. (In reality farmers do not behave as fully rational optimisers with perfect foresight and networks typically discriminate by family ties, ethnicity, etc.). The model will almost surely overestimate observed farm incomes in the past. Nevertheless, the conclusions would withstand such criticism, because accurately predicted farmer or network behaviour is not a relevant premise of the argument here.
In this case, the premise that would need to be challenged in validation is that the model calculates the best possible reaction in economic terms. Empirically this could be done, for example, by searching for observed cases for which the model predicts worse than observed outcomes. One might also identify other unexpected deviations, e.g. larger farm holdings having higher per-area incomes than smaller ones, which might be observed in the data but not in the model (or vice versa) and that are not expected to be caused by imperfect optimisation of real-world farmers alone. Nevertheless, even if the intention is not to show accurate prediction, premises on representativity, sampling error and identifiability also apply here. Structural validation could, for example, assess whether assumed constraints are overly pessimistic or alternative production, safety or income options that might become available with climate change have been omitted. Troost & Berger (2020) further observe that if, for contrast, the analysts find that their computational agents fare well, it would be a logical fallacy to conclude that real-world agents will fare well based on the same premises. Such an argument would require different premises that are much more difficult to support using a model with a clear upward bias. Both cases use the same model in the same empirical context towards the same motivating research question. This illustrates that to judge a model's adequacy we require a very precise definition of its empirical context and the exact argumentative premise it is supposed to support.

A protocol for context-adequate agent-based simulation
Summarising the previous section, sound conclusions from simulation analysis require (i) a logically valid structure for a scientific argument targeted at a carefully defined research question; (ii) a convincing use of models and methods of analysis to support the premises of the argument; (iii) a transparent evaluation whether preconditions for the use of chosen models and methods hold in the specific modelling context.
This modelling context consists of the purpose (research question) and the available knowledge and data about the modelled system. We identified eleven dimensions to be derived from the modelling context which influence an adequate choice of models and methods. In order to improve clarity about distinct possible reasons for similar method choices, it is useful to make a distinction between dimensions that can be derived directly from the research question alone (Fig. 1 a-f), and those that require a more in-depth analysis of the relationship between research question and system knowledge and data during the modelling process ( Fig. 1 g-k) In the following sections, we sketch a protocol (Fig.1), a set of questions for each stage of simulation analysis, that helps characterise the modelling context (3.1) and guide the choice of context-adequate methods (3.2) based on these dimensions. Where available, we list formal methods of analysis with useful references and highlight the premises for their applicability. The protocol is organized in 12 steps and emphasises the documentation and consistent propagation of uncertainty through the modelling process, to ensure that the robustness of final conclusions can be comprehensively assessed (3.3).

Context: Defining the modelling context
The first step is to characterise the modelling context: the precise research question and the knowledge and data that is available about the system being modelled.

Precisely define the research question (Step 1)
A research question typically arises from a larger debate, discourse, or decision problem: for example, a public, political or scientific debate, a participatory planning problem or an economic decision problem. A research question to be addressed by the simulation analysis is supposed to contribute to this debate, even if answering it may not necessarily resolve the whole debate. Useful contributions can comprise very different questions (Edmonds et al. 2019;Epstein 2008): E.g. detailed, precise forecasts of future states of the world, statistical testing of explanatory models, but also exploring and stress-testing possible consequences of decision options (Berger & Troost 2014;Lempert 2019) or purely theoretical questions concerning hypothetical models themselves (theoretical exposition in the sense of Edmonds et al. 2019). It is paramount to be clear about what precise question the simulation analysis is supposed to answer, respectively what precise argument it could contribute to the debate.

Characterise requirements implied by research question (
Step 2) Table 1 (a) provides guiding questions for identifying six dimensions of the modelling context from the research question itself without yet considering data or system knowledge: The most basic consideration is the focus of interest: Does it lie in anticipating system output for specific situations or in describing or understanding system structure? Carefully defining the target situations is a necessary precondition for judging the degree of generalisation in the next step. Required resolution, required transparency as well as computational resource constraints impose limits on a priori model selection. Judging the robustness of conclusions requires understanding the required precision and accuracy (tolerable uncertainty) in simulation outcomes. At this point, it is often not yet possible to formulate this quantitatively (e.g., 2% deviation is acceptable), and should be done in terms of consequences on conclusions (e.g., uncertainty should not affect ranking of policy alternatives by evaluation criteria).

Identify knowledge and data about structure and behaviour of the modelled system (Step 3)
In addition to the research question, the modelling context is defined by the available information about the simulated system in the form of structural and process knowledge, available observations of system behaviour (input-output trace data) as well as -in the case of an output-focus -the anticipated system input data for target situations. The next step is to identify which data, information and knowledge are available, can be obtained with reasonable effort or will remain unattainable for the analysis (e.g. input-output observations of far future system states) (Tab. 1b).

Context-adequate model and parameter selection and uncertainty documentation
Appropriate simulation models can be selected in two steps: In a first structural step, a set of candidate models and candidate parameter sets is constructed or identified whose theoretical characteristics comply with structural system knowledge and the requirements implied by the modelling context. A set of multiple candidates fulfilling the requirements represents the prior model uncertainty 3 . In a potential second step, inference by inverse modelling on data of observed system behaviour can possibly be used to ascribe empirical likelihood to the candidates, rank them and narrow down the candidate set, reducing prior to posterior model uncertainty (Beck et al. 1997).
Under suitable conditions, the two steps complement each other: The first step is key to ensure that only adequate candidates are considered in inverse modelling. Omitting this theory-based preselection can only be adequate if the simulation analysis is output-focused and the modelling context allows for the direct generalisation of statistical relationships (namely the expected predictive accuracy) to the target situations (representative and sufficiently redundant data). Only in this specific case, expected out-of-sample predictive accuracy and practical identifiability can be derived solely from the data and are sufficient criteria for model selection (Polhill & Salt 2017). Nevertheless, even for these direct generalisation cases, incorporating structural knowledge in chosen candidate models becomes more essential the scarcer the data: a defensible structure-based error model specification and pre-selection of candidate models increases practical identifiability.
For the second step, it is key to ensure the adequacy of the inverse modelling process itself. Do the necessary preconditions discussed in section 2.1 hold in the given modelling context? Is the specific method chosen appropriate for the context? Is uncertainty properly considered and documented? If not, model inference by comparison to observed system behaviour is clearly not adequate.

ABM as composite models: Structuring component context (Step 4)
Regarding the application of agent-based simulation, the first thing to ask in structural model choice is certainly whether an ABM suits the given modelling context. ABM are typically composite models (model systems), which are composed of lower-hierarchy models that mirror relevant subsystems and processes. For example, they typically contain a model of individual agent behaviour based on the internal state of and external influence on the agent. This submodel for agent behaviour in turn may itself be a composite of lower-hierarchy components, e.g. for learning, demographics and economic decisions (Schlüter et al. 2017). ABM also typically contain models of agent interactions, e.g. communication, markets, auction, collective action or network models (Schreinemachers & Berger 2011). In addition, many ABM in natural resource management link to biophysical components that model responses of natural systems (e. g. a crop field or watershed) to agent intervention (Arnold et al. 2015). 4 System behaviour in an ABM emerges not only from the interactions between agents, but conceptually also from the interactions of individual model components. In general, such structure-rich composite models are typically used for structure-focused analysis or for output-focused analysis when direct generalisation from observed data is not possible (Nolan et al., 2009;Voinov & Shugart 2013). In direct generalisation contexts, prediction is often achieved more efficiently with statistical or machine learning models (Polhill & Salt 2017). 5 The adequacy of a composite model relies on (i) an assembly of components that together fulfil the relevant premises for the overall research question to be answered, (ii) a careful assessment of the adequacy of each lower hierarchy component for its intended role in the composite, and (iii) a consistent consideration of the uncertainty in each component at the composite level (Arnold et al. 2015).
It is important to realise that each component has its specific own question to answer and has its own specific modelling context, which may differ considerably from the modelling context of the composite as a whole or that of other components. For example, even if the overall modelling context is not apt for direct statistical inference, this does not rule out that within-model contexts of lower hierarchy components exist in which representative samples even allow for the use of machine learning components. For example, we may not yet have observed how a specific group of farmers behaves and fares in a warmer climate, so we cannot empirically measure the predictive performance of a composite model that simulates potential future farmer behaviour and welfare. We may, however, be able to include a plant growth component into this composite model that can be tested based on observations and experiments in a range of warmer and colder regions if we consider this range representative for potential future growth conditions (Troost et al. 2020).
The next step hence is to structure the overall modelling task into subcomponents and then recursively revisit the steps of the protocol also for each component individually. 6 The whole process may require iteratively moving back and forth between composite and components through steps 4-10 until an adequate composite structure for the overall modelling context has been established (Tab. 2, step 4).

Representativity of data and degree of generalisation (Step 5)
The next step (Tab.2, step 5) in choosing an adequate model or model component is to contrast the observed or observable data with the target situation to determine the degree of generalisation and extrapolation implied: Can the observed sample of system behaviour be considered representative for the target situations? Are there regime shifts, non-stationarities or structural breaks or can statistical relationships be considered stable between observed and target situations? Are all relevant system states represented in the data with sufficient probability? This analysis requires a basic system conceptualisation (not yet a full conceptual model) that allows judging the system's degree of openness, internal stability, complexity and stochasticity. This does not imply ABM cannot be used for direct generalisation contexts. There may often just be more efficient approaches. 6 Especially for inverse modelling it may be useful to subdivide the composite into observational units that do not necessarily have to correspond to lower hierarchy models, but may also use different boundaries if that, for example, allows exploiting better identifiability by subsystem input-output datasets.

Choosing structurally adequate candidate models and prior parameter ranges for each component (Step 6)
indicates selected literature sources that expand on the relevant theory or suggest formal tests for the assessment of the questions.
Logical consistency, correct technical implementation, and fit to the required resolution and resource constraints are obvious preconditions for candidate models that have to be carefully assessed even if the component context allows for direct generalisation.
For adequate structure-based model selection, it is useful to first sketch a comprehensive conceptual system model, even if not all system process can or finally have to be included in the simulation model. This conceptual sketch can serve as a benchmark to check a candidate's match of the domain of applicability and sufficient completeness of processes for the target situations (Parker et al. 2008). It has to be ensured that model structure and parameters fixed in the candidate are also expected to be constant (no change over time) and invariant (unaffected by policy, treatment, change to target situation) (Lucas 1976;Engle & Hendry 1993;Hendry 1996). Relevant changes between situations must be captured as exogenous input or result from internal feedback in the model. It is not always possible to explicitly simulate all potential real-world feedback in the model itself, but it should then at least be possible to capture potential feedback as changing boundary conditions that may then later be assessed in uncertainty analysis (Troost & Berger 2015b;.
Expected deviations, i.e. the part of the system behaviour that is not explained or predicted by the model from a theoretical point of view, should be consistent with the precision and accuracy required by the research question. Research questions requiring accuracy with respect to an absolute reference necessitate not only a high degree of model completeness with respect to all systematic processes, but also with respect to probability distributions for unsystematic effects as well as reliable system input data for target situations. Research questions requiring accuracy only with respect to the relationships between simulated target situations demand model completeness only with respect to systematic differences.
Simplifying assumptions (such as optimising agents in our example) may lead to systematic overor underestimation (bias). This is not problematic as long as major conclusions drawn from the simulation analysis will not depend on such simplification (robustness to the relaxation of simplifying assumptions, no model artefacts). 7 Conclusions that are based on comparing model results to asymmetrical, one-sided thresholds even get stronger if the methodological approach is biased against them. Conversely, they are weakened by biases in their favour, especially if these cannot be precisely quantified and corrected. 8 7 The "Lucas critique" (Lucas 1976) is a famous example in economics for a challenge to modelling practice based on these grounds. 8 This principle mirrors the conservative rationale in statistical hypothesis testing: Type II errors, false-negatives, are preferred over type I errors, false-positives.

Documenting prior and input data uncertainty and assessing structural identifiability (Steps 7, 8)
Structure-based model selection typically results in a number of plausible model structures and parameter values and this prior uncertainty should be documented (even if not all plausible alternatives can be implemented and tested). The first step to determine whether data-driven model inference (calibration, model selection) can help reduce this prior uncertainty is to assess the structural identifiability of candidates in the observed range of data, i.e. analyse whether the behaviour of candidate models differ in the domain for which the data is representative. A variety of analytical and numerical approaches to assess structural identifiability exists (Guillaume et al. 2019;Chis et al. 2012) including numerical parameter screening methods from sensitivity analysis (Campolongo et al. 2007;Troost & Berger 2015a).
In addition to uncertain model structures and parameters, also uncertain auxiliary assumptions must be documented and represented in parameters (e.g. error distributions for expected deviations, imputation to deal with incompleteness in the data, alternative choices in data curation, preparation or aggregation) that may decrease identifiability. Structural identifiability in the data can considerably differ between different groups of parameters or model components. For example, parameters that relate short-term agent behaviour to static characteristics can be estimated from sufficiently heterogeneous cross-sectional data, parameters that affect dynamic behaviour or accumulative development over several periods require panel data (Troost & Berger 2020). Parameters that affect the probability of low probability events can only be identified if enough low probability events have been observed (Filatova et al., 2016). Structural non-identifiability cannot be resolved by more of the same data, but requires either widening the range of situations observed or more dimensions of the data. Under certain conditions, unidentifiable parameters may be temporarily fixed to allow identification of other components. However, fixing has to be reversed for latter predictive simulation in order not to obscure model uncertainty (noninfluence in the observed domain does not necessarily mean noninfluence in the target situation, see example in Troost & Berger 2015a).

Choosing adequate methods for model inference and measurement of predictive accuracy (Step 9)
If structural identifiability is given or direct generalisation is possible, one can choose an adequate method for data-driven model inference. If not, it is often still useful to measure sample predictive accuracy of candidates and compare it against a null model to ensure the models do completely go astray.
Inverse modelling employs algorithms to characterise the distribution of a loss function over candidates (exploration/estimation of posterior parameter distribution) or find the candidate with the optimal loss function value (optimisation, calibration). Available methods considerably differ in the extent to which uncertainty in the selection process is characterised and to which prior uncertainty is considered (Table 3).

Adequate choice of loss function or likelihood
Loss functions are used to weight deviations between simulations and observations by severity. From a decision-theoretic point of view, loss functions should more strongly penalise those errors that would lead to stronger changes in conclusions. Hence, in principle loss functions can be specified to directly reflect the precision, accuracy, relativity and symmetry required by the research question and penalise misclassifications based on their practical implications (e.g. prefer models with stronger deviations overall, but high reliability in critical areas) (Manderscheid 1965;Berger 1980;McCloskey 1985;Farahmand et al. 2017;Manski 2019). In direct generalisation cases and when sampling error has been controlled for (e.g. by cross-validation, see below), the measured loss can also be directly generalised to target situations.
In indirect generalisation cases and structure-focused analysis, loss functions must reflect the impact of model errors on our confidence that the candidate reflects underlying system processes. In this case, loss functions should reflect the expected deviations of the model including sampling error, model bias and error correlation (Schoups & Vrugt 2010): Theoretically anticipated deviations of candidate models are considered less severe than deviations unlikely to occur if the model predicts according to its theoretically expected precision (Hansen & Heckman 1996;Blavatsky & Progrebna 2010). For example, if a model is designed to predict an upper bound, underestimation of observations should be penalised, overestimation not. 9 If the model is expected to be well-specified and implies a well-defined tractable stochastic error distribution, a parametric likelihood function can be formulated. Using parametric likelihoods in cases where their underlying assumptions are not fulfilled or in doubt leads to biased model selection and overconfident conclusions (Beven et al. 2008;Stedinger et al. 2008). Robust loss functions allow for occasional outliers potentially generated by processes not captured in the model. (Willmott & Matsuura 2005;Hyndman & Koehler 2006). If the model is expected to capture the essential systematic relationship, but the exact error distribution is unknown or intractable, summary statistics that capture relevant systematic relationships can be estimated on both, observations and model output. A loss function can then be applied to the difference in the summary statistics rather than the individual observations (Classical and Bayesian indirect inference: Chen et al. 2012;Beaumont 2010;Drovandi et al. 2015). Pattern-Oriented Modelling generalises this principle to incorporate more qualitatively described strong and weak statistical patterns (Grimm & Railsback 2012). In other cases, qualitative criteria are used to define binary-valued acceptance functions (Spear & Hornberger 1980;Troost & Berger 2015a).
Pure loss functions and likelihoods provide a relative ranking between candidate models, but their absolute values are specific to the sample used. Absolute goodness-of-fit measures (e.g. model efficiencies) take the sample variance into account in order to allow comparison between models estimated from different samples (Bennett et al., 2013;Hauduc et al., 2015). Implicitly, efficiency criteria compare models with a benchmark or null model that employs only basic information of the data. R 2 and Model Efficiency, for example, contain the sample average as a null model. However, the sample average is only one possible choice for the null benchmark. Trend extrapolation, random allocation, or seasonal or group-specific averages can often be more adequate benchmarks (Schaeffli & Gupta, 2007;Pontius & Millones 2011). As an alternative, Grimm & Railsback (2012) suggest to always explicitly include a benchmark null model among candidates.

Adequate assessment of practical identifiability and posterior uncertainty
It is paramount to document uncertainty in measured predictive accuracy and model rankings and to assess how reliable the data could discriminate between candidates. Classical least-squares or maximum likelihood-based parameter estimation identify one best fitting model and quantify posterior uncertainty in the form of confidence intervals for parameters. This quantification is very limited: It presupposes that both the likelihood and the model structure are certain and correctly specified and all considered candidate parameterisations are a priori equally likely (Stigler 2007). Moreover, while large confidence intervals point to low practical identifiability, they cannot conceptually be interpreted as posterior probabilities for parameters. Bayesian frameworks (Hobbs & Hilborn 2006) can overcome the latter limitations if formal prior probabilities and certain parametric likelihoods are specifiable.
Predictive accuracy measured in a sample is a biased measure of expected predictive accuracy outof-sample: It favours models with a higher number of freely adaptable parameters, which increases the danger of overfitting. Adequate model inference requires correcting this bias: This can be achieved by the use of information criteria (AIC, BIC, DIC, WAIC) or appropriately specified prior probabilities in formal Bayesian frameworks (Burnham & Anderson 2004;Ward 2008;Vehtari et al. 2017). Both, require parametric likelihoods.
Cross-validation 10 and bootstrapping are the essential non-parametric alternatives to obtain unbiased estimates of expected predictive accuracy from a sample (Browne 2000;Arlot & Celisse 2010;Bennet et al. 2013;Vehtari et al. 2017). Statistical diagnostics for influential observations (e.g. Cook's distance) and multicollinearity in the data (e.g. variance inflation factors) common in econometric analysis should complement the analysis. 10 The traditional separation of data into one training and one validation dataset is the most basic form of crossvalidation, but is subject to sampling error itself. K-fold cross-validation is the more robust extension. Figure 1 illustrated how an adequate modelling process structures, quantifies and potentially reduces uncertainty: The definition of a research question divides uncertainty regarding the research question from uncertainty about wider implications in the debate. Theory-based model selection structures the uncertainty about the research question into prior model uncertainty (represented by different candidate model structures and parameter ranges), input uncertainty (uncertainty in boundary and initial conditions), expected deviation (error terms, bias, aleatory uncertainty) and unmodelled uncertainty (alternative models not included in the analysis 11 , processes that have been ignored, potential exogenous events not considered, unformalised error terms, unforeseeable events, critical assumptions for which no alternatives are tested, etc.). If applicable and successful, model inference potentially reduces prior model uncertainty to posterior model uncertainty. If discrimination of candidate models by data is not possible, the posterior uncertainty remains the same as the prior uncertainty.

Adequate derivation and interpretation of simulation results and uncertainty
In structure-focused analysis (description, explanation), the resulting posterior model uncertainty is already the final uncertainty to be interpreted for conclusions. In output-focused analysis (prediction, scenario analysis, exploration), posterior uncertainty and input uncertainty still need to be translated into output or predictive uncertainty for target situations (e.g. future or policy scenarios) by simulation experiments that include uncertainty analysis.
In an adequate modelling process, in which uncertainty is properly analysed and propagated, the final posterior/predictive uncertainty and the unmodelled uncertainty describe the actual state of knowledge regarding the research question that can be defensibly extracted from the available data and structural system knowledge. This final model uncertainty can then be compared with the precision required by the research question for interpretation and derivation of conclusions.

Interpretation of predictive accuracy and posterior uncertainty (Step 10)
If sampling error has been properly controlled for (e.g. by cross-validation), expected predictive accuracy indicates how well the model predicts or explains the variation in the population of situations for which the sample is representative (subject to the importance weighting embodied in likelihood or loss function). This is valuable information in its own right. However, care has to be taken when using this information to draw further conclusions, e.g. about the model being "sufficiently good" or the "correct" or "best explanation" (Oreskes et al. 1994). Even though absolute goodness-of-fit measures such as model efficiencies project predictive error onto an absolute scale between null model and perfect fit, defining any threshold to indicate 'sufficient fit' on this scale remains subjective or based on convention -similar to significance levels in statistical analysis -unless this threshold can be convincingly derived from the research question and its encompassing debate (Pontius & Millones 2011). The same holds for thresholds defined on posterior densities or relative differences in information criteria (Stephens et al. 2005).
The well-known problems of induction, under-determination and theory-ladenness imply that proving by comparison to observation that a model is the 'true' model is ultimately impossible (Oreskes et al., 1994;Quine, 1951). Expected predictive accuracy provides a relative ranking and 11 Brenner & Werker 2007 emphasise an inclusion of "all logically possible" parameter values and model structures consistent with structural and empirical knowledge. We recognise that this is often not feasible in practice, however, this needs to be recognised as unmodelled uncertainty and appropriately discussed when deriving conclusions. allows to identify the "best" among the candidate models for the given sample. The more comprehensive the list of candidate models and parameterisations that has been tested and the more representative the sample, the higher can be the confidence in having identified a generalisable best model or parameterisation. As all other statistical relationships, measured expected predictive accuracy cannot be generalised to target situations across structural breaks.
Uncertainty in inference can be quantified as a posterior probability for the candidates if a formal Bayesian framework with proper prior probabilities and appropriate likelihood has been used in inverse modelling. However, also in those cases where posterior probabilities or credible intervals cannot be derived, it is important to consider posterior uncertainty and recognise that the "best" model does not necessarily have or even approach a posterior probability of one (Troost & Berger 2015a). The potential explanatory and predictive power of alternatives should not be neglected in interpretation. If the analysis is structure-focused and interested in which model provides the better explanation, it remains inconclusive whenever two alternative models cannot be robustly discriminated by data or needs to employ additional theoretical considerations, e.g. parsimony as a philosophical principle 12 or correspondence to established theory, to justify a decision for one or the other model. In output-focused analysis, subsequent predictive simulation should use the full posterior distribution, consider confidence or credible intervals or at least a representative ensemble of all candidates that show nonnegligible explanatory power (ensemble modelling, model averaging).

Analysis and interpretation of predictive uncertainty (Step 11)
Only in rare cases, it will be permissible to directly generalise expected predictive uncertainty from inverse modelling to the target situation (representative sample, negligible input uncertainty, one clearly best model). Generally, comprehensive uncertainty and sensitivity analysis is necessary. Uncertainty analysis must be global, i.e. cover the full range of potential input values including interactions and correlation between input factors (Saltelli & Annoni, 2010). A considerable number of approaches for efficient uncertainty analysis is available that adapt to different model complexities and available computational resources (Helton et al., 2006;Saltelli et al., 2008;Gramacy & Lee 2009;Troost et al. 2022). Stochastic models require sufficient repetitions and statistical comparison tests or, more efficiently, Common Random Numbers schemes to isolate systematic differences from stochastic ones (Stoute & Goldie 2008;Troost & Berger 2016). 12 Parsimony as a philosophical principle (simpler models are always to be preferred) differs from a pragmatic argument for parsimony in estimating models for prediction (simpler models are less prone to overfitting).
Predictive uncertainty for a target situation is a function of the uncertainty about the systematic effect of system input on behaviour that is captured in the set of models and parameterisations (posterior model uncertainty), the model error (bias and unsystematic aleatory uncertainty) and the uncertainty in system inputs (e.g. scenarios, boundary conditions) for target situations. Building on the considerations by Marchau et al. (2019) and Walker et al. (2003), Table 5 lists which forms of predictive simulation outputs are adequate depending on the level of uncertainty in each of these dimensions. Unconditional predictions require low uncertainty in all "locations" of uncertainty. Probabilistic predictions require probability information in all locations. Simulation analysis can however also provide useful insights if uncertainty is high in one or all locations. It is key that exploration of predictive uncertainty focuses on the output quantity, precision and resolution relevant to answering the targeted research question. When we compare two target situations, we can distinguish the apparent (or observable) difference, i.e. the difference between two predictions that includes unsystematic, stochastic effects, and the systematic difference, i.e. the difference between two predictions controlled for unsystematic effects. In many decision support situations, the systematic difference is much more relevant than the apparent one: The future may not be precisely predictable, but for a good decision it is enough if the systematic differences caused by decision options can be pointed out (Berger & Troost 2014) and strategies that are robust under many different scenarios and assumptions can be detected (Marchau et al. 2019; Lempert 2019).

Interpretation and conclusions (Step 12)
The interpretation of results should compare the final uncertainty to the required precision and accuracy of the research question. If the required certainty is reached, conclusions that are consistent with the simulated output can be considered valid and sound. If uncertainty is too high, we have to conclude that the knowledge employed in the process is insufficient for the desired type of conclusions (Carauta et al. 2021). It should not be necessary to emphasise that this is an equally valuable and relevant result (Leamer 2010).
The structure of the argument and the premises that are critical to support the conclusions must be clearly laid out. This involves the premises that are supported by simulation results, but also the auxiliary and hidden premises (prior model evidence, representativity of data, identifiability, posterior uncertainty).
Both, unstructured uncertainty about wider implications and unmodelled uncertainty remain qualitative and unquantified in the modelling process. Nevertheless, they must be an important part of the interpretation: Conclusions must be qualified with respect to the information omitted from the modelling process. Hypotheses on how omitted processes or alternative system conceptualisations could affect conclusions must be discussed (Forrester & Senge, 1980). Banerjee et al. (2016) argue for an explicit and structured section for 'Speculation' about external validity (generalisability) of results obtained from case studies. Especially, when using models to inform decision-makers in the face of deep uncertainty, transparent documentation of critical and potentially value-laden fundamental assumptions (see protocols in Kloprogge et al. 2011, Saltelli et al. 2013Fischbach & Davis 2014;van der Sluijs 2017) and additional effort to assess the robustness of decision option outcomes to these assumptions is essential (assumptions-based planning, stress testing, red teaming; Lempert 2019; Marchau et al. 2019).

Discussion and conclusions
Adequate conclusions from simulation analysis require a careful analysis of the logical argumentative structure and the critical premises they build upon. Such premises rest on simulation outcomes, but are also implicit in the choice of models and methods of inference from data. Especially the latter is not always obvious to modellers, reviewers and addressees of simulation results. Even if -as we demonstrated -premises in the overarching argumentative structure vary, the preconditions for the use of specific methods of analysis are invariable and their violation makes the analysis inadequate. For example, empirical output validation and inverse modelling presuppose representativity of data, identifiability and control of sampling error. Moreover, specific methods such as maximum likelihood estimation rely on even more restrictive, not always obvious premises (see Table 3).
A number of previous studies (e.g. Edmonds et al. 2019;Epstein 2008) highlighted how different modelling purposes require different data and methods. In the presented protocol we have moved beyond discrete typologies of model purpose and define concrete dimensions of research question and available system knowledge and data that together characterise the modelling context. Typologies of Edwards et al. (2019), and especially terms such as prediction, forecast, projection or exploration, whose understanding and usage differ between and sometimes even within disciplines (Bray & von Storch 2009), can be mapped onto these dimensions to allow for more precise communication (see Appendix A.1).
When understood comprehensively, the process of ensuring adequate model conclusions is, however, more complex and subtler than a single-step matching of context type to a method. Rather it is a process that is hierarchical, i.e. outcomes of earlier steps affect choices in later steps (e.g. inverse modelling should not be pursued without first ensuring representative data and structural identifiability). It is recursive, i.e. in composite models the context of each component must be assessed, and iterative, i.e. outcomes of subsequent steps may encourage receding a number of steps and reconsider choices: For example, if the evaluation of structural identifiability, practical identifiability or predictive uncertainty leads to unsatisfactory results, it may be useful to go back to structure-based model selection or even to a redefinition of the research question. It may be further possible to answer a more restricted question that is already useful where the context does not allow to reliably answer the original question as presented above in our initial example.
The KIA protocol that we suggest in this article is intended to guide modellers in making adequate choices during the process of simulation analysis and justify them with adequate argumentation. It provides a guideline to reviewers who can use it by starting from the final statement of conclusions and their premises and working backward to evaluate whether the steps taken during the modelling process adequately support the premises in the given context. Moreover, it is intended to structure documentation -as a checklist to ensure modelling context and justification for all relevant modelling decisions have been discussed in the main body of an article and as a template for wellstructured tabular documentation in an appendix.
We strived to be general in redacting the protocol. We do not advocate one common method for all ABM, rather the dimensions of the modelling context that we introduced are intended to help identify which ABM applications share a similar modelling context and might learn from each other and which not. For example, Troost & Berger (2015a) and Carrella et al. (2021) both deal with unknown or intractable likelihoods for model inference. However, the former face both low structural and practical identifiability, while the latter assume few parameters and a large number of identifying summary statistics, i.e. high practical identifiability. As both are explicit about the assumed modelling context, this can be read from their articles, but may still be easily overlooked. Our protocol is intended to highlight these differences and in this way avoid common pitfalls in discussions between modellers and reviewers about adequate and valid model use and inference: E.g. to avoid discussions about an appropriate loss function, when structural identifiability is the more important issue; to avoid overemphasis on separation of training and validation data, when validation data is not representative for target situations; to avoid discussions about unreliability of unconditional predictions when these are neither possible nor necessary; to avoid suggesting model simplification to increase practical identifiability when model complexity is required for structural reasons and direct generalisation is not adequate, etc.
The KIA protocol mirrors and is compatible with established recommendations for a structured modelling process (e.g. Jakeman et al. 2006), but it emphasises the linkages and propagation of uncertainty between modelling stages and highlights general criteria for the choice of adequate methods at each stage. It concretises the principle "as empirical as possible, as general as necessary" coined for ABM by Brenner & Werker (2007). It incorporates the different levels of uncertainty of Walker et al. (2003) and Marchau et al. (2019), but also explains how this uncertainty comes about in the modelling process. Similar to Polhill & Salt (2017), it highlights the importance of structural model choice compared with purely data-driven model inference. While we have not extensively discussed stakeholder participation, the protocol is meant to be open to valuable stakeholder input and feedback at any step of the process: e.g. in shaping the encompassing debate, defining the targeted research questions, providing information in model selection and inference and shared interpretation Barreteau et al. 2010).
At this point, the KIA protocol itself is a theory-based hypothesis that requires practical testing. We propose it to the community of agent-based modellers for adoption in model construction, documentation, and review. Its use in practice will tell if it proves useful as guidance for model development and a communication device in documentation and review. Based on practical experience, it should then be reviewed and improved.
The exhaustive discussion of many of the guiding questions listed in the tables would warrant their own articles. Our main intention here has been to comprehensively list them and highlight their interlinkages. We have linked many of the guiding questions with literature on more detailed explanation or formal assessment methods. This list of methods does not claim to be complete and it will certainly become outdated over time as new approaches for model testing, selection or estimation are developed to deal with the formulated questions. However, we hope that this protocol does not only spark interest in developing new methods, but also assists in clearly communicating the conditions for which they are suitable.
We believe that the principles discussed here are applicable to any modelling endeavour and most disciplinary standards that have been established form special cases that are in principle covered by the protocol. In this sense, we expect that it can also provide guidance for non-ABM simulation when facing similar challenges.