To think inside the box, or to think out of the box?
Scientific discovery via the reciprocation of insights and concepts

Yu-Zhe Shi

{}^{1,2\,\star{}\,\textrm{{\char 0\relax}}}

, Manjie Xu

{}^{1,2\,\star{}}

, Wenjuan Han², Yixin Zhu

{}^{1\,\textrm{{\char 0\relax}}}

¹Institute for AI, Peking University ²PersLEARN
^⋆Equal contributors ✉ [email protected], [email protected]

Abstract

If scientific discovery is one of the main driving forces of human progress, insight is the fuel for the engine, which has long attracted behavior-level research to understand and model its underlying cognitive process. However, current tasks that abstract scientific discovery mostly focus on the emergence of insight, ignoring the special role played by domain knowledge. In this concept paper, we view scientific discovery as an interplay between thinking out of the box that actively seeks insightful solutions and thinking inside the box that generalizes on conceptual domain knowledge to keep correct. Accordingly, we propose Mindle, a semantic searching game that triggers scientific-discovery-like thinking spontaneously, as infrastructure for exploring scientific discovery on a large scale. On this basis, the meta-strategies for insights and the usage of concepts can be investigated reciprocally. In the pilot studies, several interesting observations inspire elaborated hypotheses on meta-strategies, context, and individual diversity for further investigations.

Introduction

How do scientists come up with novel ideas that lead to significant discoveries? Psychologists have been working long to understand the underlying cognitive processes (Schickore, , 2022) to facilitate the progress of scientific discovery (Campbell, , 1960). Among the diverse philosophical theories interpreting discovery, the one mostly being cited is that discovery refers to the eureka moment, a.k.a. the Aha! moment, of having a new insight (Auble et al., , 1979; Kounios and Beeman, , 2009). Originating from problem-solving, insight is the process that reconstructs the representation of the target problem. Given the insight, the solution can be achieved much more straightforwardly than that before the reconstruction has been done (Ohlsson, , 1984). People tend to follow prior knowledge when solving a problem because experience shows this may lead to success (Öllinger et al., , 2008). But after times of trial-and-error, people can predict the error of current problem representation (Dubey et al., , 2021), and this may be the eve of a sudden coming of the Aha! moment. Studies have shown that solutions discovered by insights usually be more promising than those generated by analytical approaches (Salvi et al., , 2016), though the latter requires much more workload than the former—this echoes how specifically adapted representations outperform prior ones on novel problems.

Meanwhile, the prior representation is also necessary, for it provides the relevant domain knowledge for the problem—insight is just useless without the sense of how to deal with the problem. This contrast becomes crucial in terms of scientific discovery—given a scientific inquiry as the target problem, the entry is the purported paradigm of the domain that seems relevant to the problem, which is deeply rooted in domain knowledge (i.e., atoms, theories, and claims) and shapes the meta-cognitive strategic knowledge (i.e., methodologies). Hence, representation reconstruction is extremely hard when the target problem becomes a scientific inquiry because that at least means generalization from a scientific to another given few observations (Tenenbaum and Griffiths, , 2001), and even means a paradigm shift (Kuhn, , 1970); but without relying on domain knowledge, a scientist can go toward nowhere because all ideas come from somewhere.

The dilemma we are facing is ubiquitous in scientific discovery. On the one side, we must think out of the box, such that to avoid missing the flashed-by insights; but unconventional ways of thinking may also lead to ridiculous solutions, taking a detour even compared with analytical solutions. On the other side, we have to think inside the box because domain knowledge keeps us aware of what we are doing and where we are going; but following an established paradigm totally restricts our mind in the prior representation of the problem (see Fig. 1). To achieve solutions successfully and efficiently, the two mindsets should interplay with each other. When and where should this happen?

Many scientists deal with such an interplay well—they gain insights from the eureka moment, which are later developed into representative scientific discoveries, such as the development of Einstein’s special relativity (Einstein, , 1982), the discovery of Kekule structure (Gruber, , 1981). This pattern is also found in many works throughout the life of Gauss (Dunnington et al., , 2004). Both historical experiences and experimental results drive the interest in understanding how insight is obtained and making computational models (Langley and Jones, , 1988), to get close to the ultimate goal—automated production of insights that improve scientific discovery. Unfortunately, stories of scientists cannot lead to concrete modeling and evaluation work at the behavior level rather than the metaphysical level, for post-hoc simulating how scientists disentangle meta-cognitive strategies from domain knowledge is difficult and imprecise. Hence, we are on the request of an experimental environment that abstractly simulates the process of scientific discovery—the domain knowledge should be crucial for solving the problem and should be general enough to carry out large-scale behavioral studies (Almaatouq et al., , 2021), without losing of group convergence or individual diversity. To the best of our knowledge, we are the first to explicitly consider the interplay between insight-seeking and domain-knowledge-relying. Hence, we propose Mindle¹¹1Visit mindle.cn to interact with the web-based user interface., a semantic searching game that triggers scientific-discovery-like thinking spontaneously, as infrastructure for exploring scientific discovery on a large scale, filling the gap in the literature.

Refer to caption — Figure 1: Overview of insight in scientific discovery. In a classic Gestalt problem, a problem solver first uses domain knowledge to analyze the problem, then seeks for insight once she gets trapped; after reconstructing the problem representation, she again uses domain knowledge to reach the solution. In this case, though domain knowledge constrains the thinking, it serves as the vehicle toward the target.

The reciprocation of insights and concepts

Based on the dilemma over insight-seeking and domain-knowledge-relying, the most critical feature that distinguishes scientific discovery from normal insight problem-solving is that domain knowledge plays a crucial role in both empowering the solution to be correct and restricting the emergence of insight solutions efficiently. Hence, to understand scientific discovery inside the box, we should understand how the organization of concepts in domain knowledge affects meta-cognitive strategies in advance. Conversely, to unveil the process of scientific discovery outside the box, we should look into how insightful decisions intervene in using concepts. This bidirectional pathway echoes the reciprocation of insights and concepts by identifying these two questions: (1) How a problem grounded on conceptual knowledge improve the study of insight problem-solving? (2) How does the usage of conceptual knowledge driven by insight problem-solving improve the study of knowledge representation? Below, we sketch Mindle by answering these questions.

Concepts improve the study of insight problem solving

Relying on conceptual knowledge is not an obstacle to investigating scientific discovery, but a better chance for understanding insight problem-solving. Current insight problem-solving tasks have been long troubled by the subjects’ unawareness of the meta-cognitive strategies they have used (Metcalfe and Wiebe, , 1987). Though this inability naturally unveils the sudden come of insights, it is avoidable by improving experimental tasks. Some current insight problems focus on stimulating the eureka moment, such as the nine-dot problem (Kershaw and Ohlsson, , 2004), the matchstick arithmetic (Knoblich et al., , 1999), and the eight-coin problem (Ormerod et al., , 2002)—these tasks provide highly confined problem space such that subjects can solve the problems without applying any semantics or commonsense knowledge other than the specific background knowledge given by the problem settings. Although such designs are motivated to disentangle representation reconstruction from solution cleanly, it makes interpreting the solutions from the trajectories hard, since the trajectories can only be mapped to the given problem settings. Hence, if we expect mapping behavioral trajectories to meta-strategic knowledge, a generally understandable semantics of problem context is necessary, such as conceptual knowledge in human language. Other insight problems introduce general semantics, including insight physical problem solving (i.e., intuitive physics as semantics) (Allen et al., , 2020) and remote association test (i.e., word association as semantics) (Mednick, , 1963). However, these tasks come in a one-to-one input-output fashion, where the measured behavior is directly generated from the single stimuli one-step, thus hard to track the representation change. This is crucial in scientific discovery because there are usually multiple steps of insights that lead to the target (Moszkowski, , 1972). In contrast to one-shot problem solving, scientific discovery is more like a path-finding process where the navigation map changes every time reaching a critical point. Putting the two reasons together, Mindle should equip with general-understandable semantics in the context of the target problem.

Insight problem-solving improves the study of conceptual knowledge

The domain knowledge of sciences is believed to be organized as conceptual knowledge (Hiebert and Lefevre, , 1986; Rittle-Johnson et al., , 2001). Conceptual knowledge systematically combines declarative and procedural knowledge, consisting of both facts about the concepts and active processes about how concepts interact with each other (Abend, , 2008). Though there are many perspectives on concept representation, we take the theory theory as a prerequisite because it is the most accepted theory in terms of scientific knowledge representation (Gopnik, , 1994), thus we can mimic the domain knowledge under the form of theory theory. One implementation of theory theory is that concepts are maintained in a fully-connected network, where each concept is related to all other concepts in the set—many calculi on fully-connected graphs, such as the general pattern theory (Grenander, , 2012), can be applied to formalize the operations over concepts. Such tools help describe scientific knowledge and meta-strategies in a computable way. Thanks to that, the theory theory can also model how a child acquires concepts in cognitive development (Carey, , 1985; Gopnik and Meltzoff, , 1997; Carey, , 2009); people similarly organize normal concepts to organize scientific knowledge. Hence, we may use such fully-connected networks statistically extracted from natural language corpus to simulate the domain knowledge of sciences, so we can carry out large-scale behavioral studies. Most interestingly, there is a major feature shared by both normal and scientific knowledge—though the semantics of concepts and relations are static and invariable viewing knowledge in the world holistically, they may be highly overloaded according to diverse context, task utility, and inner preference, from the view of individuals (Wang and Bi, , 2021). In this way, the compounded concepts are projected to simplified semantic attribute spaces (Grand et al., , 2022), which are much more tractable to be processed due to the rational use of limited cognitive resources (Gershman et al., , 2015; Lieder and Griffiths, , 2020; Ho et al., , 2022). On this basis, conceptual knowledge in scientific discovery should be studied in a dynamic way rather than a static way. However, most current behavior-level experimental methods on semantic understanding of concepts are confined to general and fixed contexts (Huth et al., , 2016; Wang et al., , 2020), in a straightforward way that is given the stimuli (input words) to obtain the descriptions or similarity judgments (output measurements) directly; other studies on the use of concepts, such as memory replay and human reinforcement learning (Momennejad et al., , 2017), mostly focus on the short-term memory for skill learning without retrieving from long-term memory that has been already formed. To capture the two features, Mindle should test the representation of conceptual knowledge in sequential decision-making.

In summary, we profile Mindle with two unique features to simulate the process of scientific discovery (see Tab. 1 for details), and importantly, the two work reciprocatively: (1) Mindle should equip with a general-understandable conceptual knowledge as domain knowledge to help interpret the process of insight problem solving; (2) Mindle should equip with a sequential decision-making task to stimulate the flexible dynamic use of conceptual knowledge.

Table 1: The analogy of scientific discovery to solving Mindle

	scientific discovery	solving Mindle
target	solve a scientific query	find out the secret word
output	the hidden answer is shaped by the known	the secret word is among the known
problem abstraction	path searching from status quo to the unknown	path searching from current guess to the unknown
problem context	conceptual scientific domain knowledge	conceptual knowledge from natural corpus
knowledge representation	concepts connected under logical or intuitive relations	concepts connected under intuitive relations
maintained representation	generalizing from one scientific concept to another	generalizing from one concept to another
reconstruction	changing the domain knowledge or methodology used	changing meta-strategy in action or semantics level
rationality	should be studied from specific perspectives	should be used in specific semantics subspaces
diversity	scientists with different background think differently	people with different background think differently
accessibility	captured by only a few individuals	captured by most individuals

Mindle: an infrastructure for large-scale studies on scientific discovery

Given the specific features that Mindle should capture, we describe how to implement Mindle. As the infrastructure for large-scale behavioral studies, Mindle should meet three elementary satisfaction: (1) providing appropriate tasks that abstract the target real problem to be studied; (2) providing correct computational models to evaluate the results; (3) the abstract task itself should be natural rather than artificial, interesting rather than boring, to make sure that subjects are easy to get into the task; (4) the task is easy to be propagated and is robust to unexpected user behaviors. Since the appropriateness of task abstraction has been illustrated in detail, we describe how Mindle meets the last three.

The semantic searching game

Mindle requires the participants to dig out a hidden secret word. In every single challenge, the participant is given only a starting word. In each guess, the participant inputs a guessed word, and Mindle outputs a score (the given starting word can be viewed as the initial guess). The score indicates how far is the current guess to the target secret word, on a 0 to 100 scale—when the score is more close to 100, the guess is more close to the target. Once the secret word is hit, one challenge ends. The participant can choose to quit at any time during the challenge. The guesses can be entered by the participant through an input text bar or be selected by the participant through some given options; the policy for proposing options will be described later. A variant of challenge mode comes with a hint implying what topic the secret word is related to. The topic can either be abstract or concrete, such as kitchen supplies, classical music, or freedom, which confines the problem space.

Score and vocabulary

The score is provided by the cosine similarity between the embedding vectors of the current guess and the target word. The embedding vectors can be generated by arbitrary embedding methods, such as Word2Vec (Mikolov et al., 2013b, ), Skip-gram (Mikolov et al., 2013a, ), or Glove (Pennington et al., , 2014). The score transforms the 0 to 1 scale of cosine similarity to a 0 to 100 scale. The vocabulary we used here is a subset consisting of about 40K most frequently used concepts in the natural corpus. Here we denote the set as $C\in\mathcal{C}$ where $\mathcal{C}$ is the space of all concepts. In bar-entering mode, participants would be informed if the input guess is out of $C$ .

Conceptual knowledge representation

We have mentioned that conceptual knowledge can be represented as a fully-connected network. It can be implemented as a directed graph encoding an adjacent matrix. The graph is obtained from natural corpus through a Transformer model (Vaswani et al., , 2017). The connection weights in the graph capture the co-occurrence frequency for each pair of concepts. Let graph $G=\langle C,C\times C\rangle$ be the system of concepts, each node $c_{i}\in C$ denotes a concept, $c_{k},k\neq i$ is a possible related concepts that shapes $c_{i}$ , and $w(c_{i},c_{k}),k\neq i$ indicates the weight $c_{i}$ is shaped by $c_{k}$ among all related concepts. The higher $w(c_{i},c_{k})$ is, the semantics of $c_{i}$ is more influenced by that of $c_{k}$ . Note that usually $w(c_{i},c_{k})\neq w(c_{k},c_{i})$ because the weights indicates the conditional probability $p(c_{i}|c_{k})=w(c_{i},c_{k})/\sum_{j\neq i}w(c_{i},c_{j})$ . Intuitively, this can be viewed as the probability of describing the concept $c_{i}$ by concept $c_{k}$ in a rational fashion (Frank and Goodman, , 2012). $p(c_{i})$ is a vector that has $p(c_{i}|c_{k})$ as its $k$ -th dimension. And since the statistical feature is obtained from the general natural corpus, the real conceptual knowledge distribution in individuals’ minds is more or less different because that is highly conditioned on diverse individual prior experiences. In this way we define all general and individual operations over concepts as formal operations on a graph.

Behavior modeling

Solving Mindle can be ideally modeled as a mdp (mdp). An mdp is a tuple $T=(S,A,P,R,\rho)$ , where $S=C$ is the set of states, $A=C$ is the set of actions, $P:S\times A\times S\mapsto[0,1]$ is the transition probability function $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ , $R:S\times A\times S\mapsto\mathbb{R}$ is the reward function (the game score here plays the role of reward) indicating in what condition the task is solved, and $\rho$ is the initial guess $\rho$ where $s_{0}\sim\rho$ . Under this formulation, solving Mindle is finding the policy $\pi=(s_{0},a_{0},s_{1},a_{1},\dots)$ that generates trajectory $\tau\sim\pi$ optimizing the objective $\max_{\pi}\mathbb{E}_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1})]$ . But this modeling is extremely ideal because the spaces of state and action are much smaller than the whole set of concepts, due to highly limited memory and attention slots. Hence, a mask $M$ can be applied to both $S$ and $A$ that $S\cdot M\subset C$ and $A\cdot M\subset C$ . $M$ can either be predefined by heuristics (will be introduced next) or be extracted from trajectories generated by subjects in pre-experiments.

Similarity, rule, and the unrelated

Given the current guess, where can we go? Take worker as the exemplar current guess, there are three types of candidate actions (see Fig. 2): (1) Concepts that are highly similar to current guess, e.g., labor, navvy, and staff. (2) Concepts that are highly related to current guess by specific rules, e.g., machine (by rule usage), salary (by rule reward), and supervisor (by rule organization). (3) Concepts that are hardly related to current guess by any mean, e.g., arts, word, and hippocampus. Suppose we have K proposals in each type and denote $c_{t}$ and $c_{t+1}$ as current and next guess, respectively. The similar concepts are generated by top-K $\max_{k}cos(\text{Vec}(c_{t}),\text{Vec}(c_{k})),k\neq t$ , the related concepts are generated by top-K $\max_{k}w(c_{t},c_{k}),k\neq t$ , and the unrelated concepts are generated by top-K $\min_{k}w(c_{t},c_{k}),k\neq t$ . Most similar concepts are synonyms to $c_{t}$ , which are not highly related to $c_{t}$ ; exceptions exist, thus making a filter afterward. When selecting related concepts, we try to maximize the diversity of the candidate concepts by filtering out duplicate concepts generated by the same rule, e.g., boss and supervisor (by rule organization). We implement this by applying Agglomerative clustering to the candidate concepts and selecting from the root (Murtagh and Legendre, , 2014). The three types are actions in the higher level of $A$ , which can be observed directly and easily identified to decision patterns, such as persistent searching in a local minima or flexibly jumping between locals. Besides analyzing behavior data, the graph pruned by the three modes is also used to control the hardness of challenges by tuning the length of the shortest path and the number of possible paths traversing from the initial guess to the target word.

Evaluation metrics

One significant indicator to be evaluated is the emergence of the eureka moment. We use both absolute and relevant measurements. First, we identify the Aha! moments in a single trajectory $\tau=(s_{0},a_{0},a_{1},s_{1},\dots,a_{-1},s_{-1})$ . First, we calculate the first-order reward difference between adjacent trials $\Delta r(t)=r_{t+1}-r_{t}$ . For action $a_{t}$ , if there exists $i<t$ that $\Delta r(i)\geq\Delta r(t)$ , let $r(i:t)=\sum_{k=i+1}^{t}r_{k}/(t-i)$ , if not then let $i=0$ ; if there exists $j>t$ that $\Delta r(j)\geq\Delta r(t)$ , let $r(t:j)=\sum_{k=t+1}^{j}r_{k}/(j-t)$ , if not then let $j=-1$ . We have $\Delta a(t)=r(t:j)-r(i:t)$ as the cumulative difference between the critical points. Those critical points with high $a(t)$ tend to be eureka moments from the view of a single trajectory. Also, we observe the trajectories in a counterfactual fashion, and we ask what would happen if other actions have been taken. The action space can be either the concept space $C$ or masked $C\cdot M$ in the low level, or the space of the three high-level action types (similarity, rule, and unrelated). Then, we define the updating rate as $\min\{0,1-\min_{a\in A}R(s_{t},a,s_{t+1})/R(s_{t},a_{t},s_{t+1})\}$ . Thus, the current guess can be viewed as an intervention to the status quo following the semantics gradient, supporting investigations on semantics overload. The evaluation metrics can be flexibly modified to meet the need of specific experiments.

Playability as a game

In the post-interview of our pilot study (25 subjects, 11 are female), after 3 challenges per subject, 24 in 25 subjects think that playing Mindle is interesting and 18 in 24 subjects think that I want to play Mindle everyday; 15 in 25 have succeeded at least one challenge and 8 in 15 have experienced at least one eureka moment in at least one challenge. All solved challenges are finished in about 80 steps of guessing on average, and the number of insightful solutions is about 50. These results imply that Mindle is attracting enough with an appropriate success rate, having the potential to propagate with ease. Also, Mindle is unique for its sense of infinite answer space and harmless trial (Hamari, , 2007). Thus, Mindle can be played through a long term, even though a challenge is disrupted in the middle—instead, a long period of discontinuous thinking may even stimulate the Aha! moment.

Case Studies

In this section, we discuss three interesting observations from the pilot study. These phenomena may inspire elaborated hypotheses that lead to further investigations. This shows how Mindle empowers large-scale experiments.

Thinking out of the box: action level

Some challenges succeed due to action-level insights. The trajectories have a common point—participants switch smoothly between searching in a local minima or making traversals between locals globally. Once they have been optimizing a local minima for a time without gaining a significant score increase, they changed to jump randomly between concepts that seem unrelated to each other, until they hit a local with a much higher score. Then they settle down again to optimize the guess locally with synonyms or similar concepts. The eureka moments usually come when hitting a hot solution in the global random jump. Interestingly, the concepts in the trajectories are highly different from each other, indicating that such behavior pattern is not constrained by semantics, but be driven by the inner preference of participants. That is, no matter what concepts I am guessing, I just switch between local search and global jump flexibly. This lead to a hypothesis: Do people apply meta-strategies ignoring the problem context?

Thinking out of the box: semantic level

Some challenges succeed due to semantic-level insights. This happens when participants hit a local with a relatively high score, which is easy to believe that target lies in this local. Affected by prior experience, participants tend to search in a subspace of the semantics space, where the selected concepts are projected onto a plate of reduced semantics space. For example, a participant has guessed school, class, and grade, where she projects the concept to the semantics subspace of school-related concepts. However, these words cannot help go further. The participant decides to project the anchor concept, say class, to another semantics space, say computer-related terminologies. Then, she guessed type and unexpectedly takes a large step toward the target. Hence, she understands that she has been trapped in a semantics subspace. This case shows that representation reconstruction can also change the semantics subspace. Hence, we come up with another hypothesis: Do people apply meta-strategies according to the problem context? This hypothesis seems to be in contrast to the one in the last paragraph, but combining these two together, we have a comparative hypothesis, which is more related to our big picture on scientific discovery: Do people use meta-strategy as policy regardless of context, or subject to the subjective understanding of context semantics?

Thinking inside the box: semantic level

Some challenges succeed through analytical solutions, especially when the participant, fortunately, reaches the right track at the start. The participant optimizes a gradient of the semantics landscape in mind. The gradient can be extremely flexible—for example, hierarchy, arts to painting to gallery; extent, large to larger to largest; or the distance with human, human to chimpanzee to monkey. The hypothesis space for such a gradient is almost infinite because the semantics space is in very high dimensionality (Grand et al., , 2022). Since the choice of the gradient is subjected to personal cognitive bias, the trajectories for the same challenge can be highly diverse. This echoes the diverse mindsets of different genres of science. And compared with previous work on testing personal diversity in knowledge representation (Wang and Bi, , 2021), Mindle (1) stimulates the spontaneous use of the commonsense knowledge, in contrast to other experimental paradigms that probe human knowledge representation explicitly; (2) empowers the scaling-up of pilot studies, both in the broadness of semantics and the diversity of subjects, to obtain more elaborated results on the landscape that where people converge or diverge on concept representation. By recovering all trajectories generated by the same group of participants, we can build a computational model that captures their semantics landscape, i.e., a function that outputs the sense of which concepts are more similar or more related to each other than to others. The function can be approximated through inverse reinforcement learning (Abbeel and Ng, , 2004), thus we can analyze group diversities of concept representation quantitatively. In this way, we may reverse-engineer the organization of conceptual knowledge in peoples’ minds.

Combining the three hypotheses

Testing the three hypotheses helps us understand the interplay between insight-seeking and domain-knowledge-relying. First, on a confined problem space, we test the existence of these thinking patterns, by stimulating the spontaneous use of action-level metastrategies, semantic-level metastrategies, and semantic-level landscape optimization. After this, we study given different constrained topics, are people tend to emerge and converge to a set of similar parameters that controls the interplay. Assuming that insight-seeking and domain-knowledge-relying are on the two ends of a continuum, one possible hypothesis is that people rationally control the interplay according to their uncertainty on the use of domain knowledge—high uncertainty on knowing what may lead to reconstruction at action level; high uncertainty on deciding which may lead to reconstruction at semantic level; and low uncertainty may lead to maintaining the current representation. Such intuition defines the balance point between the two ends, thus irrational behaviors, e.g., more close to insight-seeking, can be identified comparing with the rational case. On this basis, we scale up the behavioral studies to generalize the results to larger groups of individuals, and also collect a large dataset of trajectories and reverse-engineer the semantic landscapes in the open domain.

Acknowledgement

The authors thank Mr. David Turner for inspiring and helpful discussions on designing Mindle.

References

Abbeel and Ng, (2004) Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning (ICML).
Abend, (2008) Abend, G. (2008). The meaning of ‘theory’. Sociological Theory, 26(2):173–199.
Allen et al., (2020) Allen, K. R., Smith, K. A., and Tenenbaum, J. B. (2020). Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning. Proceedings of the National Academy of Sciences (PNAS), 117(47):29302–29310.
Almaatouq et al., (2021) Almaatouq, A., Becker, J. A., Bernstein, M., Botto, R., Bradlow, E., Damer, E., Duckworth, A. L., Griffiths, T., Hartshorne, J. K., Law, E., and et al. (2021). Scaling up experimental social, behavioral, and economic science.
Auble et al., (1979) Auble, P. M., Franks, J. J., and Soraci, S. A. (1979). Effort toward comprehension: Elaboration or “aha”? Memory & Cognition, 7(6):426–434.
Campbell, (1960) Campbell, D. T. (1960). Blind variation and selective retentions in creative thought as in other knowledge processes. Psychological Review, 67(6):380.
Carey, (1985) Carey, S. (1985). Conceptual change in childhood. MIT Press.
Carey, (2009) Carey, S. (2009). The Origin of Concepts. Oxford University Press.
Dubey et al., (2021) Dubey, R., Ho, M. K., Mehta, H., and Griffiths, T. (2021). Aha! moments correspond to metacognitive prediction errors.
Dunnington et al., (2004) Dunnington, G. W., Gray, J., and Dohse, F.-E. (2004). Carl Friedrich Gauss: titan of science. MAA.
Einstein, (1982) Einstein, A. (1982). How i created the theory of relativity. Physics Today, 35(8):45–47.
Frank and Goodman, (2012) Frank, M. C. and Goodman, N. D. (2012). Predicting pragmatic reasoning in language games. Science, 336(6084):998–998.
Gershman et al., (2015) Gershman, S. J., Horvitz, E. J., and Tenenbaum, J. B. (2015). Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245):273–278.
Gopnik, (1994) Gopnik, A. (1994). The theory theory. In Mapping the mind: Domain specificity in cognition and culture, pages 257–293. Cambridge University Press.
Gopnik and Meltzoff, (1997) Gopnik, A. and Meltzoff, A. N. (1997). Words, thoughts, and theories. MIT Press.
Grand et al., (2022) Grand, G., Blank, I. A., Pereira, F., and Fedorenko, E. (2022). Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature Human Behaviour, pages 1–13.
Grenander, (2012) Grenander, U. (2012). A calculus of ideas: a mathematical study of human thought. World Scientific.
Gruber, (1981) Gruber, H. E. (1981). On the relation between aha experiences’ and the construction of ideas. History of Science, 19(1):41–59.
Hamari, (2007) Hamari, J. (2007). Gamification. The Blackwell Encyclopedia of Sociology, pages 1–3.
Hiebert and Lefevre, (1986) Hiebert, J. and Lefevre, P. (1986). Conceptual and procedural knowledge in mathematics: An introductory analysis. In Conceptual and procedural knowledge: The case of mathematics, pages 1–27. Erlbaum.
Ho et al., (2022) Ho, M. K., Abel, D., Correa, C. G., Littman, M. L., Cohen, J. D., and Griffiths, T. L. (2022). People construct simplified mental representations to plan. Nature, 606(7912):129–136.
Huth et al., (2016) Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E., and Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600):453–458.
Kershaw and Ohlsson, (2004) Kershaw, T. C. and Ohlsson, S. (2004). Multiple causes of difficulty in insight: the case of the nine-dot problem. Journal of Experimental Psychology: Learning, memory, and cognition, 30(1):3.
Knoblich et al., (1999) Knoblich, G., Ohlsson, S., Haider, H., and Rhenius, D. (1999). Constraint relaxation and chunk decomposition in insight problem solving. Journal of Experimental Psychology: Learning, memory, and cognition, 25(6):1534.
Kounios and Beeman, (2009) Kounios, J. and Beeman, M. (2009). The aha! moment: The cognitive neuroscience of insight. Current Directions in Psychological Science, 18(4):210–216.
Kuhn, (1970) Kuhn, T. S. (1970). The structure of scientific revolutions. University of Chicago Press: Chicago.
Langley and Jones, (1988) Langley, P. and Jones, R. (1988). A computational model of scientific insight. The nature of creativity: Contemporary psychological perspectives, 177:201.
Lieder and Griffiths, (2020) Lieder, F. and Griffiths, T. L. (2020). Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and Brain Sciences, 43.
Mednick, (1963) Mednick, M. T. (1963). Research creativity in psychology graduate students. Journal of Consulting Psychology, 27(3):265.
Metcalfe and Wiebe, (1987) Metcalfe, J. and Wiebe, D. (1987). Intuition in insight and noninsight problem solving. Memory & Cognition, 15(3):238–246.
(31) Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
(32) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS).
Momennejad et al., (2017) Momennejad, I., Russek, E. M., Cheong, J. H., Botvinick, M. M., Daw, N. D., and Gershman, S. J. (2017). The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9):680–692.
Moszkowski, (1972) Moszkowski, A. (1972). Conversations with Einstein. Sidgwick & Jackson.
Murtagh and Legendre, (2014) Murtagh, F. and Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? Journal of Classification, 31(3):274–295.
Ohlsson, (1984) Ohlsson, S. (1984). Restructuring revisited: I. summary and critique of the gestalt theory of problem solving. Scandinavian Journal of Psychology, 25(1):65–78.
Öllinger et al., (2008) Öllinger, M., Jones, G., and Knoblich, G. (2008). Investigating the effect of mental set on insight problem solving. Experimental Psychology, 55(4):269.
Ormerod et al., (2002) Ormerod, T. C., MacGregor, J. N., and Chronicle, E. P. (2002). Dynamics and constraints in insight problem solving. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(4):791.
Pennington et al., (2014) Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Annual Conference on Empirical Methods in Natural Language Processing (EMNLP).
Rittle-Johnson et al., (2001) Rittle-Johnson, B., Siegler, R. S., and Alibali, M. W. (2001). Developing conceptual understanding and procedural skill in mathematics: An iterative process. Journal of Educational Psychology, 93(2):346.
Salvi et al., (2016) Salvi, C., Bricolo, E., Kounios, J., Bowden, E., and Beeman, M. (2016). Insight solutions are correct more often than analytic solutions. Thinking & Reasoning, 22(4):443–460.
Schickore, (2022) Schickore, J. (2022). Scientific discovery. In Zalta, E. N. and Nodelman, U., editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2022 edition.
Tenenbaum and Griffiths, (2001) Tenenbaum, J. B. and Griffiths, T. L. (2001). Generalization, similarity, and bayesian inference. Behavioral and Brain Sciences, 24(4):629–640.
Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS).
Wang and Bi, (2021) Wang, X. and Bi, Y. (2021). Idiosyncratic tower of babel: Individual differences in word-meaning representation increase as word abstractness increases. Psychological Science, 32(10):1617–1635.
Wang et al., (2020) Wang, X., Men, W., Gao, J., Caramazza, A., and Bi, Y. (2020). Two forms of knowledge representations in the human brain. Neuron, 107(2):383–393.

Appendix A User interface of Mindle

Given a starting word, the users are expected to navigate toward a secret target word. Users travel in the semantic world by guessing words. For each time jumping to a word, the users will get a similarity score indicating the distance between their current position and the target. For the sake of Mindle’s pilot study, we designed two versions of our Mindle game. Especially, the Web-based Mindle is designed for both lab-based and online experiments (see Fig. 4(a)), and the Mobile-based Mindle can be accessed from mobile terminals, making it suitable for larger-scale online experiments (see Fig. 4(b)).

Appendix B Examples of conceptual knowledge representation

Tab. 2 shows some canonical examples of conceptual knowledge representation. Each example shows the highly associated concepts to the centered concept. A set of stop words have been removed to make the results more representative.

Table 2: Examples of top-associated conceptual knowledge representation

Word	Top associated concept
school	building, county, education, national, community, family, government, city, house, university
cat	life, name, family, medal, known, me, house, into, your, time
work	press,have,government,community, title, zone, action, field, works, working, more, time
apple	park, famous, story, award, album, music, different, out, title
train	tour, population, district, project, zone,
gift	what, new, different, ’something, house, special, story, inside

Appendix C Examples of score contour line

Table 3: The score contour line based on the cos similarity.
Top 1 Top 10 Top 100 Top 500

(a) cat

Target	Word	Cos Similarity
	cat	1
	cats	0.809938
	dog	0.760946
	kitten	0.746498
	feline	0.732624
	beagle	0.715058
	puppy	0.707545
	pup	0.693429
	pet	0.689153
	felines	0.675593
	chihuahua	0.670976
	bassets	0.520498
	rooster	0.51893
	owl	0.518349
	pinscher	0.517772
	tiger	0.517296
	piglet	0.516684
	kelpie	0.515803
	dachshunds	0.515763
	schnauzers	0.514645
	bird	0.514626
	earless	0.392231
	hoarder	0.392067
	lynx	0.392044
	shrike	0.392036
	panleukopenia	0.391696
	iguanas	0.391474
	doglike	0.391446
	yelping	0.391318
	crow	0.391283
cat	rabbity	0.391252

(b) green

Target	Word	Cos Similarity
	green	1
	greener	0.809938
	red	0.760946
	greening	0.746498
	yellow	0.732624
	blue	0.715058
	brown	0.707545
	florescent	0.693429
	greenest	0.689153
	nongreen	0.675593
	purple	0.670976
	pistache	0.520498
	echeveria	0.51893
	paspalum	0.518349
	greened	0.517772
	bicolored	0.517296
	multicolored	0.516684
	brittlebush	0.515803
	arborvitaes	0.515763
	stripes	0.514645
	sienna	0.514626
	conserve	0.392231
	photovoltaic	0.392067
	leafier	0.392044
	euonymous	0.392036
	alpenglow	0.391696
	coppery	0.391474
	tomatillo	0.391446
	beautifying	0.391318
	marram	0.391283
green	tangelo	0.391252

Tab. 3(b) shows some canonical examples of the similarity-based semantic landscape. A contour line is shaped by different concepts with the same-level score to the target concept. Each example shows a series of contour lines by different scores. Concept cat has a single major semantic meaning, while green has two major semantic meanings. Hence, the similar concepts to cat lie in a single semantic subspace, while those to green lie in two semantic subspaces (color and plant).

Appendix D Player trajectories

Figure 5 shows some exemplar trajectories generated by players. Thinking patterns mentioned in the paper, such as insight-seeking and domain-knowledge-relying can be clearly observed in these trajectories. Besides, several a.k.a. Aha! moment can be observed in the test process.

To think inside the box, or to think out of the box? Scientific discovery via the reciprocation of insights and concepts