Is there really such a thing as collective intelligence? If so, how does it work?
As Thomas Malone argues, economics and organizational theory are really all about collective intelligence and intelligences — the ecosystem of “superminds” (teams, firms and organizations, markets, social networks) in which we all participate and which is responsible for the entirety of wealth generation. Yet we don’t have much in the way of a clear explanation of how or why these entities emerge, their inner workings, and their interactions with the environment. We haven’t gone much further than piecemeal, casuistic and empirical observation supported by some psychology and loose biological analogy. We can and should do better: not just to satisfy our intellectual curiosity, but because a better understanding of superminds can confirm or falsify our existing ideas about policy to create better social and economic outcomes, as well as point the way towards new policy.
My requirements are very specific — as a mathematical physicist, I want to see the “nuts and bolts”: a causal, quantitative theory that can produce solvable models or simulations describing relevant aspects of the world, as well as testable hypotheses about future events or experiment outcomes. As an ersatz economist, I expect to find a theory that is fundamentally about imperfect beings trying to make sense of the world and better their situation in the presence of incomplete information, as described so well by Adam Smith, Hayek, Simon, etc. Finally, I want a theory that is ab initio: it is popular to attribute economic and social facts to human psychology, but we should be able to find the deeper reason why humans think and act in the relevant way.
This is a sketch of such a theory. It is based on the framework called active inference or the Free Energy Principle (FEP), originally proposed by Karl Friston to explain the human brain, but later applied to artificial intelligence and even to explaining life itself. This framework allows us to talk about nouns such as agency, intelligence, consciousness, belief in the context of actual, specific dynamical systems. Being grounded in standard results of information theory and complex systems, it helps us to turn impressionistic metaphors into precise, quantitative models with clear applicability and limits. And given its wide realm of applicability, it also acts as a useful Rosetta Stone to help us import insights from other domains.
In a nutshell, FEP is a mathematical framework for describing, modeling and reasoning about self-organized systems. In its most epistemologically pure form, FEP comprises a general model of a stable system interacting with its environment; the system is decomposed between “active”, “sensory” and “internal” subsystems, based on the causal structure of said interactions. It turns out that, in the presence of some reasonable assumptions, such a system can be shown to implement an intentional agent possessing an internal model of its environment, and learning about said environment via conjugated observation and action. The two canonical examples are: molecules forming a cell; and proto-neurons forming a brain.
As the name implies, the FEP postulates that self-organization by virtue of the system minimizing a quantity called informational free energy. This quantity can be interpreted in multiple ways, but it’s often described as the agent’s best available approximation to expected harm or “surprise”, given its knowledge about the world and limited computing capacity. Thus, in active inference, agents are constantly prodding at the world and refining their beliefs about it, in search of better strategies for its continued stability/survival.
As mentioned, FEP applies to arbitrary systems as long as they can be decomposed into a stable unit and its environment. What makes it non-trivial to apply this framework to the social sciences, and the real innovation in the research program described here, is the following. The previously described (biological and neurological) applications of FEP deal with systems composed of small, relatively simple atomic units that learn about a generically “physical” environment with comparatively little structure and few organized degrees of freedom: molecules forming a cell that learns about its primordial soup, neurons forming a brain that learns about the jungle or ocean. The only possible form of composition is via “black box” hierarchical nesting via the physical degrees of freedom.
In a social system, by contrast, the defining and critical feature of the system is the presence of a population of interacting, similar, but independent agents: for each given agent, the salient feature of the environment is the presence of many of its peers, and therefore its learning task is mainly to understand the surrounding social structure, which in turn is defined by the composite of the agents’ actions. Thus, social agents will have models of other agents, of themselves, and eventually of the “macro” system itself. As we will see, this leads to the optimality of agents “bonding”; to a wealth of “grey-box”, semi-hierarchical structures; and eventually for the emergence of intelligence in social groups in a much more compact or efficient manner than in the biological equivalent.
In other words, the capacity for alterity allows social agents to bond, “understand each other”, create alignment and collaborate intentionally, generating “superminds” that leverage much more of the individual agents’ capabilities and that thus exhibit intelligence at much smaller unit numbers. This is why we have teams and markets with just a handful of people and organizations with only dozens, while a functioning brain requires thousands of cells.
By following this mathematical model to its logical conclusions, we believe we will be able to replicate, sharpen or better justify the core results of the key complex adaptive systems research streams, such as the ones commonly associated with Hayek, Simon, Sandy Pentland, and Padgett. Indeed, we find good reason to believe we can account for the emergence of many of the social facts that we take for granted as pure properties of certain dynamical systems, without any casuistic recourse to ground facts of human psychology (i.e., except inasmuch as human thought and behavior can themselves be explained by the same framework). This is aesthetically pleasing, helps us reason about what is universal in other systems of complex agents, and about how different modes of action impact how these systems work.
In this article, I will give a summarized description of the FEP and of a theory of social agents based on it. Some intuition and general knowledge of mathematics, physics and information theory is assumed, but I won’t go into the details or show proofs. I acknowledge that there are still many gaps and the theory is still not fully specified, but I hope this article can give technical readers a backbone to fill out, and non-technical readers a good feel for the intuitions behind it.
An overview of the FEP
Here I provide a conceptual overview of the FEP, its associated jargon, and current open questions. For the technical reader, we suggest this or this, possibly followed by a deep-dive into the long list of articles by Friston and others on the topic.
The FEP, or the active inference framework, deals with a system or “world” W composed of an environment ϕ interacting with an agent A, where the agent is itself composed of an active component a, a sensory component s, and an internal component λ, as illustrated below. This decomposition is possible for any general agent — we’re just labeling parts of the system based on which other parts they interact with.
This system is a (causal) Bayesian network, that is, edges between nodes denote dependencies (usually probabilistic), which are assumed to be created by an underlying fundamental mechanism (e.g., physical dynamics).
This is a very general model, which can be used to represent living cells and animal brains (as the illustration implies), as well as more abstract kinds of entities. The “interface” between environment and agent is called the Markov blanket, so called because it statistically insulates the internal state from the exterior.
The core tenets of the active inference framework can be summarized as follows:
- The internal states of A encode a causal model m of the system. By a causal model we mean a probability density p(W | m), which represents the causal structure above itself, as well as any and all facts that A might know about the world. Thus, the diagram above does double duty as both a representation of the actual world and an outline of m’s content.
- A performs approximate Bayesian inference, i.e., it continuously adapts its model of reality to approximate the best posterior fit possible given the observations (in jargon, minimize the statistical surprise -ln p(s | m)). In general, this inference must be approximate because the world is too complex for exact inference to be computationally tractable, i.e., eliminating the hidden states of the world to compute p(s | m) requires summing over a combinatory explosion of states.
- A follows the Free Energy Principle. If we allow A to be able to compute a simpler “recognition density” q(ϕ | b), connecting observations to a simpler space of beliefs, there is a quantity called free energy F(s, b) which is easily computable, and can be shown to be an upper bound to surprise. Therefore, by minimizing F(s, b), the agent will get as close to minimizing surprise as it can given its computational constraints.
- A’s actions are explained by active inference. Finally, the agent minimizes free energy not just by perceiving (adjusting its beliefs given observations), but also by acting, i.e., by selecting a policy (a sequence of actions within its realm of possibilities) that will, produce the minimal expected free energy under such policy. Further, the expected free energy reduction can be decomposed as a sum of “intrinsic value” (direct reduction of expected surprise) and “epistemic value” (reduction of uncertainty); correspondingly, the actions will be driven by a mixture of “exploitation” (seeking the “best” outcome given current knowledge) and “exploration” (seeking more knowledge to eventually settle on a better outcome).
- Agents subjectively “prefer” to follow the FEP, because objectively they are whole to the extent that they are effective free energy minimizers. As can be seen from (4), active inference is a control theory, but one with no extrinsic reward (a.k.a. cost or objective function); rather, the agent acts towards actualizing its model of the world, and thus m in some sense represents preference as well as knowledge and belief. This apparent paradox is actually the analog of an optical illusion, with two mathematically equivalent interpretations: the “subjective” one above, and an “objective” one, where m actually describes a “homeostatic” steady state for the system, and “surprise” is simply the amount of deviation from that state. Thus, in this view, large surprise will correspond to the system being put very far from its “ideal state”, and indeed, if the system is modified such that the Markov blanket doesn’t hold anymore, surprise will become infinite. In this interpretation, active inference is nothing but survival-seeking.
- Agents are evolutionarily fit to the extent that they are effective free energy minimizers. Consider a system that evolves agents through time, in a structured but noisy environment. Then agents whose beliefs or actions aren’t driven by FE minimization won’t learn enough about the environment to come close to their homeostatic ideal (given the environment), and will be punished by drifting towards “deadly” (high-surprise) states at a much higher rate than good FE minimizers. Thus, there is a pressure for agents to follow the FEP in general, but also, to continuously improve their models and their computational machinery. Of course, if the system has agents that reproduce, the pressure is both intra- and intergenerational.
Active inference is then a procedural theory of self-organization, behavior and rationality, and we can implement it in (real or simulated systems) and analyze how close to its predictions real systems are. For small systems (e.g., physical or chemical systems where the most granular unit is a molecule), a single application of the FEP does the trick. Indeed, there are several implementations of the FEP for simple models that do show case active inference’s main characteristics: see here for a minimal toy model, here for a model with molecule-like units, here for an algorithmic agent, here for a model of “atomic” but coordinated units bundled under a single “team”.
However, for larger systems we need to adopt a multi-scale approach, i.e., group units according to some criterion and analyze each group’s behavior as an aggregate according to an approximate description with fewer degrees of freedom. Luckily, Markov blankets provide perfect boundaries for such encapsulation, being by definition the boundaries of causal influence between the external environment and the agent, and the abstract nature of the FEP means that we can expect it to manifest itself at all such scales.
This “blankets all the way down” theory of life (with one level’s agents being the constituent units of the next level’s agents) is championed by Friston, for instance here. In summary, according to this world-view, large agents like animals and humans (and firms, as we will see) are described as composites of many simple, locally-optimizing agents, but simultaneously as macro-scale, (approximately) actively-inferring agents in their own regard.
This echoes Dennett’s “society of the mind” picture, and deeply contrasts with the hyper-rational “homo economicus” of neoclassical economic theory, who is assumed to act as a global, exact maximizer of “utility” with access to full information.
One might object that active inference as defined above is too loose and broad to serve as an useful description of things like intention and rationality. Bruineberg et al invoke the example described by Huygens of two clocks hanging from a beam that can move sideways; the coupling introduced by the beam causes the clocks to eventually synchronize. By being mechanically similar, the clocks have an implicit model of each other and these enable us to describe their interaction in terms of inference: by describing the state of clock 1, we have a good idea about the state of clock 2; in this sense, it is “as if” clock 1 is inferring clock 2’s state. This system can be described as following the FEP, and is self-organizing in that sense, but it’s clearly not intentional or rational.
To resolve this issue, Friston introduces a distinction between “mere” and “adaptive” active inference. In mere active inference, an agent’s generative model and inference capabilities are restricted to the system’s present state: such systems are Markov chains, which are special cases of Markov blankets. In contrast, in systems with adaptive active inference, an agent’s generative model has “temporal depth”, i.e., present states are described in terms of past ones. This enables the agent to infer the probabilities of future states in terms of present information, and so to consider possible consequences, counterfactuals, and judge courses of action (not just instant actions) in terms of their expected future (not just present) consequences. Biological systems with non-trivial complexity require such temporally deep models to be described (except maybe by agents possessing perfect information, which is obviously not the case).
Therefore, biological systems should generally be thought of as performing adaptive active inference, with the degree of required temporal depth being correlated with the system’s size and environment’s complexity. This is an attractive picture of the distinction between the biological and the “merely physical”, and an useful framework for thinking of composite agents’ deeper, more complex models as composed of their constituent agents’ simpler ones — functional composition in parallel with structural composition.
Although the FEP was originally formulated for understanding the brain and artificial intelligence, Friston has proposed that it is actually a general principle for self-organization (autopoiesis), specifically arguing that (a) a large class of dynamical systems will feature Markov blankets, and (b) in any system featuring a Markov blanket, the subsystem enclosed by the Markov blanket will appear to follow the free energy minimization imperative, thus performing active inference and maintaining its own integrity. In this sense, active inference is a corollary of the laws of thermodynamics. Friston provides a compelling simulation example for a “primordial soup” made out of 128 macromolecule-like components. In this soup, a Markov blanket is spontaneously formed, exhibiting behaviors that qualitatively resemble a simple cell.
Regardless of the realism of the simulation or rigor of the argument, the motivating question that this literature poses to us is:
If a small set of arbitrary, non-thinking objects can easily produce this behavior, then to what extent is humans’ own formation of social groups — with their own well-defined existence and behavior — also a direct consequence of abstract but overwhelming systemic forces?
Motivated by the question above, we sketch the beginnings of a theory. In this section, we discuss some conditions for models of social agents to produce emerging social entities, or (proto-)superminds. Note that these entities are ontologically made up of lower-level agents, but that the structure of the interactions between these agents is crucial for the existence of the superminds, much like it is not any combination of living cells that constitutes a macro-organism.
Specifically, we will discuss:
- How two or more similar agents each minimize their individual free energy via bonding (interpreted as partnering or teaming).
- How bonds help agents to survive by collaborating.
- Under which conditions the resulting group of agents forms its own distinguishable Markov blanket, a collective or proto-supermind.
- How the bonding structure and the individuals’ internal structure determine the collective’s overall free energy, and when it can legitimately be called a “supermind”.
From single agents to bonds
As mentioned above, the defining feature of social systems is the presence of a population of interacting, similar, but independent agents. As we will see, this basic fact leads almost directly to the emergence of social bonds and structure.
Consider a population P of agents. Each given agent A interacts both with the “natural environment” ϕ, and the “social environment” comprising the rest of the population P’ (not just its immediate neighbors, but all the other agents that it might eventually meet). Whereas ϕ can be arbitrarily complex and uncertain, P’ is not: it is a composite of agents similar to A, and thus predictable, especially if A has knowledge of self that it can bootstrap. Thus, in such systems:
- Agents can reduce their overall expected surprise by preferring to interact with the social environment.
- Agents who have models of self and the capability to project (understand that other agents are like itself) have an additional advantage.
We propose that the primary way in which these interactions between agents and the social environment are implemented is via the foundational concept of a bond: a stable interaction pattern between two agents, where each agent carries a model of the other, based on their difference to itself. Each interaction within the bond reinforces it and delivers to each agent a reassuring quantum of free energy reduction.
Let’s develop this concept in a bit more detail. First, consider a system that consists only of two identical agents A₁ and A₂, without an external environment:
By A₁ and A₂ being identical, we mean that they have:
- The same generative model m — i.e., same generative density p(s, W | m).
- The same recognition densities q(W | b).
For simplicity and without much loss, we assume these parts of the system are deterministic, and introduce the belief and action functions (b* and a* respectively), representing the selected beliefs (given observations) and courses of action (given beliefs). Again, in this example these two functions will be identical for both agents. Further assume that each agent can somehow generate counterfactuals, by evaluating these functions at arbitrary points in their domains. (That is, agents are able to consider: “What would it mean if I observed this? What would I do in that case?”)
Consider the solution where A₁ has a delta-shaped generative density assigning all probability to the state with a₂ = a*(b*(a₁)), and likewise, A₂’s density assigns all probability to a₁ = a*(b*(a₂)). This solution produces an exact recognition density (another delta), so it has minimal free energy. Further, it exactly matches reality, so it actually has minimal surprise: it is the global optimum for m. Given this, we conjecture that, if this model is feasible (given agents’ capabilities), and if the dynamics are such that agents can evolve towards minimal free energy, the system will converge towards the above solution.
Note that this solution is very boring: there is no actual information flowing between A₁ and A₂; each agent “knows everything it needs to know” at the outset, from the system setup. (This is very similar to the example with the two clocks, by the way.) However, to get there, each agent has to have a model that enables it to project, i.e., estimate another’s agents beliefs and actions based on its own. This implies some computing capabilities, albeit simple ones, compared to what’s required to perform adaptive active inference in more general cases. Indeed, by endowing each agent with knowledge about itself, we make it so that it doesn’t need to process any information from the outside world at all.
Agents that learn about each other
Now consider the same system as above, but allow the agents to have similar generative models: m₂ = m₁ + 𝝐, with 𝝐 suitably small. Similarly to the previous example, assume agents have some self-knowledge, in the form of being able to evaluate their own b* and a* counterfactually. We also assume agents are able to evaluate their own s* counterfactually.
Intuitively, we expect this to mean that A₁ and A₂ will have similar beliefs and actions; and since A₂’s actual actions aren’t exactly identical to what A₁ projects from their own actions, there is again a role for sensory data.
Indeed, A₁ forms a probability distribution p(a₂ | s₁) over the set of possible actions given its own observations of A₂. It can be shown that this induces a distribution p(m₂ | s₁) which is Bayes-optimal. That is, A₁ is able to estimate the difference in A₂’s model from its own model, as accurately as theoretically possible, given the observed difference between predicted and actual actions.
Thus, in this scenario, each agent is now able to infer the other’s private model, i.e., to learn information about a hidden fact of the world that is not given a priori in its own model. Note that agents’ uncertainty is due entirely to sensory imprecision, as opposed to uncertainty about internal structure.
Following the law of active inference, A₁ will act to resolve this uncertainty, by selecting policies that create different behaviors in A₂, in order to attempt to narrow down the true value of m₂ — and vice-versa. Note that since, by assumption, the agents believe that they are inherently different (no causal chain from A₁ to m₂), they will not seek to resolve uncertainty by increasing similarity: the active inference here is purely explorative, not exploitative.
What happens now has a strong dependency on the temporal/counterfactual depth of the model space. For reference, in a “mere active inference” system with zero depth, each agent can only select one action ahead and has no memory of the outcomes of its previous action, and so the agents will cycle through possible solutions, without actually narrowing down their uncertainty about each other (except by accident, if they hit a point where s* is invertible).
On the other hand, given a more capacious system, each agent will select a policy (sequence of actions) that will reduce their overall posterior uncertainty of 𝝐. Now, since 𝝐 is small by hypothesis, the uncertainty is mostly radial, i.e., referring to which degrees of freedom are the ones where the agents deviate the most from each other. The optimal strategy is for the agent to select actions that “exercise” each degree of freedom in turn, in order of decreasing present uncertainty — suggesting that agents would instinctively perform something akin to Principal Component Analysis (PCA). After performing enough actions to exhaust their temporal depth, agents will therefore land in a state of of minimal uncertainty about each other, given their capabilities.
Now, consider again the evolutionary process that evolves agents and their models. As observed above, the difference between models is the only source of uncertainty in the system. Therefore, the agents will tend to evolve towards 𝝐=0, i.e., towards becoming identical. This is because, in this scenario, there is no value in being different — something that will be rectified in the next scenario.
Agents that collaborate
Finally, let us introduce again a “natural environment” ϕ, which interacts with both agents. As before, let’s assume that agents are similar and have some measure of introspective self-knowledge. Additionally, assume that each agent’s model includes the correct causal structure (i.e., the one described below), in particular that they know about the existence of both the other agent and the environment.
This is a composition of three causal loops: the two agent-environment loops, and the one comprising the bond between the two agents. Accordingly, both agents will perform active inference both about the environment and about each other. What is notable in this setting, though, is that the presence of a bonded peer reveals information to each agent about the environment, via the a₂→ϕ→s₁ pathway. That is, by observing not only the results of one’s own actions, but also of its peer, each agent learns about the environment faster — albeit at the cognitive cost of having to discriminate the effects of the two actions. This opens up the door for collaborative inference strategies, where peers will use their bond to coordinate on a joint course of action which explores the environment in a way that exploits the availability of two agents — specifically, we conjecture, by performing the same kind of exploratory PCA on the environment, but tag-teaming on alternating degrees of freedom.
For this, it is critical that both agents have a bond, and therefore that they are sufficiently similar to form that bond. However, it is actually useful for them to not be identical: similar but different agents will exhibit similar but different behaviors, and therefore exercise the environment in slightly different ways. Thus, the environment will be exploited in a richer way, providing to each agent information that it wouldn’t naturally have uncovered otherwise.
From bonded agents to collectives
From the “elementary particles” of above sections, we zoom out to see many agents interacting with each other (via the thick edges representing bonds) and with the environment. Note that each agent can have many bonds. In fact, the number of bonds for each given agent is driven by the following tradeoff: more bonds equal more stability and more implicit information from the environment, but also more cognitive cost and more bandwidth spent helping peers.
Now, if we zoom out even further, we can obtain a macro-level Markov blanket by simply aggregating all individual agents’ active, sensory and model components into “macro” components a’, s’, m’ respectively. By exclusion, everything else in the micro systems — including not only the agents’ individual beliefs, but the micro-structure of their relationships — gets aggregated into the “macro belief” b’. This is our final picture of the social collective: an entity composed exclusively of bonded micro-agents, yet looking very much like an agent itself.
The obvious next question is how closely this aggregate agent actually follows the FEP itself. We conjecture that, given perfectly optimizing micro-agents, the collective is also optimizing, and the recognition capacity is:
- If the bonds are fixed: given by the product space of the internal subsystems of the currently bonded agents — i.e., growing quadratically with the number of micro-agents.
- If the bonds are allowed to change as part of the learning dynamics: given by the space of all possible configurations — i.e., growing exponentially with the number of micro-agents.
As promised in the introduction, this is a very different picture of aggregation than the one in purely physical systems. In the latter, constrained by the number and range of possible interactions, individual micro-agents play the roles of active, sensory or internal (belief/model) elements in the macro-agent, depending on their (physical and therefore causal) standing relative to the external environment and to each other; therefore, aggregate capacity grows at best linearly with the number of agents, and in practice much slower. Whereas in social systems, all agents simultaneously contribute to all three subsystems of the collective, via external interactions and internal bonds, allowing capacity to grow at least quadratically. This means that collectives with relatively few members will already exhibit full agent-like behavior.
On to superminds
If a collective is also a free-energy-minimizing agent, is it reasonable to call it a supermind? On the one hand, it has a very large recognition capacity compared to that of its constituting agents. On the other hand, we haven’t shown that it is capable of adaptive active inference — only (conjectured) that it follows the FEP. This means we could have collectives of smart agents acting dumbly — i.e., without foresight or learning.
Here, again, bonds are the determinant factor. In a population of solitary agents with no bonds, individual agents’ actions will be independent, and therefore the recognition capacities do not add up; not only that, they may detract from each other, as uncoordinated action causes the agents to interfere destructively. Therefore, this population will, in general, be less smart than an individual agent, and may not even have any meaningful recognition capacity.
However, “true collectives” made out of bonding agents can not only avoid destructive interference, but exhibit collaborative courses of action, as we have seen. Such a system will exhibit learning at the individual and collective level (e.g., with specific tasks being preferentially distributed to individuals with fitting skill sets). Additionally, if bonds are allowed to change, it will also exhibit learning at the structural level (e.g., with new subgroups of individuals gathering around parts of the environment that haven’t been adequately covered yet, and forming bonds to collaborate on them). This shows that the implicit model of the collective has a temporal depth that encompasses that of the individuals’, but also includes time scales that go far beyond what individuals would directly process. Such a collective can reasonably be called a supermind.
Emergence of structure and the global supermind
Such superminds, capable of understanding and intentional action, are of course also agents, capable of interacting with each other. Of course, unlike micro agents who are all more or less cut from the same cloth, superminds vary widely in their size, internal structure and lifespan, and this limits the stability of the bonds that they can form. However, this is more than made up for by their expanded bonding capacity — i.e., its “social surface area”, effectively determined by which micro-agents are most strongly involved in creating social bonds (vs interacting only with the natural environment or internally). Therefore, at any given moment we will have superminds bonding with each other, and forming even larger superminds.
The overall picture, then, is an ecosystem composed of a rich variety of superminds, with a broad range of structures, constantly changing shape, bonding, nesting and decoupling from each other. This is precisely Malone’s account of the “global supermind”, emerging from first principles.
We finally land back where we started. Malone argues that the superminds framework accounts for the entire spectrum of social structures and patterns. I subscribe to that, and believe the above theory will be able to reproduce this spectrum in simulations. Going even further, it seems clear that this framework also gives us quantitative “hooks” into theories of idea formation, symbolic communication, strategic behavior, and collective decision-making — theories which have, to date, been built on more ad hoc foundations. Together with expanded access to behavioral microdata about organizational and social life, such “hooks” should be able to allow us to precisely define and quantify traits of organizations, such as effectiveness, connectivity, agility, velocity, resilience — and perhaps even intelligence, mindfulness, and consciousness.