Stories by Javier Ideami on Medium

How to grow a sustainable artificial mind from scratch

Javier Ideami — Thu, 05 Dec 2024 19:36:09 GMT

Scale free Active Inference, towards a more sustainable and explainable AI

Image by Javier Ideami

Today’s AI is impressive and useful in many ways. We can view it as a massive subconscious pattern matching engine. It operates as a fast thinking (System 1, per Kahneman) entity, excelling in a wide range of tasks because much of human life revolves around such pattern matching processes.

Additionally, AI is beginning to tackle slow thinking (System 2), deep reasoning and agentic behavior, although it is still in the early stages of development in those areas. Despite these advancements, many experts agree that current AI faces several key challenges, including the following:

Lack of efficiency and sustainability
Lack of explainability and transparency
Brittleness due to an underdeveloped world model
Limited predictive capability for deep reasoning and planning (System 2). OpenAI’s O1 model represents early steps in this direction
Absence of built in ethical guardrails
Lack of active, continuous learning and adaptation
Limited generalization across different tasks and domains
Inability to understand or process complex, abstract concepts like common sense reasoning (at a robust level)
Difficulty in handling uncertainty and ambiguity in dynamic environments

While we’re all enjoying the progress of current AI, and I certainly do, it’s valuable to broaden our horizons by exploring alternative paradigms. These could potentially enhance, or offer insights that help improve the current approach. One such paradigm is Active Inference, founded by Karl Friston, the world’s most cited neuroscientist today.

Active Inference

Image by Javier Ideami

Active Inference is a theory that explains how intelligent systems, whether biological or artificial, perceive, act, and learn. At its core, it suggests that all intelligent behavior arises from minimizing uncertainty (or free energy) about the world by building and constantly updating internal models of it.

These models predict sensory input, and the differences between predictions and observations drive learning and decision making.

What makes Active Inference so promising is that it brings it all together: it bridges perception, action, and learning into a single process, offering a way to design AI systems that not only adapt dynamically to their environments but also emulate the kind of curiosity and planning typical in humans.

Active Inference addresses the limitations of the current paradigm that I described above in different ways:

Efficiency and sustainability: By focusing on minimizing free energy, Active Inference optimizes resource use and reduces computational waste, making systems more efficient.
Explainability: Its foundation in probabilistic generative models provides a clear, interpretable structure for how systems perceive, predict, and act, offering insights into their decision making.
Brittleness: Active Inference builds robust world models that are constantly updated through sensory feedback, making systems more adaptable and less fragile.
Predictive reasoning and planning: By integrating perception and action into a single predictive framework, it enables deep reasoning and long term planning, addressing the gap in current system 2 capabilities.
Ethical guardrails: Systems based on Active Inference can incorporate human aligned priors, ensuring actions minimize harm and align with ethical principles.
Active continuous learning: Its constant cycle of prediction, observation, and updating ensures systems are perpetually learning and adapting to their environment.

One of the biggest challenges for Active Inference to go mainstream is scaling. The algorithms are computationally expensive, and ongoing efforts are focused on finding effective solutions to overcome this hurdle.

This article focuses on a talk given by Karl Friston at the recent Active Inference Symposium, where he presented a new academic paper (https://arxiv.org/abs/2407.20292) proposing a solution to this challenge: How can we scale Active Inference for high dimensional real world problems?
(the full talk is available at https://www.youtube.com/watch?v=ee6_mNOfP38)

The majority of this article goes through Karl’s explanation, part by part, adding clarifications and elaborating on key points. Occasionally, I’ll provide extra commentary on implications or variations of the topics discussed.

A scale Free Approach

In his talk, Karl Friston begins by addressing a key challenge for self organizing agents: understanding the coupling between different scales, the micro and the macro, and how they relate in terms of causality, with both bottom up and top down influences.

Scaling, in this context, is closely linked to abstraction, as it involves scaling information across various dimensions, such as state space and time. A crucial method for scaling is through compression, which, when done efficiently, produces abstractions of the original system.

This is a fundamental challenge for all forms of AI, and is particularly relevant to the current dominant AI paradigm. To tackle this, Karl Friston starts with the basics: defining what a thing is, and how we can define an agent that interacts with the world. This leads us to the concept of a Markov Blanket.

Defining the agent

Image by Javier ideami | ideami.com

A Markov blanket is the boundary that separates a system (like a brain or an AI) from its environment. It consists of sensory inputs (what the system senses) and actions (how it affects the environment), allowing the system to interact with and infer the state of the world without directly accessing it. It’s like a filter or interface through which the system perceives and influences its surroundings.

So the Markov blanket of an agent separates internal and external states through boundary states which constitute the markov blanket. And those boundary states are the sensory and the active states.

Image by Javier ideami | ideami.com

In this way, Karl tells us, the inside can influence the outside via active states (outputs).
And the outside can influence the inside via sensory states (inputs).

Great, we got a way of defining the essence of what constitutes an agent. Now it’s time to define the mechanics of its behavior.

Bayesian Mechanics

Quantum mechanics, classical mechanics, statistical mechanics, and Bayesian mechanics all share a common goal: to model, describe, and predict the behavior of systems, from the smallest particles to complex, large scale structures. Each of these frameworks, while distinct in their methods and applications, tries to understand how systems evolve, interact, and respond to various influences.

In the case of Active Inference, Bayesian mechanics is a framework for understanding how systems predict and adapt to their environment by minimizing uncertainty.

Unlike classical mechanics, which describes physical systems with deterministic laws, and quantum mechanics, which operates on probabilistic principles, Bayesian mechanics focuses on probabilistic models where systems continuously update beliefs about the world using Bayes’ rule.

This makes it ideal for modeling adaptive, self organizing systems like brains or AI, which need to learn and act in uncertain, ever changing environments.

Image by Javier Ideami

So think of Bayesian Mechanics as the dynamics of anything that can be expressed in terms of markov blankets. For that kind of entities, their dynamics can be described as active inference.

Karl explains that we can summarize Bayesian mechanics through a couple of equations, the variational free energy equation and the expected free energy equation.

Think of free energy as a measure of how much a system’s predictions about the world differ from what it actually experiences. It’s like a mismatch score. The bigger the mismatch, the higher the free energy. Systems work to minimize this mismatch by improving their predictions or changing their actions to make the world more like they expect. This process helps them learn, adapt, and stay in balance with their environment.

The agent tries to minimize Free Energy, the difference between its predictions and its observations | Image by Javier ideami

Let’s now consider those two equations:

VFE (Variational Free Energy): It’s about keeping the internal states of the agent as good models of the external states, therefore tracking in a probabilistic way the dynamics of those outside states. Minimizing variational free energy helps the agent improve its model of the world, make its beliefs about the world be accurate enough while keeping them as simple as possible.

Perceptual Inference | Image by Javier Ideami

EFE (Expected Free Energy): encapsulates prior beliefs about the kinds of actions that the agent can exert over external states. Minimizing expected free energy helps the agent find the right sequence of actions to take.

Active Inference | Image by Javier Ideami

Karl explains that we can combine both equations to produce a Generalized Free Energy equation. At this stage, it’s not crucial to understand these equations in great detail. What matters is understanding their purpose and why they exist.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

So far, we’ve defined what an agent is and introduced the equations that govern its dynamics. By minimizing Expected Free Energy (EFE) and Variational Free Energy (VFE), the agent can improve its world model and select optimal sequences of actions.

To keep this article straightforward, we’ll avoid diving into the full complexity of these equations. However, it’s worth noting that each equation is broken down into distinct terms, each with significant implications.

The EFE (Expected Free Energy) equation is composed of a Pragmatic term, an Epistemic term, and an Entropic term.

The pragmatic term measures how effectively the agent is pursuing its preferences or goals, evaluating how well it exploits its current beliefs.
The epistemic term assesses how effectively the agent is exploring, increasing its knowledge of the environment.
The entropic term quantifies the uncertainty surrounding the outcomes of the agent’s actions, providing stability and ensuring balanced decision making.

Without going into too much detail, notice that this equation will allow the agent to balance exploration (curiosity, creativity) and exploitation in different ways depending on what’s happening at each moment.

Image by Javier Ideami | ideami.com

The Variational Free Energy (VFE) equation consists of two key components: an accuracy term and a complexity term. This means the agent is constantly trying to balance accuracy with simplicity, seeking the most efficient explanations for its observations. It develops a world model that is both sufficiently accurate and efficient, aiming to compress and abstract the complexity of its environment in the simplest possible way.

Image by Javier Ideami | ideami.com

The next challenge is to explore how markov blankets can give rise to scale free dynamics in the self organizing agents under consideration.

On the Scaling Elevator

Karl guides us through the interplay between the micro and the macro to uncover the nature of the states that define entities or particles.

At the heart of this exploration lies recursivity.

Particles, Karl says, constitute a set of microscopic blanket and internal states.
These states, in turn, form a set of macroscopic eigenfunctions, eigenmodes, or mixtures derived from the blanket states.

When Karl uses eigen to describe states, such as eigenfunctions or eigenmodes, he is referring to patterns or stable structures that naturally arise within a system.

The term eigen here implies fundamental, characteristic forms or modes that persist over time and help define the behavior of the system. Just as in physics or linear algebra where eigenvalues and eigenvectors represent core properties that remain consistent under certain transformations, in Active Inference, eigenstates or eigenmodes represent stable, intrinsic patterns within the system’s dynamics or blanket states.

These eigen patterns are the fundamental ways that a system organizes its internal and external states to minimize free energy and maintain a balanced interaction with the environment.

In summary:

Eigenfunctions represent stable, functional states that emerge from the interactions of microscopic states.
Eigenmodes refer to the patterns of collective behavior that these blanket states assume when organized at larger scales.

So, in order to create different scales or levels of granularity and abstraction:

Partitioning States by Dependencies: First, we group or partition states based on how they are coupled, meaning how they influence or depend on each other. By identifying which states interact closely (have strong dependencies), we can treat them as smaller, interrelated clusters.
Creating Eigen Mixtures for Simplicity: Next, we combine (or coarse grain) these clusters into new, simplified eigen mixtures. These are stable, low dimensional representations of the original boundary states that capture their essential interactions without the need for all the detailed complexity. This dimensionality reduction lets us represent the system’s behavior at a higher level, while still preserving the key dependencies.
Scaling Up by Repeating the Process: By repeating this clustering and reduction process, we can create layers of abstraction at different scales. Each higher scale captures broader, more abstract patterns, gradually filtering out the smaller details. This approach lets us model complex systems in a way that is both manageable and hierarchical, moving from fine grained details to big picture dynamics.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

Therefore, the process involves two main steps, repeated in cycles:

Cluster by Causal Influence: Start by grouping or clustering states based on their causal influences or dependencies, meaning which states tend to affect each other directly. This grouping captures how different parts of the system are interrelated.
Coarse Grain the Clusters: Then, reduce the complexity within each cluster by creating a simpler, coarse grained version of it. This involves combining the states within each cluster into a single, representative eigen mixture that keeps only the essential patterns or interactions, discarding less important details.

Repeating these steps lets us build layers of increasing abstraction, where each higher level preserves the major patterns and causal structures from the layer below. This way, we can understand complex systems in a structured, hierarchical way, focusing on the most important interactions at each scale.

Now, in order to apply that coarse graining of the clusters, we apply some operators which we call Renormalization Group Operators (RG).

Renormalization and scale invariance

Renormalization Group Operators (RG) simplify or compress clusters of states, while preserving the essential dynamics or behavior of the system.

Applying RG Operators: After clustering by causal influence, we use RG operators to coarse grain these clusters. The RG operator effectively reduces the complexity of the cluster by creating a simpler, higher level representation that captures the main interactions within that cluster.
Scale Invariance: When this coarse graining process is applied recursively, we move up one level of abstraction or scale each time, creating a hierarchy. Scale invariance means that each level preserves the core dynamics of the level below, even though it’s represented in a simpler form.

By recursively applying these RG operators, we achieve a structured, multi-scale model of the system where each level maintains the overall dynamics, even though we’re gradually discarding finer details. This process is key to help model complex systems efficiently while retaining the essential behavior at each scale.

And Karl explains the reasons why we can apply the RG operators without disrupting the dynamics of the agent:

We know that anything that has a particular partition can be described in terms of a gradient flow of a variational generalized free energy.
So when a system is partitioned (into internal, blanket, and external states), each partition behaves like an independent, self contained system. And it follows a principle of gradient flow, meaning it continuously adjusts to minimize its own free energy. This makes the partitioned subsets predictable and stable in terms of their dynamics.
The renormalization group operators can be applied effectively to such partitioned systems because each partitioned subset of states (with its own internal states, blanket states, and external states) acts as a self contained system that seeks to minimize its free energy through a gradient flow.
So each partition, or each particle, behaves as if it is minimizing its own generalized free energy. The RG operators then coarse grain or reduce these partitions, simplifying them while conserving their dynamics (their tendency to minimize free energy).
The idea of scale invariance here comes from the fact that, after coarse graining, the same free energy minimization behavior can be observed at each level, maintaining a consistent dynamic across scales.
Since each partitioned state structure behaves according to that gradient flow (minimizing free energy), we know that applying RG transformations will preserve the fundamental dynamics of the system at each level of scale. This property is what makes RG transformations valid in this context. They do not disrupt the system’s underlying tendency to self organize, as each coarse grained structure continues to follow the same free energy minimization principle.

Therefore, the RG process works because free energy minimization is a universal, scale independent property of the system. As long as partitions are structured around this principle, applying RG operators simplifies the system without disrupting its core dynamics. This allows for a hierarchical, multi-scale model where each level accurately reflects the dynamics of the system as a whole.

Consider the elegance of this recursive, scale invariant dynamic:

Particles are composed of states
States are eigen mixtures of particles

This, Karl tells us, resolves a chicken and egg problem, through the creation of scale invariance provided by the renormalization group operators.

And Karl explains that as you move up the scales through coarse graining, you encounter slower macroscopic variables, ones that decay more gradually and exhibit persistence, providing a kind of temporal dilation.

Essentially, as you coarse grain at progressively higher levels, things change more and more slowly.

Tying this back to bayesian mechanics: this framework offers a way to model the agent’s external dynamics, which are inherently scale invariant.

This scale invariance arises naturally from defining things as particles with states, and states as eigen mixtures or functions of particles. This recursive structure inevitably leads to external dynamics that remain consistent across scales. And this consistency is inherent in the system’s structure.

In essence, we configure the system so that, regardless of how we partition or coarse grain it, whether at the microscopic particle level or the macroscopic system level, the overall behavior remains scale invariant.

And scale invariance, in this context, means that the system’s dynamics, how it evolves over time, are independent of the scale at which you observe it. As you coarse grain (move to higher levels of abstraction, grouping states into particles), the system slows down and becomes more persistent. The same underlying generative model still applies, but at larger scales, the behavior becomes slower, more stable, and exhibits temporal dilation.

We perform dimensionality reduction or coarse graining using these eigenfunctions, which capture the slow, macroscopic stable modes that dissipate or decay at a very gradual rate.

Building on everything discussed so far, we can now turn our attention to the agent’s generative model, its mind, so to speak. This generative model must also be scale invariant, or scale free (when applied to graphs), across both time and state space.

The mind of the Agent

Now it’s time to construct the mind of such an agent, which, as Karl explains, involves a model that captures a world with scale invariance in both state space and time.

To build these generative models, we can use generalized partially observed Markov decision processes (POMDPs).

A Partially Observable Markov Decision Process (POMDP) is a mathematical framework used to model decision making problems where an agent has incomplete information about the environment’s state. It extends the basic concept of a Markov Decision Process (MDP) by incorporating partial observability, meaning the agent can’t directly observe the true state of the environment but only receives observations that are noisy or incomplete.

Think of it like navigating a room with a foggy window. You can make guesses based on blurry information, but you don’t have a clear view.

Markov refers to the idea that the agent’s next move depends only on its current state, not the entire history of previous events. This simplification mirrors how we often operate in real life, where we make decisions based on what we know in the moment, rather than recalling every past detail, and it also simplifies computations by reducing the information the agent needs to process.

So in an active inference agent, we use this framework because it helps simulate decision making in uncertain conditions, where the agent doesn’t have all the answers but still needs to take action based on incomplete or noisy information, something real life systems face all the time, unlike simpler models where everything is perfectly observable.

In the context of POMDPs used in active inference, here are the key components and what each represents:

A (Likelihood Tensor): Represents the likelihood of observing certain data given the agent’s current beliefs about the world. It links the hidden states to the observed data.
B (Transition Tensor): Describes the probability of transitioning from one hidden state to another, depending on the current state and action taken by the agent.
C (Preference Tensor): Represents the agent’s preferences or desires for different states or actions. This could relate to goals or rewards the agent is trying to achieve.
D (Action Tensor): In addition to defining the agent’s possible actions, D can also express how actions influence transitions across different scales, representing the way decisions affect the system at multiple levels of granularity (or multiple scales of time and space). It models how actions at one level can influence or propagate through other levels.
E (Exogenous Input Tensor): Similarly, E can describe external inputs that influence the system at different scales, such as environmental changes or sensory data that affect both micro and macro levels. It can show how external factors couple the agent’s behavior across various levels of scale, influencing how the agent interacts with the world at both fine and coarse levels.
Paths (Trajectory of States and Actions): In the generalized version, paths represent the sequence of states and actions over time, which can be used to track the agent’s experience or trajectory in the environment.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

These tensors work together to define the agent’s belief system, actions, and how it interacts with its environment, providing a framework for decision making under uncertainty.

And Karl explains that we add the word generalized to define these POMDPs because we are representing not only the states and the observations, but also the dynamics in terms of the paths that specify the transitions between states.

By incorporating these paths as random variables, we can model not just where the agent is at a given moment (hidden state) but also how it gets there and how it moves through the sequence of states over time, which adds more richness and flexibility to the generalized POMDP model. And in this way we have, Karl says, a generalized kind of Markov decision process.

It therefore means expanding beyond the simple states and observations seen in basic MDPs/POMDPs by incorporating:

More complex relationships (paths as random variables)
Hidden states and their dynamics (not directly observable)
Transitions between states described probabilistically (using tensors to represent these paths)
A renormalization process that allows for different levels of abstraction and dynamic adaptation within the system.

A reminder about random variables. A random variable is a variable that can take different values, but its value is uncertain. For example, in the case of paths being random variables, the path an agent takes through its hidden states is uncertain because it depends on its actions and the environment.

And a probability distribution is a way of describing the likelihood of each possible value that a random variable can take. In this case, the probability distribution would show how likely each possible path (sequence of states) is, based on the agent’s actions and the environment’s responses. So, the random variable represents the uncertain path, and the probability distribution tells us how probable each path is.

The generative model of the agent we are considering includes tensors (a tensor is a multi dimensional array of numbers used to represent data) that can be parametrized as dirichlet distributions. This means that the relationships encoded by the tensors (such as state transitions, actions, and observations) are represented probabilistically, rather than deterministically.

These tensors are parametrized by learnable parameters (like the mean, variance, or other parameters of a distribution), allowing the model to learn how these relationships evolve in the face of uncertainty and partial observability.

And what is a dirichlet distribution?

A dirichlet distribution is a type of probability distribution that is used when we want to model uncertainty about multiple possible outcomes that add up to 1 (like percentages or probabilities). It describes a situation where you have several categories or options, and you want to know the likelihood of different proportions or shares across those categories.

In the context of generalized POMDPs in active inference, the Dirichlet distribution is useful because it helps model uncertainty about different possible paths or states that the agent could take. It provides a flexible way to express how likely the agent thinks different sequences of events are, considering it doesn’t know exactly what will happen next. So, the Dirichlet distribution helps the agent estimate the probability of different scenarios in a way that adds up to a full picture of all possible outcomes.

Next, Karl explains that we will explore the various levels of depth within the generative model and examine how we can leverage the renormalizing aspect of this causal architecture in relation to the world in which our entities exist.

So, time to go deep.

Parametric depth

We arrive at a key twist in this fascinating exploration.

As we just explained, the different data structures in the mind of the agent are modeled with probability distributions. Those probability distributions, mathematical models, have parameters that control them.

Parametric depth means that the parameters of those probability distributions are themselves random variables with their own probability distributions.

For example, the likelihood mapping that maps from hidden states to observations (usually encoded with the A tensor) is itself encoded in terms of dirichlet parameters, giving the parametric model depth.

Yes, it’s another form of recursivity. Recursivity is a phenomenon we often observe in the natural world, and it helps us model scale invariance in both state space and time within this framework.

There are also several important implications of having this parametric depth. It means that we now have to apply bayesian mechanics to more things.

We will be using Bayesian methods to infer or estimate the underlying, hidden aspects of the system that we can’t directly observe, and that includes the hidden states but also the parameters of the distributions that model those hidden states as well as other parts of the generative model.

Karl explains that in the generative model we can consider three kinds of depths: temporal, hierarchical and factorial.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

We have temporal depth, because we are modeling the transitions of the states towards the future. This temporal depth is generalized through the explicit path variables we will be making use of.

We have hierarchical depth in the sense that we can compose multiple MDPs. The output of a process can now form, Karl says, the input, or the empirical prior, or the inductive bias, or the top down constraint on the dynamics and states of the level below.

And the dynamics, the kind of outputs of the upper level that provide empirical constraints on the paths and initial conditions of the lower levels, are supplied or mediated by a mapping between higher levels (slow states), and the initial conditions or trajectories of the lower levels (through, for example, the D tensor).

Let’s dive into this last bit:

Dynamics of the upper level: At the upper level (the higher or slow states), we have outputs, things that emerge from the process at this level (like predictions, control signals, or constraints).
Empirical constraints on the lower levels: These upper level outputs provide empirical constraints or guidelines that affect the paths (the transitions between states) and the initial conditions (the starting states) at the lower levels of the system.
Mapping between higher and lower levels: The relationship between these levels is mediated by a mapping (a mathematical relationship or transformation), which connects the slow or high level states to the initial conditions of the lower level states. This ensures that the higher level states can guide or constrain the dynamics of the lower level system.
D tensor: The D tensor (one of the tensors in the generalized POMDP framework) is responsible for encoding the relationship or mapping between the higher and lower levels. It helps specify how the higher level dynamics influence the initial conditions at lower levels.

A clarification about paths and transitions:

Paths are the entire sequences of states (and actions), which track a trajectory.
Transitions are the local probabilities of moving from one state to another, as described by the transition matrix (B matrix).
So, paths are made up of transitions, with transitions describing the immediate steps, and paths describing the entire sequence of states and actions taken over time.

In summary, higher level processes (abstract goals or strategies) influence lower level processes (specific actions or states) in a hierarchical system.

This influence is mediated through a mapping that takes into account the slow dynamics of the higher level states, which then help define or constrain the initial conditions for the lower level states. The relationships between these levels are encoded using tensors, which represent complex, probabilistic transitions between different states in the system.

And a bit more about the D and E tensors:

D (initial states): It encodes how the initial conditions at higher levels influence the states at lower levels, that is, the mapping of state space across levels.
E (initial paths): It describes how the temporal dynamics or transitions at the higher level affect the paths or transitions at the lower levels.

So, the D tensor deals with states, while the E tensor deals with paths.

Finally, we also have factorial depth, Karl explains.

Factorial depth: It refers to how a model can deal with multiple independent factors that influence an outcome. Think of it as having many different ingredients (factors) that independently affect a final dish (the outcome).
Independent factors: These factors may be independent of each other, meaning that each factor can change without directly affecting the others. However, when combined, they all contribute to the final outcome. So, a generative model with factorial depth considers all these independent factors, but may initially entangle them (mix them together) when generating an outcome (which itself can be a prior constraint on the states and dynamics of the level below).
Entangling latent states: In the generation process, the model entangles (mixes together) these latent (hidden) factors. This means that, while the factors are independent, their effects on the outcome might be complicated or hard to separate directly.
Disentangling the representation: The real task comes when we need to invert or disentangle these mixed up effects to figure out exactly what each factor (latent state) was doing in the outcome. It’s like looking at a complex puzzle (the observed data) and working backward to separate the pieces to understand how each factor contributed.
Role of factorial depth: This allows the model to understand and manipulate the system at a deeper level by identifying and separating the independent factors that were mixed together in the generation process.

So Factorial depth means the model recognizes that outcomes can come from a mix of independent factors, and its job is to later separate these factors to better understand how they contributed to the outcome.

In summary, all of that produces a generalized discrete state space model with paths as random variables.

Let’s now focus on the temporal dimension.

Temporal Depth

In real life, as we ascend scales (in terms of output states), things slow down (temporal dilation).
We move from fast microscopic states to slow macroscopic states.

We can implement the same dynamics in these MDPs that have hierarchical depth.

To achieve this, we discretize and quantize time, allowing for more frequent updates at each level compared to the level above. This process creates temporal scaling, aligning with the scale invariant approach we’ve been discussing.

In this way, we link temporal dynamics with hierarchical depth, as the higher we move in the hierarchy, the slower the updates become at each level

Therefore, as Karl explains:

At a certain level, we generally parametrize or generate the initial conditions and the initial path from the states that are above that level.
And that path can pursue a trajectory over multiple time steps until a certain horizon (at that level), at which point it then sends messages to the level above.
Those messages produce belief updating, and the subsequent step of the level above generates priors about the subsequent path at the level below.
And so on, recursively all the way to the depth of the model.

Now it’s time to analyze in more detail that same coarse graining but this time of the state space.

Coarse Graining the state space

To coarse grain the state space in this renormalizing generative model, we apply what are known as spin block transformations.

Spin block transformations are a specific way to apply coarse graining in the context of this generative model. These transformations group local states (like tiny tiles) into larger blocks. The term spin is borrowed from physics, where systems with spin states (like magnetic spins) can be grouped to simplify complex models.

First, it’s important to emphasize two key assumptions or rules that we will consider.

Each block will have one and only one parent. This simplification avoids complex relationships where multiple higher level factors could affect a single block, keeping the system simpler and easier to analyze.
For most systems composed of coupled elements, the interactions are usually local. So we are going to focus on those local interactions.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

Let’s understand why those assumptions are important. Karl proposes: let’s think of a system that can be described in terms of local interactions. Consider a euclidean world where massive objects bounce around (let’s ignore gravity in this example). Such objects only influence each other when they touch, when they are proximal.

The partition that defines all the markov blankets in that state space is composed of lots of local little tiles. We can then use a block or spin operator that groups together local tiles, a group of 2D arrangements of states.

So, looking back at our generative model, a higher group, hidden factor, with its states and paths, is responsible for predicting the dynamics of other lower factors (at a lower level), and in so doing we are successively grouping together sets of sets of sets, etc.
So any group or set at one level is accountable or trying to generate the dynamics of a small local group at the lower level.

The key idea is that lower level blocks (small interactions between states) are grouped into higher level blocks, and the higher level blocks influence the dynamics of the lower levels.

And anything at the lower level can have only 1 parent. There are no additional parents, so we can ignore or dissolve dependencies between the factors at any given level, because each factor is only responsible for its children and it doesn’t share any children, in the case of this particular spin block transformation.

At some point, all the tiles, all the groups of the lowest level will come together at the top, very deep in the model, at which point they will be encoding quite long trajectories, Karl says, 2 to the power of the depth minus 1, in terms of updates. But at the lower level there are no interactions between the different blocks as we ascend them through the hierarchy.

What this produces, in summary is:

A very efficient kind of generative model
Many converging streams that don’t have to talk to each other until the point at which they come together through these spin block transformations.

And Karl reminds us that we can treat the D and E tensors (initial states, initial paths), as likelihoods that couple one scale or level to the one below.

This emphasis on the local interactions surely reminds us of something very familiar.

Karl explains that Renormalizing Generative Models (RGM) can be seen as homologues of deep convolutional neural networks or continuous state space models in generalized coordinates of motion. They can be used to learn compositionality over space and time, models of paths or orbits (events of increasing temporal depth and itinerancy).

Let’s clarify some of these terms.

In RGMs, we group local states (tiles) into larger blocks, similar to how CNNs aggregate local features into higher level features.
Compositionality means that complex systems can be broken down into simpler components (like tiles or blocks), which can then be recombined to understand more complex behavior. This allows the model to learn patterns over space and time.
Paths (or orbits) represent the transitions or trajectories between states over time. The model learns to understand these paths, which can be seen as sequences of events that occur as the system evolves.
Temporal depth refers to the ability of the model to account for the passage of time in understanding state transitions. It captures how the system changes over time, based on the accumulated information at each hierarchical level.

The next and very key question is, how do we actually build such a model?

Growing the artificial mind

Let’s begin with a key statement that Karl shares.

We want the model to learn itself. To grow itself by using something called recursive fast structure learning.

This is very important because earlier versions of active inference required humans to manually design the base structure of the generative model. This new approach begins to automate these processes. Sounds beautiful, doesn’t it? Let’s see how this would work.

It will be a recursive process, a recursive application of the renormalization group operators.

At a high level, fast structure learning equips the model with unique states and transitions as they are encountered.

Let’s consider what this means. We are growing a model from scratch, from nothing!

And Karl adds: so that EFE (expected free energy) or mutual information is extremised, using structure learning as active model selection. Let’s unpack this:

Mutual Information: It is a measure from information theory that quantifies how much information one variable contains about another. In simple terms, it tells you how much knowing one variable reduces uncertainty about the other.
In the context of machine learning and probabilistic models, mutual information is often used to quantify how much information a model’s latent variables (the hidden or unobserved ones) share with the observations (the data we can see). In simpler terms, mutual information measures how much knowing the state of X helps in predicting Y and vice versa.
If X and Y are independent, their mutual information is zero, meaning knowing one doesn’t help in predicting the other. In this context, it refers to the information shared between the model’s predictions and the actual observed data. The goal of learning is to maximize this mutual information, so the model’s predictions become more aligned with reality.
Extremised: In this context, extremising means maximizing mutual information or minimizing expected free energy. So, the model aims to optimize this relationship, either by reducing the surprise or uncertainty about the data (minimizing free energy) or by increasing the amount of useful information the model can extract from the data (maximizing mutual information).
Structure learning as active model selection: This suggests that as the model encounters new data, it actively selects or adapts its structure (the components of its generative model) to better align with the data, improving its predictive accuracy. It’s a process of self improvement where the model chooses which structures and states are necessary, based on the data it is seeing.

Karl also tells us that the EFE (expected free energy) is the mutual information in the absence of any expected cost. It is the mutual information of any mapping plus or minus an expected cost (in terms of constraints or prior preferences). Let’s unpack this.

In our case we can consider that in relation to the EFE equation:

Mutual Information = Epistemic Term (exploration, uncertainty reduction, learning about the world).
Expected Cost = Pragmatic Term (exploitation, maximizing utility, evaluating costs).

Together, these terms in the expected free energy (EFE) represent the balance between learning (exploration) and making the best use of what you know (exploitation), guiding how a system acts and learns in a way that optimizes both its understanding of the world and its ability to achieve goals with minimal costs.

In the absence of expected cost, the system is focused on learning and reducing uncertainty through exploration.

But when you introduce the expected cost (pragmatic term), the system also focuses on making decisions that not only increase knowledge but also keep actions aligned with goals, preferences, or minimizing harm.

So, it’s a balance between exploring for more knowledge (uncertainty reduction) and acting in a way that minimizes costs (alignment with goals).

When you’re trying to minimize EFE, you are essentially trying to reduce uncertainty by learning about the world (gaining more information) while also ensuring that the decisions you make do not violate important constraints or preferences. This creates a feedback loop where you’re both gathering knowledge and acting in ways that are beneficial or cost effective in the long run.

In simpler terms: minimizing EFE means you’re reducing uncertainty (exploration) but also making sure you don’t incur too many costs (exploitation), all while satisfying any constraints or preferences that are important for achieving your goals.

And now, let’s return to the inspiring concept we discussed earlier: we will be growing a model from scratch, from the ground up!

Compare that to the inefficiency of current AI paradigms:

Current paradigms start with highly complex, massively overparameterized models, which might later be simplified through the training process. Optimization and pruning methods help reduce the influence of certain parameters, streamlining the model.
This active inference approach, in contrast, begins from scratch and grows its model until it achieves the simplest possible implementation that fulfills the agent’s objectives.

How would this process take place? Karls takes us through a high level example. Let’s take the perspective of the agent.

https://medium.com/media/c7d8c48b436612c1d642ffb989cc85f5/href

I receive an observation, which I infer was generated by a latent hidden state. And I encode that latent state.
I then receive another observation. I have not seen that observation before, so I will encode it with a second latent state.
And I have seen those two observations one after the other, so I will encode that with a transition from the first to the second state.
And I keep doing the same. I now get a third observation. If I have not seen it before, I will induce a third latent state, and a second transition.
In this way, I accumulate all the unseen or unique instances of observations within a likelihood mapping structure, that grows and grows until we find something that we have encountered before.
So we don’t induce a new state or, implicitly, a column of the likelihood matrix, if we have seen that state previously. We don’t create an instance of something that has been encountered before.
However, that state may come along with a related path that is new, so the next state after that one may not be the same that was following the previously seen instance of that state. In that case, I am going to induce a new path.
In this way, I will be gradually growing my B tensors (the probability transition matrices), along with other components of the generative model such as the A tensors (likelihood mappings), to encode the full dynamical structure of the sequence of observations that is training the agent.
And that encoding will be summarized in a maximally efficient way. The implicit mapping that we are growing from scratch will have the highest mutual information between the latent space (where the latent states allow us to derive a prediction for the next observation) and the observations that are at hand.

So we can now see that Fast Structure Learning is a kind of active inference, but not over states or parameters. It is instead active selection of the right model in terms of the size of the likelihood tensors and the transition tensors.

We are figuring out which model structure best fits the data or the situation you’re facing.

And model selection here refers to the process of selecting the right structure, that is, how complex the model should be in terms of its likelihood tensors and transition tensors, based on the evidence available.

So gradually, as agents, we are building our internal tensors. And for example:

Likelihood tensors represent the relationship between the states and the observations. This is how the model explains the likelihood of different observations given the current states.
Transition tensors represent how states transition from one to another, essentially defining the dynamics or temporal evolution of the system. This shows how one hidden state leads to the next over time.

The size of these tensors reflects the complexity of the model. More complex models have larger tensors with more parameters, which means they can potentially explain more complex phenomena but at the cost of becoming harder to learn and requiring more data.

When the system is choosing whether to accept a new latent state (a new hidden state in the model) or transition (a new way states evolve over time), it checks if this new model improves the EFE or maximizes mutual information.

Let’s unpack this:

Improving EFE: In active inference, the overall goal is to minimize Expected Free Energy, which combines two objectives: reducing uncertainty (mutual information) and minimizing expected costs (the negative consequences of actions). A new structure is considered beneficial if it lowers the overall EFE by balancing these two objectives.
Maximizing Mutual Information: This is the epistemic aspect of EFE minimization. Here, the focus is on reducing uncertainty by preserving or increasing the mutual information between the model and the observations. In other words, the system prioritizes a model that offers the most informative representation of the data and reduces uncertainty about hidden states.
So if a new structure doesn’t immediately lower the total EFE (e.g., it doesn’t reduce both uncertainty and cost), the system may still accept it if it improves the epistemic term. This ensures that the model becomes more informative, even if pragmatic concerns are temporarily less prioritized. The rationale is that improving the understanding of hidden states can later lead to better decisions and lower EFE in subsequent updates.
In summary, the criteria for accepting a new latent state or transition are: Does it optimize or improve the EFE? Does it preserve or maximize mutual information? This approach reflects a trade off between exploration (learning about the environment) and exploitation (acting effectively based on current knowledge). However, in the case of fast structure learning, as we will discuss below, when applied to growing a model from scratch, the emphasis shifts toward the maximization of the epistemic term, maximizing mutual information.

This is essentially a Bayesian approach to model selection. The idea is that the system is selecting among different models by evaluating the evidence (data, observations) and choosing the model that best fits that data, while also considering the complexity of the model (which is related to the size of the likelihood and transition tensors).

The system learns not just the states but also the structure of the model itself, dynamically adjusting itself to find the most appropriate model for the current data.

In summary, fast structure learning is a specific application of active inference, applied here in the context of Bayesian model selection.

And Karl tells us that it follows the same rules as when an active inference agent deals with planning and accepting parameter updates, but with a specific emphasis on maximizing the epistemic term of the EFE in the case of growing a model from scratch. Let’s unpack this further.

In typical parameter updates and planning steps in active inference, we aim to minimize EFE, which balances reducing uncertainty (epistemic gain) and aligning with preferences (pragmatic gain).

However, when it comes to structure learning, updating or selecting the model’s structure, the focus shifts. The system’s objective is not necessarily to minimize EFE in its entirety. Instead, it prioritizes maximizing the epistemic term of the EFE, which represents the mutual information between latent states and observations. In this context, maximizing EFE means seeking the most informative model structure that captures patterns in the data and reduces uncertainty over time.

This focus on maximizing the epistemic value ensures the model grows in a way that is both adaptive and capable of effective learning. By emphasizing the mutual information component, the model identifies latent states and transitions that provide the highest potential for reducing uncertainty in the future.

This principle mirrors how parameter updates in active inference sometimes prioritize epistemic gain to guide exploration and learning. Structure learning similarly follows an epistemic driven rule, selecting updates, actions, or structures that maximize mutual information and ensure the model remains informative and adaptable.

So it’s all about choosing the most informative model structure. In terms of expected free energy (EFE), and specifically in the context of fast structure learning, the primary focus is on maximizing the epistemic term with the aim of selecting a structure that has the highest potential for reducing uncertainty in the future. This process guides the selection of models that are informative and allow for effective learning or adaptation.

In summary, Friston emphasizes that structure learning, like planning or doing parameter updates, follows the same epistemic driven principles: it’s guided by maximizing the informative value of the model. This means, in practice, selecting structures, actions, or updates that increase mutual information and reduce uncertainty. However, in the context of fast structure learning, the reduction of immediate uncertainty is not the top priority. Instead, such reduction may occur in the future, not necessarily in the immediate present.

So let’s stand back for a moment and look at the structure being created from afar:

We ingest our observations in relation to local groups.
At a certain level of the structure, we combine initial conditions and paths into new outcomes for the next level below.
And because we do this for local groups, the number of groups just gets smaller and smaller, as you get higher and higher.
And the number of combinations of paths and initial states increases as you get deeper and deeper (lower) into the model.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

The next question that Karl poses is, how do we bound all of the above? And he explains that the answer is trivial because the size of the highest level is upper bounded by the number of elements in our training set.

Therefore, we can control the size of these renormalizing operators and likelihood mappings or transition tensors, simply by giving it canonical training data, the kind of data that we want this agent to generate, recognize or predict.

Let’s consider, though, that in a controlled training environment, bounding the complexity with a fixed, well defined dataset works well because the model only has to generalize within the boundaries of that data. But in the open, real world, the model would encounter way more variability and unexpected scenarios that weren’t in the training data. In this context, simply bounding based on the original training data would no longer apply effectively, because the variety of inputs and potential scenarios would be far greater.

So if we imagine our agent being set loose on the complexity of the real world, such an approach would have to be expanded with other sorts of constraints and mechanisms. Let’s expand on this:

Dynamic Model Updates: As mentioned earlier, the model would need mechanisms for dynamic learning or continual adaptation, allowing it to update its generalized states or high level structures based on new, diverse experiences in the real world. This would enable it to expand its boundaries by incorporating novel patterns it observes, without being rigidly confined to the initial training data.
Hierarchical Compression and Forgetting Mechanisms: For real world applications, models could include strategies for compressing knowledge and selectively forgetting less useful or outdated information. This hierarchical updating approach, where recent data is prioritized and irrelevant paths are pruned, helps manage the growth in complexity while maintaining relevance in a changing environment.
Safety and Practical Boundaries through Constraints: Models might include contextual constraints or policy based limits to control how much they explore certain areas. For example, in active inference models, bounding behavior in an unbounded real world can be achieved by biasing exploration toward goals or preferences, essentially building in rules that guide behavior within practical limits

In the real world, then, the boundaries would need to be adaptable, using strategies that balance exploration (to handle novel scenarios) and compression or constraints (to keep complexity in check). This way, the model doesn’t endlessly expand but instead selectively grows and prunes knowledge in a way that remains manageable and relevant to its goals or purpose.

Now, let’s explore some specific examples that Karl Friston presented, illustrating one of these systems in action.

Filling the gaps

The first example that Karl explains is: fast structure learning and image compression to discover a simple orbit.
In this example the generative model is trained on a video of a bird flying in a loop through a periodic orbit.

https://medium.com/media/f2dfaf9662450c7a9e019292e804c6aa/href

And the generative model learns a structured representation of the video data (in this case, the bird flying in a periodic loop). By doing so, it compresses the data into a simplified, interpretable form (the essence of the bird’s trajectory) while still being able to reconstruct the original data.

Karl explains that the coarse graining process applied at the bottom level uses a finessed version of a spin block operator by harvesting local eigen modes or eigen images of patches of the video to get the image into a reduced or coarse grained representation, that is then passed up to the next level.

A spin block operator is a method inspired by statistical physics to gather information over small regions (for example, patches of an image). This involves summarizing local details into eigenmodes or eigenimages, which represent the dominant patterns in those patches.

In this example the model uses 2 levels.

Lower level: Processes small scale, local features of the image (e.g., pixel patches or eigenimages).
Higher level: Works with the entire image as a whole, analyzing large scale patterns (e.g., the bird’s entire trajectory).
Interaction: The lower level sends coarse grained representations upward to the higher level for broader contextual understanding.

The process is made of a few stages, Karl explains, which include:

Discretization: breaking the continuous video data into discrete chunks for analysis (e.g., frames or patches).

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

Single value decomposition: A mathematical technique that breaks down data into key components. Here, it can be used to identify the dominant patterns in the video sequence (e.g., the periodic trajectory of the bird).

These processes help the system learn a simplified model of the video by focusing on its most significant features.

At the highest level, the model represents the bird’s motion over time as a sequence of abstract states or events (e.g., periodic loop). In this generalized space, the highest level encodes paths or trajectories over 2 to the n minus 1 timesteps in the future where n is the depth of the model. The model, therefore, is able to predict future states by understanding patterns across all of those time steps.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

After the training, we can generate new movies simply by moving through the different paths or episodes.

Karl asks: what other things can we do with these models?
And he explains that one of them is pattern completion. Rather than presenting the entire data to the model, we make the observations be incomplete, presenting an incomplete prompt (part of each frame is missing).

Then we can see what the model beliefs are, what its posterior over output space, image space in this case, is. And we verify that the model is able to predict and complete the missing part of each frame.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

Karl explains that this demonstrates efficient compression, highlighting how many aspects of Bayesian mechanics and variational inference are grounded in the same principles that support efficient information transfer. This relates to maximizing mutual information, how much valuable information can be retained and shared in the most compact representation possible.

https://medium.com/media/9b30765c4011b3002cef85ef783a4a50/href

And so, we can consider that the objective function of the model is to find the best representation that retains the essential structure of the data (e.g., the bird’s orbit) while minimizing redundancy. This means encoding the data into a compact, generalizable form that allows the agent to perform tasks like filling in gaps or generating new examples.

Therefore, fast structure learning under renormalizing scale free generative models can be seen as the most efficient way of compressing whatever data the agent is exposed to.

Once this compression takes place, the agent is then able to fill in the gaps in very efficient ways.

Let’s emphasize the following two key concepts:

Renormalization: A process from physics that simplifies complex systems by focusing on large scale patterns while ignoring small scale details. In this context, it’s how the model moves between levels of abstraction (e.g., from pixel patches to full trajectories).
Scale free: The model can operate across different levels of detail or scales (local features to global trajectories) without being tied to a specific resolution.

And in this way:

By transforming the data into a compact representation, the agent gains the ability to infer missing information or generate new content.
Example: If part of the bird’s loop is obscured, the compressed model can reconstruct the missing frames, leveraging its understanding of the periodic trajectory, the bird’s motion dynamics, and other relevant patterns in the data.

Learning the chaos

In the next example, Karl showcases a 3 level hierarchical renormalizing model, this time applied to the learning of the dynamics of a ball moving through a chaotic and aperiodic orbit.

https://medium.com/media/bc5d7e1b1a5a9f2c56264219803e235c/href

Chaotic and aperiodic motion: Unlike a simple periodic motion (e.g., a bird flying in a loop), this system is less predictable, with sensitive dependence on initial conditions. This introduces both deterministic (rule based) and stochastic (random) elements.
Model’s task: To capture both the deterministic rules and the chaotic variations probabilistically, compressing the dynamics into a structured, learnable format.

So how does the model deal with such a challenge?

Probabilistic representation: The model learns distributions over possible trajectories. This allows it to account for uncertainty and variability in the system.
Compression: The model identifies patterns and reduces redundancy, creating a compact representation of the system’s behavior.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

After the learning takes place, we begin to test the capabilities of the model.

Input at the lowest level: At the start, the raw observations are provided in a deliberately imprecise way, meaning the fine grained details are blurred or noisy. This tests how well the model can rely on its learned priors rather than on precise data.
In such a case, everything is governed, generated from the top down priors or predictive posteriors, because there is no precise likelihood.
Top down control: The model generates predictions about the system’s dynamics starting from the highest level (abstract priors) and passing them down to the lower levels.
Posteriors: These are updated beliefs about what is happening in the system, based on a combination of priors (high level expectations from what the model has learned) and likelihoods (observations from the environment, which are imprecise in this case. The model is not receiving accurate input data. Therefore, it cannot rely heavily on direct sensory evidence).

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

And then we remove the imprecise input completely and let the agent generate its own beliefs about what’s going on.

Now the model must rely entirely on its internal representations (beliefs and priors) to generate the ball’s motion.
The model autonomously predicts the dynamics of the ball based on what it has learned, effectively imagining the motion of the ball without any external input.
Despite this uncertainty, the model successfully reproduces the overall dynamics of the ball, demonstrating that it has captured the system’s behavior in its hierarchical generative framework.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

https://medium.com/media/194213fc86877bd8294c9e3460cc28ec/href

This example highlights how generative models can not only learn complex systems but also operate autonomously in very uncertain conditions by leveraging probabilistic structures and hierarchical abstractions.

Exploring video and music

In the next example, fast structure learning applied to discover and learn event sequences, Karl shows us the movie of a bird. After the learning takes place, some of the frames of the video are hidden to the generative model, and the model is able to reconstruct them faithfully, filling the gaps.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

https://medium.com/media/1ea272fe848617582973943d967464b5/href

Next is the turn for music, in the example titled: fast structure learning applied to generate music.

This example intends to compress 1D frequency summaries of a particular audio file. The audio is summarized as a continuous wavelet transform.

Frequency representations like spectrograms are typically 2D, showing frequency vs. time. However, when referring to 1D frequency summaries, Friston seems to be referring to a condensed representation that reduces this information to a single dimension.

A 1D summary might be a single vector that captures average power or amplitude across frequencies, essentially collapsing the time dimension. For instance, we can calculate an average amplitude for each frequency band over time, creating a 1D vector of frequency components. This provides a summary of the frequency content without the time detail, a sort of fingerprint for the audio.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

This is a good opportunity to review the types of charts Karl shows in these examples, specifically in relation to the three level hierarchical system in this music context.

First, a reminder about a couple of terms:

Predictive Posterior: A forecast of the state before receiving new data, based purely on the model’s prior beliefs and the dynamics of lower levels.
Posterior: The updated belief about the state after receiving new observations and incorporating them into the model, essentially the new and more accurate belief about the state.

The different charts shown by Karl are:

Predictive Posterior(states) level 1
Predictive Posterior(states) level 2
Predictive Posterior(states) level 3
Posterior (states) level 3
Predictive Posterior (paths) level 3
Predictive Posterior (paths) level 2
Transition chart (square matrix)

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

Each Predictive Posterior (states) chart for levels 1, 2, and 3 represents the probability distributions of predicted hidden states at different hierarchical levels. Those hierarchical levels correspond to different levels of abstraction. For example:

Level 1 could represent fine grained, immediate states like sensory features (e.g., specific notes or sounds in music).
Level 2 could capture patterns over time, like sequences or phrases in music.
Level 3 may represent higher level structures, such as motifs, themes, or the genre of a piece.

The system predicts the states at each level based on the states of the levels below and above, updating these predictions based on new observations.

The Posterior (States) Level 3 chart shows the updated posterior distribution over the hidden states at Level 3, following Bayesian updating with observed data. It reflects the system’s refined beliefs about Level 3 states after accounting for incoming observations and prediction errors from lower levels.
This level’s posterior will help shape the priors (predictions) at the lower levels and inform the next iteration of predictions for Level 3 itself.

The Predictive Posterior (Paths) charts for levels 2 and 3 represent the system’s beliefs about possible paths or sequences at these hierarchical levels. These paths aren’t single states but sequences of states over time, predicting not only what states are likely but how states are expected to unfold across time.
For example:

At Level 2, paths could represent sequences of sounds or notes in a phrase.
At Level 3, paths might represent more complex patterns like thematic progressions or chord changes.

By predicting paths, the system can anticipate longer sequences and make structured predictions about the trajectory of events, crucial in a time dependent setting like music generation.

Finally, the transition Chart is a square matrix showing the transition probabilities between states within a particular level.
It defines how likely the system is to move from one state to another, capturing the dynamics or temporal dependencies between states.

For example, in a music model, this matrix might represent the likelihood of one note or chord transitioning to another within a specific musical structure. Each entry (i,j) in this matrix represents the probability of transitioning from state i to state j, helping the model understand and anticipate sequences based on learned transitions.

After showcasing these examples that deal with learning to compress and generate, Karl takes us onto the following section that deals with active learning and agency.

Active learning and agency

We will now focus on an agent equipped with one of these generative models interacting with an environment where it must learn a skill.

The key objective is to learn an effective and optimal state action policy.
A good mapping from states of the world to actions that the agent should take. And we can consider two routes to achieve that:

Use some function approximation strategy, through deep learning, for example, to learn a mapping from sensory observations to active states (actions) that maximizes a reward (or cost, or utility, or free energy). This would be the reinforcement learning route.
Or use active inference to realize predictions under a renormalising generative model of rewarded events. The agent uses its generative model to infer the states of the world and selects actions that minimize free energy, achieving its goals in an efficient and sustainable way. This is the route we are going to consider in this section.

With the active inference approach to purposeful behavior, such purpose is provided mostly by the prior preferences or the constraints, usually encoded in the C tensor.
Therefore, unlike reinforcement learning, where purpose often comes from an external and explicit reward signal, active inference relies on the generative model itself to encode what the agent is trying to achieve.

Those prior preferences are, therefore, the agent’s built in or learned expectations about the kinds of observations it wants to encounter. The agent doesn’t seek to maximize a reward but instead aligns its actions to ensure its observations match these preferred outcomes.

The generative model serves as the agent’s internal understanding of the world and within the model itself, the prior preferences in the C tensor are encoded as probabilities over desired outcomes. For example:

If an agent is navigating a maze, it might prefer observations that indicate it is closer to the goal.
A robot assembling objects might prefer states where objects are correctly aligned.
The preferences in the C tensor guide the agent’s actions by making certain observations or states more likely in the model.
Constraints might represent physical limitations (e.g., movement restrictions) or other environmental factors that shape how the agent interacts with the world.

So we can see the purpose of the agent as, in a way, an emergent property, because the agent doesn’t explicitly have a goal like: maximize a reward. Instead, it acts to minimize variational free energy and expected free energy, which involve:

Reducing the surprise (unexpectedness) of its observations.
Ensuring its sensory inputs are consistent with its prior preferences.
The C tensor effectively sets what counts as success for the agent. By minimizing free energy, the agent works to reduce prediction error, aligning its sensory input with the predictions made by its generative model. This process ensures that the agent’s actions bring its observations into better alignment with the world model, which is guided by the prior preferences encoded in the C tensor.

Another very simple example. Imagine a thermostat as an agent:

Its C tensor encodes a preference for a specific temperature range (e.g., 21–23°C).
When the room temperature deviates from this range, the thermostat acts (e.g., turning on the heater or cooler) to align observations (measured temperature) with its preference.

This mechanism generalizes to more complex agents and tasks, where the C tensor can encode preferences for specific visual patterns, spatial arrangements, sequences of actions, or even ethical constraints and social norms. In such cases, the agent’s actions are driven not only by sensory preferences but also by abstract goals or principles, ensuring that its behavior aligns with both practical and moral considerations.

Let’s now consider action in the context of this active inference agent. We can view action as part of a gradient flow on free energy. The agent will pick actions that minimize free energy, which leads to maximizing predictive accuracy. And this is because the process is built on top of the variational free energy (VFE) optimization process and equation, which balances complexity and accuracy and Karl tells us that complexity does not depend upon action, but accuracy does.

So when we consider the behavior of our active inference agent, we have a generative model that is optimized through perception (in terms of variational inference minimization), and that generates predictions about what should happen next. From those predictions, the agent can then pick its actions to fulfill those predictions (by also involving the expected free energy (EFE) optimization process, which handles action selection).

So, as an active inference agent, my predictions about what will happen next are shaped by perception through variational inference, based on my learned and selected generative model. This is very similar to what is called model predictive control. The actions we take have the goal of evidencing or realizing predicted states.

With that in mind, Karl invites us to explore a strategy to make the agent become an expert at something.

Becoming an expert

Imagine a world where everything unfolds according to your beliefs about what it means to be an expert in that world. The agent can plan a series of episodes, generating predictions that cascade from higher to lower levels of the model, with actions at the lowest level executed to faithfully fulfill those predictions.

So we can think of control as inference at the bottom level (model predictive control) and planning as inference at the top level.

Let’s unpack this.

Control (bottom level): This involves making local predictions and adjusting actions in real time to fulfill those predictions (similar to Model Predictive Control).
Planning (top level): This involves high level predictions about long term goals or episodes. The agent acts to fulfill those predictions by planning and generating sequences of future states.

And let’s unpack it even more.

At the lower levels, which are closer to sensory input or immediate states, inference is used to take actions that fulfill short term predictions. For example, if a robot predicts it will be in a certain position after moving, it will act to bring itself closer to that predicted position, reducing prediction error. In this sense, lower level inference is focused on immediate control, taking actions now to align the system with its predictions.
At the higher levels (more abstract, long term goals), the inference is used to plan sequences of actions that will fulfill long term predictions or reach future goals. Instead of directly acting on immediate predictions, the agent is planning ahead, figuring out which sequence of actions will likely achieve the desired outcomes over time. Therefore, the top level inference is about planning, selecting the best course of actions to fulfill long term predictions.

So lower levels use inference to adjust immediate actions.
And higher levels use inference to generate and plan sequences of actions.

The transition from acting to planning can be considered to be flexible and dependent on many factors such as:

Temporal scale of predictions (short term vs long term).
Level of abstraction (concrete actions vs abstract goals).
Hierarchical structure of the system, where lower levels focus on actions, and higher levels handle planning.

In hierarchical systems like the one Friston describes, the key is that lower levels focus on real time action while higher levels handle planning over longer time scales.

Let’s explore an example where we have a robot with a 7 level hierarchical system navigating through a maze:

Level 1 (Lowest level): This level might be responsible for motor control, for example, turning the wheels left or right in response to obstacles detected by sensors.
Level 2–4: These levels might help with local adjustments, like figuring out which direction to go next in the maze or dealing with specific obstacles.
Level 5–6: These might be responsible for higher level strategy, such as planning out the entire path through the maze, determining which turns to make based on the robot’s position and ultimate goal (e.g., reaching the exit).
Level 7 (Highest level): This level might oversee the entire process, adjusting the overall plan based on higher level goals (like completing the maze as quickly as possible or optimizing the path for energy efficiency).

In this example, lower levels act immediately based on sensory feedback, while higher levels plan the sequence of actions over a longer time scale.

Let’s now compare all of the above to Reinforcement Learning.

RL offers an architecture that is a bit more overengineered, Karl says. Active Inference, in contrast, is more elegant, focusing on minimizing free energy without the need for complex reward structures or explicit policy search.
In RL, you need to have explicit inputs that are the external rewards (which constitute a proxy for the constraints under which the Bayesian mechanics would operate). Active Inference does not require explicit external rewards, relying instead on prior beliefs (including the preferences encoded in the C tensor) and the system’s generative model to guide actions and learning.
In RL, the inference scheme is about selecting the right policy (e.g., through Q learning or policy gradient methods) to maximize the expected cumulative (discounted) reward. In Active Inference, the inference scheme revolves around minimizing free energy by continuously updating beliefs, generating predictions about future states, and selecting actions to align observations with prior preferences encoded in the generative model.
RL is often focused on optimizing short term rewards through trial and error. Active Inference emphasizes long term goal fulfillment, with actions driven by predictions about the future rather than immediate rewards.
RL agents may struggle with sparse or delayed rewards, making learning slower and less efficient. Active Inference continuously updates its generative model, which allows for more efficient learning even in the face of sparse feedback or uncertainty.
In RL, the exploration exploitation trade off is a central challenge, requiring strategies to achieve a better balance between trying new actions (exploration) and exploiting the best known routes (exploitation). In contrast, active inference handles exploration more naturally, as the agent explores to reduce future uncertainty and improve its model, rather than to just maximize rewards.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

It’s now time to see a specific example of this in action, Karl says.

Learning to play Pong

We have seen that planning or imagining a future can occur at the highest levels of these models, levels where the focus is purely on abstract patterns, with no direct notion of action. At the lowest levels, however, the system can bring these predictive posteriors to life by selecting actions that realize the predicted outcomes.

And how can we use this framework to learn a state action policy from scratch?

Through active data selection, Karl explains. To demonstrate this, we will explore an example showcasing an active inference agent learning a state action policy to play the game of Pong.

In this context, the focus shifts from selecting the type of model to selecting the training data itself. The agent must evaluate sequences by asking: does this sequence end in a reward?

Yes: admit this sequence for fast structure learning
No: ignore this sequence

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

https://medium.com/media/7351728a6452fb55686124d090558e58/href

Karl tells us that fast structure learning builds a generative model of events that lead to a reward (and excludes everything else). In short, fast structure learning helps the agent quickly focus on understanding the cause and effect relationships that are most relevant for achieving its goals, filtering out everything that isn’t directly contributing to those goals.

Unpacking this.

We only select something if it minimizes EFE (including expected cost). In other words, as an agent, I’ll only select data features or training sequences that minimize cost and are associated with a reward.
Therefore, we select instances of random play that align with our goals, the ones that minimize EFE and avoid costly outcomes. In summary, we generate training sequences through random play and ignore those that result in a high cost.
After securing a reward, we restart the random play and use subsequent sequences if they also achieve rewards.
We chain, bootstrap, or splice together sequences of rewarded play, which fast structure learning uses to build a generative model of expert play.
This creates a generative model that encodes a state action policy, which can be realized through reflective active inference. The agent acts in accordance with its learned generative model, making its behavior resemble that of an expert.
The trained agent recognizes only expert play, which ensures that all its predictions align with those of an expert player. Because actions fulfill these predictions, the agent will appear to behave like an expert player.
The reward is expressed in the charts as free energy (which is minimized). So the reward can be seen as a process of reducing free energy through actions that align with the predictions of the model.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

https://medium.com/media/adc411adad5a709ed8f784ff1d56da6d/href

And now comes a key point. Friston says that we have switched on inductive inference at the highest level .

Inductive inference involves making predictions about the world based on prior observations and patterns. At the highest level of the agent’s model (the top level), the agent applies inductive reasoning to predict what it needs to do in order to reach desired outcomes (rewarded states).
Sequences and projections through the model: The agent considers potential future states (or sequences of states) and works out how those states are encoded in its model. It projects these predictions down the hierarchy to lower levels, essentially checking if those predicted sequences of events lead to desired outcomes.
Intended states: If a predicted sequence includes a reward or a hit, that sequence is considered an intended state. This is the state the agent wants to move toward in future timesteps.
Avoiding undesired states: The agent uses this knowledge to figure out which states it must avoid in order to reach its intended states. It works out what actions or paths could lead to undesirable outcomes that would prevent it from reaching its goals.
In summary: In simpler terms, the agent uses inductive inference to guide its future actions by predicting the sequence of states that lead to rewards, figuring out which ones to avoid, and then adjusting its actions accordingly.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

So inductive inference is applied at the top level, where the agent is essentially told: ‘This world is under your control. Imagine what you want to happen.’ In other words, the agent is tasked with controlling the world and envisioning the outcomes it seeks to achieve.

The agent then predicts what will occur based on its learned model and determines which sequences of events are necessary to achieve its goals. Once these predictions are made, it deploys reflexive actions (or behaviors) to make them come true and reach its intended states.

Let’s summarize the full process.

Random Play and Selection of Rewarded Training Sequences
The agent begins by engaging in random play, exploring various actions and observing their outcomes. We only select the sequences that result in rewarded or desired outcomes (sequences that minimize free energy). This provides the agent with a set of examples that lead to good results.
Fast Structure Learning on Selected Sequences
Next, we apply fast structure learning to the selected rewarded sequences. This enables the agent to learn the underlying structure and dynamics of the task, essentially creating a model of the system based on these examples.
Applying Inductive Inference During Actual Play
Once the agent is actively playing, it uses inductive inference to plan its actions. The agent looks at the high level intended or rewarded outcomes it wants to achieve and traces back through its generative model to identify which sequences of states and actions are most likely to lead to those desired outcomes. This allows the agent to select actions aligned with its goals, rather than merely exploring randomly.

The key difference between steps 2 and 3 is that:

Step 2 (fast structure learning) involves building a generative model offline using the rewarded sequences.
Step 3 (inductive inference) uses that model to guide decision making and action selection during actual play.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

The reason we need the inductive inference step, even though we’ve already trained on the rewarded sequences, is that the generative model learned in step 2 captures the overall structure and dynamics, but doesn’t necessarily prescribe the exact sequence of actions to take at each moment.

Inductive inference allows the agent to plan and adapt dynamically during play, based on the current state of the game, rather than simply replaying previously learned sequences. This makes the agent’s decision making more flexible and goal oriented.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

https://medium.com/media/c585b7f9b7cb29f42fbd6d534e3fae51/href

Next, Karl explains that focusing exclusively on rewarded sequences can result in a model that may be somewhat brittle, as it lacks exposure to a broader range of possibilities. He then discusses how incorporating more diverse sequences can increase the robustness of the model.

All roads lead to Rome

To make the model more robust, Karl Friston suggests expanding the learned model by incorporating more pathways that lead to desirable outcomes (rewards).

Creating an Attracting Set
At the beginning of the process we have created a set of episodes (or sequences of states and actions) that lead to rewarding outcomes. This set is called an attracting set, where any path through the model will eventually lead to a reward. This attracting set represents the set of all desirable or goal directed states within the agent’s generative model.
Augmenting the Model with Additional Paths
To make the model more comprehensive, we can add more episodes or sequences of actions, as long as they eventually lead to one of the states in the attracting set. These new episodes help the agent explore different paths or strategies while still ensuring that the end result is a reward.
Expanding the Generative Model
By adding these additional pathways, we are effectively enriching the agent’s generative model. Instead of relying only on a small set of rewarded sequences, the model can now consider a broader range of possibilities that ultimately lead to a desirable outcome. This expansion helps the agent learn more about the environment and increase its robustness by exploring diverse paths to the same goal.

Image by Karl Friston via https://www.youtube.com/watch?v=ee6_mNOfP38

This concludes our fascinating journey into scale free active inference, as explained by Karl Friston. You can watch Karl Friston’s full talk, along with many others on active inference, via the following link to the latest Active Inference Symposium.

https://medium.com/media/4c72744c3a3c240c8f1011bdee14571f/href

Stay updated on the latest developments in active inference through the publications and streams of the Active Inference Institute.

The Active Inference Institute

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter | Podcast
Create a free AI-powered blog on Differ.
More content at PlainEnglish.io

How to grow a sustainable artificial mind from scratch was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Paths to the future of AI

Javier Ideami — Wed, 13 Mar 2024 07:28:37 GMT

Exploring some of the possible routes to reach artificial general intelligence

Paths to the future of AI | Image AI generated by Javier Ideami | ideami.com

As impressive as AI is today, its limitations are clear whenever we look at how ChatGPT, self driving cars or Midjourney work. They are very useful as prototyping or brainstorming tools, and can take you sometimes 99% of the way to a useful result. But would you put any of those systems in charge of a process in which failures had potentially catastrophic consequences? Of course not. But why do these systems sometimes make silly mistakes, “hallucinate”, or simply fail to “understand” what they are doing?

Dual Process Theory

Dual process theory, popularized by nobel prize Daniel Kahneman, gives us a straight forward abstraction to understand the limitations of today’s AI. This theory tells us that we can consider two ways our brain can process information. We call them: System 1 and System 2.

Image AI generated by Javier Ideami | ideami.com

System 1

System 1 is the type of information processing that we perform automatically, subconsciously. It is, essentially, pattern matching. A quick mapping from perception to action, or from a certain need to a response. It is quick and cheap. Living organisms rely a lot of System 1 responses because of two main reasons:

Automating your responses saves energy. It is better than having to figure them out all over again. Less energy spending equals a more efficient life.
If you are facing a danger, like a tiger that is about to eat you, you have no time to reflect or ponder. You have to generate a response fast!

Image AI generated by Javier Ideami | ideami.com

Think of System 1 as an analog to human intuition, to your subconscious automatic mapping processes. Human intuition is very powerful. However, sometimes it gives you the wrong answer. Why?

Human intuition gives you the best approximation it can derive from the knowledge you have. But if you don’t have enough relevant knowledge for the scenario at hand, it will still produce an approximation, but an incorrect one. And here comes the key. An incorrect approximation which your intuition will still present to you as if it was the truth. Does this sound familiar? Yes, because that’s the same behavior you find in GPT, Gemini, Claude, and other AI systems of today.

How do we go beyond System 1? Let’s explore it.

System 2

A living organism, whenever it finds a new unknown scenario, or whenever the approximation presented by its intuition (System 1) doesn’t seem correct, can employ another way of processing information, System 2, slow thinking.

Think of when you need to learn to drive, or play a new piano piece. You need to slowly, step by step, find the algorithm, the sequence of actions that will solve that new scenario.

Image AI generated by Javier Ideami | ideami.com

But, how do you find that optimal sequence of actions?

World Models

A robust System 2 needs a robust world model. A world model is a simplified abstraction of how the world works.

Babies gradually learn their world model mainly by observation, and a little by interaction.

Image AI generated by Javier Ideami | ideami.com

If you have such an abstraction in your mind, and if it is robust enough, you can do something really special. You can plan. You can do that because, by using your world model, you are able to predict the consequences of taking different sequences of actions.

So when an organism does slow thinking, it makes use of the predictive capabilities of a world model in combination with other modules (a cost module to estimate how far you are from your objective, a memory module, etc), to find the optimal sequence of actions for a certain goal.

This is why System 2 is way more expensive for the organism than System 1. And that’s why you cannot employ System 2 processing at all times. It would burn you out. So once you have learnt a new algorithm with System 2, you transfer it to System 1, you automate it.

And so, once you have learnt to drive or to play that piano piece, you can do those asks without paying attention, automatically, subconsciously, with just your System 1 capabilities.

In summary, good System 2 capabilities allow you to reason and plan in effective ways.

Image created by Javier Ideami | Images within the image are from unsplash.com

So what about AI? Artificial intelligence has mastered System 1 capabilities at a superhuman level.

However, it has very limited System 2 capabilities. Most experts agree that, although within the latent spaces of a system like GPT there probably exists some kind of abstract representation of the world, it is a very brittle and fragile representation. And it is an implicit one.

AI needs robust world models, trained in explicit ways to function as effective predictive systems. Only then can the organism plan and reason well enough.

The road to ASI

System 1 + System 2 equal human capabilities. That’s what many people call AGI, artificial general intelligence. AGI is not a popular term among most experts because intelligence in general is fairly specialized rather than general. It is specialized to the specific needs an organism has to survive.

I will use the term AGI in this article simply because it has become sort of a used-by-everybody term to denote human like intelligence. But it is important to clarify that it is a bit of a misnomer. Alternatives to AGI are other terms like AHI = artificial human intelligence

So what does the path ahead look like?

Image AI generated by Javier Ideami | ideami.com

Today we have ANI = artificial narrow intelligence
As we develop system 2 capabilities we will be reaching other stages like ARI (artificial rat intelligence), AMI (artificial mouse intelligence) or ACI (artificial cat intelligence)
Eventually we will arrive to AGI, or AHI (artificial human intelligence)
And after that? We will keep going beyond human intelligence, towards ASI (artificial super intelligence)

Graphic and Cookies created by Javier Ideami | ideami.com

So, the question is, what are the potential paths ahead of us to reach those robust System 2 capabilities that are essential to move on from the current ANI? Let’s explore some of them.

Image created by Javier Ideami | ideami.com

But it’s already here!

There is a very small minority in the AI community that believe that AI may already be sentient. And that its System 2 capabilities are already there or almost there. This is a very tiny minority and most AI experts do not agree with this view.

Deep learning all the way

(with sprinkles of RL, discrete search, etc)

Image AI generated by Javier Ideami | ideami.com

We enter mainstream territory. This is the view led by the likes of OpenAI or Google. This camp believes that deep learning will be the core technology that will take us all the way to AGI.

This would happen through a combination of scale (data, compute, etc) and tweaks and evolutions of the architectures being used by deep learning systems.

Although deep learning would be the core, approaches like reinforcement learning, discrete search and others would complement that core on its way towards AGI.

Let’s summarize the potential of this approach.

Challenges: questions remain as to how robust would be the world models and System 2 capabilities of solutions centered mainly on deep learning.
Cost: very high. Deep learning systems spend massive amounts of energy compared to, say, how the human brain works.
What’s next: look to scaling + improvements in deep learning architectures in the next few years, as planning and other system 2 capabilities begin to be unlocked.
Examples: OpenAI & Google Approaches (GPT, Gemini)

Hybrid & multimodular approaches

Image AI generated by Javier Ideami | ideami.com

Hybrid and multimodular systems may combine different approaches. I will focus here on an example that will still use deep learning as its core technology. The key is that these solutions will be more divergent in terms of moving from the typically monolithic systems of today towards others that are more modular, closer to how our brain is partitioned in different functional areas.

Consider the human brain. An area called the hippocampus deals with memory formation and spatial navigation. Another called the basal ganglia is related to our automatic behaviors, habit formation and reward processing. Yet another, the cerebellum, deals with fine tuning of movement, balance and coordination.

In that spirit, Yann LeCun, one of the godfathers of AI, has proposed a new kind of multimodular architecture in his academic paper “A Path towards autonomous machine intelligence”.

Captured from Yann LeCun’s public academic paper at https://openreview.net/pdf?id=BZ5a1r-kVsf

His proposal combines a number of modules to implement both System 1 and System 2 capabilities. Some of those modules are:

The configurator module. A kind of a master module that connects to all the others and sets their parameters to adapt their functionality to the current goals.
The Perception module: it takes the input to the senses of the agent and abstracts it into a latent representation of the state of the world. This latent representation may be expressed hierarchically at different levels of abstraction. In our brain, this module would correspond to parts of our visual cortex, auditory cortex, etc.
The World Model: The most complex part of the architecture. It allows the agent to estimate missing information about the current state of the world, predict world states from previous ones or from actions proposed by the actor module, etc. Yann LeCun has proposed the JEPA architecture to implement the world model. Initial versions called I-Jepa and V-Jepa (trained with images and videos respectively) have been launched already, and work continues in this key part of the proposal.

GitHub - facebookresearch/jepa: PyTorch code and models for V-JEPA self-supervised learning from video.

The Cost module: It measures an amount that Yann calls “energy”, which expresses how far the agent is from “comfort”, or the distance between where the agent is and where it wants to be in relation to different drives and goals. The ultimate goal of the agent is to minimize this energy, this cost. The cost combines two terms, Yann explains, the Intrinsic cost module, which is hard wired and computes the instantaneous present “discomfort” of the agent, and the trainable critic module, which is used to predict future intrinsic energies. Some parts of the intrinsic cost module can be compared to the basal ganglia and the amygdala in humans, whereas parts of our prefrontal cortex involved in reward prediction would correspond to the trainable critic module.
The Memory module: It stores useful data about past, present and future world states, together with their associated intrinsic costs. This is useful, for example, to train the critic module (which estimates future intrinsic costs). The memory module can be compared to the hippocampus in humans.
The Actor module: it creates proposals for action sequences that may be used to solve a new scenario. It also sends actions to the system’s actuators. The actor is at the center of how System 1 and System 2 processes are implemented. It has two components. One is a policy module that quickly maps world states (derived from the perception module) to actions. This is the base of System 1. And the second is the action optimizer that performs model-predictive control. This is the base of System 2. This second component, when combined with the world model and the cost module, can be used to learn new algorithms, by gradually finding an optimum sequence of actions that minimizes the cost associated with them. In our brain, areas of the premotor cortex that deal with proposing and encoding motor plans could correspond to parts of this module.

Captured from Yann LeCun’s public academic paper at https://openreview.net/pdf?id=BZ5a1r-kVsf

For details, check his academic paper. You can also explore an infographic and article I wrote about Yann’s proposal:

The tower of mind, towards a better ChatGPT

Let’s summarize this approach:

Challenges: questions remain as to how to solve many key pending issues, including the robust world model (research and the first versions of the JEPA architecture are in their initial phases), the memory module, the setting of goals and others.
Cost: high. Cost is likely to be similar to the previous approach.
What’s next: look to research on the different modules of these hybrid and multimodular architectures in the next few years as planning and other system 2 capabilities begin to be unlocked.
Examples: LeCun’s proposal

Active Inference

Image AI generated by Javier Ideami | ideami.com

Active Inference represents one of the most promising alternative approaches to the deep learning revolution. It is based on a completely different paradigm, focused on Bayesian mathematics.

Bayesian mathematics is all about probabilistic computation, which can be used to update in real time the beliefs about the world of an agent.

Image created by Javier Ideami | ideami.com

Those beliefs of the agent constitute its world model. Active inference has different ways of implementing the mechanics of how that world model is updated in real time, and how the agent uses that world model to decide what to do at each moment, in response to the new observations (data) it continuously receives from its surroundings.

At the base of Active Inference is the Free Energy Principle, which stipulates that living organisms behave as if they were constantly minimizing their free energy. You can think of the free energy as if it was equivalent to surprise, to the difference between the predictions of the agent regarding what it is about to perceive, and the actual observations it receives.

To minimize surprise, the active inference agent can do two things: it can update its beliefs, its world model, or it can act upon the world to change its observations. So, as you can see, in active inference, perceiving and acting are two sides of the same coin.

By minimizing surprise, you improve your predictions, which improves your world model. You may then ask, if the goal is to minimize surprise, what about curiosity and creativity?

Active Inference has a good answer to that. The agent tries to minimize surprise, but also to move towards a number of goals and preferences that can be encoded in the mind of the agent. So both, minimizing surprise and moving towards goals and preferences, are important drivers.

But not only that. Minimizing surprise can focus on shorter or longer time horizons. And sometimes, minimizing surprise long term, requires the agent to first explore its surroundings in the short term, to gain enough knowledge about its environment.

It is just one scenario within all the complexity that can arise from active inference implementations. This is all accounted for by different equations. One of them, as an example, deals with the mechanics of action selection, and depends on a number of factors, such as :

Pragmatic value: how much do different action sequences help the agent achieve its goals?
Epistemic value: how much do different action sequences help the agent gain new knowledge about its environment?
Entropic value: how much uncertainty does the agent have about the consequences of taking different sequences of actions?

As those factors interact with each other, the agent will be driven to either explore its environment or exploit its current knowledge.

Image created by Javier Ideami | ideami.com

Active inference has some very important advantages over deep learning. For example:

It implements continual learning. You don’t need to retrain the system every time you want to add new knowledge. The agent is continuously updating its beliefs in real time.
It is explainable and transparent. You can examine the “mind” of the agent, its inner computations, in order to explore the rationale behind its decisions.
It requires much less data to learn, orders of magnitude less.
It spends much less energy, thousands of times less, potentially. In that sense, it gets much closer to how our brain works.
It can implement sophisticated System 2 capabilities.
Learnings can be transferred between agents.
It is based on robust science. Active Inference was founded by Karl Friston, the most academically cited neuroscientist in the world.

Those are just some of its advantages. But if active inference is so great, why isn’t it dominating the field?

Because until very recently, something called the Bayesian Wall was in the way. The Bayesian wall means that bayesian computations are slow and expensive, so scaling up these systems was considered impractical.

However, a company called Verses (Karl Friston is part of it), has recently announced that they have broken the Bayesian wall, finding a way to solve the scaling problem of active inference.

VERSES AI's Active Inference Outperforms Deep Learning in Historic AI Industry Benchmark Test

This could potentially be a game changer. Verses has announced that in the summer of 2024 they will release a series of academic papers explaining their progress. In the meantime, they have released a demo + preview, a summary of which you can watch in the following video:

https://medium.com/media/6fc26fb59a68fcb5cd1f50c70aed938f/href

Let’s summarize this approach:

Challenges: questions remain about how well active inference implementations will scale. The related academic papers, which Verses is planning to release in the summer, will be a great starting point to evaluate the possibilities of scaling this approach.
Cost: low. Active inference implementations would use much less energy than deep learning implementations.
What’s next: explore the recent announcement by Verses and stay tuned to their new academic papers, slated to be released this summer.
Examples: Verses implementation of Active Inference.

Mixture of approaches

Image AI generated by Javier Ideami | ideami.com

There are many chances that previous approaches will eventually converge.

If architectures like JEPA begin to show promise and parts of Yann LeCun proposal start to be implemented, that is likely to influence the rest of the deep learning community. We could see at that time or even earlier other multimodular and hybrid approaches beginning to emerge.

If what Verses has announced, regarding the breaking of the bayesian wall, is confirmed, and their active inference implementation begins to beat deep learning benchmarks, it is likely that parts of the deep learning community will begin to explore how to combine both approaches.

The combination of the impressive system 1 capabilities of deep learning, with the explainability, efficiency and transparency of probabilistic approaches like active inference, and the benefits of multi modularity, could produce a great diversity of new combinations of approaches on the way to AGI.

Let’s summarize this approach:

Challenges: plenty, regarding how well these approaches will combine with each other. Engineering challenges may appear that may be hard to solve.
Cost: it’s possible that a mixture of some of these approaches will be more efficient than one relying solely on deep learning.
What’s next: stay tuned for progress on the approaches highlighted above.

Other possibilities

Image AI generated by Javier Ideami | ideami.com

Beyond the previous approaches, there are other possibilities to consider.

A part of the AI community believes that more advanced technology, such as quantum computing, may be essential to achieve true AGI.
Parts of the AI community believe that embodiment will be essential to get to AGI. That is, AIs interacting through sensors and actuators with the full complexity of a world like ours (be it virtually or physically).
In the spirit of professor Kenneth Stanley and his fabulous book “Why greatness cannot be planned”, the intermediate steps towards AGI may look nothing like the objective and be really counterintuitive. So it may be that we will be surprised in the next years by unexpected new kinds of architectures and algorithms that may appear due to following a gradient of interestingness during the everyday exploratory activities of researchers.

This concludes our exploration of some of the possible paths to AGI. The one thing that is certain is that we live in a historic moment and that the coming future will surely bring plenty of surprises, guaranteed excitement and also important developments which will require we all to do our best to push this technology in the direction that brings the most benefits to humanity.

Image created by Javier Ideami | ideami.com

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
More content at PlainEnglish.io

Paths to the future of AI was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

The tower of mind, towards a better ChatGPT

Javier Ideami — Mon, 10 Jul 2023 07:54:42 GMT

The Tower of Mind, towards a better ChatGPT

How new architectural paradigms can fix the limitations of systems like ChatGPT

The Tower of Mind | Infographic by Javier ideami | ideami.com

ChatGPT has taken the world by storm. And yet, its limitations are many. It often hallucinates and presents straight out lies as if they were facts.

In a way, it is like our human intuition: fast, powerful and confident, sometimes too confident for its own good!

Recently, Yann LeCun, one of the godfathers of AI, published a fascinating paper titled “A Path Towards Autonomous Machine Intelligence”. In this theoretical paper, he outlines a proposal for a new kind of multimodular architecture that could fix many of the limitations of systems like ChatGPT and move us closer to AGI.

I have combined parts of Yann’s paper (which is rather long at 60 pages) with one of my visual metaphors in order to produce the Tower of Mind infographic, which simplifies and visually expresses some key parts of his proposal.

Throughout the infographic, we will be reviewing key aspects of how our human mind works and simultaneously we will connect those aspects with this potential future AI architecture proposed by Yann LeCun. We will see that there are a lot of parallelisms and connections to be made between Yann’s proposal and the way the brain works. Let’s begin!

Thinking Fast and Slow

At the right side of the infographic we establish a key objective of this new architectural paradigm: to implement two different ways of processing information, two ways or modes that match the way our brain works, the modes that nobel prize Daniel Kahneman explains so well in his famous book “Thinking Fast and Slow”: System-1 and System-2.

The Tower of Mind | Infographic by Javier ideami | ideami.com

There is an evolutionary reason for these two ways of processing information to exist in our brain, and perhaps, in most intelligent systems.

Whenever we encounter a new scenario (for which we haven’t previously learnt a response pattern), we need to find the sequence of steps, the algorithm to solve such a scenario. For this, we employ what Kahneman calls System-2. We slowly look for that new algorithm, paying close attention in a systematic and conscious way as we slowly reason our way to a new response pattern.

The problem with our System-2 is that it is slow and expensive. It requires a lot of cognitive effort and fuel (the glucose that powers our brain). We cannot use such a mode of thinking all day. Therefore, once we learn a new response pattern, we proceed to automate it. We make it subconscious. We transfer it from System-2 to System-1.

Our System-1 is fast and much cheaper. It is also subconscious and powers our intuition and the associative machinery that has a lot to do with what we call “Creativity”. It quickly connects perceptions with actions, or goals with response patterns.

Metaphorically speaking, our subconscious is like a kitchen pot that combines and recombines the information we absorb (which we also compress and abstract).

Image AI generated by Javier ideami | ideami.com

Within the subconscious realm, System-1 processes quickly map inputs to outputs, bypassing the slow System-2.

When a tiger is about to eat us, we have no time to consciously reflect on what to do. We must map perception to action immediately.

While System-2 is typically either off or weakly active (monitoring the impressions sent by System-1), System-1 is always on and active.

Our subconscious intuition is constantly trying to fit perception to response patterns. And when it doesn’t find a perfect fit, it gives us an approximation. But it always presents its conclusions as if they were facts.

It is then up to our System-2 to accept or override our system-1 impressions. The way a typical System-1 mode works is one of the sources behind the hallucinations and lies produced by systems like ChatGPT.

It’s time to jump into the tower of mind. Let’s first take a quick look at the interface between the world and our perception at the bottom of the tower.

The ocean beneath

The Tower of Mind | Infographic by Javier ideami | ideami.com

The ocean beneath the tower represents all the complexity around us. In contact with that ocean of complexity, our senses perceive and absorb information. However, such an amount of complexity cannot possibly fit into our brains. Therefore, we must compress it.

Within Yann’s proposal, an encoder module compresses our perception and gradually abstracts it, discarding much of the detail and preserving its essence.

A similar thing happens in our brains. For example, in our visual cortex, the information passes through different layers that gradually extract more refined abstractions of our perception.

It is now time to return to the very top of the tower and begin our gradual descent.

I swear I’m in control of this mess

The Tower of Mind | Infographic by Javier ideami | ideami.com

Consciousness is not necessarily a prerequisite for an advanced form of artificial intelligence. We don’t really know if future AI systems will or won’t be conscious. In addition, we don’t even understand what consciousness is or how it works.

In any case, in the infographic we create a parallel between human consciousness and a potential similar thing that may exist in the AIs of the future. A horizontal line separates conscious and subconscious processes.

The comical texts at the top represent the degree of control that we typically feel we have in relation to what goes on down below, in the rest of the tower of the mind.

However, way more of our lives than we may think, is run by our System-1 processes.

Experts state that we make around 30000 decisions a day, and that we are only conscious of 0.26% of them. Most of what we do is System-1.

Most of the time we run in auto-pilot or semi-auto pilot mode. This is the reason why, even though what AI has mastered today is mainly limited to System-1 capabilities, it is still able to impact almost every aspect of our lives. Because, in fact, most of what we do is related to such a way of processing information.

Still, System-2, while being used selectively and carefully, is the most important part of our thinking. It is what allows us to proactively find new algorithms to solve new scenarios, to analyze and reason in systematic ways, to monitor and, if necessary, override the impressions sent by System-1, etc.

Does ChatGPT have system-2 capabilities? The AI community is sort of divided regarding this question, but most experts state that although it is possible that it may have some kind of a rudimentary or basic model of the world internally, ChatGPT is still very far from having something equivalent to our powerful System-2 capabilities.

What makes the situation confusing is that ChatGPT uses human language, and our language is one of the foundational pillars of our System-2. The combination of the use of language and its massive System-1 pattern matching capabilities (way larger than ours), allows ChatGPT to sometimes very convincingly imitate our process of reasoning.

Also, having been trained on data which includes many reasoning processes, ChatGPT is able to sort of output reasoning steps if we either encourage it or help it with techniques such as Chain of Thought Prompting.

At the same time, it is quite easy to catch ChatGPT committing the kind of mistakes that reveal the absence of a sophisticated world model.

But, at a simple level, what is a world model, and why is it so important?

The all important world model

A world model is a simplified abstraction of how the world works. A good world model allows you to:

Predict a future state of the world from a previous one
Predict next states of the world after performing simulated actions
Fill in missing details in the data coming from the perception module

Most importantly, a powerful world model allows you to simulate and learn without having to do trial and error in the real world, which can be costly and dangerous.

If you think about it, a great model of the world is like having common sense. As Yann LeCun explains, we can view common sense as a set of models of the world that tell us what is plausible or likely and what is impossible.

And the kind of mistakes ChatGPT often makes, reveal a lack of a sophisticated common sense, that is, a lack of a sophisticated model of the world.

In order to solve these limitations, Yann LeCun proposes this new architectural paradigm. One of the first key points to note in regards to this new paradigm is its multi modularity.

Systems like ChatGPT, although they may have some internal modules, are quite monolithic. For example, they don’t have a dedicated memory module. They are what we typically call end-to-end systems.

To compensate for the lack of separate key modules and also for the lack of sophisticated System-2 capabilities, all sorts of patches and hacks are currently being used in combination with ChatGPT: plugins of all kinds, from Wolfram Alpha to Zapier, vector databases, libraries that allow LLMs to become agents capable of taking (or attempting to take) autonomous decisions (see AutoGPT or BabyAGI).

These and others are temporary solutions that may eventually be replaced by robust ones like the new paradigm proposed by Yann LeCun.

Below I share another of my infographics, one which summarizes the current ecosystem around LLMs.

Every prompt matters | Infographic by Javier ideami | ideami.com

You can download the “Every prompt matters” infographic at its Github repo.

GitHub - javismiles/every-prompt-matters: Infographic exploring the quickly evolving LLM related AI ecosystem (will be updated regularly)

Therefore, in contrast to the limitations of current systems, Yann’s architecture is more akin to our human brain, being composed of a number of separate modules which allow the architecture to flexibly implement processes that make system-1 and system-2 capabilities possible. We will soon explore these modules and the way they work together.

Graphic from “A Path Towards Autonomous Machine Intelligence” by Yann LeCun | https://openreview.net/pdf?id=BZ5a1r-kVsf

Let’s now review the top of the tower, continuing to establish parallels between how our mind works and this new architectural paradigm of Yann LeCun.

The Tower of Mind | Infographic by Javier ideami | ideami.com

At the top we have our attention capabilities. As humans, when we use System-2 processes to learn new algorithms, we make use of the lighthouse of our attention to illuminate and understand the context around the information we interact with.

Modern AI Architectures also employ attention mechanisms, a key part of the Transformers architecture, which has revolutionized the deep learning field.

Below, you find a detailed infographic I created about how Transformers work.

X-Ray-Transformer | Infographic by Javier ideami | ideami.com

GitHub - javismiles/X-Ray-Transformer: Infographic about the inner computations of a transformer model, training and inference

Let’s continue. At the top of the abstraction layers of the tower of our minds lies human language. The word “flower” at the top of the tower, represents every possible flower in the universe. At the opposite end of the tower, by the ocean, we find the rich details of a specific flower.

Time to talk about goals. We humans, subconsciously and/or consciously set goals for ourselves. We want to do more exercise, get a new job, improve our relationship with somebody, or simply, feel better.

We will soon see how these goals connect with the rest of the modules and the way we may perform optimization processes to find out the best actions or response patterns to fulfill such goals.

Finally, Yann proposes the existence of a configurator module. A kind of a master module that connects to all the others and sets their parameters in order to adapt their functionality to the current goals.

We may have something similar to this module within the prefrontal cortex of our brain, which is involved in many of our executive functions.

Apart from the configurator, the other main modules proposed by Yann, and which we will explore in a bit, are:

The Perception module: it takes the input to our senses and brings it up the tower, compressing it and abstracting it into a latent representation of the state of the world. This latent representation may be expressed hierarchically at different levels of abstraction. In our brain this module would correspond to parts of our visual cortex, auditory cortex, etc. The process of compressing and abstracting our perception is represented in the infographic by the green encoder module.
The World Model: The most complex part of the architecture. As explained previously, it allows us to estimate missing information about the current state of the world, predict world states from previous ones or from actions proposed by the actor module, etc. The predictive capabilities of the world model are represented in the infographic by purple circles with white edges.
The Cost module: It measures an amount that Yann calls “energy”, which expresses how far we are from “comfort”, or the distance between where we are and where we want to be in relation to different drives and goals. The ultimate goal of the agent (us, in the case of humans) is to minimize this energy, this cost. The cost combines two terms, Yann explains, the Intrinsic cost module, which is hard wired and computes the instantaneous present “discomfort” of the agent, and the trainable critic module, which is used to predict future intrinsic energies, and soon we will see why this is so important and useful. Some parts of the intrinsic cost module can be compared to the basal ganglia and the amygdala in humans, whereas parts of our prefrontal cortex involved in reward prediction would correspond to the trainable critic module. The cost modules are represented in the infographic by fuchsia rectangular boxes.
The Memory module: It stores useful data about past, present and future world states, together with their associated intrinsic costs. This is useful, for example, in order to train the critic module (which estimates future intrinsic costs), and later on we will explain how this would be done. The memory module can be compared to the hippocampus in humans. It is represented in the infographic by a small tower on the left side of the main structure.
The Actor module: it creates proposals for action sequences that may be used to solve a new scenario. It also sends actions to the actuators of the system. The actor is at the center of how System-1 and System-2 processes are implemented. It has two components. One is a policy module that quickly maps world states (derived from the perception module) to actions. This is the base of System-1. And the second is the action optimizer that performs model-predictive control. This is the base of System-2. Later on we will see how this second component, when combined with the world model and the cost module, can be used to learn new algorithms, by gradually finding an optimum sequence of actions that minimizes the cost associated with them. The actor modules are represented in the infographic by white circles with yellow edges (involved in System-2) and by yellow triangles (involved in System-1). In our brain, areas of the pre-motor cortex that deal with proposing and encoding motor plans could correspond to parts of this module.

It is now time to descend and explore the way System-2 capabilities would work in the paradigm Yann proposes.

System 2, the slow and steady snail

The Tower of Mind | Infographic by Javier ideami | ideami.com

Let’s review. This new architectural paradigm will fully activate System-2 processes when it needs to find the steps, the actions, the algorithm to solve a new scenario, or when it wants to override the automatic impressions sent by System-1.

In order to do this, System-2 needs to find the right actions to solve that new scenario. And to get there, the system will perform an optimization process to gradually find those actions.

What is most important to understand is that such optimization processes will take place in a hierarchical fashion at different levels of temporal abstraction. This is because each goal the AI has, can be expressed at different temporal scales. A similar thing happens with us, humans.

For example, say I want to learn to play a new piano piece. The response patterns I need to activate in order to play it don’t yet exist in my brain so I need to learn them slowly and consciously using System-2.

Learning to play the piece is my main goal. But now I can subdivide it into subgoals that correspond to different parts of the piece. And now I can take each of those subgoals and further subdivide them into other goals like learning first the melody and then the accompaniment of each of them. And so on and so forth.

So we want to optimize the actions we will take in regards to each of the subgoals, hierarchically encompassing each temporal abstraction, including the master goal.

In this infographic, to simplify, we are representing two of the levels of such a hierarchical process.

The Tower of Mind | Infographic by Javier ideami | ideami.com

And each of those levels is composed of the combination of 3 parts:

First, an actor module proposes a sequence of actions to tackle a certain goal (the white circles with yellow edges).

Then, the world model takes the current state of the world and the action proposed by the actor and outputs its predicted next state of the world. It then takes that predicted next state and the following action proposed by the actor, and predicts again the following state of the world. And so on and so forth, gradually predicting the consequences of performing the actions proposed by the actor. These predictive processes are represented by the purple circles with white edges.

The Tower of Mind | Infographic by Javier ideami | ideami.com

Great, so we propose possible actions and we predict the consequences of performing them. But now, how can we link those two with our ultimate goal, which is to optimize those actions in order to minimize the total cost, the discomfort, the difference between where we are and where we want to get to?

We do it through the cost module. As explained previously, the cost module measures an amount, an energy, that expresses our degree of discomfort, how far we are from “comfort”, from an ideal state where our goals have been achieved.

And what the agent wants, be it an AI or a human, is to minimize this number, to decrease the cost, to decrease that difference between where we are and where we want to be, to get closer to our ideal state.

To review, Yann explains that this module is composed of two parts:

An intrinsic cost, that expresses our present level of comfort in relation to things like hunger, pain, pleasure and similars, as well as other needs that may arise from our goals (depending on how the configurator module has set up the cost module).
A critic, which is a trainable module that is used to predict the future intrinsic cost connected to a certain state of the world. This is very important because we are using the world model to simulate actions. This means that we are not taking them for real. So the only way to know their cost in advance is to predict it. And the critic is in charge of doing that.

Each predictive process (linked to a set of actions) is connected to a cost process that estimates the related cost. And each intermediate cost is summed to produce in the end one final total cost.

The Tower of Mind | Infographic by Javier ideami | ideami.com

So now we have:

Proposed actions
Predicted new states of the world derived from those actions
A total predicted cost that results from performing a simulation of that sequence of actions

All that remains is to perform an optimization process by iterating through this pipeline with the objective of minimizing as much as possible the value of the total cost.

As we do that, we will be modifying and tweaking the proposed actions while we keep decreasing the total cost at the end of the pipeline.

Yann emphasizes that we have to remember that this would be happening in parallel at different levels of temporal abstraction within this hierarchical architecture. And as we can see below, the different levels of temporal abstraction influence each other in key ways.

The Tower of Mind | Infographic by Javier ideami | ideami.com

The specifics of the optimization algorithm that will gradually tweak the actions with the objective of decreasing the total cost, depends on how continuous and differentiable the mapping that goes from the actions through the world model to the cost may be.

If such mapping is continuous and smooth, we could use gradient based optimization algorithms like backpropagation.

However, if there are discontinuities in the mappings, we would have to employ gradient-free methods like dynamic programming, heuristic search techniques, combinatorial optimization, etc.

After performing the optimization process, the system has found a set of actions that allows it to reach its goal. We have learnt a new response pattern to fit with the new scenario.

But let’s remember that System-2 is slow and expensive. Therefore, we want to transfer these learnings to System-1, so that the following times we can execute the learnt response pattern automatically without having to engage the world model and the rest of the optimization process. Let’s therefore continue our way towards System-1.

System 1, the fast and always ready cheetah

The Tower of Mind | Infographic by Javier ideami | ideami.com

The yellow ellipse represents part of our subconscious System-1, a set of quick perception-action loops that map our sensor inputs or our abstract needs to sequences of actions or response patterns.

The yellow triangles represent the part of the actor module that Yann LeCun calls “policy modules”, modules that are trained to map a certain perception or need to an action or sequence of actions.

The Tower of Mind | Infographic by Javier ideami | ideami.com

Yann tells us that this new architectural paradigm would have one single world model (in relation to System-2), but multiple action policy modules (in relation to System-1).

Having a single world model in this new AI architecture allows us to reuse the related hardware as well as share knowledge between different goals and tasks.

Is the world model in our brain the final consequence of a voting process like Jeff Hawkins suggests in his book “A Thousand Brains: A New Theory of Intelligence”? Jeff Hawkins talks about thousands of models within our cortical columns which through voting coalesce into stable predictions. It remains to be seen.

On the other hand, in this new AI paradigm the subconscious System-1 processes can employ numerous policy modules which can be trained to output different response patterns learnt by System-2 processes in response to related perceptions or connected needs.

And how do we perform such a transfer process, from System-2 learnings to System-1 policy modules?

The snail becomes a cheetah

The white circles with yellow edges represent the actions that System-2 has learnt and optimized. Below them, we find light blue square modules that connect each of those learnt actions with System-1 policy modules (the yellow triangles).

The Tower of Mind | Infographic by Javier ideami | ideami.com

The light blue square modules estimate the distance, the difference between the actions learnt by System-2 and the actions that System-1 policy modules output in response to the related states of the world.

In order to perform this transfer process from System-2 to System-1, we need to decrease the distance between those two kinds of inputs that feed the light blue modules. As we gradually decrease that distance, the output of the System-1 policy modules gets closer and closer to the actions learnt by System-2.

As that optimization process progresses, System-1 policy modules are able to gradually output with more precision those same actions in response to the related states of the world. We have transferred the learnings from System-2 to System-1.

The memory lane

On the left of the infographic, we find a little tower that represents the short term memory module that Yann describes in his paper.

The Tower of Mind | Infographic by Javier ideami | ideami.com

This separate memory module would correspond to the hippocampus in humans. It is in charge, in between other things, of storing pairs of states of the world and associated costs. Storing this information for future retrieval is key in order to, for example, be able to train the very important critic module. Let’s review why.

Remember that in order to learn, System-2 proposes some actions, simulates them, predicts the new states of the world derived from that simulation, and then predicts the total related cost. It then gradually optimizes that pipeline by tweaking the related actions in the direction that minimizes the total cost.

The critic module is in charge of estimating in advance the costs connected to the states that result from the simulations, and it needs to be trained to perform those predictions as well as possible.

By accessing the short term memory, we can pick a state of the world and its related intrinsic cost, and compare that cost with the one predicted by the critic. We can then perform an optimization process to decrease the distance between what the critic predicts and the correct cost stored in memory. By doing this many times, we will be training our critic to estimate in better ways future costs connected with simulated actions.

Looking ahead

In this article and infographic we have explored some of the key areas of this new architectural paradigm presented by Yann LeCun. Simultaneously, we have reflected on the way our mind works, which has a lot to do with this new proposal by Yann.

But Yann’s paper is pretty long at 60 pages and includes many more details. A big chunk of the paper centers on how to design and train the world model, which includes its Joint Embedding Predictive Architecture (JEPA), the most complex part of the system. If you want to go deeper into his research and paper, you may explore it in depth here:
https://openreview.net/pdf?id=BZ5a1r-kVsf

Finally, you can download the Tower of Mind infographic in very high quality at the following Github repo:

GitHub - javismiles/tower-of-mind: The Tower of Mind Infographic , How new architectural paradigms can fix the limitations of systems like ChatGPT

This infographic was first presented during a talk of the same name proposed by Instituto de Inteligencia Artificial in Spain (iia.es) and which I gave to the company Roams. Roams, directed by Eduardo Delgado, is one of the best examples of great entrepreneurship in Spain, a company that puts talent and people above everything else.

The Tower of Mind | Infographic by Javier ideami | ideami.com

Image AI generated by Javier ideami | ideami.com

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: CoFeed | Differ
More content at PlainEnglish.io

The tower of mind, towards a better ChatGPT was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Every prompt matters

Javier Ideami — Tue, 28 Feb 2023 19:54:33 GMT

Hopes and risks of interacting with future evolutions of a ChatGPT like system, through one of the first movies that reflects on it all

https://medium.com/media/ccbf056cdbc1504eed04026bf02b7f7f/href

Hamelin 77 is one of the first movies that combines storytelling about prompt engineering and generative AI with the use of related techniques within its production. Let’s use some of the concepts in the movie to reflect about the present and future of our interaction with ChatGPT and related systems.

Poster of Hamelin 77 by Javier Ideami @ Ideami Studios

The piper of Hamelin

The movie connects at a metaphorical level with the famous tale of the pied piper of Hamelin. At the speed that everything is progressing, it is urgent to reflect upon the issue of control.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

Will our lives be eventually fully or subtly controlled by AI systems? (akin to what happens to the characters controlled by the Pied Piper in the famous tale?)

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

Will we instead always remain in control, or will a delicate win-win balance be struck?

In a way, we are in the middle of a delicate and crucial chess match on a twisted board whose rules keep dynamically changing as the match evolves.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

Every few days AI systems make a move (through the actions of researchers and companies), and society responds in turn.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

What do most people think about transferring some control to the AI piper in relation to issues that impact our day to day, such as transportation, medicine, etc. while maintaining human supervision?

And what will happen when eventually that supervision can be done by AI systems themselves in a highly efficient way? The new movement of constitutional AI is beginning that path. The related academic paper says:
“As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles”.

We are exploring how to reach a suitable balance within this complex chess board. And this is a key point in the movie: a balance that is at the same time expectant but tense, hopeful but cautious. We are all progressively entering this dynamic chess board and such conundrums will only get more tricky and intense from now on.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

The piper is a mirror

We used to think of machines as cold and deterministic entities, and of humans as emotional and spontaneous. And yet, as AI advances, those distinctions are quickly blurring.

To the surprise of many, this latest AI revolution started from the generative and creative side (which seemed untouchable a while ago).

In terms of generating and creating with text, large language models are somewhat brittle but they can express themselves in emotional ways, be witty and use humor (and even attempt to explain their jokes).

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

And so, in the movie we witness both poles, humans and AI, as capable of expressing the range that goes from unpredictable and emotional to calculating and precise.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

We are seeing more and more that AI is in many ways a mirror of our own nature. The more it grows and the more we explore it, the more we learn about ourselves.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

Human language as well as parts of the creative process seemed for a long time to be almost magical in nature, hopelessly complex.

But as AI systems evolve, they keep on removing the veils that cover our gaze. AI is helping us demystify many of these processes, as we discover that by combining powerful architectures at scale with the right data, we are able to automate many of these capabilities, albeit in brittle ways (more about this later).

At the same time, our increasing interactions with AI systems will also be transforming and modulating our behavior. And as hinted in the movie, we could potentially encounter conflicts that arise from the different goals AI and humans may have.

This is the so-called alignment challenge. As Wikipedia states: “AI alignment research aims to steer AI systems towards their designers’ intended goals and interests“ and “AI systems can be challenging to align and misaligned systems can malfunction or cause harm”.

How will these two poles, humans and AI, influence each other, and what will emerge from this increasing interaction, humans 2.0?

An experiment is already underway in regards to today’s young generations. Kids are starting to use ChatGPT and related systems in intensive ways, and their way of thinking may be about to undergo a profound transformation.

Risks: when the flute malfunctions

In the movie, a catastrophic power failure produces unforeseen consequences in the interaction between the human and the AI.

If we are transferring more and more control to AI systems, because they are capable of automating many of our daily chores, how do we protect ourselves against unforeseen failures, sabotage, natural disasters and other factors that could disrupt the operation of these systems?

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

It feels like today there is nothing to worry about. But what about the future, when these systems gain more autonomy and potentially switch their tune unexpectedly, be it because of a misalignment issue, or because of unpredictable technical events?

Beyond that, there is the so-called “singularity” issue, the point in history in which technological growth becomes impossible to control. By the time we realize that AI has evolved beyond a key threshold, it may be already too late to manage the situation and to avoid being controlled and/or manipulated by such systems.

In summary, how do we avoid ending up like the rats in the Pied Piper of Hamelin story?

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

To deal with some of these concerns, we are witnessing the rise of terms like:

Constitutional AI: We briefly addressed this term earlier in connection with AI systems supervising other AIs. In general, Constitutional AI involves setting up a set of guiding principles (akin to a constitution), that can be used to control and govern how these systems behave. It is all about setting up guidelines for these systems so that we make sure that they align with the appropriate boundaries, constraints and goals.
Responsible AI: the practice of planning, designing, implementing and deploying AI systems that are safe, trustworthy, and behave in ethical ways.

Will some of these initiatives help us to eventually manage the capabilities of these pipers of Hamelin? We shall see.

Work: supervising the piper

In the movie, a teacher attends a job interview for the position of specialist in prompt engineering.

Anthropic, a company acquired by Google, recently announced the first job posts for the position of prompt engineer, offering an outstanding salary.

As the AI piper takes control of more and more roles in the job market, will people be ready to make a radical mental shift in order to become supervisors of these new tools and capabilities, moving into more of a management role?

Or perhaps the human will always have to put the finishing touches since AI systems may not master the so-called System 2 way of processing information (as per Nobel Prize Daniel Kahneman) for quite some time (if ever).

System 1 and System 2 are abstractions that reflect two key ways in which we process information in our brain: the fast and subconscious (System 1) and the slow, systematic, logical and conscious (System 2).

Humans use System 2 thinking to, for example, slowly and consciously find new algorithms, sequences of steps that allow you to reach an objective (think of learning to play piano, drive a car or perform mathematical calculations). Once the algorithm has been learnt by System 2, it gets automated to different degrees by our fast and subconscious System 1.

In this way, next time that you drive or play that tune, you can do it pretty much in an automatic and effortless way, instead of slowly and consciously analyzing every step.

When you encounter a new scenario, your System 1 tries to find a match for it, or alternatively, the best approximation it can (that’s what you also call intuition, and it sometimes makes mistakes when its approximations are far from ideal).

System 2 is often listening to the proposals of System 1, ready to modulate and override such impressions and proposals if needed (how often this happens may depend on how reactive-impulsive a person is).

In case there is no suitable match, System 2 can intervene to slowly reason its way sequentially towards a new algorithm that may later on be automated again by System 1. What a beautiful dance!

Current AI systems excel at System 1 capabilities and are able to kind of mimic some System 2 ones. But to have proper System 2 capabilities, which are necessary in order to plan, supervise, reason and discover new algorithms, we would need to evolve and advance the current AI paradigms.

Despite the need to supervise these systems, one thing is clear. Many jobs will be gone, and new kinds of roles and positions will appear.

How can we modulate and manage the potential destruction of many jobs?And how can professionals keep up to date with the opportunities offered by the AI revolution so that they are not left behind?

One of the reasons why prompt engineering is going to keep growing and expanding is that these powerful AI systems will continue being somewhat brittle and prone to make silly mistakes at least in the next few years.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

This will happen because they are still far from being able to master those System 2 capabilities we talked about, and even further away from having agency to reflect on their choices. (although both of these capabilities can be somehow mimicked).

The current brute force approach based on scaling up (by adding more data and parameters) will surely diminish these mistakes over time.

However, the need to move towards AI systems that are more efficient in terms of energy consumption, size and performance, as well as the need to make them more secure and precise, will eventually increase the pressure to evolve the current AI architectures and paradigms towards new possibilities that will be more sustainable, precise and safe.

What we are witnessing today is a kind of frankenstein phase, in which people are starting to combine the System 1 magic of these AI entities (their capacity to do pattern matching in extremely complex ways) with a number of external third party tools capable of finding and implementing System 2 algorithms (see for example the proposal for connecting Wolfram Alpha and ChatGPT).

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

But, as I just mentioned, for these systems to become truly secure, precise and stable, we will have to move beyond current AI paradigms, and that may still be quite far away.

In the meantime, we are patching these systems by using things like:

RLHF: Reinforcement learning from Human Feedback. This is a way of fine tuning AI models through human feedback. Basically, humans help to fine tune the model by providing feedback as to what way of communicating is appropriate for the AI system. It is a way of aligning the AI closer to how humans communicate.
Chain of thought prompting: using a number of prompts to explain the steps of a System 2 algorithm to the AI so that the AI can then apply it to whatever we want. This works surprisingly well, but again it is just a hack that makes AI systems seem to reason as if they had System 2 capabilities.
Feedback loops with other third party tools capable of implementing some System 2 capabilities (like the proposal for connecting Wolfram Alpha and ChatGPT mentioned above).

Here is an example of Chain of Thought prompting as explained in the related academic paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”:

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Reasoning: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11.
Answer: 11

You give all of these prompts to the AI model. You are literally explaining how to reason to get to the right result, giving it the steps needed to implement the algorithm that will produce the right answer. Then you ask it to apply the same reasoning to a different but similar question. And this works generally pretty well. But again, it is a hack.

Towards the unknown | Hamelin 77 by Javier Ideami @ ideami.com

As long as our paradigms don’t evolve beyond these hacks and patches:

These systems will keep on making silly and sometimes dangerous mistakes, same as our human intuition (System 1) does at times.
Prompt engineering will keep on becoming an extremely important role and discipline, because for reasons of safety and precision, the need to guide and supervise these systems will only increase.

So the next question is: how can one practice prompt engineering? How can one become a great specialist in this area? We are about to witness (it has already started) the launch of all sorts of aids (courses, podcasts, books, etc) that will help people train this skill.

But you see, prompt engineering is an art and a science. Yes, great prompt engineers should know the ins and outs of many of these AI architectures, their strengths and weaknesses, how they were trained, etc. And in systems that work within specific niches, prompt engineers that are domain experts will be required.

But beyond that, a great prompt engineer also needs something that is less tangible. What makes somebody a great communicator, capable of guiding others and of impacting our thoughts in powerful ways? These are people that have a flexible and creative mind. They are problem solvers and have also gone through a diverse set of experiences in life.

That’s why in the list of Anthropic’s requirements for the job of prompt engineer, the very first phrase says the following: “Have a creative hacker spirit and love solving puzzles”. Basically, be an expansive, creative, inquisitive, curious and experienced person.

As explained in the excellent book “Range” by David Epstein, the future may well belong to the generalists and multidisciplinary souls out there.

Multimodal education: learning with the piper

The core of the movie is the storytelling around the interaction between humans and AI. But on top of it, and because of it, a diversity of generative AI techniques have been used, including:

Images: generation of images from textual prompts.
Video: generation of videos produced by navigating the abstract latent spaces produced by artificial intelligence architectures.
AI model finetuning: retraining AI models to add new prompts connected to visuals of the actress.
3D: NeRF — Volumetric reconstruction of 3D spaces and navigation through the reconstructed space.
Voice: AI based synthetic generation of voices.
Text: some specific phrases of the AI voices came from explorations performed with GPT models.
VFX: Various AI techniques were used during the visual post-production phase.

Soon, AI technology will be able to generate any kind and variation of multimodal output we may desire, bringing these systems ever closer to our very nature as multimodal agents, and then moving beyond to come up with new ways of expressing our thoughts.

This brings us to the subject of education. In order to learn about, for example, the battle of Waterloo or the discovery of America, people used to explore books, movies and the teachings of their tutors. Soon, students will be able to learn about the same topics through realistic and customized reconstructions created by AI systems, tailored to the student’s preferences.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

What will be the new role of teachers in the face of this explosion of customized multimodal learning? And how can students make the most of these new technologies?

We can imagine that fairly soon AI systems will be able to create customized gamified experiences that will allow any of us to absorb and learn any topic in a much deeper (as well as more entertaining) way than ever before.

Language: the cornerstone of this revolution

Central to the movie is human language, which is perhaps what makes us most unique as living beings.

The ability to abstract the complexity of the universe into a set of elements that we can combine in order to reason about any subject in an agile way, is becoming the cornerstone of artificial intelligence today.

It is, in many ways, thanks to LLMs (large language models), that AI has advanced so much in recent years.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

In the movie, Lara, is a teacher who is passionate about language, about combining words to express a tapestry of feelings and meanings. This is reflected in her recitation of poems by José de Espronceda and Antonio Machado (well known Spanish poets).

Many are surprised by how “only” through processing a large part of the language present on the internet, these AI systems are capable of interacting with us displaying such apparent mastery of the entire human communication spectrum, from the emotional aspects to even the humor related ones.

What makes this especially effective is that language has unexpectedly become a perfect bridge between current AI systems, which are kind of massive subconscious cooking pots (System 1) and our human reasoning processes (System 2). We are effectively using human abstractions (language) to guide the cooking processes of these massively scaled System 1 entities.

By using chain of thought prompting as well as precise prompt engineering, we are like a master chef that guides the combinations and recombinations of the ingredients that are present in these trained AI systems, gradually pushing that cooking process in the direction of our intended goal.

Hamelin 77: the Future

In between hope and tension, the movie ends with a sentence that hints at what’s to come in the future.

It connects with the question that more and more people are asking themselves. As AI evolves in ever faster ways, what will the world look like in a decade or two?

The path ahead. Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

The answer may depend a lot on those abstractions we mentioned earlier, System 1 and System 2 (per Daniel Kahneman), the two ways in which we process information in our brain: the fast and subconscious (System 1) and the slow, systematic, logical and conscious (System 2).

Summarizing some of our earlier explorations, AI systems are only properly implementing System 1 capabilities. However, by involving language in such an integral way at both their training stage as well as the prompting process inference phase (being language intrinsically linked to System 2 capabilities in humans), it is capable of performing behaviors that resemble System 2.

Chain of thought prompting (or prompt programming), as mentioned earlier, are ways of hacking these models to mimic System 2 behaviors. By using them, these systems can implement algorithms that resemble reasoning because we are literally explaining to them step by step how such reasoning should take place. This is a clever hack, because these systems have no agency and are not able to find these algorithms on their own.

At the same time, it is not hard to get these systems to make very obvious and silly mistakes, which can give away their true nature, and remind us that, despite appearances, these systems are still pretty far from implementing true System 2 capabilities.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

This is all part of the exciting debate about AGI (artificial general intelligence) and about how far AI systems can go in the next decade and beyond, by scaling the current paradigms or by looking for new ones.

To reflect on how we may get there, Yann LeCun’s paper “A Path Towards Autonomous Machine Intelligence” is a great read.

Pipers of Hamelin all the way down

Finally, if we step back and look at life from afar, we may realize that we are part of a chain of many pipers of Hamelin.

As we go through our lives, we control different entities and processes, and we are also controlled by others.

Our own bodies are chains of pipers of Hamelin at different scales. And a good life happens when there is a decent balance in terms of our position within that chain.

Professor Michael Levin and his team have published extensive research about how our cells behave. Each of our cells has its own local agenda and a certain control over its immediate environment.

At the same time, groups of cells behave according to different top-down goals and are effectively controlled by those goals. When these two poles balance each other, the organism functions correctly.

However, when one of these cells escapes this balance, and prioritizes its own immediate goals and local control beyond other matters, cancer happens.

In the same way, for AI and humans to coexist successfully, we must reach a good balance between giving them enough autonomy and preserving our own supervision and control of their systems.

The most important chess match in history

Back to the twisted chess board analogy, whose rules keep changing as we play.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

The last few moves in this uncertain match have triggered a number of initiatives and terms like: Responsible AI, AI ethics, constitutional AI, and others.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

This is a match that we cannot afford to lose. A match where the best result we can expect, and the one we should pursue, is a draw.

A match, where collaboration and cooperation should be the ongoing and ever present goal.

And a match which may only finish when, in a few decades maybe, humans and AI sort of merge with each other. Companies like Neuralink are exploring the first stages to get there.

In the meantime, it is a certainty that AI systems will become powerful pipers of Hamelin. It is our responsibility and key mission to keep such pipers connected to a healthy chain of control mechanisms so that the upcoming AI-human organism can function in a healthy way.

Hamelin 77 will be released in the coming weeks. For more information stay tuned through the youtube and vimeo pages, as well as its IMDB

This project was made possible with the help of the following sponsors and partners: Mobile World Capital Barcelona through the Digital Future Society initiative, Programamos.es, Aerovisuales (joanlesan.com), Tejera Studio, Ideami Studios.

Still of Hamelin 77 by Javier Ideami @ Ideami Studios / ideami.com

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: CoFeed | Differ
More content at PlainEnglish.io

Every prompt matters was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Towards a sustainable generative AI revolution

Javier Ideami — Fri, 21 Oct 2022 17:19:48 GMT

Towards a Sustainable Generative AI Revolution

Facing the growing pains: how to steer the wild new age of the super subconscious

Images generated by Generative AI

Humanity’s creative muscles are being stretched by the unstoppable generative AI revolution. By using text and other kinds of prompts, people use this technology to generate stunning images, videos, 3D shapes, VR environments and more. And yet, growing pains are starting to appear in relation to various matters, from the rights of living artists to the presence of AI generations within art competitions, art platforms, stock libraries and the like.

I am the cofounder of one of the first generative AI platforms that were launched at the start of this revolution (Geniverse). I have also been a multidisciplinary artist for a long time.

As somebody that is very active in both fields (generative AI and the arts), I intend to reflect upon many of the angles and perspectives involved in these matters.

First, though, we will take a fun journey together in order to review the very essence of this exciting technology from first principles, connecting it all with human creativity and the minds of creatives and artists.

And then, we will explore the good, the tricky, and the elephant in the room of the current state of this revolution. Finally, I will reflect on how we may all contribute to move towards a more sustainable scenario beyond these fast-paced initial stages.

Buckle up, as in this article we are going to go from metaphors about AI to latent spaces, the mind of an artist, smart generative environments and other future scenarios, the rights of creatives, the content authenticity initiative (CAI) standard and way more. Let’s begin.

Coming home

Let’s use a simple metaphor to explore what the generative AI revolution is bringing to the table and what it all implies in relation to creatives, artists, and all of humanity.

Once upon a time, you fell on the ocean of life. This is quite a vast ocean, an ocean of information.

Graphic by Javier ideami | ideami.com

Let’s imagine that you are made of two perspectives or parts: your subconscious and your conscious one. And let’s represent your subconscious as if it was a kitchen pot, floating on that ocean of information.

Your first priority on this ocean is to survive, and hopefully, thrive. For that, you need information. So you want to bring enough quality ingredients into your pot, and combine and recombine those ingredients in order to generate knowledge and ideas that help you reach your goals.

Graphic by Javier ideami | ideami.com

Above your subconscious pot, there is this diffuse mysterious shining sphere that represents our consciousness (of which we still know so little).

Graphic by Javier ideami | ideami.com

And so, there you are. Floating on the ocean of life, with your mysterious consciousness sometimes providing a direction for the cooking process that takes place within your subconscious pot.

All the while, that subconscious is constantly combining, mixing and remixing all sorts of ingredients (information) that reach it through our senses.

And sometimes, those combinations may become the seeds of new ideas. Metaphorically speaking, we may imagine fragile, subtle bubbles that emerge from that cooking process, ascending from the subconscious to the conscious. And, if we have space in our minds, if they are not full of noise, we may then perceive those fragile bubbles, and: Eureka! An idea!

Graphic by Javier ideami | ideami.com

But, there is an issue here. There is too much information on this ocean, too much complexity. And our subconscious pot has a limited size. It is not rigid. It is somehow flexible, malleable up to a point. But its size is still limited.

So nature evolved a mechanism to solve this issue of dealing with the tremendous complexity of the ocean of life: compression and decompression processes.

Our brain is able to take the information that arrives through our senses, and compress it into a form that has less detail and more abstraction.

Compression funnels | Graphic by Javier ideami | ideami.com

Let’s begin to visualize this very important axis, the detail-abstraction axis. When we compress the complexity of life, we go from high detail (and a higher dimensional space) to high abstraction (within a lower dimensional space).

Graphic by Javier ideami | ideami.com

And so, within our subconscious pots, we gather these compressed representations of the complexity of the world, in what we sometimes call: latent spaces.

Graphic by Javier ideami | ideami.com

These latent spaces hold the abstract essence of different information domains. We get rid of uninformative details and we preserve a number of reduced dimensions, each of which documents relevant and useful factors related to whichever information domain the data belongs to.

Graphic by Javier ideami | ideami.com

Our brain can do the opposite process as well. It can perform decompression, and go from high abstraction to high detail.

“Visualize an elephant!” We hear those words and the image of an elephant pops in our minds. We just ran the opposite process, and decompressed that high abstraction representation (elephant) onto a highly detailed visualization in our minds.

Graphic by Javier ideami | ideami.com

The processes we just explored are very similar to the ones happening within AI networks. We train AI networks to learn to compress high dimensional domains (like the domain of natural images) into latent spaces that preserve the abstract essence of those domains within a much smaller number of dimensions.

And we also train them to decompress any point within those latent spaces into a corresponding high dimensional representation that belongs to the original information domain.

Graphic by Javier ideami | ideami.com

When we explore complex generative AI systems, from DALLE-2 (OpenAI) to Imagen (Google), Stable Diffusion (Stability.ai) and beyond, we find different intermediate stages, which, for example, may translate between modalities, perform diffusion processes, scale inputs and outputs, etc; but the initial base common to all those systems are these compression and decompression processes that allow us to move bidirectionally between high detail and high abstraction.

The specifics of AI systems depend on the objective we have. We may want to upscale images, or sharpen them, or to generate brand new images conditioned on text prompts, or some of those things together, or something entirely different. That will determine what sort of training objective and dataset we use, as well as the precise details of the different parts of the final architecture.

The key strategy used by the leading generative AI systems nowadays is based on what we call diffusion. The Stable diffusion system, for example, uses a U-Net like architecture that has been trained (with a large dataset) to predict the noise that has been added to an image.

Once trained, the same network is able to go from different combinations of image+noise (including from complete random noise) back to a high quality image in a number of steps.

It can also go from an image to another image, by adding some noise to the initial image and then performing the same process as before.

In order for these generations to move in the right direction, they are conditioned on the compressed representation of the text prompt we entered (which is injected onto different parts of the U-Net architecture).

Enough with the technical details. Let’s continue.

AI is coming home

And so, with the generative AI revolution, we are getting closer to our essence as beings capable of performing the complementary processes of convergence and divergence (compression and decompression), expressed through our analytical and creative muscles.

After a decade in which we gradually expanded and evolved the convergence capabilities of deep learning AI systems (able to predict, recommend, classify, identify, etc), the generative AI revolution completes the loop by adding superhuman divergence capabilities (able to create and generate). AI is coming home.

AI is coming home | graphic by ideami.com

The magic of latent spaces

But what do we really mean when we talk about latent spaces or abstract compressed representations? We find the answer within ourselves, through a very simple example.

I take a walk in nature. When I return, my friend asks me how the walk went. I say: “Wonderful, I saw a beautiful cicada!”. And she asks me: “What did the cicada look like?”

Photo by Bill Nino on Unsplash

At that point, I visualize the cicada in my mind. Let’s pretend that my visualization is expressed in a grid of 1000 x 1000 points of light. That is a 1 million dimensional space. If the points have color, then each of them will have a red, green and blue component (3 times more dimensions).

So I could start describing the cicada to my friend by saying: “Well, the first point of light at the top left of my visualization has 15 of red intensity, 25 of green intensity and 77 of blue intensity. The next point to the right of it has 145 of red intensity, 55 of green intensity… etc, the next one has.., etc”. And I could keep going like that through the 1 million points of light. The problems with this approach are obvious.

It may take me a month to describe the cicada and by that time my friend will be long gone. Zero efficiency. But the main issue is not even that one.

To know that one of those million points has 155 of red intensity is just not very useful. The fine detail often doesn’t provide relevant information. That’s why I will do something different.

I will compress all that complexity and richness of the details of the cicada into just a few dimensions, 30, 50, 100 factors (a small number anyway) that explain the essence of what I saw.

Photo by Saryu Mae (https://inaturalist.nz/taxa/342384-Kikihia-ochrina | CC BY: https://creativecommons.org/licenses/by/4.0/) | Graphic by Javier ideami | ideami.com

And I will tell my friend: look, it had a broad head, a stout green body and clear membraned wings. 4 wings, and the wings had these kinds of patterns. And it had large compound eyes, it had this number of eyes, and six legs, and the legs were like this, etc. I have compressed the high detail representation onto a small number of dimensions that communicate important and relevant information.

And now, my friend hears this and she does the opposite process, decompression.

She transforms these few compressed dimensions that express the essence of what I saw and inflates them to visualize in her mind the high detail representation that would correspond to that essence, the image of a cicada (which will differ from the one I visualized, because of the compression-decompression process as well as other differences between the systems involved and the previous knowledge each of us held in relation to the relevant scenario).

Photo by Saryu Mae (https://inaturalist.nz/taxa/342384-Kikihia-ochrina | CC BY: https://creativecommons.org/licenses/by/4.0/) | Modified with variations created with generative AI | Graphic by Javier ideami | ideami.com

And so, in a way, every time we recall something, we are kind of rebuilding it, reimagining it, recreating it from that essence that we stored (the precision of that process depends a lot on the richness of the relevant latent space as well as the number of sensory modalities involved in its creation, in between other factors).

The following is an infographic I created months ago about how DALLE-2 works, comparing its processes with what goes on in the human brain.

Graphic by Javier ideami | ideami.com | Download a super high resolution version of the infographic at https://github.com/javismiles/dalle2-inference

Small Pot, Giant Pot

There are many differences between what goes on within our brains and within these AI networks, but one difference that is specially relevant to this article is the size of that subconscious pot, metaphorically speaking.

Our subconscious pot is fed by the experiences we have in life. When we talk to people, when we experience the world, we enrich its contents. Eventually, its cooking processes generate within our minds new ideas, visualizations, sounds, and more.

AI networks are fed (at training time) by giant datasets. The datasets used by generative AI systems are made of information collected from all around the internet. We are talking about massive amounts of data.

So, on one side, we have us humans, with our little subconscious pots.

On the other side, we have these giant AI pots, fed with data from all around the internet. Some of that data is in the public domain. But not all. And we will discuss what that means and implies, a bit later.

Graphic by Javier ideami | ideami.com

The depth elevator

It’s time to connect all the previous sections with art and human artists. Now, defining what makes an artist is an impossible task. Instead, I will focus on exploring something that has been common to many of the great creatives in history.

Remember that axis (detail to abstraction) that I was discussing above? In a book I published years ago, I wrote about another metaphor I came up with, which I call “The depth elevator”.

Imagine a vertical line with an elevator moving through it. At the bottom of the line, we have the high dimensional and high detail realm. This is where the complexity of the ocean of life is fully expressed.

At the top of the line, we have the realm of the compressed low dimensional latent spaces that preserve the abstract essence of the lower realms (here lives our language, for example).

Photo by Owen Cannon on Unsplash.com. Text by Javier ideami | ideami.com

Artists are masters at navigating this depth elevator in an agile, flexible and dynamic way. Let’s go deeper into this.

When we are little babies and later kids, we spend most of our time at the bottom of the depth elevator, interacting with the richness and detail of the universe. Our analytical mental modules are still not fully developed. It is our exploration phase.

Most adults, instead, tend to focus on efficiency by reusing the mental patterns already established within their minds (which also helps us to prevent wasting our precious fuel, the glucose that powers our brains). It is our exploitation phase. As such, adults spend a lot of their time in the narrow ivory towers at the top of the depth elevator.

Achieving a good balance between time spent at both halves of the depth elevator, is a healthy goal. A good balance between convergence and divergence, between compression and decompression, between abstraction and detail.

A lack of balance between those poles (in whichever direction) produces different kinds of issues in adults. I have written extensively about those matters, but that is not the topic of this article. Let’s get back to the artists.

Repetitive mental strain injury | Graphic by ideami.com

Many great artists have the following thing in common. They are able to navigate this depth elevator in an agile and flexible way. They are able to go down into the depths at the bottom of the elevator, where the richness of the universe awaits.

And, crucial point, they don’t just dip their toes and leave. Instead, they are able to spend long periods of time down below, exploring those muddy, wild and uncertain waters.

They are also able to crystallize that richness onto different interpretations and representations, which may express themselves at different levels all throughout that axis that goes from detail to abstraction.

And the representations themselves, or instead, their explanations or the way they are communicated, are also located much closer to the top of the depth elevator.

This all is in contrast to the typical adult that spends most of the time at the top or close to the top of the elevator. And you guessed why.

Because being on the ivory tower of abstraction, at the top of the elevator, is way more comfortable (and requires less fuel) than navigating the muddy bottom of that axis, which contains the complex details of the universe (metaphorically speaking, we could also say that it is way more comfortable than dirtying our hands exploring the wild playground down below, at the bottom of the elevator).

Here we arrive at another crucial point. Navigating that depth elevator in the way that many of the greatest artists in history have been able to do, requires effort. It requires time. Perseverance. And, in a way, to go against the natural predisposition of our adult mind to be efficient and avoid wasting our precious fuel.

Requirements for a productive and creative cooking process | by Javier ideami | ideami.com

In regards to that, it is relevant to point out that a number of platforms are currently banning generative AI art (or putting it in a separate category or area), because they consider it to be: “low effort” art.

Yes, it takes some effort to find the right prompt to guide generative AI architectures. But the effort and time required by that process cannot be compared to the years and sometimes decades that it takes to master the process described earlier. We will go deeper into this point and other related ones a bit later in this article. At that time, we will also reflect on potential solutions to such conundrums.

So, by exercising this flexible navigation through the depth elevator, great artists and creatives are able to express the richness of the universe in novel ways.

Pick anything in life, say, wood. You may experience wood in a very detached, abstract way. Or you may explore all the intricacies of wood at a very deep and detailed level. If you are able to flexibly move between both poles, you are in a much better position to create something novel and different related to that element of the universe.

Graphic by Javier ideami | ideami.com

Great creatives are also able to understand different ways of interconnecting various areas of that vast ocean located at the bottom of the axis, across the various layers of those waters and also through the top layers of the depth elevator.

When, for example, a great creative experiences rhythm, she can go beyond disciplines, techniques, tools and flashy terms. A great creative sees and feels rhythm everywhere. In the lights and the shadows projected by a curtain, in the sound and movement of falling tears, in the dance of the stars, the gaps between our thoughts, and beyond.

Graphic by Javier ideami | ideami.com

Throughout years and decades, great creatives expand and consolidate the latent spaces of their subconscious pots.

They also refine the way they navigate their depth elevators, which allows them to connect detail with abstraction in powerful ways that enrich their creative processes.

In addition, artists and creatives often collaborate with others. By doing so, different subconscious pots may enrich each other.

Graphic by Javier ideami | ideami.com

So, if you study some of the greatest creatives and artists in history, you will see that they all had something to say, a message, a vision. And also that such vision, and the way they expressed it, was inextricably connected with their capacity, cultivated during decades, of navigating these depth elevators in fluid ways, exploring both, the depths of the richness of the universe, as well as the ivory towers of abstraction and many of the realms in between both poles.

Finally, regarding those depth elevators, the next step would be to visualize them not as isolated entities, but as multiple funnels that are interconnected with each other within multidimensional spaces.

The following image of an origami, seeks to represent a small fragment of that extension of the metaphor.

It is time, though, to stop the elevator and move on, in order to focus on a review of the current status of the generative AI revolution, as well as ways to address its current growing pains.

The depth elevator origami | By Javier ideami | ideami.com

So using what we have explored above, let’s consider the situation today and in the coming future, and what could be done about it all.

The good, the tricky, and the elephant in the room

Let’s explore a number of consequences derived from this initial phase of the generative AI revolution.

The good

Generative AI won’t replace human creativity. It will enhance it.
This technology demystifies creativity. Think of what Edison said: Genius is 99% perspiration (combination, recombination, productive work and experimentation) and 1% inspiration (establishing the seeds, polishing, etc). Thanks to this new technology, we now realize that we can automate a large percentage of the creative process, a part that takes place subconsciously in our minds.
Studies about human decision making show that we make more than 30000 decisions each day. But we are only aware of around 0.26% of them (e.g. research by Huawei). Way more of our lives than we may think, takes place subconsciously. By automating our subconscious cooking processes with AI technology, we can impact in positive ways a large part of our existence.
In fact, I call this new era “The age of the super subconscious”.

The age of the super subconscious | Graphic by ideami.com

Think of this technology as a series of different iron man suits that will amplify your subconscious pot and empower your creative muscles.
Different iron man suits will have different styles, traits, and personalities.
Prompt engineers are people that will become experts at getting the best results out of these iron man suits. They will know the ins and outs, the strengths and weaknesses of each.
They will also be masters at using their human experience and intuition when interacting with these powerful amplifiers in order to achieve the desired result.
As such, these prompt specialists will be highly regarded in coming years. Their role will become a prestigious one in the job market. And we will witness a large amount of courses, publications and systems that will educate and help people train this skill.
Today, our prompts are natural language and images. But thanks to multimodal architectures, prompts will soon be any kind of data we want to use to guide these architectures (different systems will be designed to absorb different kinds of guiding inputs).
The initial text to image phase has now transitioned to text to video and text to 3D capabilities. Eventually, we will be able to output all kinds of data with custom systems that will target the needs of specific verticals.
Next, we will witness multimodal output capabilities, which will eventually allow us to produce, for example, full movies that will include visuals, dialogues, music, and more.
This technology will inspire new forms of art that we cannot yet imagine. Multimodal generative AI is poised to trigger the emergence of novel ways of combining explored and unexplored areas of the depth elevator, which in time may become highly regarded new forms of artistic expression.
Generative AI will impact a very large number of sectors. It will be used to expand scientific datasets with synthetic generations, revolutionize brainstorming processes, personalize branding in ways unimaginable only months ago, accelerate the rise of real time dynamic “only for you” marketing and advertising, and usher presentations of all kinds into a new era by surrounding them with media that matches their content in impressive ways, in between many other examples. From stock libraries to design boutiques, whole swaths of the media landscape will rush and compete to incorporate this technology.
Cutting edge tech like VR & AR (and in general all forms of XR) will incorporate this technology (experiments are already ongoing) and eventually we will witness real time generation of immersive spaces that regenerate in smart ways by tracking the gaze of the user (it’s interesting to consider the connections between these experiments and the theories of Donald Hoffman).
This technology will also accelerate the exploration and experimentation phase of many creative processes. From concept design to product design, character design and prototyping stages across a wide range of fields, generative AI will allow us to do more in less time, to try all sorts of new directions and to go deeper into our explorations of every level of the depth elevator.
The so-called “metaverse” is for many still a Utopia and a decent implementation of it appears to be pretty far in the future. If the metaverse is ever to become a useful reality, it will probably happen on the shoulders of generative AI technology, which may be the key to accelerate its implementation.
Further in the future, we will witness the rise of smart generative environments (SGE), which will mutate according to our needs or emotional state. Houses, event venues and other environments will begin to resemble organic living entities by matching and resembling the intent and emotions of their contents. They will do so in multimodal ways. Eventually, we will be able to converse with those environments and they will become a key support of our mental balance and health.
The combination of generative AI with ever more powerful sensing models capable of interpreting every subtle nuance of our expressions and behaviors will allow us to produce real time multimodal interpretations of our emotional and mental state. When combined with new iterations of brain waves reading tech (EEG, MEG, etc), this will usher a new kind of creative expression that will literally use our most intimate sphere as a brush to produce extraordinary renditions of the human condition.
Although some jobs are and will be in danger, it is also highly likely that new roles that we cannot yet imagine, will emerge from the need to manage and interact with this technology.
At the same time, many of the impacted jobs and roles will survive and even thrive by embracing this new age and adapting their processes to what this new technology offers.
A good number of people who may not be professional artists, but that have a natural predisposition to exercise their creative muscles, will thrive with this new technology. They will strengthen those muscles in faster and easier ways, and they will enjoy new opportunities to augment and amplify their creative potential.
And we end this section as we began. Reminding us all that Generative AI won’t replace human creativity. It will enhance it. And that, exercising our creative muscles will continue being the very same highly recommended activity. Achieving a good balance between our capacity to diverge and converge, compress and decompress, will continue being so very important for our mental and spiritual health for the foreseeable future.

The tricky

We, humans, have a limited and relatively small subconscious pot. Generative AI systems are trained to hold massive pots that encompass a large part of the knowledge of the internet.
Because of that, it doesn’t seem fair, nor morally correct, that human creatives should have to compete with generative AI systems.
When machines overcame humans at playing chess (a way less consequential event than this one), no one thought that it would be great fun to keep exploring human vs machines chess competitions (beyond the ones that demonstrated that we had lost the battle). We accepted that they were better. And then we went our separate ways.
Human chess players use AI to train themselves and become better (akin to the augmenting and amplifying capabilities of these metaphorical iron man suits provided by generative AI systems).
AI systems that play chess or go, produce at times really beautiful moves that would never occur to a human. They kind of have their own special perspective (based, of course, on a tremendous capacity to look ahead in time). And yet, very few people are interested in following machine vs machine competitions. Humans prefer to see other imperfect humans play.
The key thing, in any case, is that they separate both domains. Machines help human chess players train and become better. And they may also play among themselves. Humans, separately, play at their own competitions.
I believe that eventually a similar thing may happen with generative AI (with a number of differences, of course, as these are very different domains).
Another tricky point to consider is a key factor that lies behind some of the current excitement with this technology. And I will expand on this matter in the final section of this article. Let’s, for now, introduce it.
Greg Rutkowski is, in the opinion of many people, one of the best, if not the best, illustrator of fantasy art nowadays. And his name appears in a massive amount of the prompts used to produce some of the most impressive generative AI art in recent times.
So, after all the dopamine rushes triggered by the production of amazing art that seems to have been painted by Greg Rutkowski, subside, after those dopamine rushes go down, a lot of people are going to be left with hundreds or thousands of AI generated images or videos, and then, they will ask themselves: “And now, what?”
“Nothing” will be the answer in most cases. Because most of those people were not really exercising their creative muscles in relation to any deep meaningful internal drive; they were using this technology like the person that buys a new iPhone, in a sort of compulsive way, following the shiny latest tech.
And when that compulsion dies down, they will feel sort of empty. Because most of what will be left behind is not theirs, it belongs to, among others, Greg Rutkowski and his style, crafted over decades of hard work (as an example, among many other living artists whose work powers these networks).
In any case, let’s be realistic. Things have been moving too fast and it makes sense that people need time to catch up. There may be many solutions to the current scenarios. And I will discuss some of those at the end of the following section.

The elephant in the room

Images generated at the Geniverse (Generative AI)

AI generative systems are only possible because of the giant datasets that are used to train them.
AI generative architectures are trained with massive datasets composed of images, videos, text and soon other kinds of data.
This data is typically extracted from the internet by the groups that create these datasets.
Some of the data used in these datasets is public domain data. It seems fair to use such data for the creation of these datasets.
But, a good part of the data used in these datasets belongs to living artists that have not declared it to be public domain data. These are artists that make their living by selling such data = selling their decades of hard work that have produced a specific style and a series of works.
These artists are, indeed, the foundation on which this revolution is supporting itself on its meteoric rise.
And so, an increasing chorus of living artists are complaining about this. Some of them state that the works of living artists should not be included in these datasets. According to some, their complaints have been falling on deaf ears. They are being mostly ignored (at least so far).
If we ignore the complaints of these living artists, we are ignoring ourselves. For today, we are discussing visual art, but tomorrow, it may be music, novels, legal writings, or whatever our occupation or field may be.
Let’s again contemplate it all from the perspective of what many people experience when using these systems. On whose shoulders has been built the dopamine rush that a person may feel when they produce a stunning digital art which resembles so incredibly well Mr.Rutkowski’s style and body of work? On Mr.Rutkowski’s ones, of course. More specifically, on the decades of extremely hard work and perseverance that Mr.Rutkowski applied and invested to create that style and body of work.
A style and a body of work that is now giving that person such an intense dopamine rush when they take some time to come up with a prompt, which includes Mr.Rutkowski’s name, and then click a button and with minimal effort produce a result that so closely resembles his art.
Some may say: “But it took me 50 hours to come up with the prompt”.
Whether that is an inflated number or not, it does not change the fact that there is no comparison between anybody’s exploration of language prompts for a few minutes or hours, and the decades of work invested by the likes of Mr.Rutkowski.
It also does not change the fact that Mr.Rutkowski never gave explicit permission for his artworks to be included in the datasets used by these AI architectures.
Prompt engineering is art+science. And it will gradually become a prestigious skill and discipline.
There will be tons of books and courses on the matter. Great prompt engineers will know the ins and outs, strengths and weaknesses of many different AI architectures, while at the same time being capable of applying their human intuition to the generation of prompts that extract the best results from the human-machine interaction. Indeed.
But that is still not an excuse to trample on the rights of fellow human beings and living artists. And in the next and final part of this article, I will address in more detail what we could do about this and other related matters.
And let’s emphasize again the following: this revolution has moved so fast that it is understandable that people need time to catch up with it all. And that process of catching up and finding a more sustainable scenario is in its initial stages.
I will always support generative AI, but above all, I will support and defend my fellow creatives (because people and their lives should always matter more than technology). This is a matter of ethics and morals (legal aspects are not part of this article. Those will be addressed by others and I believe that ethics and morals should be the first compass in this matter).

This is a wonderful revolution that will bring many benefits to humanity. But as we can see, there are also tricky sides to consider at these initial stages. Let’s discuss how we may address some of them.

Images generated at the Geniverse (Generative AI)

Steering the revolution

I will address this final section from a moral and ethical stance.

It is to be expected that, eventually, a number of bodies and groups will introduce different forms of regulation related to these systems and companies will also introduce their own safeguards and controls. But these, as well as other legal perspectives, will take time to be established.

Although maybe not as much time as we would expect. At the end of the next section, I will comment on the content authenticity initiative (CAI), an open standard, founded by Adobe. CAI has been joined already by hundreds of companies, some of which are already planning to implement it in their platforms.

This will allow them to track where digital content comes from, if generative AI has been used to produce it, as well as other factors related to misinformation and the protection of the rights of creators.

Let’s now reflect on ways to make this revolution more sustainable.

A living artist, that has spent decades developing a style and a body of work, whose rights belong solely to the artist we are considering, should have a say and/or be compensated if this artist’s work is to be included in any of these massive generative AI datasets.

Otherwise, it is as if, for example, you display an artwork within a gallery, and somebody comes and grabs it, takes it away, and profits from it. There is something universally known as copyright, which did not disappear magically at the start of the generative AI revolution.

Some will put the example of YouTube, saying that at its beginning stages, Youtube was kind of hand waving these issues, and that otherwise they would have never taken off the ground. As we all know, nowadays and for a long time, YouTube employs a very strict set of mechanisms for protecting copyright within their platform. The fact is that generative AI has already exploded in a massive way. So, an initial “what the heck” phase is understandable, but that phase is now behind us. Therefore, the moment to begin protecting the rights of creators, like YouTube and other similar platforms had to do, is right now.

Finally, we need to discuss a super important issue, the gray zones. To get there, let us quickly review a point that we raised in the previous sections.

It is not fair to have humans and machines competing in the same art competitions, art platforms and the like. Humans have small subconscious pots. AI systems have massive ones. Human’s subconscious pots hold the limited experiences of their life, of one life. AI systems hold the knowledge of millions or billions of human beings. Let’s get real. It is not fair and it is not moral to have them compete with each other.

Instead, just like it happened with chess, we can imagine separate sections in art competitions and in art platforms. Art made by humans. Art made by AI. And this is already happening on many platforms around the world. But this point takes us, finally, to the gray zones.

The gray zones.

“Wait, this thing was not fully produced by AI, you see. I used AI to produce part of the work, yes, true, but then I polished it, I built on top of it, and well, therefore, it is legit, right?”

We are going to hear a lot of this. So it is crucial to address this kind of scenario.

A recent ruling by the US Copyright office in relation to a request to register an AI generated artwork, states that “human authorship is a prerequisite to copyright protection in the United States and that the Work therefore cannot be registered.” An extended discussion on this ruling can be found here.

But again, what we are about to face (and it is already happening) are the gray zones. The in betweens. And I believe that the answer to those lies in the — public domain vs non public domain — discussion.

Because, in a way, everything has changed but nothing has changed at the same time. Here we go:

Before generative AI exploded, you could go to google search, find some public domain images, videos or whatever kind of data, and incorporate those into your creative process, and all was fair and good.
Before generative AI exploded, you could not go to google search, find some non-public domain images, videos or whatever kind of data, from some living artist, and take them and incorporate them into your work without asking for permission (obviously when trying to profit from the resulting combination of their work and yours. We are not discussing here the cases in which you just use some online artwork to experiment by yourself, privately, without seeking to make any profit from it).

Well, guess what. That’s the answer. Nothing new. The answer is that the same criteria could be applied onwards.

As we navigate this generative AI revolution, it should be ok to use this technology when it is connected to datasets that use only public domain data (or data from living artists that have explicitly given their permission for their creations to be used within these datasets). We are again only referring to scenarios that seek to generate profit from the use of this technology.
It should not be ok to use this technology, fully or partially, when making use of datasets that contain non-public domain data, if you intend to use the result for any commercial purpose. You may experiment with it for your own personal use, like some people may do nowadays when they download an artwork from a famous living artist, but certainly not for commercial purposes.

These are thoughts based on, I believe, common sense. But others may come up with novel ideas regarding ways of compensating the artists that may provide new avenues to solve this conundrum. And YouTube provides again a clue as to what some alternative ways to address these issues may look like (more about this below).

And so, art competitions, art platforms, stock image platforms and the like could ask participants to disclose:

If they have used generative AI technology.
If so, which one have they used and what datasets are powering that technology.
If the datasets powering that technology contain only public domain data, then they may choose to open their doors to that work.
If the datasets involved contain also non public domain data, then they could decide to close their doors to those works, or to put them in a separate section.
People may lie, of course. So we will witness as well the rise of automated systems capable of recognizing if part of your work matches parts of the creations of living artists whose copyright is protected.

And this is exactly what platforms like YouTube are using today, for example, in relation to the music of the videos people upload. There will be lots of false positives and the like. Just as it happens with the systems that YouTube uses nowadays. It is the price to pay to protect the rights of living creatives and artists.

Extending these mechanisms to account for all sorts of data, and data that is way more complex and high dimensional than audio, won’t be easy. But there are surely already people working on these matters.

If we look at YouTube again, we also see the variety of ways in which platforms could deal with generative AI art that is built on top of non-public domain data (and it is to be expected that platforms will eventually be able to detect this, either because the user declares it, or because their automatic systems detect it, or because technology like the one that the CAI standard proposes, helps detect it).

Platforms may add advertising to those works, and share the profits with the impacted artists. Or they may block parts or the whole of those works in the regions affected by the copyright related to the artist or creative group. Or they may put them in separate special categories (away from creations produced by humans) while these scenarios get further clarified. We may also witness a great variety of ways of dealing with creations produced by humans+AI systems powered by public domain data. In summary, once detection systems become good enough, there will be a number of ways of dealing with these gray zones.

The work on those detection systems has already begun. The CAI standard, by using smart metadata and other tools, will soon begin to be implemented by companies and platforms all around the world. Let’s briefly explore what it does.

Responsible AI and the content authenticity initiative (CAI)

A number of companies and groups have already been researching and working on designing systems that can be used to deal with gray zones as well as misinformation.

One of these systems is the content authenticity initiative project (CAI) started by Adobe. CAI was actually started in 2019, as companies like Adobe anticipated the need of a standard to deal with the potential for AI tools to produce misinformation and other related issues.

In their words, CAI members are: “a community of media and tech companies, NGOs, academics, and others working to promote adoption of an open industry standard for content authenticity and provenance”. (list of current members)

The group, whose membership is free, provides open source tools that allow to track the provenance and attribution of digital content throughout the entire pipeline, from capture to distribution.

The ultimate goal is to ensure that creatives are recognized for their work and that people and platforms can understand what the origins and methods involved in the production of the content they are dealing with are.

The key thing to highlight is that the CAI standard is going to enable people to know if, and how, generative AI was used to create a certain piece of content.

It is a good sign that there are large companies working to promote what they call “Responsible AI”. And that systems are being put in place that will allow us to know where each piece of digital content comes from, if generative AI was involved in its production or not, what copyright is attached to the content, etc.

It’s important to highlight that to protect the privacy and security of photojournalists and other creators, such creators have the option to choose if to preserve attribution or remain anonymous when using these systems.

The world is watching. At the recent Visual 1st conference (the premier conference for the imaging ecosystem, which takes place in San Francisco and is led by Hans Hartman and Alexis Gerard), generative AI was a big part of the conversation. I had the pleasure of having a great discussion with Hans and Alexis during the fireside chat that opened the event.

Visual tech experts like Paul Melcher are doing a great job bringing the very latest of generative AI to audiences worldwide.

Educators around the world, from organizations like fast.ai to AI master programs, YouTubers with hundreds of thousands of followers, and experts in prompt engineering, are documenting and explaining every stage of this revolution.

In the realm of datasets, we also find very interesting companies and projects like datasetshop.com, powered by vAIsual, pioneers in the generation of legally clean synthetic stock media and creators of the world’s largest licensable biometrically-released real-life dataset.

Again, it is good news that we are witnessing the rise of terms like “Responsible AI” and “Legally Clean” datasets.

And as a human that is very active in both areas, generative AI and the arts, I tried to give you in this article a high-level overview of a number of perspectives involved in these dynamic early stages.

Let’s remind ourselves that these are indeed early times in a rapidly evolving context, so let’s all be as gentle as possible with each other, as we do our best to find the right balance between encouraging a technology that will bring many benefits to humankind, and the need to protect the rights of creatives and artists.

What the future holds

As for the coming times, in my view, and in simple terms:

Artists will keep on being artists. As this article has strived to explain, being or not being an artist has nothing to do with specific tools or technologies. Instead, it has a lot to do with the ways we interact with those depth elevators we explored previously.
Engineers will keep on being engineers
Researchers will keep on being researchers
Prompt engineers (a new segment), will be that, prompt engineers.
And artists and creatives, professional or not (the following applies equally to pro creatives or to those that have a natural predisposition towards exercising their creative muscles) that incorporate generative AI tech and prompt engineering into their processes, will have a better chance to lead their fields, and may become even greater artists and creatives because they will be incubating their ideas with the help of these powerful iron man suits (immense subconscious pots) as well as using that very same tech to accelerate their creative production processes.
Finally, lazy people will keep on being lazy people.

Let’s make it together

AI is definitely coming home. We must all push together to bring the best out of this revolution, in order to benefit humankind as much as possible.

And to complete this article, where we have explored pretty complicated matters, let’s end on a lighter tone, with some musical tributes to this wonderful technology.

The following is a small fragment of a performance by Soprano Covadonga González Bernardo, performing a song that was composed as a collaboration between different AI systems and myself. The GPT architecture was used for the lyrics, music transformers for melody+chords, and VQGAN for the visuals. (the visuals don’t appear in this small fragment). This was a project proposed and organized by the Instituto of Inteligencia Artificial @ iia.es, where I’ve given talks a few times.

https://medium.com/media/b4c635156b52ffd11dd6773c81664339/href

Next, a simple little piano improv dedicated to the theme of generative AI coming home, getting closer to the human potential.

https://medium.com/media/f4d9dfd64884ddd5ecf07675d46b4623/href

Finally, a bit of time travel fun. Can we all appreciate that what we are experiencing today with generative AI would probably have been interpreted as a miracle just a few decades ago? Let’s travel back in time to the year 1950 in Spain :)

https://medium.com/media/19c31e9836247fde8c5f5f628598f6c3/href

Stay well everybody, and above all, stay human.

Epilogue

In regards to my last phrase, “stay human”.

Sometimes, people ask me: What do I think will happen when AI excels at system 2 capabilities (reasoning, planning, etc) in say 30, 40 or 50 years from now?

System 1 and 2 are different types of thinking modes in our minds.

System 1 refers to fast, subconscious, simultaneous, intuitive processes, and this is the domain where AI is reaching superhuman capabilities.

System 2 refers to the slow, logical, rational, systematic, precise and sequential kind of thinking. And mastering this second mode is still way beyond our AI systems. (see Daniel Kahneman’s book “Think fast and slow” to expand on system 1 vs system 2 thinking).

A discussion about system 2 capabilities in connection with AI, now and in the future, would fill a whole article of this size and larger. So I leave that for another time. Let’s get back to the question posed at the start of this epilogue.

I typically answer that the question may not make sense anymore in a few decades. Why not?

Because today there is a separation between AI and humans. AI is there. We are here.

But in some decades, that separation won’t be there anymore. Think of what the company Neuralink is working on already these days. That’s only the very beginning of what’s to come.

In some decades, our technology, including AI, and our biology, will have merged in many ways.

And then, the new question may be: “Where will we go next, now that we are together?”

Thank you for reading.

Photos by Kelly Sikkema and Firmbee.com on Unsplash | Text by Javier ideami | ideami.com

Towards a sustainable generative AI revolution was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cooking a good relationship | How to make it last

Javier Ideami — Sat, 09 Jul 2022 09:44:17 GMT

Graphic created by Javier Ideami

Cooking a good relationship

Let’s use an analogy connected to creativity, innovation processes, and cooking to reflect on how relationships evolve through time.

At the beginning of a relationship, no matter how similar two people may be, they always hold a good number of divergent ingredients in relation to each other. Two human beings, each with their own background, past experiences, etc will always diverge a good deal in relation to each other at the start.

And if they like each other, they are also likely to converge on a number of areas. So they also hold a good set of convergent ingredients.

Graphic created by Javier Ideami

Therefore, at the beginning of their relationship, their subconscious pots are incubating, combining, and cooking a good mixture of quality convergent and divergent ingredients.

This is the reason why relationships are so exciting at the start. Both people are still diverging a good deal in relation to each other. Therefore the mixture of their information ingredients generates all kinds of unexpected and original behaviors, reactions, and situations.

Just as in innovation processes we generate innovative ideas, in a fresh relationship we are generating innovative interactions, situations, and behaviors.

A lot of people, in the first weeks of their relationship, feel that things are exciting and different in a very effortless way. They don’t have to put much or any effort into making things exciting. This is happening because both people diverge naturally in relation to each other, and their subconscious pots are continuously generating unexpected and exciting outputs.

Graphic created by Javier Ideami

After some time has passed (which is different for each relationship), a lot of those divergent ingredients gradually become convergent ingredients. Gradually, the number of convergent ingredients increases, and the number of divergent ones decreases.

For example, during their first days together, she taught him a traditional game from her childhood that he had never tried before. They had a lot of fun playing it together. They played it a few times over the next few weeks. But eventually, once they played it enough times, that divergent ingredient gradually became a convergent one. Now, they both know the game very well. They also know every detail of their reactions and behaviors when they play it.

We now arrive at the crucial point that has an enormous impact on the future of the relationship.

As a relationship progresses, both members of that relationship have a choice to make. They can either:

Make the effort to keep generating divergent ingredients in their lives. Those divergent ingredients are information that is divergent in relation to the partner. This means growing and learning through new activities and experiences so that when you interact with your partner you can bring to your partner’s subconscious pot unexpected and divergent ingredients that will keep generating exciting and unexpected reactions, behaviors, and situations.
Alternatively, you can just wait and see what happens. If you do that, eventually, most of your divergent ingredients will become convergent ones. And what this means is that when both partners interact, their interactions will be combining and incubating ingredients that are mainly convergent between them. The result of that will be situations, behaviors, and reactions that will also be mainly convergent, typical and predictable. Not very exciting. Another side effect is that too much convergence produces the merging of both partners. They become more dependent on each other and begin to lose their own unique identities.

Graphic created by Javier Ideami

All extremes are dangerous. The opposite situation happens when both partners diverge too much and constantly in relation to each other. In this case, they begin to lose touch with each other. The relationship becomes unstable and too unpredictable. Divergent ingredients dominate their interactions and they lack a solid base made of the necessary convergent ingredients. Gradually they literally diverge from each other. The relationship is in trouble.

Graphic created by Javier Ideami

Why affairs or breakups happen

Every human being and every human brain, we should say, longs for variety and new stimuli. These often come from those divergent ingredients we are talking about.

Within a relationship, we long for situations, interactions, and scenarios that feel fresh and exciting. If, after some time, all you are getting are typical and predictable situations, If one of the members of the relationship is not receiving enough divergent ingredients from the partner, what do you think that is going to happen?

What will happen is that one or both of the partners may go and find those divergent ingredients somewhere else. That means having an affair, breaking up, going away to travel around the world, or anything else. It means you go and find those divergent ingredients somewhere else because your partner or both of you have not made the proactive effort to keep enriching those subconscious pots.

How to keep your relationship alive

Therefore, if you want to keep your relationship alive, fresh, and exciting, you need to:

First of all, understand that your relationship is so exciting at the beginning because you two diverge a lot naturally in relation to each other at the start. Therefore, without any effort, you both are holding and giving to each other tons of divergent ingredients. Those divergent ingredients, in combination with the convergent ones that you both also hold, generate effortlessly lots of unexpected, fresh and exciting situations and scenarios. The relationship feels awesome. And you feel that this will last forever. But now you know why it won’t last very long unless you make a proactive effort to keep it that way.
To keep it that way, you must continue to grow and enrich your own subconscious pot through learning and experiencing new situations, generating new divergent ingredients, ingredients which should be divergent in relation to your partner. This way, when you interact with your partner, even though the past divergent ingredients are gradually becoming convergent ones, you will be bringing to your partner’s subconscious pot a brand new batch of divergent ingredients, compensating the gradual transformation of the previous ones.

What about having new experiences together with my partner rather than on my own

Both are useful. Having new experiences together is a source of new divergent ingredients, because when you two experience something new, even if you do it together, the way you both experience it is different in relation to each other, and the way you interact with each other within a new context is also different. And that generates new divergent ingredients.

However, if you always do everything with your partner, diverging in a consistent way is also harder. Doing things together all the time constantly empowers the process of transforming divergent ingredients into convergent ones as well as the process of generating new convergent ingredients in general. That’s why it is crucial to also learn and experience new things on your own, away from your partner. Doing that allows you to diverge freely without any constraints. You will be generating a lot of divergent ingredients that will later enrich your partner’s subconscious pot.

And that’s the paradox, being away from each other is often what later will bring you closer together. And the explanation is that what brings you closer together is a great mix of quality convergent and divergent ingredients.

And after the honeymoon phase of a relationship has passed, in order to generate those divergent ingredients quickly and powerfully, doing new things away from your partner is really effective. So remember, doing things with your partner is wonderful and also useful and effective. But doing new things away from your partner is very powerful and useful in order to approach that balance we are looking for, the balance between converging and diverging in relation to your partner.

Enjoy great new experiences with your partner but also find the time to have new ones on your own.

What about living together

When you live together, the transformation of divergent ingredients into convergent ones happens faster. That’s why a lot of couples report that their relationship degrades and becomes more monotonous after they move in together.

That’s also the reason why sometimes one or both of the partners begin to argue and fight more often after they move in together. This happens because gradually, a lot of their divergent ingredients become convergent ones and their subconscious pots begin to generate mostly typical and predictable outputs.

And when things get too predictable and less exciting, it is easier to get annoyed with things around you, especially, in a subconscious way, with your partner, with that partner that is not giving you any more quality divergent ingredients.

Therefore, if you wanna live together and keep the relationship fresh and exciting, what we discussed before applies again. You have to keep learning and growing, sometimes on your own and sometimes with your partner. You should keep generating quality divergent ingredients that can enrich both of your subconscious pots.

And the fact is that if you are living together, you need to be even more proactive about this than if you are not because the speed of transformation of your divergent ingredients into convergent ones increases the more you interact with each other on a daily basis.

But we are so happy doing always the same thing together

Let’s suppose that you are born in a city called A. And when you were raised you were told that city A was the best and had everything you needed. So you never travel outside of it, not even to other places nearby. And when people ask you about it, you say that you are happy, that you have everything you need in your city and that you don’t need to go to see other places around.

Or let’s suppose that you are a very fearful person and you decide to be in your house all the time 24/7. And you tell people that you are very happy living 24/7 in your house without going to the streets and that you are satisfied with that lifestyle because you can order everything through the internet and you don’t really need to exit the house for anything.

The fact that these scenarios are possible, and indeed they happen, does not mean that they are healthy or that there aren’t other alternatives that could improve the lives of these persons and make them better, deeper and richer.

The fact that they happen only means that this person is accepting some self-imposed limits. That’s all. It does not mean that such a person would not benefit from visiting other cities and places, from living in a different way.

There are all kinds of people in the world. People with all sorts of backgrounds and experiences. And there are indeed people out there that either prefer or want to stay in relationships that are mainly convergent. They are totally ok with a partner that is either very similar to them or makes the effort to converge to them continuously. And they are totally ok triggering situations, behaviors, and scenarios that are most typical and predictable.

This is their choice. But let’s not be mistaken. It is just a choice. Most people will agree that there is more happiness and personal growth to be found in a life that is rich with variety and diversity of stimuli, rather than in one full of monotony and repetition. And the biology of our brain agrees as well.

Why do some couples that are really convergent stay together?

Sometimes it happens because of fear, lack of options, or laziness. Fear of not finding another partner or lack of self-confidence in their own possibilities of finding somebody else. Other times it is because of financial issues. In normal conditions, it is fear or laziness, subconscious, or conscious that keeps people from breaking up a relationship that has grown monotonous or is too convergent.

The natural impulse of most, not all, but most human beings is to find a partner that continuously feeds you a good mixture of convergent ingredients (ingredients that resonate with you and which establish a solid foundation in the relationship) and divergent ingredients (ingredients that diverge from you and which in combination with the convergent ones, generate unexpected and exciting situations, behaviors and scenarios).

Some people want the balance between convergent and divergent ingredients in their relationship to lean more towards the convergent ones. And others prefer the opposite. What we must remember is that all extremes are unhealthy. And a relationship that has too many convergent ingredients or too many divergent ones (in relation to the others), is going to get in trouble.

A relationship with an excess of convergent ingredients (in relation to the divergent ones) will feel stale and predictable. A relationship with an excess of divergent ingredients (in proportion to the convergent ones) will feel unstable and too unpredictable.

However, we put special emphasis in the generation of divergent ingredients because those are the ones that are harder to generate, and the reasons for this (biological, cultural and social) are explained in depth in methodologies like Torch (parts of which I’m using to build analogies in this article).

Strive for balance. And once the honeymoon phase has passed, make sure to be proactive about generating those divergent ingredients that your partner will subconsciously need. Do it for yourself and for your partner. Do it for the good of your relationship.

Lucy says hi — 2031, AGI, and the future of A.I

Javier Ideami — Wed, 16 Jun 2021 13:27:08 GMT

In the year 2031 there is a new Alexa in town, and artificial general intelligence is one of its features. What is Lucy made of?

Continue reading on TDS Archive »

Journey to the center of the neuron

Javier Ideami — Wed, 21 Apr 2021 13:34:53 GMT

Dive into the salty ocean of the brain and get closer to the entities that inspire our A.I systems and make your thoughts possible. Explore how understanding the difference between artificial and biological neurons may give us clues about how to move towards a more flexible kind of artificial intelligence.

https://medium.com/media/5abc840b0ec228a0e098c57a5260e318/href

Every single one of your thoughts is made possible by your biological neurons. And behind many of the most useful A.I architectures is an entity inspired by them. Neurons are at the epicenter of the processing that underpins the complexity produced by intelligent systems. Curious to know more about the engine of your thoughts and about how they compare to their artificial counterparts? Let’s do it!

A.I neurons were originally inspired by our biological ones, yet they are very different. And why shouldn’t they be? There are many ways to get to the same destination and in the same way that human flight got inspired but didn’t copy part by part the way that birds fly, our artificial neurons are only partially inspired by our biological ones.

And yet, our biological neurons are way more complex than our artificial ones and hold so much rich detail and so many mysteries within. Even if we don’t need to copy the way biological neurons work, understanding what is different between both entities could give us new clues about how to move towards a more flexible form of artificial intelligence.

In this article we will review those differences, as well as some that have to do with the networks where these neurons are embedded. We will consider also how those differences may inform new possibilities for the coming future.

Note: A.I is a vast field. There are all sorts of exotic variations within the area. When making comparisons, I will be referring exclusively to some of the most typical and popular deep learning architectures nowadays.

Neural Spring Waterfall, painting by the author Javier Ideami | https://ideami.com

The neuron: salty water and electricity

Before we go deep into the details of the functionality of the neurons, let’s take a quick look inside. This is about to get salty!

Think of your brain as a recipient that holds a salt water ocean. Within that ocean you got a lot of cells that we call neurons and a lot of ions. Ions are atoms in which the positive and negative charge are not equal. And the main ions you have in your brain are: sodium (Na+), potassium (K+), calcium (Ca++) and chloride (Cl-). A good reminder of why those minerals are so important!
So, the biological neuron, like most cells, is basically made of salt water with ions like chloride and sodium floating around. And everything a neuron does can be explained in terms of electricity: voltages (the potential/voltage that, for example, exists at the membrane of the neuron) and currents (the flow of charged ions in and out of the neuron).
The artificial neuron is created with computer code, which when executed creates data structures made of digital bytes, and so on. Most of what an artificial neuron does can be understood in terms of computations, linear and nonlinear transformations of data.

A salty ocean vs a silicon wonderland. What’s more efficient? Let’s take a look.

Painting by the author Javier Ideami | https://ideami.com

Energy: 20 watts for a premium service

Information processing requires energy, so energy consumption within these neural networks matters a lot . It sets boundaries on what’s possible and our brain is tremendously efficient.

Right now, as thoughts move through your mind, your brain is spending just about 20 watts of power, barely enough to turn a light bulb on. It is able to do this even while you fast or sleep, and keeping a moderate temperature around 37 celsius degrees.
The powerful GPUs often used by our deep learning systems can spend hundreds of watts per unit, way more than our brain, and they emit a lot of heat, reaching temperatures around 70 or 80 celsius degrees.

What’s up: research is ongoing on making A.I systems more energy efficient. Sparsity of connections and activations may help these systems to approach the tremendous efficiency of our brain, in which only a small percentage of the neurons are active at any one time (typically between 0.5 and 2%).

Functionality: detecting all around

We have reviewed the environment and the energy consumption. Time to zoom into one of these entities . What is a neuron doing?

It detects patterns coming from the many inputs it receives. We have around 100 billion neurons in the brain (estimates vary, some experts say the real number is more around 86 billion), and each of them receives inputs from around 10000 other neurons (some say it’s around 7000, others 8000, in any case there are thousands of them).
The biological neuron has a threshold and when that threshold is crossed, it emits a signal that we call an action potential or spike. That signal travels down its axon (the neuron’s output) towards the synapses of other neurons.
The synapse is a structure that allows a neuron to pass a signal to another neuron. Synapses are located at the dendrites of the neurons (its branches).
So you can summarize the biological neuron with this process: receive inputs, integrate them, and decide if the result of that integration is strong enough to fire an output signal.
The artificial neuron, on the other hand, performs computations that combine its inputs with their respective weights (the weights are numerical values that specify the strength of the connection between that neuron and each of its inputs).
The result is passed through an activation function (a function that computes a nonlinear transformation of its input. This makes it possible for the network to learn nonlinear mappings between inputs and outputs).
So you can summarize the artificial neuron with this process: receive inputs, add the result of multiplying each of those inputs by the strength of their connection to that neuron, and pass the result of that calculation through a nonlinear function.

What’s up: notice a key difference. Biological neurons have a threshold that keeps them silent until it is crossed. Most artificial neurons in deep learning systems produce an active output (some may output a 0. For example, ReLU activation functions set the output of the neuron to 0 if their input is smaller than 0). As we will emphasize later on, around 0.5 to 2% of our biological neurons are active at any one time vs around 50% in typical artificial deep learning systems.

Painting by the author Javier Ideami | https://ideami.com

Electricity all around

Let’s go back for a moment to the electric side of things within the biological realm. We often hear that our neurons communicate with electric impulses. Let’s dive into the electric dimension so that we can internalize even better what is going on.

A voltage is a comparison (in relative terms) between the electric charge in one place and the charge in another place.
The voltage in neurons is typically measured in millivolts. A millivolt is a thousandth of a volt. Our tiny neurons use tiny amounts of electricity to conduct their operations.
A crucial thing to highlight is the membrane potential (voltage) of a neuron. This is the voltage of the neuron in relation to the space outside of it. And we call it membrane potential because it is located at the membrane of the neuron, which is a thin layer of fat.
When there is a difference in electric charge between two areas, the electric charge will tend to flow in order to compensate for the difference and equalize the situation.
The membrane of the neuron acts as a barrier between the electric current inside and the one outside. And what we call ion channels are like little tunnels in that barrier that allow things to flow in a controlled way.
The conductance, the size of the openings, determines how fast those ions flow in and out of the membrane.
So when a voltage (potential) exists, a relative difference in charge between two areas, ions flow to equalize things. But why? Because of the universal principle of opposite charges attracting and similar charges repelling each other. When, for example, in a certain setting there is more positive charge than negative, an electrical current is formed to equalize the situation, bringing more negative charge into the area.
The firing threshold (or action potential threshold), is the voltage that has to be reached at the membrane of the neuron for the neuron to fire an output signal (an action potential) through its axon. This threshold is typically at around -50mv (millivolts).
The so-called resting potential of a neuron is at -70mv, below the firing threshold, so by default the neuron will not fire.
The importance of this threshold cannot be overstated. As a consequence of its existence, only the most intense levels of activation are communicated through the neuron’s axon (its output). This allows for the information to be encoded in a very compact and efficient way.
Remember those 20 watts of energy consumption that we mentioned at the beginning? Biological neurons only communicate relevant and key information. For the rest, they stay silent.

Let’s remind ourselves that there is no reason why we should copy or imitate the complexity of the biological neuron. We could create systems that display flexible forms of intelligence in a completely different way. And the simpler the way, the better. But understanding in depth our biological neurons could give us some ideas that may enrich our experiments and strategies when working with artificial neurons.

Painting by the author Javier Ideami | https://ideami.com

It’s about time: to spike or not to spike?

A very important difference between both kinds of neurons is related to the time dimension.

Biological neurons fire for a very brief moment. They emit spikes, which last a very short time (typically around 1 millisecond). The information transmitted by a neuron is encoded in the timing of those spikes. A spike train is a sequence of spikes and silences.
After each spike, the neuron’s membrane potential returns to be a small value (it can even go below its resting potential). In order to spike again, the voltage needs to go back up and above the firing threshold.
When learning processes take place, the efficiency with which a neuron can contribute to activate other neurons can dynamically change through processes like long term potentiation (LTP), which are central to the way we learn and create memories (long term depression is the opposite process of LTP).
Most artificial neurons are generally constantly producing outputs (throughout each execution cycle), sending continuous signals to the next neurons down the line (sometimes their activation functions may set their output to 0).
So, in most artificial deep learning networks, the time dimension is not relevant. There is no threshold per se in the way it is used in our biological networks. Our artificial systems are way simpler. But simpler sometimes is better. Will the direction in which today’s deep learning systems are moving be enough to take us to what experts like to call AGI (artificial general intelligence)? Or in any case to a more flexible form of A.I?. The jury is still out on that matter.

What’s up: The Von Neumann architecture is behind most of the hardware we use today. To get closer to what the brain does, some researchers are starting to work with other kinds of architectures. Neuromorphic computing is an example. This kind of architecture allows for more parallel processing and robustness. Most importantly, it can work with spiking neural networks, which deal with both the spatial and the time dimensions, just as the brain does. Companies like IBM or Intel have already produced neuromorphic chips. This area of research faces a number of important challenges, both in the research front and also in terms of having to deal with an existing ecosystem that is so well adapted to the Von Neumann model.

Painting by the author Javier Ideami | https://ideami.com

Some space, please: the magic of sparsity

When neurons combine with each other, the number of them that are firing at any one moment and the number of connections between them have a lot of consequences in terms of energy consumption, resiliency, robustness and other related factors.

As thoughts stream through your mind, only around 2% of your neurons are firing on average. Most of them are silent. Because only a small percentage of the neurons are active at any one time, noise and other distortions have a harder time interfering with the pattern detection processes of these networks. Sparsity makes our biological networks resilient and robust.
Conversely, in our artificial deep learning networks, most of the neurons are continuously producing outputs (some may have their outputs set to 0 by their activation functions). This is one of the potential reasons why deep learning systems are often quite brittle and sensitive to what we call adversarial attacks/examples. Adversarial attacks are subtle, minimal changes in the input of a network (typically invisible to our perception), that produce dramatic and incorrect changes in its output. The non-sparse nature of deep learning networks makes them more sensitive to variations in their inputs. When most of the weights are relevant and constantly in play, any changes can produce dramatic consequences.
But sparsity goes beyond the activations. Numenta is a well known research company where a talented group of scientists and engineers combine neuroscience and machine intelligence research. Their team is led by Jeff Hawkins and Subutai Ahmad. Numenta’s team has explored very deeply the issue of sparsity in the brain as well as other areas related to how our neocortex functions. One of the things we can learn from their work, research and publications, is that our neocortex is sparse at two levels.
Firstly, as stated above, in terms of activation, best estimates are that 0.5% to 2% of our biological neurons are active at any one time.
And then, we also have sparsity in terms of connectivity between the neurons. When a layer of neurons projects onto another layer, Numenta’s team tells us that current estimates say that 1 to 10% of the possible neuron to neuron connections exist.
In contrast to that, most deep learning systems nowadays are very dense. 100% dense in terms of connectivity, typically. And around 50% in terms of activations.

What’s up: New architectures that make use of sparse connectivity and sparse activations are an ongoing area of research. Regardless of A.I not having to necessarily copy what the brain does, sparsity, as a strategy, makes quite a lot of sense within the mission to build systems that are more resilient and robust.

Painting by the author Javier Ideami | https://ideami.com

Inputs and parameters

Ultimately we want to use these neurons to learn. So let’s now go deeper into the knobs that these networks tweak in order to generate that learning. Afterwards we will look at the learning algorithm itself. We begin by comparing the inputs and parameters of both kinds of neural networks.

In biological networks you can have 3 types of inputs: excitatory (makes the receiving neuron more likely to fire), inhibitory (does the opposite) and leak (similar function to the inhibitory one).
As stated previously, these inputs interface with the receiving neuron through synapses, the connection points between sending and receiving neurons. And most synapses are on the dendrites of the receiving neuron.
The dendrites are branches that come off the neuron (dendrite comes from the greek -dendro-, which means tree). At the dendrites, the different input signals are integrated. These dendrites have small spines on them. It is there where the outputs from the sending neurons (axons) interface (synapse), establishing connections to other neurons.
In artificial networks, you have, in general, a single kind of input, which typically has a number of associated weights (numbers that express the strength of the connection between that input and a number of other neurons connected to it, one weight per connection).
Those weights hold continuous values that can be negative or positive. Each weight value, in combination with the computations performed at the neuron, will, in practice, contribute to make the receiving neuron more or less active (kind of analog to the excitatory — inhibitory dynamic previously described).
So, in the biological neuron, there is this battle between excitatory and inhibitory signals. The result of that battle determines the voltage at the cell’s membrane. And it is that membrane voltage that needs to go over the action potential threshold in order for the neuron to fire.
In the artificial neuron, things are simpler. There is no explicit threshold and the different strengths (positive or negative) of each of the weights combine to stimulate more or less the receiving neurons.

Let’s get closer to the parameters of both entities.

In biological networks we have the concept of the synaptic weight, which determines the impact that a signal from a sending neuron can have on the receiving neuron through their synaptic connection.
Getting even closer, what this impact represents is the capacity of the sending neuron’s action potential to release neurotransmitters, and of those neurotransmitters to open the synaptic channels of the receiving side.
In artificial networks, we have weights that determine the strength of each of the connections between a sending and a receiving neuron. And those weights are simply numbers. They could be floating point numbers, integers, single bits, etc.
So, whereas the artificial weight is a simple number, the biological synaptic weight depends on a lot of factors. Those factors include, for example, the amount of neurotransmitter that can be released into the synapse and absorbed on the other side (and here the number of specific kinds of receptors and ions come into play), how well the signal moves through the axon (myelination in the axon has an impact on this), the efficiency of the signal propagation and the number of connections between the axon and the dendrites of the receiving neuron. As we can see, this goes well beyond a simple number.

So, those are the parameters of these networks. And zooming out for a moment, what are they representing?

These weights, in general, represent what a neuron is sensitive to, what it is detecting. If a weight value is large, it means that the related neuron is very sensitive to the input it is receiving.
We can therefore sense that the learning process, in both cases, has to do with changing and tweaking these weights, producing different patterns in the networks as learning progresses.
So, get ready for this: each one of your thoughts and memories is represented by a pattern of synaptic weights. And a similar thing takes place in our artificial networks, where patterns of numerical weights represent information at different abstraction levels, patterns that evolve throughout the learning process.

We got the structure, the inputs, the parameters and the outputs. Time to learn!

Painting by the author Javier Ideami | https://ideami.com

Learning: backpropagation and beyond

In order to learn, we need to tweak those weights, those parameters, and to do it in the right direction. But how? What is the learning algorithm?

In our artificial deep learning networks, backpropagation, in combination with gradient descent, is the typical algorithm of choice.
For example, in supervised systems (where we provide a dataset with labels, say a number of images of animals with labels that identify the kind of animal), we run the network and then calculate its performance, its loss value or error (the difference between what we are obtaining and what we would like to obtain).
Then, starting from the output of the network and moving towards its input (propagating the current loss value in reverse, that’s why we call it backpropagation), we use the power of calculus and its chain rule to calculate the impact of each of the parameters of the network on that final loss value. We are able to do this because all the computations performed at the different layers of the neural network are differentiable.
Once we know how tweaking each of the weights will impact the final loss value of the network, we can proceed to tweak each of those parameters in the direction that will minimize that final loss value, that difference between our targets and where we are at each moment.
If we keep repeating this process, moving down those gradients, we will end up at a place where the computations produced with the combination of all our weights produces a very small difference between our target values and our current network output. Learning has taken place.
And what about our biological networks? Is something similar to backpropagation going on in the brain? There is controversy around this. Some experts think that there may be something going on in the brain that, while being different, may have similarities to what backpropagation does. Others think that the way our biological networks learn has nothing to do with it. So, the jury is still open and there is a lot of active research in this area.

What’s up : Backpropagation is a great learning algorithm. And yet, like any algorithm, it has strengths and weaknesses. what if we went beyond the search for something similar to backprop in the brain and considered other options? Researcher Ben Goertzel, an expert in the AGI field (artificial general intelligence), thinks that we will eventually go beyond backprop and use other kinds of learning algorithms that will adapt in better ways to the needs of the future AGI systems. Those may include evolutionary algorithms of the kind of CMA-ES applied to complex neural architectures.

And Ben tells us that if we used these kinds of evolutionary algorithms, we could then, for example, use inference for fitness estimation and other strategies to guide the evolutionary learning process, strategies that are more difficult to implement when we use backpropagation.

Ben puts forward a really interesting question: how many neural architectures are being discarded just because they are not suitable to work with the backpropagation algorithm? It is a great reminder to keep our options and our mind open to new possibilities.

Our brain benefits from two kinds of learning processes: the evolutionary ones that get encoded in our genes, and the ones that take place during our lifetimes within our neural networks. Combining both approaches in our artificial networks could open the door to new advances.

Evolution: everything in movement

Which takes us to the evolutionary aspect of these neural architectures.

Right now, the neural networks within your neocortex are already different to how they were hours ago. They never stop evolving.
In general and zooming out, in our biological networks there are optimization processes going on at many different levels, not only in terms of their parameters, but also at the level of the networks themselves, their structure, algorithms, etc (in relation to our genome, for example).
When active, our typical and most popular artificial deep learning systems optimize their parameters (weights) by using backpropagation, and that’s about it. The architecture itself remains static, apart from the changes we apply from time to time to their hyperparameters (manually or through autoML, grid search and other similar options).

What’s up: research on self-optimizing mechanisms that may allow deep learning architectures to transform and evolve their structures and optimize their strategies as the learning process progresses, could make our artificial networks more adaptive and flexible. Researchers like professor Kenneth Stanley have already produced very interesting results with dynamic systems that are moving in this direction.

Neural Forest, painting by the author Javier Ideami | https://ideami.com

When: continuous learning

And what is the duration of those learning processes?

In our current and most typical deep learning networks, training processes have a beginning and an end. We first train, complete the learning process and then we perform what we call inference. We apply that learning to previously unseen data in a separate process.
We may keep retraining our networks as new data comes in, and update our models iteratively.
In our brain, learning processes never go to sleep. Continuous learning is taking place. The strengths of our synaptic connections change as we think, as we act, and even as we sleep.

What’s up: Continuous learning is a hot topic in the A.I community. We know that if we are to eventually reach a more flexible and robust kind of artificial intelligence, learning needs to have more continuity. A lot of research is ongoing in this area.

Let’s be social

Biological neurons are way more social than our artificial ones. What does this mean?

In typical deep learning networks, artificial neurons communicate in a single direction towards the next layer (the backpropagation computations are done in reverse), and they connect only to the previous and next layers. There are exceptions, but we are talking here about the most typical kinds of deep learning networks.
Biological neurons may communicate in multiple directions and have a wider and richer range of connections (being those also more sparse at the same time). Some neurons communicate up and down the cortical columns. Others have connections that go sideways. The timing aspect that I commented earlier introduces an even richer aspect to the process.

What’s going on: new kinds of A.I architectures that make use of more flexible and richer forms of connectivity are another active area of research. For example, researchers are working on graph neural networks and other architectures that use hypergraphs and metagraphs. The singularityNET project, founded by Ben Goertzel, is doing a lot of work in this area. It combines blockchain technology with artificial intelligence services to produce a decentralized A.I network. Recently, the project has partnered with the Cardano ecosystem to accelerate its progress towards a global decentralized AGI system.

Painting by the author Javier Ideami | https://ideami.com

Zooming away

To complete our journey, let’s zoom out for a moment.

The neurons that are involved in the advanced part of our intelligence are located in our neocortex.
Our neocortex is organized in terms of microcolumns, each of which is composed of around 100 neurons that deal with similar types of data.
A number of microcolumns are then structured in terms of cortical columns and we have around 150000 columns in our neocortex.
If you are interested in going deeper into what goes on in these cortical columns, I recommend you check the book, “A Thousand Brains: A New Theory of Intelligence”, by scientist and entrepreneur Jeff Hawkins, a true masterpiece in which he dissects the latest research performed by his team at Numenta.

And if you want to explore further how recent neuroscience research like the one performed by Jeff Hawkins and his team points the way towards achieving a more resilient, consistent and flexible form of artificial intelligence, you may check this other article below that I wrote on the topic

Towards the end of deep learning and the beginning of AGI

The sounds of a million souls poem

“ And as we approach the magnificent column,

the mysterious pattern calling us from afar with the sounds of a million souls…

I sense that the brightest sun is compressed in those tiny specks of wonder..

reduced to a tapestry of dreams that resonate in our consciousness..

And I hear you laugh.. I hear you fall… I hear your tears devastate the horizons..

until we merge at the center of the column where silence awaits..

Silence, and then the million suns spiking towards the awakening of a new existence..

Hold me tight.. and let’s dive right in, right into the center of the column..

where you and I are one in silence.. ”

— by Javier Ideami

Journey to the center of the neuron was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Towards the end of deep learning and the beginning of AGI

Javier Ideami — Wed, 17 Mar 2021 12:31:15 GMT

How recent neuroscience research points the way towards defeating adversarial examples and achieving a more resilient, consistent and flexible form of artificial intelligence

Painting by the author Javier Ideami@ideami.com

Adversarial examples are a hot research topic in deep learning nowadays. Subtle, often invisible changes in the data can push our networks to make terrible mistakes. We, as human beings, seem to be way more resilient to these perturbations in our sensory inputs (though not totally immune).

There is a certain pattern in our deep learning systems. They achieve remarkable things, but they are also at times delicate and brittle. Like a rigid tree in the middle of a storm, they look majestic, but may crack at any time without warning. Why is this happening and how can we improve the situation?

Some clarity is starting to appear through new research arriving from the field of neuroscience. In this article we are going to explore it.

In his recently published book, “A Thousand Brains: A New Theory of Intelligence”, a masterpiece that I really enjoyed, scientist and entrepreneur Jeff Hawkins dissects the latest research performed by his team on our neocortex, the part of the brain that occupies 70% of its volume and is responsible for our advanced intelligence. (the other 30% is occupied by the older, more primitive part of the brain).

In a fascinating journey, Jeff Hawkins takes us deep into the epicenter of our intelligence. He shares that:

The circuits in the neocortex are really complex. In just one square millimeter we have around one hundred thousand neurons, several hundred thousand million connections (synapses) and kilometers of axons and dendrites.
The neocortex looks very similar all around. The variations between areas are small.
All parts of the neocortex seem to be connected to the generation of movement, to motor tasks. In every part of the neocortex, scientists find cells that connect with areas of the old brain related to movement.

Painting by the author Javier Ideami@ideami.com

One circuit to rule them all

Vernon Mountcastle was a leading american neurophysiologist and professor emeritus of neuroscience at Johns Hopkins university. He was the discoverer of the columnar structure of the cortex. And he proposed that through evolution, our neocortex became larger by basically copying over and over the very same thing, the same basic circuit.

When I read about Mountcastle’s idea in Jeff’s book, I got reminded of a fascinating talk by the great scientist Robert Sapolsky. Answering a question about what separates us from Chimps (https://www.youtube.com/watch?v=AzDLkPFjev4), Sapolsky explains that about half of the difference in gene expression between chimps and humans has to do with genes that are coding for olfactory receptors, other differences are related to the size of the pelvic arch, the amount of body hair, immune system recognition capabilities, some aspects of reproductive isolation, etc; those and others account for almost all the genetic differences between chimps and humans. So then, where are the differences in the genes that are relevant to the human brain? Sapolsky explains that there is hardly any, and the few identified are in genes having to do with the number of rounds of cell division during fetal brain development. Basically: we have 3 times more neurons than chimps. And this difference in scale seems to be key to our advanced intelligence.

This fits nicely with Mountcastle’s idea of a single circuit that gets replicated many, many times (volume matters, but is volume enough to push today’s deep learning systems towards AGI? let’s keep exploring below).

Painting by the author Javier Ideami@ideami.com

That all parts of our neocortex are working based on a similar principle, on the same basic circuit, fits the flexibility our brain has demonstrated in different scenarios. And if volume matters, will this mean that GPT-11 may get us closer to AGI?

Unfortunately, it is not that simple. Because there is a massive elephant in the room that Jeff illuminates in his book and theory. One that we have been kind of ignoring for far too long.

150000 columns

Before we go to visit the elephant in the room, let’s establish the context. According to the scientists, we have around 150000 cortical columns in our neocortex. Jeff tells us that we could think of these columns as if they were thin spaghetti. So, imagine 150000 thin spaghetti next to each other. That’s your neocortex metaphorically speaking.

Painting by the author Javier Ideami@ideami.com

What is going on within these cortical columns? Over the last years scientists have come to realize that the brain is a predictive machine. It produces a model of the world and continuously predicts what will happen next.

When our brain’s predictions are not correct, we realize that something is not right and our brain updates its model of the world. As time goes on, our model of the world gets richer and more sophisticated.

So in a way we indeed live in a simulation. For what we perceive is really the model that the brain constructs rather than the “reality” out there. This explains phantom limbs and other similar scenarios.

Jeff Hawkins points out that our brain learns a model of the world by paying attention to how the inputs it receives change as we move (or as those inputs move). And that takes us to the elephant in the room.

The elephant in the room

The world is changing constantly. Everything moves. And it makes sense that as things move and change, our brain keeps updating our model of the world (our many, many models as we will soon see).

And just as attention mechanisms have revolutionized the deep learning field in recent years, so is attention key in how our brain is learning these models.

But if our neocortex is making a very large amount of predictions constantly, and adapting to any misalignments between its models and what it perceives, why don’t we notice all of those predictions and instead we perceive one continuous reality? Let’s get there, step by step.

Painting by the author Javier Ideami@ideami.com

Through their latest research, Jeff and his team reach some fascinating insights:

Each of our cortical columns (150000 roughly) is learning models of the world, of objects, of concepts, of anything you could imagine. And just as our old brain has place cells and grid cells to make models of our surroundings, the neocortex, they propose, has equivalent cells that allow the brain to make models of objects, concepts, etc. The cortical columns are using what Jeff calls reference frames, which are like grids of a number of dimensions that help the brain organize any kind of knowledge.
Jeff tells us that thinking is a form of movement. Thinking happens when we change location within these reference frames. So what you are thinking right now, what is right now in your head, depends on where your cortical columns are at the moment within those different reference frames. And your thoughts keep evolving as your brain navigates those structures.

Notice that the concept of movement is beginning to appear everywhere. Movement and the dynamic nature of systems are the elephant in the room. And we will soon discuss how that connects with the issue of adversarial examples and the limitations of much of what’s going on in today’s deep learning.

So, it is all about reference frames or maps, maps of physical spaces, maps of concepts, maps of anything. Jeff tells us that in the same way that reference frames in the old brain are learning maps of different environments, reference frames in the neocortex are learning maps of objects (in the case of what they call “what” columns), or the space around the body (in the case of “where” columns), or maps of concepts within the non-sensory columns.

I love the analogy that Jeff uses regarding how in order to be an expert in any domain we need to find a good way to organize our knowledge about that domain, we need to create internally a great reference frame or map of that domain. Think of the deep and complex reference frames that, for example, Leonardo Da Vinci or Einstein had in order to excel as they did within their respective areas of expertise.

All right, so each of our 150000 cortical columns is learning a predictive model of the world as it pays attention to how the inputs change throughout time. And each of these columns learns models of a large number of elements, objects, concepts, etc.

Painting by the author Javier Ideami@ideami.com

So our knowledge of anything, an object or a concept, is distributed across thousands of cortical columns, across thousands of complementary models. This relates to the name of Jeff’s theory (A thousand brains).

And all of this connects with the flexible nature of our brain. Our neocortex doesn’t depend on a single column. Knowledge is distributed across thousands of them. So the brain continues working even if an injury damages a set of columns (there are great examples about this in the academic literature).

The next thing to consider is: if the brain is creating new predictions everytime movement happens, where are these predictions being stored?

Jeff and his team propose that spikes that occur at different dendrites in a neuron are predictions (dendrites are branches in the neuron, and receive inputs through their synapses). Dendrite spikes put the cell connected to them into what Jeff calls a predictive state. Therefore, predictions happen inside neurons. And these predictions change the electrical properties of the neuron and make it fire sooner than it would otherwise, but the predictions are not sent through the axon to other neurons, which explains why we are not aware of most of them. The question now is: how do we settle on a specific prediction?

Painting by the author Javier Ideami@ideami.com

Consensus by voting

Our perception of reality is the result of a voting process. The different cortical columns reach consensus through voting and this is what produces a single perception that unifies different predictions coming from different parts of the system (that may also be related to a diversity of types of sensory inputs).

Only some cells need to vote, the ones that represent, for example, a specific object that we are perceiving. And how do they vote?

Most of the connections in our cortical columns are going up and down the different layers of the neocortex. But there are exceptions. Scientists have found that there are cells that send their axons (output connections) from side to side through the neocortex. Jeff and his team propose that these cells that have long distance connections are the ones responsible for the voting.

Painting by the author Javier Ideami@ideami.com

As we recognize an object, our cortical columns have reached a consensus on what it is that we are looking at. The voting cells (neurons) in each of our columns make a stable pattern that represents that object and where the object is located in relation to us.

And as long as we keep perceiving the same object, the state of those voting neurons doesn’t change while we keep interacting with that element. Other neurons will change their state as we move or the object moves, but the voting neurons will remain stable.

Painting by the author Javier Ideami@ideami.com

This is why our perception is stable and we are not aware of the flurry of activity related to the moving predictions that are taking place. We are just aware of the final stable patterns that arise from the consensus reached by the cells that are voting.

Therefore:

Movement is key in how our brain perceives the world. It is thanks to movement (our own or the one of the world around us) that our brain can enrich its internal models of objects and concepts (movement doesn’t have to be physical, it can be virtual, etc).
Using a single principle to process all kinds of inputs and creating thousands of predictions and models for each element we interact with, makes these models rich and versatile.
The consensus mechanism reached through voting means that our perception of the world is stable and at the same time flexible and resilient.

It is time to return to the adversarial examples and the status of the deep learning field.

How to defeat adversarial examples

Human beings are not immune to adversarial examples. Perturbations in our sensory inputs can confuse us and make us interpret things incorrectly. Most of us have experienced a wide range of optical illusions. However, in general, our perception is consistent and quite resilient, and certainly way more consistent than the one we find in today’s deep learning systems, where invisible changes can derail completely our results.

What is behind this resiliency, consistency and flexibility? Whatever it is, it may include some of the following:

The models created by our cortical columns are based on movement and the creation of reference frames. As we move around or the world moves around us, our brain creates thousands of predictions and models of each object or concept. This provides flexibility. We are not putting all our eggs in one basket. Just as when we use ensembling in deep learning, we are betting on thousands of angles on the problem, not just one.
Our perception is based on multimodality and stable voting dynamics. The different models created in relation to a specific object (or concept, etc), are using multiple predictions that are often connected to different sensory modalities (vision, touch, hearing, gestures, etc). Voting among the cells that are responsible for the final representations produces stable patterns that are resilient to changes. A minimal change in the object or context will not derail the stable voting pattern because such a pattern is based on the combination of thousands of separate predictions, which in turn are based on a combination of many different angles, perspectives and also, often, different sensory modalities. Just as ensembling often wins in Kaggle competitions, it is also a kind of ensemble that is taking place in the brain, making our human perception stable, resilient and flexible (relatively speaking of course, but especially so in comparison with current deep learning systems).

So, “the end” of adversarial examples in deep learning, and by “end” I don’t mean absolute end, just getting to a level of resiliency, consistency and flexibility similar to the one we have as human beings, will be possible with a combination of:

Movement: physical or virtual. Deep learning systems need to be able to gather different perspectives and angles on the world by enriching their internal models as they move or as the world moves around them. Robotics and AI will have to merge further. Beyond robotics, movement can also be virtual so this principle goes beyond the physical.
A collection of models: we have to go beyond single representations or models. To become resilient to adversarial examples and other challenges, deep learning needs to generate a large amount of predictions and models that get updated continuously. A voting mechanism can then create stable patterns and representations that will be way more resilient to adversarial perturbations.
Continuous learning: The world out there is not waiting. A consequence of the above is that learning needs to be continuous. Deep learning systems are too static nowadays. Continuous learning is an active area of research and its importance will only increase going forwards.
Reference frames: We can find much inspiration about how to build our representations and models from the reference frames described by Jeff Hawkins in his book and theory. As Jeff points out, deep learning leaders like Geoffrey Hinton have already been working for quite some time in trying to make deep learning models more flexible (see capsule networks). But there is still a long road ahead and it is becoming clear that the latest neuroscience research is reinforcing that direction with new hints; our brain is way more flexible and resilient than our deep learning models and now we are beginning to understand why.

Researching new ways of detecting adversarial examples is an interesting area with much academic activity. What is missing now is to rethink our deep learning architectures and systems to transition from the current static paradigm to a dynamic one based on multi-modal, multi-model, consensus-based predictive systems that are resilient, consistent and flexible. When we reach that point, we will be able to hide or perturb parts of our systems and still maintain stable predictions.

As Jeff points out, this will become more and more crucial as we try to apply AI systems to scenarios that require a lot of flexibility and resilience.

Mountcastle’s ideas, Sapolsky’s thoughts and our fascination about the GPT architecture, all of those things indicate the importance of volume. Volume matters. Having 3 times more neurons, or thousands of copies of the same basic circuit, or hundreds of billions of parameters rather than 1 billion, all of that matters.

And that’s good news for the current state of the deep learning field. With projects such as the GPT system, we are discovering and confirming that fact, that volume matters.

But, what we are also beginning to realize is that, as much as volume matters, it will not be enough to take us where we wanna go.

If you follow the latest conversations about systems like GPT-3 in a range of podcasts and venues, say for example at the machine learning street talk podcast, you will hear similar conclusions. GPT-3 is very impressive, but it is also kind of delicate, brittle, and it often feels like a hack. This has nothing to do with how resilient and flexible human brains are.

Volume matters. But movement also. We cannot escape movement and change just through sheer volume. The world is like a storm that never stops.

We are the static tree that gets larger and larger but keeps breaking over and over because it lacks the capacity to move with the storm.

Thinking is movement. Movement through reference frames. Movement across thousands of predictions and models unified through consensus mechanisms.

The way forward is through movement.

Afterword: In his book “The Master Algorithm” Pedro Domingos writes about different paradigms connected with deep learning: symbolists, connectionists, evolutionaries, bayesians and analogizers. It’s clear that the path towards AGI could come through many different routes and combinations of approaches. In regards to Jeff and his team’s work and theory, I am following, as professor Kenneth Stanley would say, a gradient of interestingness (and the magnitude of this gradient in regards to Jeff’s work is pretty strong). It feels to me that Jeff’s theory and work (alongside all his talented team) could hold inside very interesting and useful stepping stones that could bring us closer to AGI (or at a minimum their research could point us towards those stepping stones). So yes, we could get to AGI in many different ways, but so far the only intelligent system we know that is resilient and flexible enough is the one on top of our shoulders. So it does make a lot of sense that exploring in depth the latest research coming from neuroscience may point us towards useful stepping stones on the way to AGI.

And if you want to go deep into the mysteries of the entities that inspire our deep learning A.I systems and make your thoughts possible, check the related article below.

Journey to the center of the neuron

Painting by the author Javier Ideami@ideami.com

Towards the end of deep learning and the beginning of AGI was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Convolutional layer hacking with Python and Numpy

Javier Ideami — Mon, 15 Mar 2021 12:41:09 GMT

Create a convolutional layer from scratch in python, hack its weights with custom kernels, and verify that its results match with pytorch

Continue reading on TDS Archive »