Introduction

Wigner’s remark about the ‘unreasonable effectiveness’ of mathematics in allowing us to understand physical phenomena (Wigner, 1959) is famously contrasted by Gelfand’s quip about its ‘unreasonable ineffectiveness’ in doing the same for biology (Borovik, 2021). Considering the component of randomness that is inherent to evolution, this may not be all that surprising. However, while this argument holds just as well for the brain at the cellular level, ultimately brains are computing devices. At the level of computation, machine learning and neuroscience have revealed near-optimal strategies for information processing and storage, and evolution is likely to have found similar principles through trial and error (Hassabis et al., 2017). Thus, we have reason to hope for the existence of fundamental principles of cortical computation that are similar to those we have found in the physical sciences. Eventually, it is important for such approaches to relate these principles back to brain phenomenology and connect function to structure and dynamics.

In physics, a fundamental measure of ‘effort’ is the action of a system, which nature seeks to ‘minimize’. Given an appropriate description of interactions between the systems constituents, the least-action principle can be used to derive the equations of motion of any physical system (Feynman et al., 2011; Coopersmith, 2017). Here, we suggest that in biological information processing, a similar principle holds for prediction errors, which are of obvious relevance for cognition and behavior. Based on such errors, we formulate a neuronal least-action (NLA) principle which can be used to derive neuronal dynamics and map them to observed dendritic morphologies and cortical microcircuits. Within this framework, local synaptic plasticity at basal and apical dendrites can be derived by stochastic gradient descent on errors. The errors that are minimized refer to the errors in output neurons that are typically thought to represent motor trajectories, planned and encoded in cortical motor areas and ultimately in the spinal cord and muscles. In the context of motor control, a phenomenological ‘minimal action principle’ has previously been proposed that guides the planning and execution of movements (Feldman & Levin, 2009). Our neuronal least-action principle reformulates and formalizes the classical equilibrium point hypothesis (Latash, 2010) in a dynamical setting, linking it to optimality principles in sensory-motor control (Todorov, 2004).

Other attempts exist to link biological information processing and neural networks with the least-action principle, for instance by directly learning to reproduce a given trajectory (Amirikian & Lukashin, 1992), by minimizing the physical action for the muscle force generation by motor unit recruitment (Senn, Wyler, Clamann, Kleinle, Larkum, et al., 1995), minimizing cognitive prediction errors (Alonso et al., 2012), minimizing output errors with a weight-change regularization (Betti & Gori, 2016), minimizing psychomotor work (Fox & Kotelba, 2018), minimizing data transport through a network (Karkar et al., 2021), minimizing the discrimination information (Summers, 2021), or minimizing the free energy (Friston, 2010; Friston et al., 2022). Apart from the latter, however, these attempts remain far from the biology that seems to resist a formalization with the tool of physics – at least, when applied too strictly.

The fundamental novelty of our NLA principle is the way it deals with time. In physics, bodies interact based on where they are now, irrespective of what happens in the future. Living systems, instead, interact based on what could happen in the near future, and react early to stay alive. This difference is also mirrored in the way our NLA principle looks for an error-minimizing trajectory of brain states. We postulate that the brain trades with near-future states and seeks for a path that minimizes errors of these future states any moment in time. Looking ahead towards what will likely happen allows the network for correcting the internal trajectory of deep neurons early enough so that the delayed output moves along the desired path. The notion of looking into the future to gate a dynamical system is also central in optimal control theory (as expressed by the Bellman equation, see e.g. Todorov, 2006). Yet, starting with a neuronal action is more principled as it includes the derivation of the dynamical system itself that will be optimally controlled.

The insight into the time structure of biological information processing allows us to express a simple form of a total ‘mismatch energy’ for our cortical neuronal networks, from which we derive the dynamic neuronal and synaptic laws. In short, the mismatch energy within a single pyramidal neuron is the squared prediction error between basal dendrites and the soma, together with the apical dendrites receiving a top-down feedback. The apical dendrites calculate a local prospective prediction error that looks ahead in time and overcomes neuronal integration delays (Fig. 1a). As a consequence, the output neurons are corrected on the fly by the prospective error processing, pushing them in real time closer to the desired path. In addition, the prospective errors are suited for gradient learning of the sensory synapses on the basal dendrites. This gradient learning is proven to reduce the error in the output neurons at any moment in time.

Somato-dendritic mismatch energies and the neuronal least-action (NLA) principle.

(a1) Sketch of a cross-cortical network of pyramidal neurons described by NLA. (a2) Correspondence between elements of NLA and biological observables such as membrane voltages and synaptic weights. (b1) The NLA principle postulates that small variations δũ (dashed) of the trajectories ũ (solid) leave the action invariant, δA = 0. It is formulated in the look-ahead coordinates ũ (symbolized by the spyglass) in which ‘hills’ of the Lagrangian (shaded grey zones) are foreseen by the prospective voltage so that the trajectory can turn by early enough to surround them. (b2) In the absence of output nudging (β = 0), the trajectory u(t) is solely driven by the sensory input, and prediction errors and energies vanish (L = 0, outer blue trajectory at bottom). When nudging the output neurons towards a target voltage (β > 0), somato-dendritic prediction errors appear, the energy increases (red dashed arrows symbolising the growing ‘volcano’) and the trajectory u(t) moves out of the L = 0 hyperplane, riding on top of the ‘volcano’ (red trajectory). Synaptic plasticity reduces the somato-dendritic mismatch along the trajectory by optimally ‘shoveling down the volcano’ (blue dashed arrows) while the trajectory settles in a new place on the L = 0 hyperplane (inner blue trajectory at bottom).

The NLA principle builds on and integrates various ingredients from existing work and theories. Output neurons, be they motor neurons or decision making neurons, are postulated to be ‘nudged’ towards the desired target time course by additional synaptic input to the soma or the proximal apical dendrite, as described by (Urbanczik & Senn, 2014). The cortical microcircuit with lateral ‘inhibition’ that seeks to cancel the top-down feedback in order to extract the apical error is inspired from Sacramento et al., 2018 and Haider et al., 2021. The energy-based approach for describing error-backpropagation for weak nudging is borrowed from the Equilibrium Propagation algorithm (Scellier & Bengio, 2017) that we generalize from a steady-state algorithm to real-time computation in cross-cortical microcircuits. Our theory covers both cases of weak and strong output nudging. For strong nudging, it likewise generalises the least-control principle (Meulemans, Zucchet, et al., 2022) and the prospective configuration algorithm (Song et al., 2024) from a steady-state to a dynamic real-time version, linking to optimal feedback control (Todorov & Jordan, 2002). Finally, the apical activity of our pyramidal neurons can be seen in the tradition of predictive coding (Rao & Ballard, 1999), where cortical feedback connections try to explain away lower-level activities. Yet, different from classical predictive coding, our prediction errors are integrated in the soma, and these errors are prospective in time. The errors extrapolate from current to future activities, so that their integration improves the network output in real time. The combination of an energy-based model with prospective coding in which neuronal integration delays are compensated on the fly enters also in Haider et al., 2021.

The paper is organized as follows: we first define the prospective somato-dendritic mismatch error, construct out of this the mismatch energy of a network, and ‘minimise’ this energy to obtain the error-corrected, prospective voltage dynamics of the network neurons. We then show that the prospective error coding leads to an instantaneous and joint processing of low-pass filtered input signals and backpropagated errors. Applied to motor control, the instantaneous processing is interpreted as a moving equilibrium hypothesis according to which sensory inputs, network state, motor commands and muscle feedback are in a self-consistent equilibrium at any point of the movement. We then derive a local learning rule that globally minimizes the somato-dendritic mismatch errors across the network, and show how this learning can be implemented through error-extracting cortical microcircuits and dendritic predictive plasticity.

Results

Somato-dendritic mismatch errors and the Lagrangian of cortical circuits

We consider a network of neurons – identified as pyramidal cells – with firing rates ri(t) in continuous time t. The somatic voltage ui of pyramidal neuron i is driven by the close-by basal input current, Σj Wijrj, with presynaptic rates rj and synaptic weights Wij, and an additional distal apical input ei that will be learned to represent a prospective prediction error at any moment in time (Fig. 1a). While in classical rate-based neuron models the firing rate ri of a neuron is a function of the somatic voltage, ρ(ui), the NLA principle implies that the effective firing rate of a cortical neuron is prospective. More concretely, the formalism derives a firing rate that linearly extrapolates from ρ(ui) into the future with the temporal derivative, , where represents the temporal derivative of ρ(ui(t)). There is experimental evidence for such prospective coding in cortical pyramidal neurons where the instantaneous rate ri is in fact not only a function of the underlying voltage, but also a function of how quickly that voltage increases (see Fig. 2a).

Prospective coding in cortical pyramidal neurons enables instantaneous voltage-to-voltage transfer.

(a1) The instantaneous spike rate of cortical pyramidal neurons (top) in response to sinusoidally modulated noisy input current (bottom) is phase-advanced with respect to the input (adapted from Köndgen et al., 2008). (a2) Similiarly, in NLA, the instantaneous firing rate of a model neuron , black) is phase-advanced with respect to the underlying voltage (u, red, postulating that the low-pass filtered rate is a function of the voltage, ). (b) Dendritic input in the apical tree (here called ē) is instantaneously causing a somatic voltage modulation (u, (modeling data from Ulrich, 2002)). The low-pass filtering with τ along the dendritic shaft is compensated by a lookahead mechanism in the dendrite . In Ulrich (2002) a phase advance is observed even with respect to the dendritic input current, not only the dendritic voltage, although only for slow modulations (as here). (c) While the voltage of the first neuron (u1) integrates the input rates rin from the past (bottom black upward arrows), the output rate r1 of that first neuron looks ahead in time, (red dashed arrows pointing into the future). The voltage of the second neuron (u2) integrates the prospective rates r1 (top black upwards arrows). By doing so, it inverts the lookahead operation, resulting in an instantaneous transfer from u1(t) to u2(t) (blue arrow and circles).

The second central notion of the theory is the prospective error ei, that we interpret as prospective somato-dendritic mismatch error in the individual network neurons, . It is defined as a mismatch between the prospective voltage, , and the weighted prospective input rates, Σj Wijrj. In the same way as the firing rates rj linearly extrapolate into the future given the current voltages uj of the presynaptic neurons j, the postsynaptic error is based on the linear extrapolation of its current voltage ui using its temporal derivative, . If the prospective error ei is low-pass filtered with time constant τ, it takes the form , where is the corresponding low-pass filtered firing rate of the presynaptic neuron j (that becomes a function of the presynaptic voltage, , see Methods, Sect. A). We refer to ēi as somato-dendritic mismatch error of neuron i that, as compared to ei, is non-prospective and instantaneous.

We next interpret the mismatch error ēi in terms of the morphology and biophysics of pyramidal neurons with basal and apical dendrites. While the error ei is formed in the apical dendrite, this error is low-pass filtered and added to the somatic voltage ui, that is also driven by the low-pass filtered basal input , so that . From the perspective of the basal dendrites, the low-pass filtered apical error ēi can be calculated as difference between the somatic voltage and the own local low-pass filtered input, . The somatic voltage ui is assumed to be sampled in the basal dendrite by the backpropagating acting potentials (Urbanczik & Senn, 2014; Spicher et al., 2017). The apical error now appears as ‘somato-basal’ mismatch error, that both are summarized as somato-dendritic mismatch error. It tells the difference between ‘what a neuron does’, which is based on the somatic voltage ui, and ‘what the basal inputs think it should do’, which is based on its own input (Fig. 1a2). The two quantities may deviate because neuron i gets additional ‘unpredicted’ apical inputs from higher-area neurons that integrates in the somatic voltage ui. What cannot be predicted in ui by the sensory-driven basal input remains as somato-basal (somato-dendritic) mismatch error ēi.

Associated with this mismatch error is the somato-dendritic mismatch energy defined for each network neuron i∈ 𝒩 as the squared mismatch error,

On a subset of output neurons of the whole network, 𝒪 ⊆ 𝒩, a cost is defined as a function of the somatic voltage and some instructive reference signal such as targets or a reward. When a target trajectory is available, the cost is defined at each time point as a squared target error,

Much more general mismatch energies and cost functions are conceivable, encompassing e.g. conductance-based neurons or including further dynamic variables (see SI, Sect. F). The cost represents a performance measure for the entire network that produces the output voltages uo(t) in response to some input rates rin(t). The cost directly relates to behavioral or cognitive measures such as the ability of an animal or human to perform a particular task in real time. The target could be provided by explicit external supervision, for example target movements in time encoded by , it could represent an expected reward signal, or it could arise via self-supervision from other internal prediction errors.

We define the Lagrangian (or ‘total energy’) of the network as a sum across all mismatch energies and costs, weighted by the nudging strength β of the output neurons,

The low-pass filtered presynaptic rates, , also encompass the external input neurons. While in classical energy-based approaches L is called the total energy, we call it the ‘Lagrangian’ because it will be integrated along real and virtual voltage trajectories as done in variational calculus (leading to the Euler-Lagrange equations, see below and SI, Sect. F). We ‘prospectively’ minimize L locally across a voltage trajectory, so that, as a consequence, the local synaptic plasticity for Wij will globally reduce the cost along the trajectory (Theorem 1 below).

Due to the prospective coding, the Lagrangian can be minimal at any moment in time while the network dynamics evolves. This is different from the classical predictive coding (Rao & Ballard, 1999) and energy-based approaches (Scellier & Bengio, 2017; Song et al., 2024), where a stimulus needs to be fixed in time while the network relaxes to a steady state, and only there the prediction error is minimized (see SI, Sect. C, comparison with Equilibrium Propagation).

The least-action principle expressed for prospective firing rates

Motivated by the prospective firing in pyramidal neurons, we postulate that cortical networks strive to look into the future to prevent instantaneous errors. Each neuron tries to move along a trajectory that minimizes its own mismatch error ēi across time (Fig. 1b1). The ‘neuronal currency’ with which each neuron ‘trades’ with others to choose its own error-minimizing trajectory is the future discounted membrane potential,

The prospective voltages ũ are the ‘canonical coordinates’ entering the NLA principle, and in these prospective coordinates the overall network searches for a ‘least-action trajectory’. Since from ũ we can recover the instantaneous voltage via (see SI, Sect. B), we can replace u in the Lagrangian and obtain L as a function of our new prospective coordinates ũ and the ‘velocities’ , i.e. , where bold fonts represent vectors. Inspired by the least-action principle from physics, we define the neuronal action A as a time-integral of the Lagrangian,

The NLA principle postulates that the trajectory ũ(t) keeps the action A stationary with respect to small variations δũ (Fig. 1b1). In other words, nature chooses a trajectory such that, when deviating a little bit from it, say by δũ, the value of A will not change (or at most up to second order in the variation), formally δA = 0. The motivation to search for a trajectory that keeps the action stationary is borrowed from physics. The motivation to search for a stationary trajectory by varying the near-future voltages ũ, instead of u, is assigned to the evolutionary pressure in biology to ‘think ahead of time’. To not react too late, internal delays involved in the integration of external feedback need to be considered and eventually need to be overcome. In fact, only for the ‘prospective coordinates’ defined by looking ahead into the future, even when only virtually, will a real-time learning from feedback errors become possible (as expressed by our Theorems below).

The equations of motion that keep the action stationary with respect to these prospective coordinates are known to satisfy the Euler-Lagrange equations

Applying these equations to our Lagrangian yields a prospective version of the classical leaky integrator voltage dynamics, with rates r and errors e that are looking into the future (Methods, Sects A, B),

The ‘·’ denotes the component-wise product, and the weight matrix splits into weights from input neurons and weights from network neurons, W = (Win, Wnet). While for output neurons a target error can be defined, , for non-output neurons i no target exist and we hence set . In a control theoretic framework, the neuronal dynamics (Eq. 7a) represents the state trajectory, and the adjoint error dynamics (Eq. 7b) represents the integrated costate trajectory (Todorov, 2006).

From the point of view of theoretical physics, where the laws of motion derived from the least-action principle contain an acceleration term (as in Newton’s law of motion, like for a harmonic oscillator), one may wonder why no second-order time derivative appears in the NLA dynamics. As an intuitive example, consider driving into a bend. Looking ahead in time helps us to reduce the lateral acceleration by braking early enough, as opposed to braking only when the lateral acceleration is already present. This intuition is captured by minimizing the neuronal action A with respect to the discounted future voltages ũi instead of the instantaneous voltages ui. Keeping up an internal equilibrium in the presence of a changing environment requires to look ahead and compensate early for the predicted perturbations. Technically, the acceleration disappears because the Euler-Lagrange operator (Eq. 6) turns into a lookahead-gradient operator, , since the is absorbed via (see Methods, Sect. A, and SI, Sect. F for the link to the least-action principle in physics).

Mathematically, the voltage dynamics in Eq. 7a specifies an implicit differential equation since also appears on the right-hand side. This is because the prospective rates include through . Likewise, the prospective errors , with ē given in Eq. 7b and plugged into Eq. 7a, imply through . Nevertheless, the voltage dynamics can be stably run by replacing (t) on the right-hand side of Eq. 7a with the temporal derivative from the previous time step (technically, the Hessian is required to be strictly positive definite, see Methods Sect. F and SI, Sect. C). This ensures that the voltage dynamics of Eq. 7 can be implemented in cortical neurons with a prospective firing and a prospective dendritic error (see Fig. 2).

The error expression in Eq. 7b is reminiscent of error backpropagation (Rumelhart et al., 1986) and can in fact be related (Methods, Sect. C). Formally, the errors are backpropagated via transposed network matrix, , modulated by , the derivative of with respect to the underlying voltage. While the transpose can be constructed with various local methods (see Akrout et al., 2019; Max et al., 2022) in our simulations we mainly adhere to the phenomenon of feedback alignment (Lillicrap, Cownden, et al., 2016) and consider fixed and randomized feedback weights B (unless stated differently). Recent control theoretical work is exploiting the same prospective coding technique as expressed in Eq. 7 to tackle general time-varying optimization problems (see Simonetto et al., 2020 for a review and the SI, Sect. C for the detailed connection).

Prospective coding in neurons and instantaneous propagation

The prospective rates and errors entering via r and e in the NLA (Eq. 7) are consistent with the prospective coding observed in cortical pyramidal neurons in vitro (Köndgen et al., 2008). Upon sinusoidal current injection into the soma, the somatic firing rate is advanced with respect to its voltage (Fig. 2a), effectively compensating for the delay caused by the current integration. Likewise, sinusoidal current injection in the apical tree causes a lag-less voltage response in the soma (Fig. 2b, Ulrich, 2002). While the rates and errors in general can be reconstructed from their low-pass filterings via and , they become prospective in time because and ē are themselves instantaneous functions of the voltage u, and hence r and e depend on . The derivative of the membrane potential implicitly also appears in the firing mechanism of Hodgkin-Huxley-type conductances, with a quick depolarization leading to a stronger sodium influx due to the dynamics of the gating variables (Hodgkin & Huxley, 1952). This advances the action potential as compared to a firing that would only depend on u, not , giving an intuition of how such a prospective coding may arise. A similar prospective coding has been observed for retinal ganglion cells (Palmer et al., 2015) and cerebellar Purkinje cells (Ostojic et al., 2015), making a link from the visual input to the motor control.

To understand the instantaneous propagation through the network, we low-pass filter the dynamic equation (obtained by rearranging Eq. 7a), with ē given by Eq. 7b, to obtain the somatic voltage . At any point in time, the voltage is in a moving equilibrium between forward and backpropagating inputs. Independently of the network architecture, whether recurrent or not, the output is an instantaneous function of the low-pass filtered input and a putative correction towards the target, , see Fig. 2c and Methods, Sect. D. The mapping again expresses an instantaneous propagation of voltages throughout the network in response to both, the low-pass filtered input and feedback error . This instantaneity is independent of the network size, and in a feed-forward network is independent of its depths (see also Haider et al., 2021, where the instantaneity is on the rates, not the voltages). In the absence of the look-ahead activity, each additional layer would slow down the network relaxation time.

Notice that an algorithmic implementation of the time-continuous dynamics of a N -layer feedforward network would still need N calculation steps until information from layer 1 reaches layer N. However, this does not imply that an analog implementation of the prospective dynamics will encounter delays. To see why, consider a finite step-change Δu1 in the voltage of layer 1. In the absence of the look-ahead, Δu1 were mapped within the infinitesimal time interval dt to an infinitesimal change du2 in the voltages of layer 2. But with a prospective firing rate, , a step-change Δu1 translates to a delta-function in r1, this in turn to a step-change in the low-pass filtered rates , and therefore within dt to a step-change Δu2 in the voltages u2 of the postsynaptic neurons (Fig. 2c). Iterating this argument, a step-change Δu1 propagates ‘instantaneously’ through N layers within the ‘infinitesimal’ time interval N dt to a step-change ΔuN in the last layer. When run in a biophysical device in continuous time that exactly implements the dynamical equations (Eq. 7), the implementation becomes an instantaneous computation (since dt → 0). Yet, in a biophysical device, information has to be moved across space. This typically introduces further propagation delays that may not be captured in our formalism where low-pass filtering and prospective coding cancel each other exactly. Nevertheless, analog computation in continuous time, as formalized here, offers an idea to ‘instantaneously’ realize an otherwise time consuming numerical recipe run on time-discrete computing systems that operate with a finite clock cycle.

Prospective control and the moving equilibrium hypothesis

Crucially, at the level of the voltage dynamics (Eq. 7a) the correction is based on the prospective error e. This links our framework to optimal control theory and motor control where delays are also taken into account, so that a movement can be corrected early enough (Wolpert & Ghahramani, 2000; Todorov & Jordan, 2002; Todorov, 2004). The link between energy-based models and optimal control was recently drawn for strong nudging (β→ ∞) to learn individual equilibrium states (Meulemans, Zucchet, et al., 2022). Our prospective error e(t) appears as a ‘controller’ that, when looking at the output neurons, pushes the voltage trajectories towards the target trajectories. Depending on the nudging strength β, the control is tighter or weaker. For infinitely large β, the voltages of the output neurons are clamped to the time-dependent target voltages, (implying ), while their errors, , instantaneously correct all network neurons. For small β, the output voltages are only weakly controlled, and they are dominated by the forward input, .

To show how the NLA principle with the prospective coding globally maps to cortico-spinal circuits we consider the example of motor control. In the context of motor control, our network mapping can be seen as a forward internal model that quickly calculates an estimate of the future muscle length uo based on some motor plans, sensory inputs, and the current proprioceptive feedback (Fig. 3a). Forward models help to overcome delays in the execution of the motor plan by predicting the outcome, so that the intended motor plans and commands can be corrected on the fly (Kawato, 1999; Wolpert & Ghahramani, 2000).

Moving equilibrium hypothesis for motor control and real-time learning of cortical activity.

(a) A voluntary movement trajectory can be specified by the target length of the muscles in time, , encoded through the γ-innervation of muscle spindles, and the deviation of the effective muscle lengths from the target, . The Ia-afferents emerging from the spindles prospectively encode the error, so that their low-pass filtering is roughly proportional to the length deviation, truncated at zero (red). The moving equilibrium hypothesis states that the low-pass filtered input , composed of the movement plan and the sensory input (here encoding the state of the plant e.g. through visual and proprioceptive input, and ), together with the low-pass filtered error feedback from the spindles, , instantaneously generate the muscle lengths, , and are thus at any point in time in an instantaneous equilibrium (defined by Eq. 7). (b1) Intracortical iEEG activity recorded from 56 deep electrodes and projected to the brain surface. Red nodes symbolize the 56 iEEG recording sites modeled alternately as input or output neurons, and blue nodes symbolize the 40 ‘hidden’ neurons for which no data is available, but used to reproduce the iEEG activity. (b2) Corresponding NLA network. During training, the voltages of the output neurons were nudged by the iEEG targets (black input arrows, but for all red output neurons). During testing, nudging was removed for 14 out of these 56 neurons (here, represented by neurons 1, 2, 3). (c1) Voltage traces for the 3 example neurons in a2, before (blue) and after (red) training, overlaid with their iEEG target traces (grey). (c2) Total cost, integrated over a window of 8 s of the 56 output nodes during training with sequences of the same duration. The cost for the test sequences was evaluated on a 8 s window not used during training.

The observation that muscle spindles prospectively encode the muscle length and velocity (Dimitriou & Edin, 2010)) suggests that the prospective coding in the internal forward model mirrors the prospective coding in the effective forward pathway. This forward pathway leads from the motor plan to the spindle feedback, integrating also cerebellar and brainstem feedback (Kawato, 1999). Based on the motor plans, the intended spindle lengths and the effective muscle innervation are communicated via descending pathway to activate the γ- and α-motoneurons, respectively (Li et al., 2015). The mapping from the intended arm trajectory to the intended spindle lengths via γ-innervation is mainly determined by the joint geometry. The mapping from the intended arm trajectory to the force generating α-innervation, however, needs to also take account of the internal and external forces, and this is engaging our network W.

When we prepare an arm movement, spindles in antagonistic muscles pairs that measure the muscle length are tightened or relaxed before the movement starts (Papaioannou & Dimitriou, 2021). According to the classical equilibrium-point hypothesis (Feldman & Levin, 2009; Latash, 2010), top-down input adjusts the activation threshold of the spindles through (γ-)innervation from the spinal cord so that slight deviations from the equilibrium position can be signaled (Fig. 3a). We postulate that this γ-innervation acts also during the movement, setting an instantaneous target for the spindle lengths. The effective lengths of the muscle spindles is uo, and the spindles are prospectively signaling back the deviation from the target through the Ia-afferents (Dimitriou & Edin, 2010; Dimitriou, 2022). The low-pass filtered Ia-afferents may be approximated by a threshold-nonlinearity, , with β being interpreted as spindle gain (Latash, 2018). Combining the feedback from agonistic and antagonistic muscle pairs allows for extracting the scaled target error . Taking account of the prospective feedback, we postulate the moving equilibrium hypothesis according to which the instructional inputs, , the spindle feedback, , and the muscle lengths, uo, are at any point of the movement in a dynamic equilibrium. The moving equilibrium hypothesis extends the classical equilibrium-point hypothesis from the spatial to the temporal domain (for a formal definition of a moving equilibrium see Methods, Sect. F).

Prediction errors are also reduced when motor units within a muscle are recruited according to the size principle (Senn, Wyler, Clamann, Kleinle, Lüscher, et al., 1997), which itself was interpreted in terms of the physical least-action principle (Senn, Wyler, Clamann, Kleinle, Larkum, et al., 1995). With regard to the interpretation of the prospective feedback error as spindle activity, it is worth noticing that in humans the spindle activity is not only ahead of the muscle activation (Dimitriou & Edin, 2010), but also shares the property of a motor error (Dimitriou, 2016). The experiments show that during the learning of a gated hand movement, spindle activity is initially stronger when making movement errors, and it returns back to baseline with the success of learning. This observation is consistent with the NLA principle, telling that the proprioceptive prediction errors are minimized through the movement learning. We next address how the synaptic strengths W involved in producing the muscle length can be optimally adapted to capture this learning.

Local plasticity at basal synapses minimizes the global cost in real time

The general learning paradigm starts with input time series rin(t),i and target time series , while assuming that the target series are an instantaneous function of the low-pass filtered input series, . The low-pass filtering in the individual inputs could be with respect to any time constant (that may also be learned, see SI, Sect. B). Yet, for simplicity we assume the same time constant τ for low-pass filtering the rates of the network neurons and input neurons. The goal of learning is to adapt the synaptic strengths W in the student network so that this moves towards the target mapping, FWF *. The local synaptic plasticity will also reduce the global cost C defined on the output neurons o in terms of the deviation of the voltage from the target, (Eq. 2).

The problem of changing synaptic weights to correct the behavior of downstream neurons, potentially multiple synapses away, is typically referred to as the credit assignment problem and is notoriously challenging in physical or biological substrates operating in continuous time. A core aspect of the NLA principle is how it relates the global cost C to the Lagrangian L and eventually to somato-dendritic prediction errors ē that can be reduced through local synaptic plasticity . We define this synaptic plasticity as partial derivative of the Lagrangian with respect to the weights, . Since the somato-dendritic mismatch error is , this leads to the local learning rule of the form ‘postsynaptic error times low-pass filtered presynaptic rate’,

The plasticity rule runs simultaneously to the neuronal dynamics in the presence of a given nudging strength β that tells how strongly the voltage of an output neurons is pushed towards the target, . The learning rule is local in space since is represented as voltage of the basal dendrites, and the somatic voltage u may be read out at the synaptic site on the basal dendrite from the backpropagating action potentials that sample u at a given time (Urbanczik & Senn, 2014). The basal voltage becomes the dendritic prediction of the somatic activity u, interpreting Eq. 8 as ‘dendritic predictive plasticity’.

We have derived the neuronal dynamics as a path that keeps the action stationary. Without external teaching signal, the errors vanish and the voltage trajectory wriggles on the bottom of the energy landscape (L = 0, Fig. 1b2). If the external nudging is turned on, β > 0, errors emerge and hills grow out of the landscape. The trajectory still tries to locally minimize the action, but it is lifted upwards on the hills (L > 0, Fig. 1b2). Synaptic plasticity reshapes the landscape so that, while keeping β fixed, the errors are reduced and the landscape again flattens. The transformed trajectory settles anew in another place (inside the ‘volcano’ in Fig. 1b2). Formally, the local plasticity rule (Eq. 8) is shown to perform gradient descent on the Lagrangian and hence on the action. In the energy landscape picture, plasticity ‘shovels off’ energy along the voltage path so that this is lowered most efficiently. The error that is back-propagated through the network tells at any point on the voltage trajectory how much to ‘dig’ in each direction, i.e. how to adapt the basal input in each neuron in order to optimally lower the local error.

The following theorem tells that synaptic plasticity pushes the network mapping towards the target mapping at any moment in time. The convergence of the mapping is a consequence of the fact the plasticity reduces the Lagrangian L = EM + βC along its gradient.

Theorem 1

(real-time Dendritic Error Propagation, rt-DeEP)

Consider an arbitrary network W with voltage and error dynamics following Eq. 7. Then the local plasticity rule (Eq. 8), acting at each moment along the voltage trajectories, is gradient descen

  1. on the Lagrangian L for any nudging strength β ≥ 0, i.e. , with .

  2. on the cost C for small nudging, β → 0, while up-scaling the error to , i.e. .

The gradient statements hold at any point in time (long enough after initialization), even if the input trajectories rin(t) contain delta-functions and the target trajectories contain step-functions.

Loosely speaking, the NLA enables the network to localize in space and time an otherwise global problem: what is good for a single neuron (the local plasticity) becomes good for the entire network (the gradient on the global cost). Learning is possible any point in time along the trajectory because the NLA inferred a prospective voltage dynamics expressed in prospective firing rates ri and prospective errors ei of the network neurons. In the limit of strong nudging (β → ∞), the learning rule performs gradient descent on the mismatch energies in the individual neurons. If the network architecture is powerful enough so that after learning all the mismatch energies vanish, , then the cost will also vanish,. This is because for the output neurons the mismatch error includes the target error (Eq. 7b). In the limit of weak nudging (β 0), the learning rule performs gradient descent on C, and with this also finds a local minimum of the mismatch energies.

In the case of weak nudging and a single steady-state equilibrium, the NLA algorithm reduces to the Equilibrium Propagation algorithm (Scellier & Bengio, 2017) that minimizes the cost C for a constant input and a constant target. In the case of strong nudging and a single steady-state equilibrium, the NLA principle reduces to the Least-Control Principle (Meulemans, Zucchet, et al., 2022) that minimizes the mismatch energy EM for a constant input and a constant target, with the apical prediction error becoming the prediction error from standard predictive coding (Rao & Ballard, 1999). While in the Least-Control Principle the inputs and outputs are clamped to fixed values, the output errors are backpropagated and the network equilibrates in a steady state where the corrected network activities reproduce the clamped output activities. This state is called the ‘prospective configuration’ in Song et al., 2024 because neurons deep in the network are informed about the distal target already during the inference, and are correspondingly adapted to be consistent with this distal target. In the NLA principle, after an initial transient, the network always remains in the moving equilibrium due to the prospective coding. While inputs and targets dynamically change, the network moves along a continuous sequence of prospective configurations.

In the motor control example, the theorem tells that a given target motor trajectory is learned to be reproduced with the forward model , by applying the dendritic predictive plasticity for the network neurons (Eq. 8). We next exemplify the theory by looking into the brain, reproducing cortical activity, and showing how a multi-layer cortical network can learn a sensory-motor mapping while staying in a moving equilibrium throughout the training.

Reproducing intracortical EEG recordings and recognizing handwritten digits

As illustration we consider a recurrently connected network that learns to represent intracortical electroencephalogram (iEEG) data from epileptic patients (Fig. 3b). For each electrode, we assign a neuron within this network to represent the activity of the cell cluster recorded in the corresponding iEEG signal via its membrane potential. During learning, a randomly selected subset of electrode neurons are nudged towards the target activity from recorded data while learned to be reproduced by the other neurons. After learning, we can present only a subset of electrode neurons with previously unseen recordings and observe how the activity of the other neurons closely matches the recordings of their respective electrodes (Fig. 3c). The network derived from NLA is thus able to learn complex correlations between signals evolving in real time by embedding them in a recurrent connectivity structure.

As an example of sensory-motor processing in the NLA framework, we next consider a well-studied image recognition task, here reformulated in a challenging time-continuous setting, and interpreted as motor task where 1 out of 10 fingers has to be bent upon seeing a corresponding visual stimulus (see Fig. 3a). In the context of our moving equilibrium hypothesis, we postulate that during the learning phase, but not the testing phase, an auditory signal identifies the correct finger and sets the target spindle lengths of 10 finger flexors, . The target spindle length encodes the desired contraction of a flexor muscle in the correct finger upon the visual input rin(t), and a corresponding relaxation for the 9 incorrect fingers.

We train a hierarchical three-layer network on images of handwritten digits (MNIST, LeCun, 1998), with image presentation times between 0.5τ (= 5 ms) and 20τ (= 200 ms, with τ = 10 the membrane time constant). Fig. 4a-c1 depict the most challenging scenario with the shortest presentation time. Synaptic plasticity is continuously active, despite the network never reaching a temporal steady state (Fig. 4b1). Due to the lookahead firing rates in the NLA, the mismatch errors ēi(t) represent the correct gradient and propagate without lag throughout the network. As a consequence, our mismatch errors are almost equal to the errors obtained from classical error backpropagation applied at each time step to the purely forward network (i.e. the network that suppresses the error-correction ē of the voltage and instead considers the ‘classical’ voltage ul = Wl ρ(ul−1) only, see blue dots in Fig. 4b2). The network eventually learned to implement the mapping with a performance comparable to error-backpropagation at each dt, despite the short presentation time of only 5 ms (Fig. 4c1). The approximation is due to the fact that the NLA learns an instantaneous mapping from the low-pass filtered input rates to the output voltage uo, while the mapping from the original input rates rin to the voltages u1 of the first-layer neurons (and hence also to the output voltages uo) is delayed by τin. Since in the simulations the target voltages were switched instantaneously with rin (and not with ), however, a mismatch error between uo and remains for stimulus presentation times shorter than τin (Fig. 4c2). The Latent Equilibrium (Haider et al., 2021) avoids these temporal limitations by implementing an instantaneous mapping on the rates instead on the voltages (Methods, Sect. F).

On-the-fly learning of finger responses to visual input with real-time Dendritic Error Propagation (rt-DeEP).

(a) Functionally feedforward network with handwritten digits as visual input in Fig. 3a, here from the MNIST data set, 5 ms presentation time per image), backprojections enabling credit assignment, and activity of the 10 output neurons interpreted as commands for the 10 fingers (forward architecture: 784 × 500 ×10 neurons). (b) Example voltage trace (b1) and local error (b2) of a hidden neuron in NLA (red) compared to an equivalent network without lookahead rates (orange). Note that neither network achieves a steady state due to the extremely short input presentation times. Errors calculated via exact backpropagation, i.e. by using the error backpropagation algorithm on a purely feedforward NLA network at every simulation time step (with output errors scaled by β), shown for comparison (blue dots). (c) Comparison of network models during and after learning. Color scheme as in (b). (c1) The test error under NLA evolves during learning on par with classical error backpropagation performed each Euler dt based on the feedforward activities. In contrast, networks without lookahead rates are incapable of learning such rapidly changing stimuli. (c2) With increasing presentation time, the performance under NLA further improves, while networks without lookahead rates stagnate at high error rates. This is caused by transient, but long-lasting misrepresentation of errors following stimulus switches: when plasticity is turned off during transients and is only active in the steady state, comparably good performance can be achieved (dashed orange). (d) Receptive fields of 6 hidden-layer neurons after training, demonstrating that even for very brief image presentation times (5ms), the combined neuronal and synaptic dynamics are capable of learning useful feature extractors such as edge filters.

The instantaneous voltage propagation reduces an essential constraint of previous models of bio-plausible error backpropagation (e.g., Scellier & Bengio (2017), Whittington & Bogacz (2017), and Sacramento et al. (2018), with reviews Richards et al. (2019), Whittington & Bogacz (2019), and Lillicrap, Santoro, et al. (2020)): without lookahead firing rates, networks need much longer to correctly propagate errors across layers, with each layer roughly adding another membrane time constant of 10 ms, and thus cannot cope with realistic input presentation times. In fact, in networks without lookahead output, learning is only successful if plasticity is switched off while the network dynamics did not reach a stationary state during a stimulus presentation interval (Fig. 4c2). As a comparison, neuronal response latency to flashing stimuli in the primary visual cortex (V1) of rats in a synchronized state are in the order of 50 ms (Wang et al., 2014). The latency shortens to roughly 40 ms if in a desynchronized state (Wang et al., 2014).

In humans, cortical responses to expected visual stimuli (rectangulars) are observed 70-90 ms ahead of the responses observed in the absence of expectations (which have a latency of roughly 150 ms, see Blom et al., 2020). A remaining latency in the presence of prospective coding is consistent with a remaining response delay of τin in the presented NLA principle. The minimal cortical response latency of roughly 50 ms to an expected rectangular (Blom et al., 2020) compares with the roughly 80 ms cortical response time when humans have to decide whether a flashed image contains an animal or not (Thorpe et al., 1996). The ongoing higher cortical processing during animal detection does not seem to substantially slow down cortical responses to sensory stimuli. This supports a ‘sensory equilibrium hypothesis’, the sensory correlate of our ‘moving equilibrium hypothesis’ for motor control, that enables us to interact with a dynamic visual environment in almost real time (as also suggested in Blom et al., 2020). Notice that, from a technical perspective, making the time constants of individual cortical neurons arbitrarily short leads to network instabilities and is unlikely the option chosen by the brain (see SI Sect. C, comparison with Equilibrium Propagation).

Implementation in cortical microcircuits

So far, we did not specify how errors e appearing in the differential equation for the voltage (Eq. 7a) are transmitted across the network in a biologically plausible manner. Building on Sacramento et al., 2018, we propose a cortical microcircuit to enable this error transport, with all neuron dynamics evolving according to the NLA principle. Although the idea applies for arbitrarily connected networks, we use the simpler case of functionally feedforward networks to illustrate the flow of information in these microcircuits (Fig. 5a).

Hierarchical plastic microcircuits implement real-time Dendritic Error Learning (rt-DeEL).

(a) Microcircuit with ‘top-down’ input (originating from peripheral motor activity, blue line) that is explained away by the lateral input via interneurons (dark red), with the remaining activity representing the error ēl. Plastic connections are denoted with a small red arrow and nudging with a dashed line. (b1) Simulated network with 784-300-10 pyramidal-neurons and a population of 40 interneurons in the hidden layer used for the MNIST learning task where the handwritten digits have to be associated to the 10 fingers. (b2) Test errors for rt-DeEL with joint tabula rasa learning of the forward and lateral weights of the microcircuit. A similar performance is reached as with classical error backpropagation. For comparability, we also show the performance of a shallow network (dashed line). (b3) Angle derived from the Frobenius norm between the lateral pathway and the feedback pathway BlWl+1. During training, both pathways align to allow correct credit assignment throughout the network. Indices are dropped in the axis label for readability.

For such an architecture, pyramidal neurons in area l (that is a ‘layer’ of the feedforward network) are accompanied by a pool of interneurons in the same layer (area). The dendrites of the interneurons integrate in time (with time constant τ) lateral input from pyramidal neurons of the same layer (rl) through plastic weights . Additionally, interneurons receive ‘top-down nudging’ from pyramidal neurons in the next layer through randomly initialized and fixed backprojecting synapses targeting the somatic region, and interneuron nudging strength βI. The notion of ‘top-down’ originates from the functionally feed-forward architecture leading from sensory to ‘higher cortical areas’. In the context of motor control, the highest ‘area’ is last stage controlling the muscle lengths, being at the same time the first stage for the proprioceptive input (Fig. 3a).

According to the biophysics of the interneuron, the somatic membrane potential becomes a convex combination of the two types of afferent input (Urbanczik & Senn, 2014),

In the biological implementation, the feedback input is mediated by the low-pass filtered firing rates , not by ul+1 as expressed in the above equation. Yet, we argue that for a threshold-linear ρ the ‘top-down nudging’ by the rate is effectively reduced to a nudging by the voltage ul+1. This is because errors are only backpropagated when the slope of the transfer function is positive, , and hence when the upper-layer voltage is in the linear regime. For more general transfer functions, we argue that short-term synaptic depression may invert the low-pass filtered presynaptic rate back to the presynaptic membrane potential, , provided that the recovery time constant τ matches the membrane time constant (see end of Results and SI, Sect. A).

Apical dendrites of pyramidal neurons in each layer receive top-down input from the pyramidal population in the upper layer through synaptic weights Bl. These top-down weights could be learned to predict the lower-layer activity (Rao & Ballard, 1999) or to become the transposed of the forward weight matrix (, Max et al., 2022), but for simplicity we randomly initialized them and keep them fixed (Lillicrap, Santoro, et al., 2020). Beside the top-down projections the apical dendrites also receive lateral input via an interneuron population in the same layer through synaptic weights – that are plastic and will be learned to obtain suitable dendritic errors. The ‘-’ sign is suggestive for these interneurons to subtract away the top-down input entering through Bl (while the weights can still be positive or negative). Assuming again a conversion of rates to voltages, also for the inhibitory neurons that may operate in a linear regime, the overall apical voltage becomes

What cannot be explained away from the top-down input Blul+1 by the lateral feedback, , remains as dendritic prediction error in the apical tree (Fig. 5a). If the top-down and lateral feedback weights are learned as outlined next, these apical prediction errors take the role of the backpropagated errors in the classical backprop algorithm.

To adjust the interneuron circuit in each layer (‘area’), the synaptic strengths from pyramidal-to-interneurons, , are learned to minimize the interneuron mismatch energy, . The interneurons, while being driven by the lateral inputs , learn to reproduce the upper-layer activity that also nudges the interneuron voltage. Learning is accomplished if the upper-layer activity, in the absence of an additional error on the upper layer, is fully reproduced in the interneurons by the lateral input.

Once the interneurons learned to represent the ‘error-free’ upper-layer activity, they can be used to explain away the top-down activities that also project to the apical trees. The synaptic strengths from the inter-to-pyramidal neurons, , are learned to minimize the apical mismatch energy, . While in the absence of an upper-layer error, the top-down activity Blul+1 can be fully cancelled by the interneuron activity , a neuron-specific error will remain in the apical dendrites of the lower-level pyramidal neurons if there was an error endowed in the upper-layer neurons. Gradient descent learning on these two energies results in the learning rules for the P-to-I and I-to-P synapses,

The following theorem on dendritic error learning tells that the plasticity in the lateral feedback loop leads to an appropriate error representation in the apical dendrites of pyramidal neurons.

Theorem 2

(real-time Dendritic Error Learning, rt-DeEL)

Consider a cortical microcircuit composed of pyramidal and interneurons, as illustrated in Fig. 5a, with more interneurons in layer (‘cortical area’) l than pyramidal neurons in layer l+1, and with adaptable pyramidal-to-inhibitory weights within the same layer (that are nudged through top-down weights , see Methods, Sect. E). Then, for suitable top-down nudging, learning rates, and initial conditions, the inhibitory-to-pyramidal synapses within each layer l (Eq. 11) evolve such that the lateral feedback circuit aligns with the bottom-up-top-down feedback circuit,

After this horizontal-to-vertical circuit alignment, the apical voltages of the layer-l pyramidal neurons (Eq. 10) represent the ‘B-backpropagated’ errors. When modulated by the postsynaptic rate derivatives, , the apical voltages yield the appropriate error signals

for learning the forward weights Wl according to (Eq. 8).

The backprojecting weights can also be learned by a local real-time learning rule to become transpose of the forward weights, (Max et al., 2022). In this case, the error signals ēl learned in the apical dendrites according to the above Theorem (Eq. 13) represent the gradient errors ē appearing in the real-time dendritic error propagation (rt-DeEP, Theorem 1). There, the errors ē drive the gradient plasticity of the general weight matrix W, split up here into the forward weights Wl to a layer l (for l = 1, .., N).

Simultaneously learning apical errors and basal signals

Microcircuits following these neuronal and synaptic dynamics are able to learn the classification of hand-written digits from the MNIST dataset while learning the apical signal representation (Fig. 5b1-2). In this case, feedforward weights Wl and lateral weights and are all adapted simultaneously. Including the -plasticity (by turning on the interneuron nudging from the upper layer, βI > 0 in Eq. 9), greatly speeds up the learning.

With and without -plasticity, the lateral feedback via interneurons (with effective weight ) learns to align with the forward-backward feedback via upper layer pyramidal neurons (with effective weight BlWl+1, Fig. 5b3). The microcircuit extracts the gradient-based errors (Eq. 13), while the forward weights use these errors to reduce these errors to first minimize the neuron-specific mismatch errors, and eventually the output cost.

Since the apical voltage appears as a postsynaptic factor in the plasticity rule for the interneurons , this I-to-P plasticity can be interpreted as Hebbian plasticity of inhbitory neurons, consistent with earlier suggestions (Vogels et al., 2012; Bannon et al., 2020). The plasticity of the P-to-I synapses, in the same way as the plasticity for the forward synapses , can be interpreted as learning from the dendritic prediction of somatic activity (Urbanczik & Senn, 2014).

Crucially, choosing a large enough interneuron population, the simultaneous learning of the lateral microcircuit and the forward network can be accomplished without fine-tuning of parameters. As an instance in case, all weights shared the same learning rate. Such stability bolsters the biophysical plausibility of our NLA framework and improves over the previous, more heuristic approach (Sacramento et al., 2018; Mesnard et al., 2019). The stability may be related to the nested gradient descent learning according to which somatic and apical mismatch errors in pyramidal neurons, and somatic mismatch errors in inhibitory neurons are minimized.

Finally, since errors are defined at the level of membrane voltages (Eq. 11), synapses need a mechanism by which they can recover the presynaptic voltage errors from their afferent firing rates. While for threshold-linear transfer functions the backpropagated voltage errors translate into rate errors (SI, Sect. A), more general neuronal nonlinearities must be matched by corresponding synaptic nonlinearities. Pfister et al. (2010) have illustrated how spiking neurons can leverage short-term synaptic depression to estimate the membrane potential of their presynaptic partners. Here, we assume a similar mechanism in the context of our rate-based neurons. The monotonically increasing neuronal activation function, , can be approximately compensated by a vesicle release probability that monotonically decreases with the low-pass filtered presynaptic rate (see SI, Sect. A and Fig. 6 therein). If properly matched, this leads to a linear relationship between the presynaptic membrane potential ul+1 and the postsynaptic voltage contribution.

Recovering presynaptic potentials through short term depression.

(a1) Relative voltage response of a depressing cortical synapse (recreated from Abbott et al., 1997), identified as synaptic release probability p. (a2) The product of the low-pass filtered presynaptic firing rate times the synaptic release probability is proportional to the presynaptic membrane potential, . (a3) Average in vivo firing rate of a neuron in the visual cortex as a function of the somatic membrane potential (recreated from Anderson et al., 2000), which can be qualitatively identified as the stationary rate derived in Eq. 43.

Discussion

We introduced a least-action principle to neuroscience for deriving the basic laws of the voltage and synaptic dynamics in networks of cortical neurons. The approach is inspired by the corresponding principle in physics where basic laws of motions are derived across the various scales. While in physics the action is defined as time-integral of the kinetic minus potential energy, we define the action as time-integral of instantaneous somato-dendritic mismatch errors across network neurons plus a behavioural error. The ‘kinetics’ of a voltage trajectory only arises because we postulate that the action along a trajectory is minimized with respect to future voltages, not the instantaneous voltage, as would be done in physics. The postulate implies a prospective voltage dynamics that looks ahead in time, together with prospective local errors, in order to minimize the action and hence the somato-dendritic mismatch errors. The prospective errors nudge the firing of pyramidal neurons deep in the brain, so that motor neurons improve the output of the network right in time. A putative behavioural error, encoded in the motor feedback, propagates back through the network and produces prospective corrections of the pyramidal neuron activities that effectively manifest in instantaneous corrections of the motor trajectory. Through this prospective coding, the sensory stream, the deep network activity, and the motor feedback are in sync at any moment in time. We formulated the dynamic synchronization as ‘moving equilibrium hypothesis’, referring to the classical equilibrium point hypothesis for motor control (Feldman & Levin, 2009; Latash, 2010). More generally, the brain activity formed by the prospective firing of cortical pyramidal neurons is in a moving equilibrium while converting sensory input streams into motor outputs, consistent with prospective sensory processing in the human cortex (Blom et al., 2020).

Because the neuronal dynamics derived from the global NLA principle is in a moving equilibrium, the prospective dendritic errors that globally correct the output trajectory are also suited to instruct local synapatic plasticity in the dendrites. In fact, working down the apical errors by adapting the sensory driven synapses on the basal dendrites reduces the global output errors in real-time. The apical errors are extracted from the top-down feedback via lateral ‘inhibition’ that tries to cancel the top-down signal. This top-down feedback includes activity from a putative erroneous motor output that was not foreseen by the local inhibition and thus survives as a local apical error. Given the prospective coding of the pyramidal neurons, the dendritic errors are also prospective and thus able to induce the correct error-minimizing plasticity online, while stimuli and targets continuously change.

The NLA principle as a bottom-up theory from neurons to behaviour

To show that the NLA principle offers a viable program for a formalization of neuroscience following the example of physics, we exemplified its ramifications into dendritic computation, cortical microcircuits, synaptic plasticity, motor control, and sensory-based decision making. The crucial point of our axiomatization is that it connects the local neuronal errors to the global behavioural errors right in the formulation of the principle, eventually leading to local gradient-based plasticity rules. Because the formulation builds upon computations that can be realized in single neurons and dendrites to produce a behavioural output, the NLA principle can be seen as a bottom-up theory of behaviour. It is articulated in terms of apical and basal dendrites, somatic firing, network connectivity and behavioural outputs that jointly minimize their errors. This contrasts the related free energy principle, for instance, that leads to a top-down theory of behaviour by starting with the statistical, but more universal, notion of a free energy. It postulates that any self-organizing system, that is at a statistical equilibrium with its environment, must minimize its free energy (Friston, 2010; Friston et al., 2022), and from there works down its way to neurons and dendrites (Bastos et al., 2012; Kiebel & Friston, 2011).

Starting with a single Lagrangian function that specifies the form of the somato-dendritic prediction errors leaves some freedom for the interpretation and the implementation of the emerging dynamical equations for the voltages. We interpret errors to be represented in the apical dendrites of pyramidal neurons while sensory input targets the basal dendrites, but other dendritic configurations are conceivable (Mikulasch, Rudelt, Wibral, et al., 2023) that apply also to non-pyramidal neurons. We have chosen a specific interneuron circuitry to extract our apical errors, but other microcircuits or error representations might also be considered (Keller & Mrsic-Flogel, 2018). On the other hand, the derived gradient-based synaptic plasticity is tightly linked to the specific form of the somato-dendritic prediction errors expressed in the Lagrangian and its interpretation, making specific predictions for synaptic plasticity (as outlined below). The ‘external’ feedback entering through the cost function offers additional freedom to model behavioural interactions. We considered an explicit time course of a target voltage in motor neurons, for instance imposed by the feedback from muscle spindles that are themselves innervated by a prospective top-down signal to control muscle lengths (Papaioannou & Dimitriou, 2021; Dimitriou, 2022). But the cost may also link to reinforcement learning and express a delayed reward feedback delivered upon a behavioural decision of an agent acting in a changing environment (Friedrich, Urbanczik, et al., 2011; Friedrich & Senn, 2012).

A fundamental difficulty arises when the neuronal implementation of the Euler-Lagrange equations requires an additional microcircuit with its own dynamics. This is the case for the suggested microcircuit extracting the local errors. Formally, the representation of the apical feedback errors first needs to be learned before the errors can teach the feedforward synapses on the basal dendrites. We showed that this error learning can itself be formulated as minimizing an apical mismatch energy. What the lateral feedback through interneurons cannot explain away from the top-down feedback remains as apical prediction error. Ideally, while the network synapses targetting the basal tree are performing gradient descent on the global cost, the microcircuit synapses involved in the lateral feedback are performing gradient descent on local error functions, both at any moment in time. The simulations show that this intertwined system can in fact learn simultaneously with a common learning rate that is properly tuned. The cortical model network of inter- and pyramidal neurons learned to classify handwritten digits on the fly, with 10 digit samples presented per second. Yet, the overall learning is more robust if the error learning in the apical dendrites operates in phases without output teaching but with corresponding sensory activity, as may arise during sleep (see e.g. Deperrois et al., 2022; Deperrois et al., 2023).

The NLA principle integrates classical theories for cortical processing and learning

The prospective variational principle introduced with the NLA allows for integrating previous ideas on formalizing the processing and learning in cortical networks. Four such classical lines of theories come together. (i) The first line refers to the use of an energy function to jointly infer the neuronal dynamics and synaptic plasticity, originally formulated for discrete-time networks (Hopfield, 1982; Ackley et al., 1985), and recently extended to continuous-time networks (Scellier & Bengio, 2017). (ii) The second line refers to understanding error-backpropagation in the brain (Rumelhart et al., 1986; Xie & Seung, 2003; Whittington & Bogacz, 2017; Whittington & Bogacz, 2019; Lillicrap, Santoro, et al., 2020). (iii) The third line refers to dendritic computation and the use of dendritic compartmentalization for various functions such as nonlinear processing (Schiess et al., 2016; Poirazi & Papoutsi, 2020) and deep learning (Guerguiev et al., 2017; Sacramento et al., 2018; Haider et al., 2021). (iv) The fourth line refers to predictive coding (Rao & Ballard, 1999) and active inference (Pezzulo et al., 2022) to improve the sensory representation and motor output, respectively.

  1. With regard to energy functions, the NLA principle adds a variational approach to characterize continuous-time neuronal trajectories and plasticity. Variational approaches are studied in the context of optimal control theory where a cost integral is minimized across time, constrained to some network dynamics (Todorov & Jordan, 2002; Meulemans, Farinha, et al., 2021). The NLA represents a unifying notion that allows to infer both, the network dynamics and its optimal control from a single Lagrangian. The error we derive represents prospective control variables that are applied to the voltages of each network neurons so that they push the output neurons towards their target trajectory. The full expression power of this control theoretic framework has yet to be proven when it is extended to genuine temporal processing that includes longer time constants, for instance inherent in a slow threshold adaptation (Bellec et al., 2020). The NLA principle can also treat the case of strong feedback studied so far in relaxation networks only (Meulemans, Zucchet, et al., 2022; Song et al., 2024). Our rt-DeEP Theorem makes a statement for real-time gradient descent learning while the network is in a moving equilibrium, linking to motor learning in the presence of perturbing force fields (Herzfeld et al., 2014) or perturbing visual inputs (Dimitriou, 2016).

  2. With regard to error-backpropagation, the NLA principle infers a local error that originates from the error in the output neurons. In the current version, the NLA relies on feedback alignment (Lillicrap, Cownden, et al., 2016) to obtain a local apical errors without postulating plasticity in the top-down synapses. Other works have explored learning the feedback weights (Akrout et al., 2019; Kunin et al., 2020), notably within a phase-less real-time learning framework as considered here (Max et al., 2022). It would be promising to combine these ideas to obtain a fully plastic microcircuit that adjusts from scratch to various real-time learning tasks.

  3. With regard to dendritic computation, the NLA principle extends the idea of dendritic error representation (Sacramento et al., 2018; Mesnard et al., 2019) by the prospective coding of both errors and firing rates (Fig. 2). As a consequence, the various dendritic delays are compensated and synaptic plasticity can operate at any moment, without need to wait for network relaxations (Scellier & Bengio, 2017; Song et al., 2024). In the present framework, the input rates are low-pass filtered by variable input time constants (τin) before the induced voltages are instantaneously processed in the network. While this offers the possibilities for a temporal processing of the inputs, step-changes in the input rates cannot instantaneously propagate (as made possible in Haider et al., 2021). The dendritic error representation has also been applied to spike-based learning in recurrent networks (Mikulasch, Rudelt & Priesemann, 2021).

  4. With regard to the principle of predictive coding (Rao & Ballard, 1999; Pezzulo et al., 2022), the NLA offers a form of predictive error-coding in the apical tree of sensory pyramidal neurons that shapes the sensory representation. Both the predictive coding and the NLA principle imply an error-based adaptation of the lower and higher cortical representations. However, while predictive coding is based on propagating errors, the NLA principle directly propagates the error-corrected representations. Based on the backpropagated activities, it extracts in each area the prospective errors to reshape the local representations and induce synaptic plasticity. Otherwise, the freedom in defining the cost function in the NLA, and the inclusion of active inference in predictive coding (Pezzulo et al., 2022), makes the two frameworks equally powerful. Both frameworks can explain the learning from motor and sensory prediction errors. In fact, our example of motor control minimizes the proprioceptive error formed by the target muscle lengths minus the effective muscle lengths. Active inference likewise minimizes the mismatch between an internal motor target and the proprioceptive inputs caused by the own actions. In turn, the cost function in the NLA principle may also capture the sensory prediction errors, where the ground truth lies in the external inputs that are learned to be matched by the sensory expectation. Hence, while being functionally equivalent, the neuronal interpretations are different: predictive coding focuses on the propagation of errors using the sensory- and motor-representations as local auxiliary quantities, and the NLA principle directly integrates errors in ‘auxiliary dendrites’ to propagate sensory or motor representations.

The NLA integrates and predicts features of synapses, dendrites and circuits

Motivated by the predictive power of the least-action principle in physics, we ask about experimental confirmation and predictions of the NLA principle. Given its axiomatic approach, it appears astonishing to find various preliminary matches at the dendritic, somatic, interneuron, synaptic and even behavioural level. Some of these are:

  1. the prospective coding of pyramidal neuron firing (Köndgen et al., 2008);

  2. the prospective processing of apical signals while propagating to the soma (Ulrich, 2002);

  3. the basal synaptic plasticity on pyramidal neurons and synaptic plasticity on interneurons, driven by the postsynaptic activity that is ‘unexplained’ by the distal dendritic voltage (Urbanczik & Senn, 2014) – while partly consistent with spike-timing dependent plasticity (Sjöström et al., 2001; Spicher et al., 2017), the postulated dendritic voltage-dependence of the plasticity rules still awaits an experimental inspection;

  4. the Hebbian homeostatic plasticity of interneurons targeting the apical dendritic tree of pyramidal neurons (Bannon et al., 2020);

  5. the short-term synaptic depression at top-down synapses targetting inhibitory neurons and apical dendrites (akin to Abbott et al., 1997, but with a faster recovery time constant) that invert the presynaptic activation function (see also Pfister et al., 2010);

  6. the modulation of the apical contribution to the somatic voltage by the slope of the somatic activation function (for instance by downregulating apical NMDA receptors with increasing rate of backpropagating action potentials, Theis et al., 2018); and

  7. the involvement of muscle spindles in the prospective encoding of motor errors during motor learning, with the γ-innervation setting the target length of the spindles (Dimitriou, 2016; Papaioannou & Dimitriou, 2021; Vargas et al., 2023).

More experimental and theoretical work is required to substantiate these links and test specific predictions, such as the apical error representation in cortical pyramidal neurons.

Overall, our approach adapts the least-action principle from physics to be applied to neuroscience, and couples it with a normative perspective on the prospective processing of neurons and synapses in global cortical networks and local microcircuits. Given its physical underpinnings, the approach may inspire the rebuilding of computational principles of cortical neurons and circuits in neuromorphic hardware (Bartolozzi et al., 2022). A step in this direction, building on the instantaneous computational capabilities with slowly integrating neurons, has been done in Haider et al., 2021. Given its aspiration for a theoretical framework in neurobiology, a next challenge would be to generalize the NLA principle to spiking neurons (Gerstner & Kistler, 2002; Brendel et al., 2020) with their potential for hardware implementation (Zenke & Ganguli, 2018; Göltz et al., 2021; Cramer et al., 2022), to include attentional mechanisms in terms of dendritic gain modulation (Larkum et al., 2004) with a putative link to self-attention in artificial intelligence (Vaswani et al., 2017), to add second-order errors to cope with certainties (Granier et al., 2023), and to incorporate longer temporal processing as, for instance, offered by neuronal adaptation processes (La Camera et al., 2006) or realistically modelled dendrites (Chavlis & Poirazi, 2021).

Methods

A. Euler-Lagrange equations as inverse low-pass filters

The theory is based on the lookahead of neuronal quantities. In general, the lookahead of a trajectory x(t) is defined via lookahead operator applied to x,

The lookahead operator is the inverse of the low-pass filter operator denoted by a bar,

This low-pass filtering can also be characterized by the differential equation , see SI, Sect.B. Hence, applying the low-pass filtering to x and then the lookahead operator to , and using the Leibnitz rule for differentiating an integral, we calculate . In turn, applying first the lookahead, and then the low-pass filtering, also yields the original trace back, .

We consider an arbitrary network architecture with network neurons that are recurrently connected and that receive external input through an overall weight matrix W = (Win, Wnet), aggregated column-wise. The instantaneous presnyaptic firing rates are r = (rin, rnet)T, interpreted as a single column vector. A subset of network neurons are output neurons, 𝒪 ⊆ 𝒩, for which target voltages u* may be imposed. Rates and voltages may change in time t. Network neurons are assigned a voltage u, generating the low-pass filtered rate , and a low-pass filtered error . We further define output errors for o ∈ O, and for non-output neurons i ∈ 𝒩 \ 𝒪. With this, the Lagrangian from Eq. 3 takes the form

We next use that , with the operator defined in Eq. 4, to write out the Lagrangian L in the canonical coordinates as (see also Eq. 3)

The neuronal dynamics is derived from requiring a stationary action (see Eq. 5), which is generally solved by the Euler-Lagrange equations (see Eq. 6). Because ũ only arises in L in the compound , the derivative of L with respect to ũ is identical to the derivative with respect to ,

Using the lookahead operator Eq. 14, the Euler-Lagrange equations can then be rewritten as

Since and , the derivative of L with respect to ũ is the same as the derivative of L with respect to . Plugging this into Eq. 19, the Euler-Lagrange equations become a function of u and ,

Notice that, if we would have directly calculated , the second-order time derivative of the discounted future voltage would be absorbed in a first-order time derivative of the voltage. The reason is that , and only arises in this combination because the Lagrangian L = L(u) is only a function of u and not of . Hence, the acceleration term disappears, while a voltage derivative appears.

The solution of this differential equation Eq. 20 is , and hence any trajectory which satisfy the Euler-Lagrange equations will hence cause to converge to zero with a characteristic time scale of τ. Since we require that the initialisation is at t0 =−∞, we conclude that , as required in the rt-DeEP Theorem.

B. Deriving the network dynamics from the Euler-Lagrange equations

We now derive the equations of motion from the Euler-Lagrange equations. Noticing that u enters in twice, directly and through , and once in the output error ē*, we calculate from Eq. 16, using and W = (Win, Wnet),

Remember that for non-output neurons i no target exist, and for those we set . Next, we apply the lookahead operator to this expression, as required by the Euler-Lagrange equations Eq. 19. In general , and we set for the expression on the right-hand side of Eq. 21, , which at the same time is . Hence, the Euler-Lagrange equations in the form of Eq. 20, , translate into

To move from the middle to the last equality we replaced e by . In the last equality we interpret e as the sum of the two errors, e = ϵ + βe*, again using the middle equality. This proves Eq. 7.

Notice that the differential equation in Eq. 22 represents an implicit ordinary differential equation as on the right-hand side not only u, but also appears (in r and e). The uniqueness of the solution u(t) for a given initial condition is only guaranteed if it can be converted into an explicit ordinary differential equation (see Sect. C).

In taking the temporal derivative we assumed small learning rates such that terms including can be neglected. The derived dynamics for the membrane potential of a neuron ui in Eq. 22 show the usual leaky behavior of biological neurons. However, both presynaptic rates and prediction errors ēi enter the equation of motion with lookaheads, i.e., they are advanced and , cancelling the low-pass filtering. Since , the rate and error, ri and ei, can also be seen as nonlinear extrapolations from the voltage and its derivative into the future.

The instantaneous transmission of information throughout the network at the level of the voltages can now be seen by low-pass filtering Eq. 22 with initialization far back in the past,

with column vector and . Hence, solving the voltage dynamics for u (Eq. 7a), with apical voltage derived from Eq. 7b, yields the somatic voltage u satisfying the self-consistency equation (Eq. 23) at any time. In other words, u and ē ‘propagate instantaneously’.

C. Deriving the error backpropagation formula

For clarity, we derive the error backpropagation algorithm for layered networks here. These can be seen as a special case of a general network with membrane potentials u and all-to-all weight matrix W (as introduced in appendix I), where the membrane potentials decompose into layerwise membrane potential vectors ul and the weight matrix into according block diagonal matrices Wl (with Wl being the weights that project into layer l).

Assuming a network with N layers, by low-pass filtering the equations of motion we get

for all l ∈ 1, .., N, with the output error ēo of the general recurrent network becoming the error in the last layer, that itself is the target error, . The error , that we obtain from the general dynamics with , see Eq. 21 and Eq. 22, translates to an iterative formula for the error at the current layer l given the error at the downstream layer l+1, inherited from the drive of that downstream layer via Wl+1,

and ēN = β ē* for the output layer. The learning rule that reduces ēl by gradient descent is proportional to this error and the presynaptic rate, as stated by Theorem 1, is

for l = 1…N. Eq. 25 and Eq. 26 together take the form of the error backpropagation algorithm, where an output error is iteratively propagated through the network and used to adjust the weights in order to reduce the output cost C. From this, it is easy to see that without output nudging (i.e., β = 0), the output error vanishes and consequently all other prediction errors vanish as well, for all lN. This also means that in the absence of nudging, no weight updates are performed by the plasticity rule.

The learning rule for arbitrary connectivities is obtained in the same way by dropping the layer-wise notation. In this case, low-pass filtering the equations of motion yields , as calculated in Eq. 23, and the low-pass filtered error , as inferred from Eqs 21 and 22. Hence, the plasticity rule in general reads

D. Proving Theorem 1 (rt-DeEP)

The implicit assumption in Theorem 1 is that exists in the distributional sense for t >−∞, which is the case for delta-functions in rin and step-functions in u*. Both parts (i) and (ii) of the Theorem are based on the requirement of stationary action δA = 0, and hence on u satisfying the Euler-Lagrange equations in the form of Eq. 22, . From the solution we conclude that for initialization at t0 = −∞ we have for all t. It is the latter stronger condition that we require in the proof. With this, the main ingredient of the proof follows is the mathematical argument of Scellier & Bengio, 2017, according to which the total and partial derivative of L with respect to W are identical, and this in our case is true for any time t,

For convenience we considered to be a column vector, deviating from the standard notations (see tutorial end of sec:Integration). Analogously to Eq. 28, we infer . Reading Eq. 28 from the right to the left, we conclude that the learning rule for all β > 0 is gradient descent on L, i.e. . This total derivative of L can be analyzed for large and small β.

  1. We show that in the limit of large β, becomes gradient descent on the mismatch energy. For this we first show that there is a solution of the self-consistency equation that is uniformly bounded for all t and β. For this we assume that the transfer function ρ(u) is non-negative, monotonically increasing and bounded, that its derivative ρ(u) is bounded too, and that the input rates rin and the target potentials are also uniformly bounded. To show that under these conditions we always find an uniformly bounded solution u(t), we first consider the case where the output voltages are clamped to the target, such that ē* = 0. For simplicity we assume that ρ(u) = 0 for |u| ≥ c0. For voltages u with uic0 the recurrent input current is bounded, say for some c1 > c0. When including the error term , the total current still remains uniformly bounded, say |F (u)j |≤ c2 for all u with uic0. Because for larger voltages ui > c0 the error term vanishes due to a vanishing derivative ρ (ui) = 0, the mapping F (u) maps the c2-box u (for which |ui| ≤ c2) onto itself. Brouwer’s fixed point theorem then tells us that there is a fixed point u = F (u) within the c2-box. The theorem requires the continuity of F, and this is assured if the neuronal transfer function is continuous.

    We next relax the voltages of the output neurons from their clamped stage, . Remember that these voltages satisfy at any time t. We determine the correction term such that in the limit β → ∞ we get . The correction remains finite, and in the limit must be equal to . For arbitrary large nudging strengt h β, the output voltage uo deviates arbitrary little from the target voltage, , with target error shrinking like c2. Likewise, also for non-output neurons i, the self-consistency solution ui = F (u)i deviates arbitrarily little from the solution of the clamped state. To ensure the smooth drift of the fixed point while 1 deviates from 0 we require that the Jacobian of F at the fixed point is invertible.

    Because the output shrinks with 1, the cost shrinks quadratically with increasing nudging strength, , and hence the cost term that enters in vanishes in the limit β → ∞. In this large β limit, where and hence the outputs are clamped, , the Lagrangian reduces to the mismatch energy, L = EM. Along the least-action trajectories we therefore get . The first equality uses Eq. 28, and the second uses L = EM just derived for β = ∞. This is statement (i) of Theorem 1. In the case of successful learning, EM = 0, we also conclude that the cost vanishes, C = 0. This is the case because EM = 0 implies EM = 0 for all output neurons o. Since , we conclude that ēo = 0, and if the output neurons do not feed back to the network (which we can assume without loss of generality), we conclude that .

  2. To consider the case of small β, we use that the cost C can be expressed as . This is a direct consequence of how C enters in , see Eq. 16 and Scellier & Bengio, 2017. We now put this together with Eq. 28 and the finding that . Since for the Lipschitz continuous function L in u, W and β (L is even smooth in these arguments), the total derivatives interchange (which is a consequence of the Moore-Osgood theorem applied to the limits of the difference quotients), we then get at any t,

The last expression is calculated from the specific form of the Lagrangian (Eq. 17), using that by definition . Finally, in the absence of output nudging, β = 0, we can assume vanishing errors, ē = 0, as they solve the self-consistency equation, for all t, see Eq. 27. For these solutions we have . Writing out the total derivative of the function with respect to β at β = 0 as limit of the difference quotient, , using that g(0) = 0, we calculate at any t,

Here we assume that is evaluated at β > 0 (that itself approaches 0), while is evaluated at β = 0.Combining Eq. 29 and 30 yields the cost gradient at any t,

This justifies the gradient learning rule in Eq. 27. Learning is stochastic gradient descent on the expected cost, where stochasticity enters in the randomization of the stimulus and target sequences rin(t) and u*(t). For the regularity statement, see ‘From implicit to explicit differential equations’ in the sec:Integration. Notice that this proof works for a very general form of the Lagrangian L, until the specific expression for . For a proof in terms of partial derivatives only, see SI, Sect. I, where also a primer on partial and total derivatives is found (SI, Sect. H).

Instantaneous gradient descent on

The cost at each time t is a function of the voltage uo of the output neurons and the corresponding targets. In a feedforward network, due to the instantaneity of the voltage propagation (Eq. 23), uo is in the absence of output nudging (β = 0) an instantaneous function of the voltage at the first layer, . For initialisation at t = −∞, the second term vanishes for all t and hence . The output voltage uo(t) therefore becomes a function FW of the low-pass filtered input rate rin(t) that captures the instantaneous network mapping, , and with this the cost also becomes an instantaneous function of and , namely .

For a general network, again assuming t0 = −∞, the voltage is determined by the vanishing gradient with , see Eq. 21. For the inclusive treatment of the initial transient see SI, Sects C and D. Remember that and . For a given and at time t, the equation f (u, t) = 0 can be locally solved for u if the Hessian is invertible, . This mapping can be restricted to the output voltages uo on the left-hand side, while replacing in the argument on the right-hand side (even if this again introduces uo there). With this we obtain the instantaneous mapping from the low-pass filtered input and the output error to the output itself. Notice that for functional feedforward network, the network weight matrix Wnet is lower triangular, and for small enough β the Hessian H is therefore always positive definite (see also Methods, Sect. F).

E. Proving Theorem 2 (rt-DeEL)

Here we restrict ourselves to layered network architectures. To prove Theorem 2 first assume that interneurons receive no nudging (βI = 0) and only the lateral interneuron-to-pyramidal weights are plastic. This is already sufficient to prove the rt-DeEL theorem. Yet, simulations showed that learning the lateral pyramidal-to-interneuron weights via top-down nudging, so that the interneuron activity mimics the upper layer pyramidal neuron activity, helps in learning a correct error representation. We consider this case of learning later.

If the microcircuits is ought to correctly implement error backpropagation, all local prediction errors ēl must vanish in the absence of output nudging (β = 0) as there is no target error. Consequently, any remaining errors in the network are caused by a misalignment of the lateral microcircuit. We show how learning the interneuron-to-pyramidal weights corrects for such misalignments.

To define the gradient descent plasticity of the weights from the interneurons to the pyramidal neurons, we consider the apical error formed by the difference of top-down input and interneuron input, , and define the apical mismatch energy as . Gradient descent along this energy with respect to yields

evaluated online while presenting input patterns from the data distribution to the network. We assume that the apical contribution to the somatic voltage is further modulated by the somatic spike rate, . After successful learning, the top-down input Blul+1 is fully subtracted away by the lateral input in the apical compartment, and we have

Once this condition is reached, the network achieves a state where, over the activity space spanned by the data, top-down prediction errors throughout the network vanish,

We show that this top-down prediction error, after the s uccessful learning of the microcircuit, shares the properties of error-backpropagation for a suitable backprojection weights B.

Due to the vanishing prediction errors, pyramidal cells only receive bottom-up input . Using this expression as well as the expression for interneuron membrane potentials without top-down nudging (βI = 0 in Eq. 9), , and plugging both into Eq. 33, we get

Assuming that has full rank, and the low-pass filtered rates span the full nl dimensions of layer l when sampled across the data set, we conclude that

In other words, the loop via upper layer and back is learned to be matched by a lateral loop through the interneurons. Eq. 36 imposes a restriction on the minimal number of interneurons at layer l. In fact, the matrix product BlWl+1 maps a nl-dimensional space onto itself via nl+1-dimensional space. The maximal rank of the this matrix product is limited by the smallest dimension, i.e. rank(BlWl+1) ≤ min(nl, nl+1). Analogously, rank . But since the two ranks are the same according to Eq. 36, we conclude that in general must hold, i.e. there should be at least as many interneurons at layer l as the lowest number of pyramidal neurons at either layer l or l+1. Note that by choosing as in (Sacramento et al., 2018) (or as in this work), the conditions is fulfilled.

With and Eq. 36, the top-down prediction error from Eq. 34, in the presence of output nudging (β > 0), can be written in the backpropagation form

Finally, the simulations showed that learning the lateral weights in the microcircuit greatly benefits from also adapting the pyramidal-to-interneuron weights W IP by gradient descent on , using top-down nudging of the inhibitory neurons (βI > 0),

After learning we have , and plugging in (Eq. 9), we obtain . Since , we conclude as before,

The top-down weights that nudge the lower-layer interneurons has randomized entries and may be considered as full rank. If there are less pyramidal neurons in the upper layer than interneurons in the lower layer, selects a subspace in the interneuron space of dimension . This seems to simplify the learning of the interneuron-to-pyramidal cell connections . In fact, this learning now has only to match the nl+1-dimensional interneuron subspace embedded in dimensions to an equal (nl+1-)dimensional pyramidal cell subspace emedded in nl dimensions.

Learning of the interneuron-to-pyramidal cell connections works with the interneuron nudging as before, and combining Eqs 36 with 39 yields the ‘loop consistency’

The learning of the microcircuit was described in the absence of output nudging. Conceptually, this is not a problem as one could introduce a pre-learning phase where the lateral connections are first correctly aligned before learning of the feedforward weights begins. In simulations we find that both the lateral connections as well as the forward connections can be trained simultaneously, without the need for such a pre-learning phase. We conjecture that this is due to the fact that our plasticity rules are gradient descent on the energy functions L, EPI and EIP respectively.

F. From implicit to explicit differential equations

The voltage dynamics is solved by a forward-Euler scheme . The derivative is calculated either through (i) the implicit differential equation (Eq. 7) yielding , or (ii) by isolating and solving for the explicit differential equation , as explained in SI, Sect. C (after Eq. 51).

  1. The implicit differential equation, , see Eq. 22, is iteratively solved by assigning and calculating the error with and .

    This iteration exponentially converges to a fixed point on a time scale , where 1 − k > 0 is the smallest Eigenvalue of the Hessian , see SI, Sect. C.

  2. The explicit differential equation is obtained by eliminating the from the right-hand side of the implicit differential equation. Since enters linearly we get with . The explicit form is obtained by matrix inversion, , as the Hessian is invertible if it is strictly positive definite (which is typically the case, see SI, Sect. C, after Eq. 48). The external input and the target enter through , where the derivative of the target voltage is only added for the output neurons o. This explicit differential equation is shown to be contractive in the sense that for each input trajectory rin(t) and target trajectory u*(t), the voltage trajectory u(t) is locally attracting for neighbouring trajectories. This local attracting trajectory is the vanishing-gradient trajectory f (u, t) = 0, and the gradient remains 0 even if the input contains delta-functions, see SI Sect. D.

Moving and latent equilibria: a formal definition

We showed that the motor output (uo), together with the low-pass filtered sensory input and the motor feedback is in a moving equilibrium, , see Fig. 3a. In general, a dynamical system in u that is given in an implicit form with external inputs is said to be in a moving equilibrium if the variable u is an instantaneous function of the input x at any point in time, u = F (x). The fact that the implicit differential equation G = 0 represents a dynamical system in u implies that, in principle, it has a representation in the explicit form , guaranteed by an invertible Jacobian .

Our example is obtained from with and , leading to . Given the paramterization of G by the weights, we get the parametrized function u = FW (x), and this is restricted to the output components u of u. The condition on the Jacobian translates to being invertible. Crucially, the description of the dynamics in the biological or physical substrate is not given in its explicit form . However, it is given in an implicit form expressed as , where still appears on the right-hand side. This ‘hybrid’ form is directly solved either in real time by the biophysical substrate itself, or by the forward-Euler scheme on clocked hardware, see (i) above. Notice that moving equilibria u = FW (x) with are able to capture complex temporal processing of the instantaneous input rin. In fact, the low-pass filtering can be obtained on various time scales through different τin’s, and FW for a general network W can be arbitrary complex. The task is to adapt W such that the ‘hybrid’ dynamical system eventually implements the target mapping .

The Latent Equilibrium (Haider et al., 2021) can be analogously formalized as a dynamical system in u, implicitly given by , and having a solution of the form . Abbreviating again with the same Lagrangian as in the present NLA, the Latent Equilibrium is obtained for . The solution implies that the rate is an instantaneous function of , here without low-pass filtering. As for moving equilibria, the crucial point is that the biophysical substrate implements a hybrid form of the dynamical system, now , that is implicitly solved by the analog substrate, and also allows for a solution in clocked hardware. For an extended stability analysis of the moving and latent equilibria see SI, Sect. D.

G. Simulation details

Solving the explicit differential equation seems to be more robust when the learning rate for gets larger. The explicit form is also less sensitive to large Euler steps dt, see SI Sect. C. By this reason, the ordinary differential equations (ODE) were solved in the explicit form when including plasticity . The algorithms are summarized as follows, once without interneurons (Algo 1), and once with interneurons (Algo 2):

Algorithm 1

with projection neurons only, for Figs 3 & 4 (using the explicit ODE, i.e. Step 12 instead of 11)

Algorithm 2

including plastic interneurons, for Fig. 5 (using the explicit ODE, i.e. Step 13 instead of 12)

Details for Fig. 3b

Color coded snapshot of cortical local field potentials (LFPs) in a human brain from 56 deep iEEG electrodes at various locations, converted with the sigmoidal voltage-to-rate function and plotted onto a standard Talairach Brain (Talairach & Tournoux, 1988). The iEEG data is from a patient with pharmacoresistant epilepsy and electrodes implanted during presurgical evaluation, extracted from the data release of Burrello et al., 2019. The locations of the electrodes are chosen in accordance with plausibilty, as the original positions of the electrodes were omitted due to ethical standards to prevent patient identification.

Details for Fig. 3c

Simulations of the voltage dynamics (Eq. 7a) and weight dynamics (Eq. 8), with learning rate η = 10−3, step size dt = 1 ms for the forward Euler integration, membrane time constant τ = 10 ms and logistic activation function. Weights were initialized randomly from a normal distribution 𝒩 (0, 0.12) with a cut-off at ±0.3. The number of neurons in the network 𝒩 was n = 96, among them 56 output neurons 𝒪 ⊂ 𝒩 that were simultaneously nudged, and 40 hidden neurons. During training, all output neurons were nudged simultaneously (with β = 0.1), whereas during testing, only 42 out of 56 neurons were nudged, the remaining 14 left to reproduce the traces. Data points of the iEEG signal were sampled with a frequency of 512Hz. For simplicity, we therefore assumed that successive data points are separated by 2ms, and up-sampled the signal via simple interpolation to 1ms resolution as required by our integration scheme. Furthermore, the raw values were normalized by dividing them by a factor of 200 to ensure that they are approximately in a range of ±1 −2. Training and testing was done on two separate 8s traces of the iEEG recording. Same data as in Fig. 3b1.

Details for Fig. 4

Simulation of the neuronal and synaptic dynamics as given by Eq. 7a, Eq. 7b and Eq. 8. For 5 ms, 10 ms and 50 ms presentation time, we used an integration step size of dt = 0.05 ms, dt = 0.1 ms and dt = 0.5 ms, respectively (and dt = 1 ms otherwise). As an activation function, we used the step-linear function (hard sigmoidal) with for for u ≥ 1 and in between. The learning rate was initially set to η = 10−3 and then reduced to η = 10−4 after 22 000 s. The nudging strength was β = 0.1 and the membrane time constant τ = 10 ms. In these simulations (and only for these) we assumed that at each presynaptic layer l = 0, 1, .., n −1 there is a first neuron indexed by 0 that fires with constant rate , effectively allowing the postsynaptic neurons to learn a bias through the first column of the weight matrix Wl+1. Weights were initialized randomly from a normal distribution 𝒩 (0, 0.012) with a cut-off at ±0.03. For an algorithmic conversion see the scheme below. In Fig. 4c1, ‘rt-DeEP w/o lookahead’ is based on the dynamics . For ‘ w/o error + backprop’, we use as the forward model (so without error terms on the membrane potential, but a prospective r), and calculate weight updates using error backpropagation. In Fig. 4c2, we provide three controls: the test error for (i) a standard shallow artificial neural network trained on MNIST (black dashed line), (ii) rt-DeEP without prospective coding (as in Fig. 4c1), but in c2 with plasticity only turned on when the network is completely stationary, i.e., after waiting for several 100ms, such that synaptic weights are not changed during transients (orange dashed line, denoted by ‘w/o transients’), and (iii) an equivalent artificial neural network, , trained using error backpropagation (black dashed line, ‘standard backprop’).

Details for Fig. 5

Simulation of neuronal and synaptic dynamics with plastic microcircuit, i.e., the pyramidal-to-interneuron and lateral weights of the microcircuit learned during training.

For the results shown in Fig. 5c2, the following parameters were used. As an activation function, we used a hard sigmoid function and the membrane time constant was set to τ = 10ms. Image presentation time is 100 ms. Forward, pyramidal-to-interneuron and interneuron-to-pyramidal weights were initialized randomly from a normal distribution 𝒩 (0, 0.012) with a cut-off at ±0.03. All learning rates were chosen equal η = 10−3 and were subsequently reduced to η = 10−4 after 22000s training time. The nudging parameters were set to β = 0.1 and . The feedback connections Bl and the nudging matrices were initialized randomly from a normal distribution 5 𝒩 (0, 0.012) with a cut-off at ±0.15. The used integration step size was dt = 0.25ms. All weights were trained simultaneously. For an algorithmic conversion see the scheme below. The interneuron membrane potential was calculated by Eq. 9 with a linear transfer function.

Acknowledgements

We would like to thank Federico Benitez, Jonathan Binas, Paul Haider, Kevin Max, Alexander Mathis, Alexander Meulemans and Jean-Pascal Pfister for helpful discussions, Kaspar Schindler for providing the human intracortical EEG data, and the late Karlheinz Meier for his dedication and support throughout the early stages of the project. WS thanks Nicolas Zucchet for mathematical discussions and hints to the literature on time-varying optimal control. This work has received funding from the European Union 7th Framework Programme under grant agreement 604102 (HBP), the Horizon 2020 Framework Programme under grant agreements 720270, 785907 and 945539 (HBP) and the Swiss National Science Foundation (SNSF, Sinergia grant CRSII5180316; João Sacramento is supported by SNSF Ambizione grant PZ00P3_186027). We further express our particular gratitude towards the Manfred Stärk Foundation for their continued support. We acknowledge the use of Fenix Infrastructure resources, which are partially funded from the European Union’s Horizon 2020 research and innovation programme through the ICEI project under the grant agreement No. 800858.

Author contributions

WS, MAP and DD designed the conceptual approach. WS, DD, MAP, AFK, JJ, JS and YB contributed to different aspects of the framework and model. WS, DD, MAP and AFK derived different components of the theory. DD, BE and AFK performed the simulations. The manuscript went through multiple drafts, written by MAP, DD, JJ, BE and WS. The two published versions were written by WS.

Supplementary information

A. Extracting the presynaptic voltage error

Threshold-linear transfer functions

There is an important special case where the presynaptic voltage error can directly be extracted from presynaptic firing rates, without need to invert the transfer function via synaptic depression as shown below. This is the case when voltage errors in the upper layers are small, and the voltage-to-rate transfer function has derivatives ρ = 0 or 1, so that . The condition is satisfied, for instance, for a doubly threshold-linear function (a ‘doubly rectified linear unit’, ReLu) defined by ρ(ŭ) = 0 for ŭ < 0 and ρ(ŭ) = ŭ for 0 ≤ŭrmax, while ρ(ŭ) = rmax for larger voltages. In this case we calculate

The approximation uses the Taylor expansion in ul+1 and assumes that is small. The crucial point of Eq. 41 is that the mismatch error defined on the voltage, , can be factorized into a product of the postsynaptic rate derivative, , and the apical error, and hence it can be expressed as an error defined on the rate. Restricted to the segment where the transfer function is linear and errors do not vanish, the same microcircuit delivers the feedback to the apical tree through the top-down projections, and through the lateral connections from the interneurons. While the plasticity rules for and stay the same, the top-down nudging of the interneurons, see Eq. 9, can then be formulated based on the rate instead of the upper layer voltage, , with transfer function of the interneuron again the (doubly) threshold-linear . Since voltages and rates are identical in the segment for each component, Eqs 36 with 39 can still be inferred.

Rate-to-voltage inversion by short-term synaptic depression

We wish to readout the voltage error also for other nonlinear transfer functions than clipped ReLu’s. To do so, we take inspiration from the classical short-term synaptic depression model (Tsodyks & Markram, 1997; Abbott et al., 1997; Varela et al., 1997), see also Fig. 6a1 (that deviates from the release-independent synaptic depression, see Campagnola et al., 2022). We consider a dynamic vesicle release probability that is proportional to the pool size of available vesicles, , and this pool size is postulated to depend on past presynaptic rates,

where is the low-pass filtered presynaptic rate, a and d are constants. The proportionality factor is , making a probability out of the vesicle pools size. The effective synaptic strength B of a ‘backprojecting top-down’ connection is the product of the absolute synaptic strength Bº and the vesicle pool size v, i.e. . The contribution to the postsynaptic current of the synapse is Wr, and the contribution to the postsynaptic voltage is .

We search for an activation function such that the postsynaptic voltage contribution is the scaled presynaptic potential, . Plugging in the above expression for B yields , and dividing Bº out, , see Fig. 6a2. With the expression for in Eq. 42 we obtain a quadratic equation in that is solved by the non-negative and monotonically increasing function

with ‘smooth’ threshold θ = (1 + a)/d and asymptote . This gives us a transfer function that qualitatively matches those observed for pyramidal neurons recorded in the steady state (Anderson et al., 2000; Rauch et al., 2003), see Fig. 6a3.

The approach generalizes to other pairs of strictly monotonic neuronal activation and depression functions , as long as u is not driven below some minimal value, here urest = 0, corresponding to . The last requirement can, for instance, be accomplished by offsetting the activation function into a regime that guarantees that u stays positive.

In our simulations for Fig. 5, we did not explicitly implement a dynamic vesicle pool, i.e. the right-hand side of u, but instead directly used the recovered membrane potentials u.

B. Looking back and forward in time with derivatives

Since dealing with extrapolations into the future is a crucial notion of the paper we present here some of the calculations. The discounted future voltage was introduced in Eq. 4 as

To show that ũ satisfies , we need to apply the Leibniz integral rule in calculating the derivative . This leads to

Multiplying this equation by τ and using the definition of ũ yields , or . By applying the Leibniz integral rule one also shows that , defined in Eq. 15,

solves . This differential equation can be written as , with lookahead operator defined in Eq. 14. To show that , one applies partial integration to . Note that the equality only holds if we integrate from −∞, and hence if the initialization of the trajectory is far back in the past as compared to the time constant τ.

Uniqueness of the rate function

In the main text we concluded from the postulate and the general relation that . This conclusion is a consequence of the uniqueness of a solution of an ordinary differential equation for a given initial condition (that may include delta-functions on the right-hand side, see e.g. Nedeljkov & Oberguggenberger, 2012. In fact, we may consider both variables ρ and as solutions of the differential equation , with x either ρ or . Because the solution is unique, we conclude that .

Learning the input time constants

In our applications we assumed that the input rates in the original mapping are low-pass filtered by a common time constant that is also shared as membrane time constant of the neurons. But in general, we may consider a subclass of network neurons (with voltage vector u1) that receive exclusively input rates rin and integrate these rates with corresponding time constants τin. The prospective rates of these neurons, , are then produced by the same voltage time constant τ of the down-stream neurons. The downstream voltage integration then cancels the presynaptic look-ahead in the presynaptic firing rate, leading again to an instantaneous voltage propagation across the entire network.

The general setting of learning to map time series is that input time series, rin(t) are low-pass filtered with given time constants , and the target output time series are a function of these low-pass filtered inputs, . In the student network, the input time constants appear as parameters that can be learned by gradient descent to match the output of the teacher network, just as the synaptic weights are learned.

We ask for a learning rule so that the student network can extract the input time constants used by the teacher network. For this, we assume that only the neurons u1 obtain the external input and hence carry the time constant τin. Because , the gradient rule for the input time constant is

This local learning rule also globally reduces the Lagrangian, and in the limits of β→ ∞ it is gradient descent on the mismatch energy, while in the limit β → 0 it is gradient descent on the cost. The proof works as in Theorem 1. To learn a more complex mapping of time series that includes more complex temporal processing beyond a function of merely , additional variables need to be introduced that form memories (see SI, Sect. F, ‘Generalizations’).

C. From the implicit to the explicit differential equation: more details

Global convergence to a moving equilibrium (i.e. to )

The original Euler-Lagrange equation is , Eq. 20, with . Remember the definition of given in Eq. 21. While is an error that may not vanish, the ‘corrected error’ is shown to vanish exponentially quick, leading to the moving equilibrium with . In fact, has the general solution

as an exponentially decaying function in t with u(t0) = u0. Note that the argument u of the function does apriori not depend on time. It only becomes a function of time, u(t), by requiring that . The voltage u(t) becomes a non-trivial function of time because the time-dependence of the function f (u, t), that apriori enters only through the second argument, expresses the temporally varying input , including a possible target output .

Implicit differential equation

To make the dependence of on the right-hand-side of the implicit differential equation (Eq. 22) more transparent, we rewrite this in the form

The partial derivative of f (u, t) with respect to time, , captures the external drive from the inputs and output targets that do not depend on u, but may directly depend on time. In fact, instead of the argument t of f, we could consider the two arguments and , so that the partial derivative of with respect to t can be written as

where δio is the Kronecker delta that is equal to 1 if i is an output neuron, and 0 else. The partial derivatives of f with respect to u represents the (symmetric) Hessian of the Lagrangian, with δij being the Kronecker-delta, defined in Eq. 21 and ē* defined above Eq. 16. Remember that W = (Win, Wnet) and . In vectorial notation the Hessian of the Lagrangian is

where 1 is the identity matrix. This Hessian is ‘typically’ positive definite, as we argue next. For a functionally feedforward network, the weight matrix Wnet is lower triangular, and for small enough β (for which is small) the diagonal elements are all positive and hence the Hessian H is invertible. Since H is also symmetric, it is therefore positive definite. In the general case, we require that Wnet has Eigenvalues smaller than 1, since otherwise the recurrent network dynamics may explode. If ρ(u) is also smaller than 1, and β still small enough, we conclude again that the Hessian is invertible (and hence positive definite). Notice that the property of being symmetric is a general property of differentiable functions and ways weaker than the requirement that the recurrent weight matrix Wnet is symmetric, as classically required for fixed point convergence in Hopfield-type networks.

Since in the implicit differential equation of Eq. 46 the also arises on the right-hand-side, we need to show that this Eq. 46, for fixed , can be solved for . To find the solution for given arguments in K, we search for a fixed point of the mapping . In the argument , the mapping is affine, , with matrix appearing in Eq. 46. The Banach fixed point theorem asserts that if is strictly contracting with k, i.e. if for all pairs of inputs and 0 ≤ k < 1, then the iteration (here with iteration time step dt) locally converges to a fixed point . Because , the mapping is k-contractive if the eigenvalues of M have absolute value smaller than k. This is the case if the Hessian , with , is strictly positive definite (which is ‘typically’ the case, see above).

Stable solution of the implicit differential equation

According to the above reasoning, the Banach fixed point theorem asserts (for the typically positive Hessian) that during convergence of the mapping the distance to the fixed point is bounded by a constant times , with iteration index i and ‘virtual’ Euler step dt. Formally,

Crucially, because the mapping is affine in , the convergence is global in while fixing the other arguments of K.

In an analogue physical device that implements exactly this feedback circuit in continuous time, the dt becomes truly infinitesimal and in this sense the convergence is instantaneous. If dt remains finite, converges to a moving target because the mapping changes with time. The target should not move quicker than the time scale of the convergence (but, fortunately, this convergence time-scale can be made arbitrarily quick by reducing dt). Given a time course of the input rate rin(t) and target uo * (t) that has bounded variation, the dt can be chosen so that convergence becomes arbitrary quick, and in the limit instantaneous. As a consequence, for small enough dt, the becomes the same on the left- and right-hand side of the implicit differential equation 46.

If rin(t) contains well-separated delta-functions, while otherwise still having bounded variations, the reasoning still applies since at any point in time, except at the time points of the isolated delta-function, f (u, t) = 0. This statement is proven in Appendix D below.

In practical simulations, a caveat is in place when the learning rate becomes too big and starts to change the neuronal dynamics. In this case, the Hessian becomes , and the Eigenvalues may become negative due to the term. Simulations can in fact become unstable with a big learning rate, and this is more pronounced if also the Euler dt is large. The explicit differential equation avoids the fast iteration towards a moving target and hence allows for larger dt. This in particular pays out in the presence of a high learning rate (although the Cholesky decomposition also requires positive definiteness). By this reason, the large-scale simulations involving plasticity are performed with the explicit form described next.

Explicit differential equation

To isolate in the implicit differential equation, we rewrite again as

Plugging the Hessian from Eq. 48 into Eq. 50, we obtain the voltage dynamics from Eq. 22 in the equivalent form

In our applications, the Hessian H appears to be invertible (although this may not be the case for arbitrary networks), and Eq. 51 can be solved for the unique using the Cholesky decompositions. In fact, if H is invertible, the system implicit ordinary differential equations from Eq. 51 can be converted into a system of explicit ordinary differential equations (with and H given in Eq. 48),

and as such it has a unique solution for any given initial condition. Eq. 52 represents the explicit solution of the iteration after the convergence to the fixed point . The condition for the convergence is the same as the condition for solving for , namely an invertible (and hence positive definite) Hessian.

Because , see Eq. 47, the regularity requirement for to be integrable is satisfied even if and contain step functions (and rin(t) delta-functions), see Sect. D for details.

Even in the presence of such step-functions, the Euler-Lagrange equations lead to an f that is a decaying exponential, , see Eq. 45. For initialization at t0 = −∞ we have at any time. In fact, Eq. 52 is equivalent to and hence any solution of Eq. 52, even if contains a delta-function, is also a solution of . Possible jumps in or are compensated by the jumps they induce in u (see below for the full mathematical description with a simple example). To give an intuition, we assume that a recurrent network of our prospective neurons has separate fixed points for the two constant input currents and This network still shows an overall relaxation time of τin (but not longer!) when the input rates instantaneously switch from to . Nevertheless, at any moment during this relaxation process, gradient learning of the mapping towards is still guaranteed (Theorem 1).

Link to time-varying optimal control

The explicit differential equation, Eq. 52, is a special case of the one in Simonetto, Dall’Anese, et al., 2020, (Eq. 20), where the function to be minimized (their f) can take a general (Lipschitz continuous) form (hence their f is our Lagrangian, fL). To avoid inverting the Hessian, an iteration algorithm can be applied similar to our implicit form form, although more involved to deal with the more general form of L (Simonetto & Dall’Anese, 2017). The idea of tracking the solution of a time-varying optimization problem with a linear look-ahead in time has been introduced introduced in Zhao & Swamy, 1998.

The NLA keeps stability as compared to an infinitesimally fast (τ→ 0) Equilibrium Propagation

What have we gained as compared to making the time constant in Equilibrium propagation very small? The answer lies in the stability. To make the comparison explicit, we remind us of the Equilibrium Propagation which we formulate in terms of a Lagrangian. In both cases, the NLA and the Equilibrium Propagation, the Lagrangians are identical. We have chosen the notation to express that the quantities can be interpreted as a low-pass filtering of r + τ r. In the Equilibrium propagation, they are called r = (rin, ρ(u)), but this only represents a renaming of the input rates (and a notation change from to r). Hence, the Lagrangian in the NLA with is rewritten by with r = (rin, ρ(u)).

The only true difference is that in the NLA we apply the combined lookahead operator to the Lagrangian L, while in the Equilibrium Propagation one applies the reduced operator to L. We claim that the lookahead makes the difference for ‘living’ systems that learnt to optimize in-time with respect to future quantities. The Euler-Lagrange equations for the Equilibrium Propagation reduce to , where now e = uWr. Since these are not differential equations (no in there), one introduces the gradient descent dynamics and waits for convergence for fixed input and fixed target. If τ = 0 in both the NLA and the Equilibrium propagation, we obtain the same implicit and stationary equations in u.

Because the equation is implicit in u, it has to be solved first for u. While the Equilibrium propagation suggests to keep the inputs constant until a fixed point is reached, the NLA dynamically moves along the trajectory of fixed points while the inputs are changing. In the Equilibrium propagation we cannot simply take the limit τ → 0 since then the dynamics either disappears (when τ remains on the left, τ Δu → 0) or explodes (when τ is moved to the right, ), leading to either too small or too big jumps.

D. Contraction analysis and delta-function inputs

Contraction property

We next show when the voltage dynamics obtained from is locally contracting, i.e. when neighboring trajectories of the explicit differential equations, for given inputs and targets, locally converge. This is a different stability property as compared to the (global) convergence properties for the implicit differential equation stated in Eqs 45 and 49.

For the contraction analysis we rewrite Eq. 51 in the form , where the explicit time dependence of E and f is a short-cut to express the dependence on . According to the implicit function theorem, at any point in time when is invertible, we can locally write . When absorbing the dependence of on into an explicit time dependence, we can rewrite this differential equation as , see Eq. 52, when explicitly expressing the dependency of g on u only (instead of all the variables y). This differential equation is contractive and thus stable if the Jacobian of g with respect to u is uniformly negative definite (Lohmiller & Slotine, 1998). The contraction analysis tells that locally, where is invertible, we can express as a function , that has derivative according to the implicit function theorem. Restricted to the u-component of y we get the Jacobian

To prove Eq. 53 we note that according to Eq. 51 we have , and with this calculate , with specified above. Since according to Eq. 47 the term does not depend on u, we also calculate . Hence, .

If we had alone, would be obviously negative definite, guaranteeing the local contraction property (Lohmiller & Slotine, 1998). To understand this statement, we consider the local representation of neighboring trajectories. The linear approximation of the voltage dynamics u(t), starting with u(tº) in a dim-u-dimensional neighbourhood around some point uº(tº), is . While u(t) is the linear approximations of the original voltage dynamics starting at u(tº), the trajectory uº(t) is the true voltage dynamics starting at the ‘center’ uº(tº) of the neighbourhood. This shows the exponential local contraction of surrounding trajectories to uº at times t around tº, would we have .

The additional log-term in Eq. 53 may cause a local violation of this contraction property. However, the additional term in Eq. 53 becomes small for large or small voltages for which we assume that the curvature of the transfer function also becomes small, ρ′′(u) ≈ 0. In fact, based on Eq. 48 we calculate being a function of ρ′′(u) that vanishes with vanishing ρ′′(u),

Plugging Eq. 54 into Eq. 53 we conclude that where the curvature of the transfer function vanishes, ρ′′(u) ≈ 0. Yet, ρ′′ may strongly deviate from 0 for intermediate values of u. For a rectified linear unit (ReLu), ρ(u) = u for u≥ 0 and 0 else, for instance, ρ′′ is a delta-function at 0. Even if the deviation from ρ′′(u) = 0 in this case is on a set of measure 0, integrating across voltages u = 0 can cause a jump in the voltage dynamics, preventing a strict local contraction property everywhere. The contraction property, in the example of a ReLu, only holds almost everywhere. Because the ReLu is an important transfer function in practical applications, we elaborate on this example further.

Comparison of the NLA with the Latent Equilibrium

Here we point out that the appearance of a delta-function for ReLu’s in the NLA may represent a technical challenge, that this specific challenge is tamed in the simpler version of the NLA principle formulated as Latent Equilibrium (Haider et al., 2021), and that the Latent Equilibrium may represent a viable new approach for time-varying optimization in general.

The Euler-Lagrange equation for the presented NLA, , Eq. 20, with Jacobian , implies second-order derivatives with respect to the voltage u. This is the case because the total derivative with respect to time, , applied to the error contained in f (u, t), implies the expression , with ′′ equivalent to . These second-order derivatives enter both in the implicit form (Eq. 46 and Eq. 51) and the explicit differential equation (Eq. 52). When integrating these equations across the components u = 0, a jump in u emerges that must be explicitly included in the simulations. It is the same jump that may transiently violate the contraction property as shown above (Eqs 53 and 54, but does not violate the exponential convergence from Eq. 45, see below).

In the Latent Equilibrium, our Lagrangian with is replaced by with r = (rin, ρ(ŭ)) and . If we were asking for stationary trajectories u(t) of with respect to variations in u, we would obtain the unstable Euler-Lagrange equations , solved by an exponentially growing function of time, , unless c = 0. To force the trajectory moving along , one again requires that variations of the action L dt are stationary with respect to a prospective voltage, this time with respect to . Because the Lagrangian for the Latent Equilibrium can be expressed as a function L(ŭ) of ŭ only, not of , the Euler-Lagrange equations becomes , where now e = ŭWr.

The Latent Equilibrium is consistent with the central idea of the NLA to deal with future quantities. While the presented version of the NLA deals with the discounted future voltage ũ (leading to the operator applied to the Lagrangian), the Latent Equilibrium deals with the linear approximation ŭ of the future voltage (leading to the operator applied to the Lagrangian). Both, the NLA and the Latent Equilibrium differ in this crucial aspect from the Equilibrium Propagation that deals with the current voltage u (leading to the operator applied to the Lagrangian L(u), without appearance of , and hence without intrinsic dynamics, see above).

For the Latent Equilibrium, it is not obvious to find global convergence properties as expressed in Eqs 45 and 49 for the presented version of the NLA. However, the Latent Equilibrium satisfies the strict local contraction property . In fact, writing the Euler-Lagrange equations as with the abbreviation , leads to , and the linear approximation obtained from plugging these partial derivatives into Eq. 53 yields the claimed contraction property for the Latent Equilibrium. The Jacobian is invertible if the Hessian is (that looks as in Eq. 48, but with u replaced by ŭ).

Crucially, for the Latent Equilibrium, the analogous theorems on real-time dendritic error propagation and learning (rt-DeEP and rt-DeEL) hold. The key property is that the trajectories of the Latent Equilibrium are stationary solution of the corresponding action A = ∫ L dt, so that the variation of A induced by W reduces to the partial derivative of L with respect to W.

The above strict contraction property that for non-vanishing errors only holds for the Latent Equilibrium, but not for the NLA, may be a reason why the simulations of the error-based updates in our hands seem to be more stable for the Latent Equilibrium as compared to the NLA. This does not withstand the fact that the NLA equations are used in the context of time-varying optimal control (see above). It would be interesting to consider also the Latent Equilibrium as a tool for non-stationary, time-varying optimization (Simonetto, Dall’Anese, et al., 2020).

Delta-function inputs keep

We next explain in more details why delta-functions in the input rates rin(t) for the NLA the stationarity (‘equilibrium’) condition is always satisfied (the delta-function in rin for the NLA corresponding to step-function in rin for the Latent Equilibrium). We reconsider the explicit differential equation given in Eq. 52, with and H given in Eq. 48.

To simplify matters, we consider a single delta-function at t = 0 as input in the absence of output nudging. In this case we get , where the input matrix Win is typically sparse (not all network neurons receive external input), and δin(t) is a vector of delta-functions restricted to the input neurons. Following Nedeljkov & Oberguggenberger, 2012, Proposition 2.1, we can then write the explicit differential equation, Eq. 51, in the form

where is globally Lipschitz continuous. Due to the Lipschitz continuity the change in u evoked by during a small time interval [ε, −ε] around t = 0 vanishes when this interval shrinks, ε → 0. To quantifies the change in u during these intervals it is therefore enough to consider , or equivalently . To estimate the jump induced by the delta-functions, we consider some mollifier ϕε(t) = ε−1ϕ(t/ε), where ϕ(t) is a smooth function on the interval [1, −1] with integral 1. By ϕin,ε(t) we denote the vector of mollifiers centered at the delta-functions of the input neurons. We now consider the two differential equations, with the second approximating the first on the interval [−ε, ε], but without regular term ,

We assume that for all t ∈ [− ε, ε] the matrices H(uε(t)) and are invertible, so that the two Eqs 56a,b can be turned into an explicit differential equations. Analogously to the 1-dimensional case (Nedeljkov & Oberguggenberger, 2012), we conclude that the solution of Eqs 56a,b on the interval [−ε, ε] converge to each other, for ε→ 0. As a consequence, the jump of at t = 0 converges to the corresponding jump of uε in the various dimensions.

To calculate the jump of we have to integrate across the time interval [−ε, ε]. Instead of integrating , we first integrate given in Eq. 56b. Moving from right to left yields

where Iin is the index vector of the input neurons having a delta-function, i.e. Iin,j = 1 if there is a delta-input, and else 0. Note that . Because H is itself a derivative, , we can explicitly calculate the latter integral (also for β > 0, but for clarity here only done for β = 0). The last integral in Eq. 57 is defined as a vector with i-th component being

where in the second last equality we used that Hij(u) = δijWijρ(uj)) does only depend on the component uj, see Eq. 48. In the last equality we introduced the ‘network voltage error’ . Following the 1-dimensional case treated in Nedeljkov & Oberguggenberger, 2012, Proposition 1.2 we introduce the ‘jump function’ (called G(y) in the cited work)

with thought to represent at some time t0 before the delta-kick sets in. With this setting, Eq. 57 turns into

In the limit ε → 0 we get a relation between u(t) immediately before and after the jump, u(−0) and u(+0), using that in this limit the boundary points of the trajectories also converge, ,

We now assume that the function J(u) = τ (vv0) is invertible around the jump. This is the case if the Jacobian is invertible, and because v = uWnetρ(u), we require invertability of , with Hessian defined in Eq. 48.

In the case of invertability we get the voltage after the jump as

We next calculate the jump in v. This is easy since v = uWnetρ(u) linearly enters in the function J(u) = τ (vv0). Plugging the explicit expression for J into Eq. 61 we get τ (v(+0) − v0) = τ (v(−0) − v0) + WinIin, or

Knowing the jump in v helps to show that the equilibrium condition is always satisfied, even immediately after the delta-in put, provided the initialization is at t0 = −∞. To show this, remember that in the absence of nudging we have . The jump size of for a delta-function at t = 0, rin(t) = δin(t) is . This is because satisfies the differential equations , provided that t0 = −∞. Hence,

With Eqs 63 and 64 we conclude that throughout.

E. Example of a single recurrently connected neuron

To get an intuition for the instantaneity in a the recurrent case we consider the example of a single, recurrently connected neuron. We also put this into the context of the Latent Equilibrium (Haider et al., 2021). Consider the weight vector W = (Win, Wnet) with an input an input rate rin(t) driving the postsynaptic voltage u. The postsynaptic rate is , and its low-pass filter with respect to τ is . As always, the low-pass filtering reaches back to an initialization at t0 = −∞, see Eq. 15. The Lagrangian has the form

The Euler-Lagrange equations are derived from . Applying the look-ahead operator (Eq. 14), and abbreviating , the Euler-Lagrange equations deliver the voltage dynamics,

To simplify matters, we consider the nudging-free case, β = 0. This implies that . With , we obtain the differential equation

Abbreviating v = uWnetρ(u) as ‘network voltage error’, the above differential equation reads as

Integrating the effective voltage dynamics (Eq. 68), assuming initialization infinitely far in the past, is equal to .This equation is equivalent to the Euler-Lagrange equation being integrate, and because the solution of the Euler Lagrange equation is , we have (using t = −∞)

Voltage dynamics for a delta-function input

We next apply a delta-function in the input rate, say rin(t) = δ(t) and consider the dynamics at the level of the voltage, Eq. 67. As in Nedeljkov & Oberguggenberger, 2012, Proposition 1.2 we introduce the ‘jump function’ (called G(y) in the cited work)

As in Nedeljkov & Oberguggenberger, 2012, Proposition 1.2, we show that the voltage u makes a unique jump at the moment of the delta function that J(u) is invertible around the jump.

We set v0 = u0 Wnetρ(u0). Here, u0 is some voltage before the jump, say u0 = u(−1) evaluated at time t = −1, when the jump is at t = 0. When is the voltage immediately before the jump, the voltage immediately after the jump is specified by

The reason is that the Win-scaled delta-function triggers a step of size Win when integrating over it as done in Eq. 70. The new value is unique if J is invertible, and looking at the defining integral in Eq. 70, this is the case if .

The jump in u translates into a jump in v = uWnetρ(u) from . This endpoint can also be expressed as

To check this, we assume without loss of generality that v0 = u0Wnetρ(u0) = 0. Then J(u) = τ (uWnetρ(u)) = τv and according to Eq. 71. Since also , we conclude that , as claimed above.

We finally show that even far away from the initialization, the stationarity condition holds before and immediately after the jump. In fact, for t0 = −∞, the evolution of the ‘network voltage error’ becomes

Here we used that and according to Eq. 72 the v jumps to . Remember that for initialization far in the past, is equivalent to , see Eq. 69. We therefore have to calculate the jump in induced by the delta-input. Since is the solution of for t0 = −∞, we find that

Combing the two Eqs 73 and 74 proves that holds true any moment in time, provided the initialization is far in the past.

One may ask why the delta-kink is different from resetting v at a new intialization off from 0. The reason is that at t = 0 there is a cause for the jump in r(0), while at t0 there is no cause in r(t0). In fact, there is no jump initially, just the start of v at some initial condition. Differently from the initialization at t0, where v(t0) > 0 implies for finite tt0 > 0, the jump of v(0) at t = 0 to a positive value does leave for all t > 0, provided t0 = −∞.

Linear transfer function

We first consider the case of a linear transfer-function ρ(u) = 0 (or threshold linear, being in the linear regime). Then the differential equation becomes

With initialization at t = −∞ and low-pass filtering defined in Eq. 15 the solution is

The point is that the time constant is τ and is not , as this would be the case without prospective firing rate. In fact, for the ‘classical’ differential equation,

and ρ the identity, we obtain the differential equation with and solution that is now the low-pass filtering with respect to the effective time constant.

Sigmoidal transfer function

For a sigmoidal transfer function , a positive feedback weight Wnet > 0, and a constant external input rin, say, the solutions u(t) of Eq. 66 either converge to a fixed point or diverge. When converging, the voltage satisfies the fixed point condition

This fixed point equation can be numerically solved by time-discrete iteration process. But it can also be solved by a time-continuous process that underlies a neural or neuromorphic implementation. The prospective firing rate introduced in the NLA can be seen as a method to quickly find the fixed point in continuous time. When directly solving the implicit differential equation (as opposed to convert this into an explicit differential equation using e.g. the Cholesky decomposition), the fixed point is potentially found with a fewer number of steps.

To estimate the speed of convergence, we look at the initial speed when taking off at initial condition u(0) between the unstable and stable fixed point. The initial speed for the classical differential equation, Eq. 77, and the NLA version, Eq. 66, are, respectively,

where we set Δu(0) = u(0) + Wnetρ(u(0)) + Winrin(0). As Wnetρ > 0, the initial convergence speed of the NLA solution is larger. The scheme has some resemblance to the Newton algorithm of finding zero’s of a function by using its derivative.

F. NLA for conductance-based neurons and least-action principle in physics

The mismatch energies and costs can be generalized in different ways. Here we focus on a biophysical version of the mismatch energy that includes conductance-based neurons. This also relates to the least-action principle in physics. But the NLA can also be generalized to include other dynamica variables such as adaptive thresholds or synaptic short-term plasticity.

Equivalent somato-dentritic circuit

For conductance based synapses, the excitatory and inhibitory conductances, gE and gI, are driven by the presynaptic firing rates and have the form gE(t) = WE r(t), and analogously gI(t) = WI r(t). The dynamics of a somatic voltage u and a dendritic voltage v reads as

where c and cd are the somatic and dendritic capacitances, EL/E/I the reversal potentials for the leak, the excitatory and inhibitory currents, gsd the transfer conductance from the dendrite to the soma, and gds in the other direction.

We consider the case when the dendritic capacitance cd is small as compared to the sum of conductances gd on the right-hand-side of Eq. 81, yielding a fast dendritic time constant. In this case we can solve this equation in the steady state for v, plug this into Eq. 80, and get

with effective reversal potential , total conductance , feedforward dendritic voltage vff = (gLEL + gEEE + gIEI)/gff and feedforward dendritic conductance . Because , the conductance depends on the presynaptic voltage and its derivative. Equation 82 describes the effective circuit that has the identical voltage time course as Eqs 80 and 81 with , but with a single time-dependent ‘battery voltage’ V and Ohmic resistance R = 1/g.

Somato-dentritic mismatch power and action

The synaptic inputs gE(t) and gI(t) are continuously driving V (t), and the best what one can hope for the dynamics of u is that it traces V with some integration delay determined by the time constant τ = c/g. In fact, if u follows the dynamics of Eq. 82, then u becomes the low-pass filtered target potential, , where V (t) is filtered with the dynamic time constant τ (t). The defining equations for the low-pass filtering is

and this self-consistency equation is equivalent to the explicit form

To capture the voltage dynamics with our NLA principle we recall that the somatic voltage u can be nudged by an ‘apical voltage’ ē that causes a somatic voltage error . The voltage error drives a current I = through the conductance g. The electrical power of this current I driven by the voltage ē is P = 2. This motivates the definition of the mismatch power in a network of N neurons by

𝒫 is a virtual power that, nevertheless, is related to some physical flow of ions. Assume we could measure all the ions flowing in the original circuit of Eqs 80 and 81 (in the limit of small ratio Cd/gd). From this flow, delete the ion flow that cancels at the level of electrical charge exchange due to the counter directed flow. The remaining effective ion flow defines an effective current flowing through the conductance g with driving force (Vu), Eq. 82. If it were only this effective current in the reduced circuit, the voltage u(t), starting at u(0), would converge with time-constant τ to the low-pass filtering . Without additional ‘hidden’ current, the voltage u would then instantaneously follow that is itself given by the forward dendritic input conductances. The deviation of u from , caused by some initial conditions in u or by a feedback current from the network affecting u, builds the mismatch power 𝒫. The feedback may originate from a target imposed downstream, and the neuron is ‘free’ in how to dynamically match u and . It is therefore tempting to see 𝒫 as a ‘free power’, and the NLA principle as minimizing the corresponding ‘free energy’. In fact, the free-energy principle says that any self-organizing system that is in a dynamic equilibrium with its environment (here in the absence of output nudging) must minimize its free energy (that here builds up by imposing a target; Friston, 2010).

The NLA principle states that the time-integral of 𝒫 is minimized with respect to the look-ahead voltage ũ. We therefore define the physical mismatch energy as

that has the units of energy. εM takes the role of our neural action (A in the main text) for conductance-based neurons.

Euler-Lagrange equations for conductance-based neurons

The NLA for conductance-based neurons seeks to minimize A = ∫𝒫(u) dt with respect to variations of u, such that . In the simplest example of the main text we considered prospective rates, , so that the low-pass filtered rates become a function of the instantaneous voltage, . These low-pass filtered presynaptic rates, , determine the postsynaptic voltage u. Analogously, the low-pass filtered reversal potential, , determines the postsynaptic voltage u, and we again postulate that V is an instantaneous function of the presynaptic voltage, . Here we argue that active dendritic mechanisms advance to postsynaptic reversal potential V, so that the delayed again becomes instantaneous, similarly to the advancement of the apical dendritic potential observed in cortical pyramidal neurons (Ulrich, 2002), see also Fig. 2b. With this instantaneity, the stationarity of the action with respect to generalized (compact and non-compact) variations, , translates to the condition .

Calculating with 𝒫 given in Eq. 85 and τ = c/g for the total conductance g(u) specified after Eq. 82 is a bit more demanding. For a probabilistic version, where 𝒫 is derived from the negative log-likelihood of a Gaussian density of the voltage, , the calculation is done in Jordan et al., 2022. In this probabilistic version, there is an additional normalization term that enters as log g. In the deterministic version considered here, this log term is not present and we calculate

Notice that the transpose selects the downstream network neuron to backpropagate from there the first- and second-order errors. From and τ = c/g we conclude that

We next apply the lookahead operator to the expr ession in this Eq. 88. Assuming an initialization at t0 = −∞, the condition becomes equivalent to , and hence Eq. 88 becomes equivalent to , and with the total conductance g(u) specified after Eq. 82 this is calculated to become (see Jordan et al., 2022)

A learning rule of the form with an appropriate postsynaptic error ēpost can again be derived (see Jordan et al., 2022 for a single neuron in the probabilistic framework).

Generalizations: long memories, reinforcement learning

One could also extend the NLA principle by adding e.g. threshold adaptation that endows the dynamics with additional and longer time constants. For this, the rate function is parametrized by an additional threshold, , an the Lagrangian is added by an error term on the threshold. Such an error addition can take the form , where now represents a low-pass filtering of the rate with a long threshold adaptation time constant τϑ. Neurons that show an additional threshold adaptation will still be able to instantaneously transmit a voltage jump through a prospective firing rate, but this will now also depend on the neuron’s history. Short-term plasticity may be included in the same way. Due to the history dependence, the stationarity of the action δA = 0 cannot anymore be reduced to the stationarity of the Lagrangian at any moment in time. As a consequence, the errors will look-ahead into to future more than only based on the local derivatives.

Generalizations are further possible for the cost function that favor voltage regions with high cost, say corresponding to punishment, or negative cost, corresponding to punishment to reward. These extensions will be considered in future work.

Voltage dynamics from the least-action principle in physics

We have shown that through the look-ahead in Hamilton’s least-action principle, the notion of friction enters through the backdoor. In the least-action formalism in physics, friction is directly introduced by extending the Hamiltonian principle to a generalized D’Alembert principle, where at the level of the Euler-Lagrange equations the generalized force is equated to the dissipation force (Flannery, 2005). For an introduction to the least-action principle in physics see e.g. Feynman et al., 2011, and for an introduction to the calculus of variation in general with a derivation of the Euler-Lagrange equation see e.g. Chachuat, 2007.

The electro-chemical properties of a membrane can be captured by an equivalent circuit consisting of a battery voltage V, a capacitance C and a resistance R, arranged in parallel. The voltage dynamics is derived from the Euler-Lagrange equations that are added by a dissipative force. Formally, in the absence of an inductance defining the kinetic energy, the Lagrangian ℒ becomes identical to the potential energy 𝒰 that itself is

where Q represents the charge across the membrane. Looking for stationary trajectory of the action (time integral of the Lagrangian) with only the potential energy in the Lagrangian would not give any dynamics. In fact, the Euler-Lagrange equation yields Q = C V. The potential energy is therefore extended by the dissipative Rayleigh energy ℱ that introduces friction,

According to D’Alembert’s principle, the Rayleigh energy enters as a gradient at the level of the Euler-Lagrange equation. At this point, the theory steps out from the original principle by adding the Rayleigh gradient ‘by hand’ to the Euler-Lagrange equation (different from the NLA, see below). Otherwise, no first order differential equation with friction in the desired form would be obtained from the least-action principle in physics (if we were adding 𝒰 to ℒ, for instance, the -term in the Euler-Lagrange equations would yield the term instead of ). Hence, with D’Alembert’s patch the dynamics is characterized by the Euler-Lagrange equation added by a dissipative force,

and this equation reduces to . Identifying the charge by means of voltage across the capacitance, Q = Cu, this equation can also be written as , or

with τ = RC, similarly as derived in Eq. 89.

The link to our NLA principle is most easily made by assuming that the ‘membrane’ conductance g = 1/R does not depend on the voltage u (of the post- or presynaptic neuron, as this is the case in general, see explanation after Eq. 82). In this simplified case, the Lagrangian from Eq. 86 becomes the power

The Euler-Lagrange equation with respect to ũ, applied to the action A = ∫𝒫 dt is then calculated by

We used that τ = RC while and .

The dynamics derived from the NLA (Eq. 95) is identical to the dynamics derived from the least-action principle with friction in physics (Eq. 92). Hence, the minimization of the energy (i.e. the time-integral of the power, A = ∫𝒫dt) by looking ahead in time is equivalent to the stationarity of the physical action without looking ahead, but by taking account of the Rayleigh dissipation. The trick of looking ahead in time generates a dynamics out of an action A that encompasses only a potential energy, no kinetic energy. Without looking ahead, no dynamics would emerge from such an action. Notice, though, that the NLA action has units of energy (the time-integral of a power), while the physical action has units of energy times time (the time-integral of energies).

G. Table of mathematical symbols

H. A tutorial on total and partial derivatives as used in the paper

The proof of Theorem 1 given in the Methods (Sect. D) makes use of partial and total derivatives and follows the notation of Scellier & Bengio, 2017 and Meulemans et al., 2022. As there is some variability (and community-dependent sloppiness) in the notation of partial and total derivatives here and in general, we provide some explanations on how this notation is interpreted, and why, for instance, total derivatives commute with each other, and also partial with each other (although not total with partial). For a standard basic textbook introducing the concepts of partial derivatives, total differentials, the implicit function theorem etc., see Stewart, 2010, being historically based on the classics of Courant, 1934.

  1. In a differential geometric setting, the derivative of real-valued function E(u) on a point u of a manifold (like the flat Euclidean space) is considered as a mapping of tangent vectors at u to the real numbers. When interpreting the derivative as mapping, the is living in the dual space of u and is therefore a row vector if is a column vector. If a function f (u) of u is a vector valued, then its derivative is a matrix with entries , where i is indexing the rows (i.e. running down) and j is indexing the columns (i.e. running right). When r = ρ(u) is a column vector with ρ applied to each component of u, we consider the (partial) derivative r = ρ(u) for convenience as column vector with components ρ(ui). To strictly follow the formalism, it should be a diagonal matrix.

  2. Because we introduced the error vector ē as column vector, it is easier to write , where now is also considered as column vector. To be consistent with the above, we should have written The .T appeared us as too heavy so that we neglected it, where it did not have further mathematical consequences (hoping it does not cause confusions). Sticking here to a column vector also renders the backpropagation error to be a column vector in Eq. 21. This error then gets the classical form with the weight transpose,.

  3. We typically have real-valued functions of the form E(uθ, θ), with θ = (W, β) being a vector of parameters, and u being a function of θ. To get the total derivative of E with respect to θ we consider the values E as a function of θ. This can be done by introducing a new function ℰ(θ) defined ℰ(θ) = E(uθ, θ), where the components θi are considered as independent variables. The total derivative of E with respect to θ is then defined as vector-valued function for a n-dimensional θ. It can be helpful to think of the components of this total derivative as a (total) directional derivatives in the unit directions. For the last (n-th) unit direction Δθ(n) = (0, .., 0, 1), for instance, we have .

  4. The cost gradient, has the same dimension as W. Recall that by the cost gradient we mean , where 𝒞 is defined as , with the voltage uo of the output neurons being itself a function of W.

  5. To calculate the partial derivative with respect to θ, we fix the first argument uθ, even if for u we often plugged in the components of the trajectory u = uθ(t) that now does depend on θ. In contrast, the total derivative is . Here, is a row vector, as also , consistent with the convention (that we have broken in Eq. 28 to keep vectors as columns). When u is considered as trajectory, does not vanish in general, but it does when u is simply considered as independent variable.

  6. When replacing the argument u in we get the ‘Lagrangian’ . The partial derivative of L with respect to ũ, for instance, is then . The partial derivative considers L as a function of independent arguments ũ, and θ.

  7. We also used that the total derivatives commute, in the current example . This is generally true for derivatives of Lipschitz continuous functions, for which derivatives exist almost everywhere. The total derivatives (where they exist) then commute because the difference quotients in W and β are uniformly bounded. The Moore-Osgood theorem tells that two limits, of which at least one is uniform in the other, can be commuted. This also applies to the double difference quotients involved in the definition of . Remember that the total derivative, for instance with respect to the n-th parameter, can be written as . But note again that, while for Lipschitz functions partial derivatives commute among themselves, and total derivatives commute among themselves, partial and total derivatives in general do not commute (not even with additional smoothness conditions)!

I. Proof of Theorem 1, part (ii), using only partial derivatives

This Section proves Eqs 2931 in terms with only partial derivatives, banning nested functions that require to deal with a combination of total partial derivatives. While this represents a conceptual simplification, it requires many more additional (unnested) functions to be defined (and in fact being the reason why in mathematics the concept of a total differential / total derivative is introduced, see e.g. Courant, 1934). – The cited 3 equations are also the core for the proof for Equilibrium Propagation Scellier & Bengio, 2017, although there only applied in the steady state after the network converged to a constant activity.

We assume a network of d neurons whose membrane potential is given by u ∈ ℝd and which are connected via weights W ∈ ℝd×d. By u, we denote the gradient with respect to the membrane potentials, i.e., Similarly, W is a matrix containing the derivatives with respect to the weights, .

To prove the rt-DeEP theorem (theorem 1), we first have to make a few definitions and observations:

  1. For a given (W, β), the dynamics yield certain membrane potentials fu ∈ ℝd. Formally, we define this as

    The ith element of fu is denoted by and hence .

  2. We define the following functions:

    • the mismatch energy EM :

    • the cost function C :

    • the Lagrangian L : ℝd × ℝd×d × ℝ → ℝ, (u, W, β) 1→ L(u, W, β) = EM(u, W) + βC(u).

    To make the dependency of the cost and energies on β and W explicit, we further introduce three auxiliary functions FM, FC and FL:

    • for the mismatch energy FM : ℝd×d × ℝ → ℝ, (W, β) 1→ FM(W, β) = EM (fu(W, β), W),

    • for the cost function FC : ℝd×d × ℝ → ℝ, (W, β) 1→ FC(W, β) = C (fu(W, β)),

    • for the Lagrangian FL : ℝd×d × ℝ → ℝ, (W, β) 1→ FL(W, β) = FM(W, β) + βFC(W, β)

  3. The Euler-Lagrange equations can be written as . Hence, far enough away from initialization and for smooth enough input (and targets) we have fu L = 0 at all times, even when changing the network input continuously. Note that both the cost and the mismatch energies are defined on low-pass-filtered signals, and it is with respect to the low-pass filtered external input that the low-pass-filtered output error is minimized.

  4. Without output nudging (i.e., β = 0), the output error vanishes and consequently all other prediction errors vanish as well, . This can be easily shown for layered network architectures and holds true for arbitrary connections (e.g., recurrent networks) as long as fu uniquely exists, i.e., has a unique solution for u. From the form of the mismatch energy, we then get

    Since we are assuming smooth functions, this also implies that

  5. From the assumption of well-behaved (smooth) functions, it also follows that partial derivatives commute.

Our goal is to find a plasticity rule that minimizes the cost C, which we do by calculating W FC β=0. Similar to Scellier & Bengio (2017), to achieve this, we first calculate the partial derivatives of FL with respect to the nudging strength β

and the weights W

or in vectorized form

With these identities in place, we can calculate the plasticity rule:

where we used that limits can be exchanged for smooth functions. Using the definition of EM, we obtain a plasticity rule that minimizes the cost function

In practice, the prefactor is absorbed into the learning rate.