Introduction

Wigner’s remark about the ‘unreasonable effectiveness’ of mathematics in allowing us to understand physical phenomena (Wigner, 1959) is famously contrasted by Gelfand’s quip about its ‘unreasonable ineffectiveness’ in doing the same for biology (Borovik, 2021). Considering the component of randomness that is inherent to evolution, this may not be all that surprising. However, while this argument holds just as well for the brain at the cellular level, ultimately brains are computing devices. At the level of computation, machine learning and neuroscience have revealed near-optimal strategies for information processing and storage, and evolution is likely to have found similar principles through trial and error (Hassabis et al., 2017). Thus, we have reason to hope for the existence of fundamental principles of cortical computation that are similar to those we have found in the physical sciences. Eventually, it is important for such approaches to relate these principles back to brain phenomenology and connect function to structure and dynamics.

In physics, a fundamental measure of ‘effort’ is the action of a system, which nature seeks to ‘minimize’. Given an appropriate description of interactions between the systems constituents, the least-action principle can be used to derive the equations of motion of any physical system (Feynman et al., 2011; Coopersmith, 2017). Here, we suggest that in biological information processing, a similar principle holds for prediction errors, which are of obvious relevance for cognition and behavior. Based on such errors, we formulate a neuronal least-action (NLA) principle which can be used to derive neuronal dynamics and map them to observed dendritic morphologies and cortical microcircuits. Within this framework, local synaptic plasticity at basal and apical dendrites can be derived by stochastic gradient descent on errors. The errors that are minimized refer to the errors in output neurons that are typically thought to represent motor trajectories, planned and encoded in cortical motor areas and ultimately in the spinal cord and muscles. In the context of motor control, a phenomenological minimal action principle has previously been proposed that guides the planning and execution of movements (Feldman & Levin, 2009). The present paper formalizes this principle and reformulates the classical equilibrium point hypothesis (Latash, 2010) in a dynamical setting. Other attempts exist to link neural networks with the least action principle, for instance by directly learning to reproduce a given trajectory (Amirikian & Lukashin, 1992), by minimizing data transport through a network (Karkar et al., 2021), or by minimizing the free energy (Friston, 2010; Friston et al., 2022).

Our theory considers the minimization of a mismatch between the effective output trajectory and an internally or externally imposed target trajectory. The strength by which the output trajectory is pushed towards the target trajectory is referred to as ‘nudging strength’. For weak nudging, the feedback slightly adapts the internal network states so that the output activities closer follow their targets. For a target that is fixed in time, the dynamics reduces to the Equilibrium Propagation (Scellier & Bengio, 2017). For strong nudging, the output neurons are clamped to tightly follow the target trajectory (Song et al., 2022; Meulemans, Zucchet, et al., 2022). Each network neuron performs gradient descent on its own dendritic prediction error (Urbanczik & Senn, 2014), and with this eventually also contributes to a reduction of the output errors. For both versions, weak and strong nudging, synaptic plasticity in an interneuron feedback circuitry supports the extraction of neuron-specific dendritic errors (Sacramento et al., 2018), providing a cortical implementation of the suggested ‘real-time dendritic error propagation (rt-DeEP)’.

The NLA principle can be put into the history of energy-based models used to understand neuronal processing and learning in recurrent neural networks (Hopfield, 1982), artificial intelligence (LeCun et al., 2006), and biological versions of deep learning (Scellier & Bengio, 2017; Richards et al., 2019; Song et al., 2022). As a continuous-time theory of cortical processing that instantaneously corrects neuronal activities throughout the network, it makes a link to optimal feedback control (Todorov & Jordan, 2002) that has recently been considered in terms of energy-based models at equilibria (Meulemans, Zucchet, et al., 2022). It can further be seen in the tradition of predictive coding (Rao & Ballard, 1999; Bastos et al., 2012), where cortical feedback connections try to explain away lower-level activities. Yet, our prospective coding extrapolates from current quantities to predict activity in the future, not the activity at the current point in time, as in classical predictive coding. The NLA principle combines energy-based models with prospective coding in which neuronal integration delays are compensated on the fly (as also done in Haider et al., 2021). The prospective coding within each neuron may overcome putative delays that extend across the whole cortex and would allow for a real-time processing of stimuli while instantaneously correcting for ongoing errors. In this sense, the NLA represents an alternative to the recently suggested forward-forward algorithm which circumvents the delay problem by avoiding error propagation altogether (Hinton, 2022).

The paper is organized as follows: we first define the prospective somato-dendritic mismatch error, construct out of this the mismatch energy of a network, and ‘minimise’ this energy to obtain the error-corrected, prospective voltage dynamics of the network neurons. We then show that the prospective error coding leads to an instantaneous and joint processing of low-pass filtered input signals and backpropagated errors. Applied to motor control, the instantaneous processing is interpreted as a moving equilibrium hypothesis according to which sensory inputs, network state, motor commands and muscle feedback are in a self-consistent equilibrium at any point of the movement. We then derive a local learning rule that globally minimizes the somato-dendritic mismatch errors across the network, and show how this learning can be implemented through error-extracting cortical microcircuits and dendritic predictive plasticity.

Results

Somato-dendritic mismatch errors and the Lagrangian of cortical circuits

We consider a network of neurons – identified as pyramidal cells – with firing rates ri(t) in continuous time t. The somatic voltage ui of pyramidal neuron i is driven by the close-by basal input current, ∑j Wijrj, with presynaptic rates rj and synaptic weights Wij, and an additional distal apical input ei that will be learned to represent a prospective prediction error at any moment in time (Fig. 1a). While in classical rate-based neuron models the firing rate ri of a neuron is a function of the somatic voltage, ρ(ui), we postulate that the firing rate is prospective and extrapolates from ρ(ui) into the future with the temporal derivative, , with representing the temporal derivative of ρ(ui(t)). There is experimental evidence for such prospective coding in cortical pyramidal neurons to which we return later (Fig. 2a).

Somato-dendritic mismatch energies and the neuronal least-action (NLA) principle.

(a1) Sketch of a cross-cortical network of pyramidal neurons described by NLA. (a2) Correspondence between elements of NLA and biological observables such as membrane voltages and synaptic weights. (b1) The NLA principle postulates that small variations δũ (dashed) of the trajectories ũ (solid) leave the action invariant, δA = 0. It is formulated in the look-ahead coordinates ũ (symbolized by the spyglass) in which ‘hills’ of the Lagrangian (shaded grey zones) are foreseen by the prospective voltage so that the trajectory can turn by early enough to surround them. (b2) In the absence of output nudging (β = 0), the trajectory u(t) is solely driven by the sensory input, and prediction errors and energies vanish (L = 0, outer blue trajectory at bottom). When nudging the output neurons towards a target voltage (β > 0), somato-dendritic prediction errors appear, the energy increases (red dashed arrows symbolising the growing ‘volcano’) and the trajectory u(t) moves out of the L = 0 hyperplane, riding on top of the ‘volcano’ (red trajectory). Synaptic plasticity reduces the somato-dendritic mismatch along the trajectory by optimally ‘shoveling down the volcano’ (blue dashed arrows) while the trajectory settles in a new place on the L = 0 hyperplane (inner blue trajectory at bottom).

Prospective coding in cortical pyramidal neurons enables instantaneous voltage-to-voltage transfer.

(a1) The instantaneous spike rate of cortical pyramidal neurons (top) in response to sinusoidally modulated noisy input current (bottom) is phase-advanced with respect to the input (adapted from Köndgen et al., 2008). (a2) Similiarly, in NLA, the instantaneous firing rate of a model neuron , black) is phase-advanced with respect to the underlying voltage (u, red, postulating that the low-pass filtered rate is a function of the voltage, ). (b) Dendritic input in the apical tree (here called ē) is instantaneously causing a somatic voltage modulation (u, (modeling data from Ulrich, 2002)). The low-pass filtering with τ along the dendritic shaft is compensated by a lookahead mechanism in the dendrite . In Ulrich (2002) a phase advance is observed even with respect to the dendritic input current, not only the dendritic voltage, although only for slow modulations (as here). (c) While the voltage of the first neuron (u1) integrates the input rates rin from the past (bottom black upward arrows), the output rate r1 of that first neuron looks ahead in time, (red dashed arrows pointing into the future). The voltage of the second neuron (u2) integrates the prospective rates r1 (top black upwards arrows). By doing so, it inverts the lookahead operation, resulting in an instantaneous transfer from u1(t) to u2(t) (blue arrow and circles).

The second central notion of the theory is the prospective somato-dendritic mismatch error in the individual network neurons, ei(t). It is defined as a mismatch between the prospective voltage, , and the weighted prospective input rates, . If the prospective error is low-pass filtered with time constant τ, it takes the form , where is the corresponding low-pass filtered firing rate of the presynaptic neuron j (Methods). We refer to ēi as somato-dendritic mismatch error of neuron i.

We next interpret the mismatch error ēi in terms of the morphology of pyramidal neurons with basal and apical dendrites. While the error is formed in the apical dendrite, this error is added to the somatic voltage and, from the perspective of the basal inputs, it becomes a somato-dendritic mismatch error. The mismatch error tells the difference between ‘what a neuron does’, which is based on the somatic voltage ui, and ‘what the basal inputs think it should do’, which is based on its own input (Fig. 1a2). The two quantities may deviate because the neuron gets the apical input ēi that integrates in the soma, i, but not in the basal dendrite. What cannot be predicted from the somatic voltage ui by the basal input remains as ‘somato-basal’ mismatch error, , and this is just the apical error ēi.

Associated with this mismatch error is the somato-dendritic mismatch energy defined for each network neuron i ∈ 𝒩 as the squared mismatch error,

On a subset of output neurons of the whole network, 𝒪 ⊆ 𝒩, a cost is defined as a function of the somatic voltage and some instructive reference signal such as a targets or a reward. When a target trajectory is available, the cost is defined at each time point as a squared target error

Much more general cost functions and mismatch energies are conceivable, encompassing e.g. conductance-based neurons or including further dynamic variables (see SI). The cost represents a performance measure for the entire network that produces the output voltages uo(t) in response to some input rates rin(t). The cost directly relates to behavioral or cognitive measures such as the ability of an animal or human to perform a particular task in real time. The target could be provided by explicit external supervision, for example target movements in time encoded by , it could represent an expected reward signal, or it could arise via self-supervision from other internal prediction errors.

We define the Lagrangian (or total ‘energy’) of the network as a sum across all mismatch energies and costs, weighted by the nudging strength β of the output neurons,

The low-pass filtered presynaptic rates, , also encompass the external input neurons. Due to the prospective coding, the Lagrangian can be minimal at any moment in time while the network dynamics evolves. This is different from the classical predictive coding (Rao & Ballard, 1999) and energy-based approaches (Scellier & Bengio, 2017; Song et al., 2022), where a stimulus needs to be fixed in time while the network relaxes to a steady state, and only there the prediction error is minimized. Since we postulated that the firing rates are prospective and linearly extrapolate into the future, , and because any quantity can be reconstructed from its low-pass filtering at any point in time, , we conclude that the low-pass filtered rate is a function of the instantaneous somatic voltage, , see SI. As a consequence, also the Lagrangian becomes a function of the instantaneous voltage u(t) and does not depend on past voltage values, although it depends on past firing rates.

The previous mathematical operation of extrapolating from the voltage to future rates, and going back to the current voltage via low-pass filtering, is the key ingredient for the magically sounding instantaneous processing, see Fig. 2c.

To get the instantaneity it is not really necessary to precisely look into the future. It is only necessary to undo the temporal operation of the postsynaptic neuron, and this is achieved by encoding the presynaptic voltage in the prospective rate. We come back to this instantaneity, including its practical limitations, in the overnext section.

The least-action principle expressed for prospective voltages

Motivated by the prospective firing in pyramidal neurons, we postulate that cortical networks strive to look into the future to prevent instantaneous errors. Each neuron tries to move along a trajectory that minimizes its own mismatch error ēi across time (Fig. 1b1). The ‘neuronal currency’ with which each neuron ‘trades’ with others to choose its own error-minimizing trajectory is the future discounted membrane potential,

These prospective voltages ũ are the ‘canonical coordinates’ entering the NLA principle, and in these prospective coordinates the overall network searches for a ‘least action trajectory’. Since from ũ we can recover the instantaneous voltage via (see SI), we can replacIe u inlthe Lagrangian and obtain L as a function of our new canonical coordinates ũ and the ‘velocities’ , i.e. , where bold fonts represent vectors. Inspired by the least-action principle from physics, we can now define the neuronal action A as a time integral of the Lagrangian,

The NLA principle postulates that the trajectory ũ(t) keeps the action A stationary with respect to small variations δũ (Fig. 1b1). In other words, nature chooses a trajectory such that, when deviating a little bit from it, say by δũ, the value of A will not change (or at most up to second order in the variation), formally δA = 0. The equations of motion that keep the action stationary are known to satisfy the Euler-Lagrange equations

Applying these equations to our Lagrangian yields a prospective version of the classical leaky integrator voltage dynamics, with rates r and errors e that are looking into the future (Methods),

The ‘ ·’ denotes the component-wise product, and the weight matrix splits into weights from input neurons and weights from network neurons, W = (Win, Wnet). While for output neurons a target error can be defined, , for non-output neurons i no target exist and we hence set . In a control theoretic framework, the neuronal dynamics (Eq. 7a) represents the state trajectory, and the adjoint error dynamics (Eq. 7b) represents the integrated costate trajectory (Todorov, 2006).

Formally, the voltage dynamics in Eq. 7a specifies an implicit differential equation since also appears on the right-hand side. This is because the prospective rates imply through . Likewise, the prospective errors , with ē given in Eq. 7b and plugged into Eq. 7a, imply through . Nevertheless, the voltage dynamics can be run by replacing on the right-hand side of Eq. 7a by the temporal derivative from the previous time step (technically, the Hessian (1 − ′ − ē′) is required to be strictly positive definite, see Methods and SI). This ensures that the voltage dynamics of Eq. 7 can be implemented in cortical neurons with a prospective firing and a prospective dendritic error (see Fig. 2).

The error expression in Eq. 7b is reminiscent of error backpropagation (Rumelhart et al., 1986) and can in fact be mathematically linked to this (Methods). Formally, the errors are backpropagated via transposed network matrix, , modulated by , the derivative of with respect to the underlying voltage. While the transpose can be constructed with various local methods (see Akrout et al., 2019; Max et al., 2022) in our simulations we mainly adhere to the phenomenon of feedback alignment (Lillicrap, Cownden, et al., 2016) and consider fixed and randomized feedback weights B (unless stated differently). Recent control theoretical work is exploiting the same prospective coding technique as expressed in Eq. 7 to tackle general time-varying optimization problems (see Simonetto et al., 2020 for a review and the SI for the detailed connection).

Prospective coding in neurons and instantaneous propagation

The prospective rates and errors entering via r and e in the NLA (Eq. 7) are consistent with the prospective coding observed in cortical pyramidal neurons in vitro (Köndgen et al., 2008). Upon sinusoidal current injection into the soma, the somatic firing rate is advanced with respect to its voltage (Fig. 2a), effectively compensating for the delay caused by the current integration. Likewise, sinusoidal current injection in the apical tree causes a lag-less voltage response in the soma (Fig. 2b, Ulrich, 2002). While the rates and errors in general can be reconstructed from their low-pass filterings via and , they become prospective in time because and ē are themselves instantaneous functions of the voltage u, and hence r and e depend on . The derivative of the membrane potential implicitly also appears in the firing mechanism of Hodgkin-Huxley-type conductances, with a quick depolarization leading to a stronger sodium influx due to the dynamics of the gating variables (Hodgkin & Huxley, 1952). This advances the action potential as compared to a firing that would only depend on u, not , giving an intuition of how such a prospective coding may arise. A similar prospective coding has been observed for retinal ganglion cells (Palmer et al., 2015) and cerebellar Purkinje cells (Ostojic et al., 2015), making a link from the visual input to the motor control.

To understand the instantaneous propagation through the network, we low-pass filter the dynamic equation (obtained by rearranging Eq. 7a), with ē given by Eq. 7b, to obtain the somatic voltage . At any point in time, the voltage is in a moving equilibrium between forward and backpropagating inputs. Independently of the network architecture, whether recurrent or not, the output is an instantaneous function of the low-pass filtered input and a putative correction towards the target, , see Fig. 2c and Methods. The mapping again expresses an instantaneous propagation of voltages throughout the network in response to both, the low-pass filtered input and feedback error . This instantaneity is independent of the network size, and in a feed-forward network is independent of its depths (see also Haider et al., 2021, where the instantaneity is on the rates, not the voltages). In the absence of the look-ahead activity, each additional layer slows down the network relaxation time.

Notice that an algorithmic implementation of the time-continuous dynamics of a N -layer feedforward network would still need N calculation steps until information from layer 1 reaches layer N. However, this does not imply that an analog implementation of the prospective dynamics will encounter delays. To see why, consider a finite step-change Δu1 in the voltage of layer 1. In the absence of the look-ahead, Δu1 were mapped within the infinitesimal time interval dt to an infinitesimal change du2 in the voltages of layer 2. But with a prospective firing rate, , a step-change Δu1 translates to a delta-function in r1, this in turn to a step-change in the low-pass filtered rates , and therefore within dt to a step-change Δu2 in the voltages u2 of the postsynaptic neurons (Fig. 2c). Iterating this argument, a step-change Δu1 propagates ‘instantaneously’ through N layers within the ‘infinitesimal’ time interval N dt to a step-change ΔuN in the last layer. When run in a biophysical device in continuous time that exactly implements the dynamical equations (Eq. 7), the implementation becomes an instantaneous computation. Yet, in a biophysical device information has to be moved across space. This typically introduces further propagation delays that may not be captured in our formalism where low-pass filtering and prospective coding cancel each other exactly. Nevertheless, analog computation in continuous time, as formalized here, offers an idea to ‘instantaneously’ realize an otherwise time consuming numerical recipe run on time-discrete computing systems that operate with a finite clock cycle.

Prospective control and the moving equilibrium hypothesis

Crucially, at the level of the voltage dynamics (Eq. 7a) the correction is based on the prospective error e. This links our framework to optimal control theory and motor control where delays are also taken into account, so that a movement can be corrected early enough (Wolpert & Ghahramani, 2000; Todorov & Jordan, 2002; Todorov, 2004). The link between energy-based models and optimal control was recently drawn for strong nudging (β → ∞) to learn individual equilibrium states (Meulemans, Zucchet, et al., 2022). Our prospective error e(t) appears as a ‘controller’ that, when looking at the output neurons, pushes the voltage trajectories towards the target trajectories. Depending on the nudging strength β, the control is tighter or weaker. For infinitely large β, the voltages of the output neurons are clamped to the time-dependent target voltages, (implying ), while their errors, , instantaneously correct all network neurons. For small β, the output voltages are only weakly controlled, and they are dominated by the forward input, .

In the context of motor control, our network mapping can be seen as a forward internal model that quickly calculates an estimate of the future muscle length uo based on some motor plans, sensory inputs, and the current proprioceptive feedback (Fig. 3a). Forward models help to overcome delays in the execution of the motor plan by predicting the outcome, so that the intended motor plans and commands can be corrected on the fly (Kawato, 1999; Wolpert & Ghahramani, 2000).

Moving equilibrium hypothesis for motor control and real-time learning of cortical activity.

(a) A voluntary movement trajectory can be specified by the target length of the muscles in time, , encoded through the γ-innervation of muscle spindles, and the deviation of the effective muscle lengths from the target, . The Ia-afferents emerging from the spindles prospectively encode the error, so that their low-pass filtering is roughly proportional to the length deviation, truncated at zero (red). The moving equilibrium hypothesis states that the low-pass filtered input , composed of the movement plan and the sensory input (here encoding the state of the plant e.g. through visual and proprioceptive input, and ), together with the low-pass filtered error feedback from the spindles, , instantaneously generate the muscle lengths, , and are thus at any point in time in an instantaneous equilibrium (defined by Eq. 7). (b1) Intracortical iEEG activity recorded from 56 deep electrodes and projected to the brain surface. Red nodes symbolize the 56 iEEG recording sites modeled alternately as input or output neurons, and blue nodes symbolize the 40 ‘hidden’ neurons for which no data is available, but used to reproduce the iEEG activity. (b2) Corresponding NLA network. During training, the voltages of the output neurons were nudged by the iEEG targets (black input arrows, but for all red output neurons). During testing, nudging was removed for 14 out of these 56 neurons (here, represented by neurons 1, 2, 3). (c1) Voltage traces for the 3 example neurons in a2, before (blue) and after (red) training, overlaid with their iEEG target traces (grey). (c2) Total cost, integrated over a window of 8 s of the 56 output nodes during training with sequences of the same duration. The cost for the test sequences was evaluated on a 8 s window not used during training.

The observation that muscle spindles prospectively encode the muscle length and velocity (Dimitriou & Edin, 2010)) suggests that the prospective coding in the internal forward model mirrors the prospective coding in the effective forward pathway. This forward pathway leads from motor plan to spindle feedback, integrating also cerebellar and brainstem feedback (Kawato, 1999). Based on the motor plans, the intended spindle lengths and the effective muscle innervation are communicated via descending pathway to activate the γ-and α-motoneurons, respectively (Li et al., 2015). The transform from the intended arm trajectory to the intended spindle lengths via γ-innervation is mainly determined by the joint geometry. The transform from the intended arm trajectory to the force generating α-innervation, however, needs to also take account of the internal and external forces, and this is engaging our network W.

When we prepare an arm movement, spindles in antagonistic muscles pairs that measure the muscle length are tightened or relaxed before the movement starts (Papaioannou & Dimitriou, 2021). According to the classical equilibrium-point hypothesis (Feldman & Levin, 2009; Latash, 2010), top-down input adjusts the activation threshold of the spindles through (γ-)innervation from the spinal cord so that slight deviations from the equilibrium position can be signaled (Fig. 3a). We postulate that this γ-innervation acts also during the movement, setting an instantaneous target for the spindle lengths. The effective lengths of the muscle spindles is uo, and the spindles are prospectively signaling back the deviation from the target through the Ia-afferents (Dimitriou & Edin, 2010; Dimitriou, 2022). The low-pass filtered Ia-afferents may be approximated by a threshold-nonlinearity, , with β being interpreted as spindle gain (Latash, 2018). Combining the feedback from agonistic and antagonistic muscle pairs allows for extracting the scaled target error . Taking account of the prospective feedback, we postulate the moving equilibrium hypothesis according to which the instructional inputs, , the spindle feedback, , and the muscle lengths, uo, are at any point of the movement in a dynamic equilibrium. The moving equilibrium hypothesis extends the classical equilibrium-point hypothesis from the spatial to the temporal domain.

With regard to the interpretation of the prospective feedback error as spindle activity, it is worth noticing that in humans the spindle activity is not only ahead of the muscle activation (Dimitriou & Edin, 2010), but also shares the property of a motor error (Dimitriou, 2016). The experiments show that during the learning of a gated hand movement, spindle activity is initially stronger when making movement errors, and it returns back to baseline with the success of learning. We next address how the synaptic strengths W specifying the forward model can be optimally adapted to capture this learning.

Local plasticity at basal synapses minimizes the global cost in real time

The general learning paradigm starts with input time series rin(t),i and target time series, while assuming that the target series are an instantaneous function of the low-pass filtered input series, . The low-pass filtering in the individual inputs could be with respect to any time constant (that may also be learned, see SI), but for the network neurons we assume the same time constants for the voltage integration and the prospective rate. The goal of learning is to adapt the synaptic strengths W in the student network so that this approximates the target mapping, FWF*. This will also reduce the cost C defined on the output neurons in terms of the deviation of the voltage from the target, (Eq. 2).

The problem of changing synaptic weights to correct the behavior of downstream neurons, potentially multiple synapses away, is typically referred to as the credit assignment problem and is notoriously challenging in physical or biological substrates operating in continuous time. A core aspect of the NLA principle is how it relates the cost C to the Lagrangian L and eventually to somato-dendritic prediction errors ē that can be reduced through synaptic plasticity . We define this synaptic plasticity as partial derivative of the Lagrangian with respect to the weights, . Since the somato-dendritic mismatch error is , this leads to the local learning rule of the form ‘postsynaptic error times low-pass filtered presynaptic rate’,

This plasticity rule runs simultaneously to the neuronal dynamics in the presence of a given nudging strength β that tells how strongly the voltage of an output neurons is pushed towards the target, . The learning rule is local in space since is represented as voltage of the basal dendrites, and the somatic voltage u may be read out at the synaptic site on the basal dendrite from the backpropagating action potentials that sample u at a given time (Urbanczik & Senn, 2014). The basal voltage becomes the dendritic prediction of the somatic activity u, interpreting Eq. 8 as ‘dendritic predictive plasticity’.

We have derived the neuronal dynamics as a path that keeps the action stationary. Without external teaching signal, the voltage trajectory wriggles on the bottom of the energy landscape (L = 0, Fig. 1b1). If the external nudging is turned on, β > 0, errors emerge and hills grow out of the landscape. The trajectory still tries to locally minimize the action, but it is lifted upwards on the hills (L > 0, Fig. 1b2). Synaptic plasticity reshapes the landscape so that, while keeping β fixed, the errors are reduced and the landscape again flattens. The transformed trajectory settles anew in another place (inside the ‘volcano’ in Fig. 1b2). Formally, the local plasticity rule (Eq. 8) is shown to perform gradient descent on the Lagrangian. In the energy landscape picture, plasticity ‘shovels off’ earth along the voltage path so that this is lowered most efficiently. The error that is back-propagated through the network tells at any point on the voltage trajectory how much to ‘dig’ in each direction, i.e. how to adapt the basal input in each neuron in order to optimally lower the local error.

The following theorem tells that synaptic plasticity pushes the network mapping towards the target mapping at any moment in time. The convergence of the mapping is a consequence of the fact the plasticity reduces the Lagrangian L = EM + βC along its gradient.

Theorem 1 (real-time Dendritic Error Propagation, rt-DeEP) Theorem 1 (real-time Dendritic Error Propagation, rt-DeEP)

Consider an arbitrary network W with voltage and error dynamics following Eq. 7. Then the local plasticity rule (Eq. 8), acting at each moment along the voltage trajectories, is gradient descent

  1. on the Lagrangian L for any nudging strength β ≥ 0, i.e. , with .

  2. on the cost C for small nudging, β → 0, while up-scaling the error to , i.e. .

The gradient statements hold at any point in time (long enough after initialization), even if the input trajectories rin(t) contain delta-functions and the target trajectories contain step-functions.

Loosely speaking, the NLA enables the network to localize an otherwise global problem: what is good for a single neuron becomes good for the entire network. In the limit of strong nudging (β→ ∞), the learning rule performs gradient descent on the mismatch energies in the individual neurons in the network. If the network architecture is powerful enough so that after learning all the mismatch energies vanish, , then the cost will also vanish, . This is because for the output neurons the mismatch error includes the target error. In the limit of weak nudging (β →0), the learning rule performs gradient descent on C, and with this also finds a local minimum of the mismatch energies.

In the case of weak nudging and a single equilibrium state, the NLA algorithm reduces to the Equilibrium Propagation algorithm (Scellier & Bengio, 2017) that minimizes the cost C for a constant input and a constant target. In the case of strong nudging and a single equilibrium state, the NLA principle reduces to the Least Control Principle (Meulemans, Zucchet, et al., 2022) that minimizes the mismatch energy EM for a constant input and a constant target. While in the Least Control Principle the inputs and outputs are clamped to fixed values, the output errors are backpropagated and the network equilibrates in a steady state where the corrected network activities reproduce the clamped output activities. This state is called the ‘prospective configuration’ in (Song et al., 2022) because neurons deep in the network are informed about the distal target and are correspondingly adapted to be consistent with this distal target. In the NLA principle, after an initial transient, this relaxation process occurs on the fly while inputs and targets dynamically change, and hence the network moves along a continuous sequence of prospective configurations.

In the motor control example, the theorem tells that a given target motor trajectory is learned to be re-produced by the forward model through the dendritic predictive plasticity (Eq. 8), applied in the various layers of the forward model. We give an example below for such a learning that involves different sensory modalities to train the forward model.

Reproducing intracortical EEG recordings and recognizing handwritten digits

To illustrate the framework, we consider a recurrently connected network that learns to represent intracortical electroencephalogram (iEEG) data from epileptic patients (Fig. 3b). For each electrode, we assign a neuron within this network to represent the activity of the cell cluster recorded in the corresponding iEEG signal via its membrane potential. During learning, a randomly selected subset of electrode neurons are nudged towards the target activity from recorded data while learned to be reproduced by the other neurons. After learning, we can present only a subset of electrode neurons with previously unseen recordings and observe how the activity of the other neurons closely matches the recordings of their respective electrodes (Fig. 3c). The network derived from NLA is thus able to learn complex correlations between signals evolving in real time by encoding them into its connectivity structure.

As an example of visual processing in the NLA framework, we next consider a well-studied image recognition task, here reformulated in a challenging time-continuous setting, and interpreted as motor task where 1 out of 10 fingers has to be bent upon seeing a corresponding visual stimulus (see Fig. 3a). In the context of our moving equilibrium hypothesis, we postulate that during the learning phase, but not the testing phase, an auditory signal identifying the correct finger sets the target spindle lengths of the 10 finger flexors, , i.e. a desired contraction for the correct finger in response to the visual input rin(t), and a desired relaxation for the 9 incorrect fingers.

We train a hierarchical three-layer network on images of handwritten digits (MNIST, LeCun, 1998), with image presentation times between 0.5τ (= 5 ms, with τ the membrane time constant) and 20τ (= 200 ms). Fig. 4a-c1 depict the most challenging scenario with the shortest presentation time. Synaptic plasticity is continuously active, despite the network never reaching a steady state (Fig. 4b1). Due to the lookahead firing rates in NLA, the mismatch errors ēi(t) propagate without lag throughout the network, as explained above. As a consequence, our mismatch errors are equal to the errors obtained from classical error backpropagation applied at each time step to the purely forward network, without error-correcting the voltage, and instead, at each layer l considering only the forward voltage updates ul = Wl ρ(ul−1), see Fig. 4b2. The network eventually learned to implement the mapping with a performance comparable to error-backpropagation at each dt, despite the short presentation time of only 5 ms (Fig. 4c1). The limiting stimulus presentation time in the order of τin (Fig. 4c2) is own to the fact that for simultaneous step-changes in the input rates and the target we required , instead of . The time constant for low-pass filtering the input rates was chosen to be τin = 10 ms, like the time constants τ for the prospective rates and the downstream membranes, but τin could have also been chosen much smaller to improve performance at even shorter presentation times.

On-the-fly learning of finger responses to visual input with real-time Dendritic Error Propagation (rt-DeEP).

(a) Functionally feedforward network with handwritten digits as visual input ( in Fig. 3a, here from the MNIST data set, 5 ms presentation time per image), backprojections enabling credit assignment, and activity of the 10 output neurons interpreted as commands for the 10 fingers (forward architecture: 784× 500 ×10 neurons). (b) Example voltage trace (b1) and local error (b2) of a hidden neuron in NLA (red) compared to an equivalent network without lookahead rates (orange). Note that neither network achieves a steady state due to the extremely short input presentation times. Errors calculated via exact backpropagation, i.e. by using the error backpropagation algorithm on a purely feedforward NLA network at every simulation time step (with output errors scaled by β), shown for comparison (blue). (c) Comparison of network models during and after learning. Color scheme as in (b). (c1) The test error under NLA evolves during learning on par with classical error backpropagation performed each Euler dt based on the feedforward activities. In contrast, networks without lookahead rates are incapable of learning such rapidly changing stimuli. (c2) With increasing presentation time, the performance under NLA further improves, while networks without lookahead rates stagnate at high error rates. This is caused by transient, but long-lasting misrepresentation of errors following stimulus switches: when plasticity is turned off during transients and is only active in the steady state, comparably good performance can be achieved (dashed orange). (d) Receptive fields of 6 hidden-layer neurons after training, demonstrating that even for very brief image presentation times (5ms), the combined neuronal and synaptic dynamics are capable of learning useful feature extractors such as edge filters.

The instantaneous voltage propagation reduces an essential constraint of previous models of bio-plausible error backpropagation (e.g., Scellier & Bengio (2017), Whittington & Bogacz (2017), and Sacramento et al. (2018), with reviews Richards et al. (2019), Whittington & Bogacz (2019), and Lillicrap, Santoro, et al. (2020)): without lookahead firing rates, networks need much longer to correctly propagate errors across layers, with each layer roughly adding another membrane time constant of 10 ms, and thus cannot cope with realistic input presentation times. In fact, in networks without lookahead output, learning is only successful if plasticity is switched off while the network dynamics is not stationary (Fig. 4c2). As a comparison, neuronal response latency in the primary visual cortex (V1) of rats to flashing stimuli are in the order of 50 ms if the cortex is in a synchronized state, shortens to roughly 40 ms if in a desynchronized state (Wang et al., 2014), and potentially shortens further if the area is preactivated through expectations (Blom et al., 2020).

Implementation in cortical microcircuits

So far, we did not specify how errors e appearing in the differential equation for the voltage (Eq. 7a) are transmitted across the network in a biologically plausible manner. Building on Sacramento et al., 2018, we propose a cortical microcircuit to enable this error transport, with all neuron dynamics evolving according to the NLA principle. Although the idea applies for arbitrarily connected networks, we use the simpler case of functionally feedforward networks to illustrate the flow of information in these microcircuits (Fig. 5a).

Hierarchical plastic microcircuits implement real-time Dendritic Error Learning (rt-DeEL).

(a) Microcircuit with ‘top-down’ input (originating from peripheral motor activity, blue line) that is explained away by the lateral input via interneurons (dark red), with the remaining activity representing the error ēl. Plastic connections are denoted with a small red arrow and nudging with a dashed line. (b1) Simulated network with 784-300-10 pyramidal-neurons and a population of 40 interneurons in the hidden layer used for the MNIST learning task where the handwritten digits have to be associated to the 10 fingers. (b2) Test errors for rt-DeEL with joint tabula rasa learning of the forward and lateral weights of the microcircuit. A similar performance is reached as with classical error backpropagation. For comparability, we also show the performance of a shallow network (dashed line). (b3) Angle derived from the Frobenius norm between the lateral pathway and the feedback pathway BlWl+1. During training, both pathways align to allow correct credit assignment throughout the network. Indices are dropped in the axis label for readability.

For such an architecture, pyramidal neurons in area l (that is a ‘layer’ of the feedforward network) are accompanied by a pool of interneurons in the same layer (area). The dendrites of the interneurons integrate in time (with time constant τ) lateral input from pyramidal neurons of the same layer (rl) through plastic weights . Additionally, interneurons receive ‘top-down nudging’ from pyramidal neurons in the next layer through randomly initialized and fixed backprojecting synapses targeting the somatic region, and interneuron nudging strength βI. The notion of ‘top-down’ originates from the functionally feed-forward architecture leading from sensory to ‘higher cortical areas’. In the context of motor control, the highest ‘area’ is last stage controlling the muscle lengths, being at the same time the first stage for the proprioceptive input (Fig. 3a).

According to the biophysics of the interneuron, the somatic membrane potential becomes a convex combination of the two types of afferent input (Urbanczik & Senn, 2014),

In the biological implementation, the feedback input is mediated by the low-pass filtered firing rates , not by ul+1 as expressed in the above equation. Yet, we argue that for a threshold-linear ρ the ‘top-down nudging’ by the rate is effectively reduced to a nudging by the voltage ul+1. This is because errors are only backpropagated when the slope of the transfer function is positive, , and hence when the upper-layer voltage is in the linear regime. For more general transfer functions, we argue that short-term synaptic depression may invert the low-pass filtered presynaptic rate back to the presynaptic membrane potential, , provided that the recovery time constant τ matches the membrane time constant (see end of the section and Methods).

Apical dendrites of pyramidal neurons in each layer receive top-down input from the pyramidal population in the upper layer through synaptic weights Bl. These top-down weights could be learned to predict the lower-layer activity (Rao & Ballard, 1999) or to become the transposed of the forward weight matrix (Max et al., 2022), but for simplicity we randomly initialized them and keep them fixed (Lillicrap, Santoro, et al., 2020). Beside the top-down projections the apical dendrites also receive lateral input via an interneuron population in the same layer through synaptic weights that are plastic and will be learned to obtain suitable dendritic errors. The ‘-’ sign is suggestive for these interneurons to subtract away the top-down input entering through Bl (while the weights can still be positive or negative). Assuming again a conversion of rates to voltages, also for the inhibitory neurons that may operate in a linear regime, the overall apical voltage becomes

What cannot be explained away from the top-down input Blul+1 by the lateral feedback, , remains as dendritic prediction error in the apical tree (Fig. 5a). If the top-down and lateral feedback weights are learned as outlined next, these apical prediction errors take the role of the backpropagated errors in the classical backprop algorithm.

To adjust the interneuron circuit in each layer (‘area’), the synaptic strengths from pyramidal-to-interneurons, , are learned to minimize the interneuron mismatch energy, . The interneurons, while being driven by the lateral inputs , learn to reproduce the upper-layer activity that also nudges the interneuron voltage. Learning is accomplished if the upper-layer activity, in the absence of an additional error on the upper layer, is fully reproduced in the interneurons by the lateral input.

Once the interneurons learned to represent the ‘error-free’ upper-layer activity, they can be used to explain away the top-down activities that also project to the apical trees. The synaptic strengths from the inter-to-pyramidal neurons, , are learned to minimize the apical mismatch energy, . While in the absence of an upper-layer error, the top-down activity Blul+1 can be fully cancelled by the interneuron activity , a neuron-specific error will remain in the apical dendrites of the lower-level pyramidal neurons if there was an error endowed in the upper-layer neurons. Gradient descent learning on these two energies results in the learning rules for the P-to-I and I-to-P synapses,

The following theorem on dendritic error learning tells that the plasticity in the lateral feedback loop leads to an appropriate error representation in the apical dendrites of pyramidal neurons.

Theorem 2 (real-time Dendritic Error Learning, rt-DeEL) Theorem 2 (real-time Dendritic Error Learning, rt-DeEL)

Consider a cortical microcircuit composed of pyramidal and interneurons, as illustrated in Fig. 5a (with dimensionality constraints specified in Methods, and fixed or dynamic weights to the interneurons). Then the inter-to-pyramidal synapses evolve at each layer l (‘cortical area’) such that the lateral feedback circuit aligns with the top-down feedback circuit,

In the presence of an output nudging, the apical voltages of the layer-l pyramidal neurons (Eq. 10) then represent the ‘B-backpropagated’ error . When modulated by the postsynaptic rate derivative, , this apical error ensures the correct representation of errors ēl for the real-time dendritic error propagation (rt-DeEP, Theorem 1),

Simultaneously learning apical errors and basal signals

Microcircuits following these neuronal and synaptic dynamics are able to learn the classification of hand-written digits from the MNIST dataset while learning the apical signal representation (Fig. 5b1-2). In this case, feedforward weights Wl and lateral weights and are all adapted simultaneously. Including the -plasticity (by turning on the interneuron nudging from the upper layer, βI > 0 in Eq. 9), greatly speeds up the learning.

With and without -plasticity, the lateral feedback via interneurons (with effective weigh ) learns to align with the forward-backward feedback via upper layer pyramidal neurons (with effective weight BlWl+1, Fig. 5b3). The microcircuit extracts the gradient-based errors (Eq. 13), while the forward weights use these errors to reduce these errors to first minimize the neuron-specific mismatch errors, and eventually the output cost.

Since the apical voltage appears as a postsynaptic factor in the plasticity rule for the interneurons , this I-to-P plasticity can be interpreted as Hebbian plasticity of inhbitory neurons, consistent with earlier suggestions (Vogels et al., 2012; Bannon et al., 2020). The plasticity of the P-to-I synapses, in the same way as the plasticity for the forward synapses s, can be interpreted as learning from the dendritic prediction of somatic activity (Urbanczik & Senn, 2014).

Crucially, choosing a large enough interneuron population, the simultaneous learning of the lateral microcircuit and the forward network can be accomplished without fine-tuning of parameters. As an instance in case, all weights shared the same learning rate. Such stability bolsters the biophysical plausibility of our NLA framework and improves over the previous, more heuristic approach (Sacramento et al., 2018; Mesnard et al., 2019). The stability may be related to the nested gradient descent learning according to which somatic and apical mismatch errors in pyramidal neurons, and somatic mismatch errors in inhibitory neurons are minimized.

Finally, since errors are defined at the level of membrane voltages (Eq. 11), synapses need a mechanism by which they can recover the presynaptic voltage errors from their afferent firing rates. While for threshold-linear transfer functions this is not too involved (Methods), more general neuronal nonlinearities must be matched by corresponding synaptic nonlinearities. Pfister et al. (2010) have previously illustrated how spiking neurons can leverage short-term synaptic depression to estimate the membrane potential of their presynaptic partners. Here, we assume a similar mechanism in the context of our rate-based neurons. The monotonically increasing neuronal activation function, , can be approximately compensated by a vesicle release probability that monotonically decreases with the low-pass filtered presynaptic rate (see SI and Fig. 6 therein). If properly matched, this leads to a linear relationship between the presynaptic membrane potential ul+1 and the postsynaptic voltage contribution.

Recovering presynaptic potentials through short term depression.

(a1) Relative voltage response of a depressing cortical synapse (recreated from Abbott et al., 1997), identified as synaptic release probability p. (a2) The product of the low-pass filtered presynaptic firing rate times the synaptic release probability is proportional to the presynaptic membrane potential, . (a3) Average in vivo firing rate of a neuron in the visual cortex as a function of the somatic membrane potential (recreated from Anderson et al., 2000), which can be qualitatively identified as the stationary rate derived in Eq. 43.

Discussion

We introduced a least-action principle for neuronal networks from which we derived the membrane potential dynamics of the involved neurons, together with gradient descent plasticity of the synapses in the network. The central notion of the theory is the somato-dendritic mismatch error in each individual neuron. Given the various findings on the dendritic integration of top-down signals (Larkum, 2013; Takahashi et al., 2020), we suggest that the error is formed in the apical dendrite of pyramidal neurons. Active dendritic and somatic processes compensate for delays caused by the dendritic error propagation and the somatic integration. This leads to the prospective firing of pyramidal neurons that are jointly driven by the forward input on the basal dendrites and the error on the apical dendrite. We derived a local plasticity rule for synapses on the basal tree that, by extracting the apical feedback error, reduces the prospective somato-dendritic mismatch error in an individual neuron. This plasticity is shown to also globally reduce the instantaneous cost at output neurons (defined by deviations from target voltages) at any moment in time while stimuli and targets may continuously change. Motor control offers an intuitive context where self-correcting prospective networks may enter. We put forward the moving equilibrium hypothesis, according to which sensory input, motor commands and muscle spindle feedback are in a recurrent equilibrium at any moment during the movement.

We further showed how a local microcircuit can serve to calculate a neuron-specific error in the apical tree of each individual pyramidal neuron. The apical tree of pyramidal neurons receives feedback from higher-level neurons in the network, while local inhibitory neurons try on the fly to ‘explain away’ the top-down expectations. What cannot be explained away remains as apical prediction error. This prediction error eventually induces error-correcting plasticity at the sensory-driven input on the basal tree. To find the correct error representation in the apical tree, plasticity of the inter-to-pyramidal neurons seeks to homeostatically drive the apical de-or hyperpolarized voltage back to rest. While the network synapses targetting the basal tree are performing gradient descent on the global cost, the microcircuit synapses involved in the lateral feedback are gradient descent on local error functions, both at any moment in time.

Our work builds on three general lines of research on formalizing the learning capabilities of cortical networks. The first line refers to the use of an energy function to jointly infer the neuronal dynamics and synaptic plasticity, originally formulated for discrete-time networks (Hopfield, 1982; Ackley et al., 1985), and recently extended to continuous-time networks (Scellier & Bengio, 2017). The second line refers to understanding error-backpropagation in the brain (Rumelhart et al., 1986; Xie & Seung, 2003; Whittington & Bogacz, 2017; Whittington & Bogacz, 2019; Lillicrap, Santoro, et al., 2020). The third line refers to the use of dendritic compartmentalization in various kinds of computation (Schiess et al., 2016; Poirazi & Papoutsi, 2020), recently linked to deep learning (Guerguiev et al., 2017; Sacramento et al., 2018; Haider et al., 2021).

With regard to energy functions, the NLA principle adds a variational approach to characterize continuous-time neuronal trajectories and plasticity. Variational approaches are studied in the context of optimal control theory where a cost integral is minimized across time, constrained to some network dynamics (Todorov & Jordan, 2002; Meulemans, Farinha, et al., 2021). The NLA represents a unifying notion that allows to infer both, the network dynamics and its optimal control from a single Lagrangian. The error we derive represents prospective control variables that are applied to the voltages of each network neurons so that they push the output neurons towards their target trajectory. The full expression power of this control theoretic framework has yet to be proven when it is extended to genuine temporal processing that includes longer time constants, for instance inherent in a slow threshold adaptation (Bellec et al., 2020). The NLA principle can also treat the case of strong feedback studied so far in relaxation networks only (Meulemans, Zucchet, et al., 2022; Song et al., 2022). Our rt-DeEP Theorem makes a statement for real-time gradient descent learning while the network is in a moving equilibrium, linking to motor learning in the presence of perturbing force fields (Herzfeld et al., 2014) or perturbing visual inputs (Dimitriou, 2016).

With regard to error-backpropagation, the NLA principle relies on feedback alignment (Lillicrap, Cownden, et al., 2016) to correctly interpret the apical errors. Other works have explored learning of feedback weights (Akrout et al., 2019; Kunin et al., 2020; Max et al., 2022). Since these are complementary to the principles suggested here, a promising direction would be to explore how feedback weights can be learned in NLA microcircuits to improve credit assignment in deeper networks.

With regard to dendritic and neuronal processing, we emphasize the prospective nature of apical voltages and somatic firing (Fig. 2). The prospective coding of errors and signals overcomes the various delays inherent in our previous approach (Sacramento et al., 2018; Mesnard et al., 2019). Each neuron of the network instantaneously corrects its voltage so that the output neurons are pushed towards the target trajectories. Ongoing synaptic plasticity adjusts the synaptic strengths so that the errors in each neuron are reduced at any moment in time, without needing to wait for network relaxations (Scellier & Bengio, 2017; Song et al., 2022). Yet, because in the present framework the input rates are still low-pass filtered, step inputs cannot be instantaneously propagated, only their low-pass filterings. Haider et al., 2021 offers a framework that is also suited instantaneously process step inputs.

Motivated by the predictive power of the least-action principle in physics, we ask about experimental confirmation and predictions of the NLA in biology. Given its axiomatic approach, it appears astonishing to find various preliminary matches at the dendritic, somatic, interneuron, synaptic and even behavioural level. Some of these are: (1) the prospective coding of pyramidal neuron firing (Köndgen et al., 2008); (2) the prospective processing of apical signals while propagating to the soma (Ulrich, 2002); (3) the basal synaptic plasticity on pyramidal neurons and synaptic plasticity on interneurons, driven by the postsynaptic activity that is ‘unexplained’ by the distal dendritic voltage (Urbanczik & Senn, 2014); (4) the Hebbian homeostatic plasticity of interneurons targeting the apical dendritic tree of pyramidal neurons (Bannon et al., 2020); (5) the short-term synaptic depression at top-down synapses targetting inhibitory neurons and apical dendrites (akin to Abbott et al., 1997, but with a faster recovery time constant) that invert the presynaptic activation function (see also Pfister et al., 2010); (6) the modulation of the apical contribution to the somatic voltage by the slope of the somatic activation function (for instance by downregulating apical NMDA receptors with increasing rate of backpropagating action potentials, Theis et al., 2018); and (7) the role of muscle spindles in the prospective encoding of motor errors during motor learning (Dimitriou, 2016; Papaioannou & Dimitriou, 2021). More theoretical and experimental work is required to explore these various links.

Overall, our approach introduces a method from theoretical physics to computational neuroscience and couples it with a normative perspective on the dynamical processing of neurons and synapses within global cortical networks and local microcircuits. Given its physical underpinnings, the approach may inspire the rebuilding of computational principles of cortical neurons and circuits in neuromorphic hardware (Bartolozzi et al., 2022). A step in this direction, building on the instantaneous computational capabilities with slowly integrating neurons, has been done (Haider et al., 2021) and a next challenge is to generalize the NLA principle to spiking neurons (Zenke & Ganguli, 2018; Göltz et al., 2021) and longer temporal processing.

Acknowledgements

We would like to thank Federico Benitez, Jonathan Binas, Paul Haider, Kevin Max, Alexander Mathis, and Alexander Meulemans and Jean-Pascal Pfister for helpful discussions, Kaspar Schindler for providing the human intracortical EEG data, and the late Karlheinz Meier for his dedication and support throughout the early stages of the project. WS particularly thanks Nicolas Zucchet for inspiring mathematical discussions and hints to literature on time-varying optimal control. This work has received funding from the European Union 7th Framework Programme under grant agreement 604102 (HBP), the Horizon 2020 Framework Programme under grant agreements 720270, 785907 and 945539 (HBP) and the Swiss National Science Foundation (SNSF, Sinergia grant CRSII5180316; João Sacramento is supported by SNSF Ambizione grant PZ00P3_186027). We further express our particular gratitude towards the Manfred Stärk Foundation for their continued support. We acknowledge the use of Fenix Infrastructure resources, which are partially funded from the European Union’s Horizon 2020 research and innovation programme through the ICEI project under the grant agreement No. 800858.

Methods

Euler-Lagrange equations as inverse low-pass filters

The theory is based on the lookahead of neuronal quantities. In general, the lookahead of a trajectory x(t) is defined as , with lookahead operator

The lookahead operator is the inverse of the low-pass filter operator denoted by a bar,

This low-pass filtering can also be characterized by the differential equation , see SI. Hence, applying the low-pass filtering to x and then the lookahead operator ℒ to , and using the Leibnitz rule for differentiating an integral, we calculate . In turn, applying first the lookahead, and then the low-pass filtering, also yields the original trace back, (see SI).

We consider an arbitrary network architecture with network neurons that are recurrently connected and that receive external input through an overall weight matrix W = (Win, Wnet), aggregated column-wise. The instantaneous presnyaptic firing rates are r = (rin, rnet), interpreted as a single column vector. A subset of network neurons are output neurons, 𝒪 ⊆ 𝒩, for which target voltages u* may be imposed. Rates and voltages may change in time t. Network neurons are assigned a voltage u, generating the low-pass filtered rate , and a low-pass filtered error . We further define output errors for o ∈ 𝒪, and for non-output neurons i ∈ 𝒩 \ 𝒪. With this, the Lagrangian from Eq. 3 takes the form

We next use that , with the operator defined in Eq. 4, to write out the Lagrangian L in the canonical coordinates as (see also Eq. 3)

The neuronal dynamics is derived from requiring a stationary action (see Eq. 5), which is generally solved by the Euler-Lagrange equations (see Eq. 6). Because ũ only arises in L in the compound , the derivative of L with respect to ũ is identical to the derivative with respect to ,

Using the lookahead operator Eq. 14, the Euler-Lagrange equations can then be rewritten as

Since and , the derivative of L with respect to ũ is the same as the derivative of L with respect to u, . Plugging this into Eq. 19, the Euler-Lagrange equations become a function of u and ,

The solution of this differential equation is , and hence any trajectory which satisfy the Euler-Lagrange equations will hence cause to converge to zero with a characteristic time scale of τ. Since we require that the initialisation is at t0 = −∞, we conclude that , as required in the rt-DeEP Theorem.

Deriving the network dynamics from the Euler-Lagrange equations

We now derive the equations of motion from the Euler-Lagrange equations. Noticing that u enters in twice, directly and through , and once in the output error , we calculate from Eq. 16, using and W = (Win, Wnet),

Next, we apply the lookahead operator to this expression, as required by the Euler-Lagrange equations Eq. 19. In general , and we set for the expression on the right-hand side of Eq. 21, , which at the same time is . Hence, the Euler-Lagrange equations in the form of Eq. 20, , translate into

To move from the middle to the last equality we replaced e by . In the last equality we interpret e as the sum of the two errors, e = ϵ + βe*, again using the middle equality. This proves Eq. 7.

Notice that the differential equation in Eq. 22 represents an implicit ordinary differential equation as on the right-hand side not only u, but also appears (in r and e). The uniqueness of the solution u(t) for a given initial condition is only guaranteed if it can be converted into an explicit ordinary differential equation (see Sect. C).

In taking the temporal derivative we assumed small learning rates such that terms including can be neglected. The derived dynamics for the membrane potential of a neuron ui in Eq. 22 show the usual leaky behavior of biological neurons. However, both presynaptic rates and prediction errors ēi enter the equation of motion with lookaheads, i.e., they are advanced ( and ), cancelling the low-pass filtering. Since , the rate and error, ri and ei, can also be seen as nonlinear extrapolations from the voltage and its derivative into the future.

The instantaneous transmission of information throughout the network at the level of the voltages can now be seen by low-pass filtering Eq. 22 with initialization far back in the past,

with interpreted as column vector, and . Hence, solving the voltage dynamics for u (Eq. 7a), with apical voltage derived from Eq. 7b, yields the somatic voltage u satisfying the self-consistency equation (Eq. 23) at any time. In other words, u and ē ‘propagate instantaneously’.

Deriving the error backpropagation formula

For clarity, we derive the error backpropagation algorithm for layered networks here. These can be seen as a special case of a general network with membrane potentials u and all-to-all weight matrix W (as introduced in appendix H), where the membrane potentials decompose into layerwise membrane potential vectors ul and the weight matrix into according block diagonal matrices Wl (with Wl being the weights that project into layer l).

Assuming a network with N layers, by low-pass filtering the equations of motion we get

l ∈ [1, N], with the output error . The error e = ϵ + βe* we obtain from the general dynamics with , see Eq. 21 and Eq. 22, translates to an iterative formula for the error at the current layer l given the error at the downstream layer l+1, inherited from the drive of that downstream layer via Wl+1,

and ēN = β ē* for the output layer. The learning rule that reduces ēl by gradient descent is proportional to this error and the presynaptic rate, as stated by Theorem 1, is

for l = 1…N. Eq. 25 and Eq. 26 together take the form of the error backpropagation algorithm, where an output error is iteratively propagated through the network and used to adjust the weights in order to reduce the output cost C. From this, it is easy to see that without output nudging (i.e., β = 0), the output error vanishes and consequently all other prediction errors vanish as well, for all l≤ N. This also means that in the absence of nudging, no weight updates are performed by the plasticity rule.

The learning rule for arbitrary connectivities is obtained in the same way by dropping the layer-wise notation. In this case, low-pass filtering the equations of motion yields , as calculated in Eq. 23, and the low-pass filtered error , as inferred from Eqs 21 and 22. Hence, the plasticity rule in general reads

Proving Theorem 1 (rt-DeEP)

The implicit assumption in Theorem 1 is that exists in the distributional sense for t > −∞, which is the case for delta-functions in rin and step-functions in u*. Both parts (i) and (ii) of the Theorem are based on the requirement of stationary action δA = 0, and hence on u satisfying the Euler-Lagrange equations in the form of Eq. 22, . From the solution we conclude that for initialization at t0 = −∞ we have for all t. It is the latter stronger condition that we require in the proof. With this, the main ingredient of the proof follows is the mathematical argument of Scellier & Bengio, 2017, according to which the total and partial derivative of L with respect to W are identical, and this in our case is true for any time t,

For convenience we considered to be a column vector, deviating from the standard notations (see tutorial end of SI). Analogously to Eq. 28, we infer . Reading Eq. 28 from the right to the left, we conclude that the learning rule for all β > 0 is gradient descent on L, i.e. . This total derivative of L can be analyzed for large and small β.

(i) We show that in the limit of large β, becomes gradient descent on the mismatch energy . For this we first show that there is a solution of the self-consistency equation that is uniformly bounded for all t and β. For this we assume that the transfer function ρ(u) is non-negative, monotonically increasing and bounded, that its derivative ρ′(u) is bounded too, and that the input rates rin and the target potentials are also uniformly bounded. To show that under these conditions we always find an uniformly bounded solution u(t), we first consider the case where the output voltages are clamped to the target, such that ē* = 0. For simplicity we assume that ρ′(u) = 0 for u c0. For voltages u with uic0 the recurrent input current is bounded, say for some c1 > c0. When including the error term , the total current still remains uniformly bounded, say F(u)j| ≤ c2 for all u with uic0. Because for larger voltages ui > c0 the error term vanishes due to a vanishing derivative ρ (ui) = 0, the mapping F(u) maps the c2-box u (for which |ui| ≤c2) onto itself. Brouwer’s fixed point theorem then tells us that there is a fixed point u = F(u) within the c2-box. The theorem requires the continuity of F, and this is assured if the neuronal transfer function is continuous.

We next relax the voltages of the output neurons from their clamped stage, . Remember that these voltages satisfy at any time t. We determine the correction term such that in the limit β we get . The correction remains finite, and in the limit must be equal to . For arbitrary large nudging strength β, the output voltage uo deviates arbitrary little from the target voltage, , with target error shrinking like c2. Likewise, also for non-output neurons i, the self-consistency solution ui = F(u)i deviates arbitrarily little from the solution of the clamped state. To ensure the smooth drift of the fixed point while 1 deviates from 0 we require that the Jacobian of F at the fixed point is invertible.

Because the output shrinks with 1, the cost shrinks quadratically with increasing nudging strength, , and hence the cost term that enters in vanishes in the limit β → ∞. In this large β limit, where and hence the outputs are clamped, , the Lagrangian reduces to the mismatch energy, L = EM. Along the least-action trajectories we therefore get . The first equality uses Eq. 28, and the second uses L = EM just derived for β = ∞. This is statement (i) of Theorem 1. In the case of successful learning, EM = 0, we also conclude that the cost vanishes, C = 0. This is the case because EM = 0 implies for all output neurons o. Since , we conclude that ēo = 0, and if the output neurons do not feed back to the network (which we can assume without loss of generality), we conclude that .

(ii) To consider the case of small β, we use that the cost C can be expressed as . This is a direct consequence of how C enters in , see Eq. 16 and Scellier & Bengio, 2017. We now put this together with Eq. 28 and the finding that . Since for the Lipschitz continuous function L in u, W and β (L is even smooth in these arguments), the total derivatives interchange (which is a consequence of the Moore-Osgood theorem applied to the limits of the difference quotients), we then get at any t,

The last expression is calculated from the specific form of the Lagrangian (Eq. 17), using that by definition . Finally, in the absence of output nudging, β = 0, we can assume vanishing errors, ē = 0, as they solve the self-consistency equation, for all t, see Eq. 27. For these solutions we have . Writing out the total derivative of the function with respect to β at β = 0 as limit of the difference quotient, , using that g(0) = 0, we calculate at any t,

Here we assume that is evaluated at β > 0 (that itself approaches 0), while is evaluated at β = 0. Combining Eq. 29 and 30 yields the cost gradient at any t,

This justifies the gradient learning rule in Eq. 27. Learning is stochastic gradient descent on the expected cost, where stochasticity enters in the randomization of the stimulus and target sequences rin(t) and u*(t). For the regularity statement, see ‘From implicit to explicit differential equations’ in the SI. Notice that this proof works for a very general form of the Lagrangian L, until the specific expression for . For a proof in terms of partial derivatives only, see SI, where also a primer on partial and total derivatives is found.

Instantaneous gradient descent on

The cost at each time t is a function of the voltage uo of the output neurons and the corresponding targets. In a feedforward network, due to the instantaneity of the voltage propagation (Eq. 23), uo is in the absence of output nudging (β = 0) an instantaneous function of the voltage at the first layer, . For initialisation at t = −∞, the second term vanishes for all t and hence . The output voltage uo(t) therefore becomes a function FW of the low-pass filtered input rate that captures the instantaneous network mapping, , and with this the cost also becomes an instantaneous function of and , namely .

For a general network, again assuming t0 = −∞, the voltage is determined by the vanishing gradient with , see Eq. 21. For the inclusive treatment of the initial transient see SI, Sects C and D. Remember that and . For a given and at time t, the equation F(u, t) = 0 can be locally solved for u if the Hessian is invertible, . This mapping can be restricted to the output voltages uo on the left-hand side, while replacing in the argument on the right-hand side (even if this again introduces uo there). With this we obtain the instantaneous mapping from the low-pass filtered input and the output error to the output itself. Notice that for functional feedforward network, the network weight matrix Wnet is lower triangular, and for small enough β the Hessian H is therefore always positive definite.

Proving Theorem 2 (rt-DeEL)

Here we restrict ourselves to layered network architectures. To prove Theorem 2 first assume that interneurons receive no nudging (βI = 0) and only the lateral interneuron-to-pyramidal weights are plastic. This is already sufficient to prove the rt-DeEL theorem. Yet, simulations showed that shaping the lateral pyramidal-to-interneuron weights so that it mimics the upper layer activity helps tremendously in learning a correct error representation. We consider this case of learning later.

If the microcircuits is ought to correctly implement error backpropagation, all local prediction errors ēl must vanish in the absence of output nudging (β = 0) as there is no target error. Consequently, any remaining errors in the network are caused by a misalignment of the lateral microcircuit, and we show how learning the interneuron-to-pyramidal weights s corrects for such misalignments.

To define the gradient descent plasticity of the weights from the interneurons to the pyramidal neurons, we consider the apical error formed by the difference of top-down input and interneuron input, , and define the apical mismatch energy as . Gradient descent along this energy with respect to yields

evaluated online while presenting input patterns from the data distribution to the network. We assume that the apical contribution to the somatic voltage is further modulated by the somatic spike rate, . After successful learning, the top-down input Blul+1 is fully subtracted away by the lateral input in the apical compartment, and we have

Once this condition is reached, the network achieves a state where, over the activity space spanned by the data, top-down prediction errors throughout the network vanish,

We show that this top-down prediction error, after the successful learning of the microcircuit, shares the properties of error-backpropagation for a suitable backprojection weights B.

Due to the vanishing prediction errors, pyramidal cells only receive bottom-up input . Using this expression as well as the expression for interneuron membrane potentials without top-down nudging (βI = 0 in Eq. 9), , and plugging both into Eq. 33, we get

Assuming that has full rank and the low-pass filtered rates span the full nl dimension of layer l when sampled across the data set we conclude that

In other words, the loop via upper layer and back is learned to be matched by a lateral loop through the interneurons. Eq. 36 imposes a restriction on the minimal number of interneurons at layer l. In fact, the matrix product BlWl+1 maps a nl-dimensional space onto itself via nl+1-dimensional space. The maximal rank of the this matrix product is limited by the smallest dimension, i.e. rank(BlWl+1) ≤ min(nl, nl+1). Analogously, . But since the two ranks are the same according to Eq. 36, we conclude that in general must hold, i.e. there should be at least as many interneurons at layer l as the lowest number of pyramidal neurons at either layer l or l+1. Note that by choosing as in (Sacramento et al., 2018) (or as in this work), the conditions is fulfilled.

With and Eq. 36, the top-down prediction error from Eq. 34, in the presence of output nudging (β > 0), can be written in the backpropagation form

Finally, the simulations showed that learning the lateral weights in the microcircuit greatly benefits from also adapting the pyramidal-to-interneuron weights WIP by gradient descent on , using top-down nudging of the inhibitory neurons (βI > 0),

After learning we have , and plugging in (Eq. 9), we obtain . Since , we conclude as before,

The top-down weights that nudge the lower-layer interneurons has randomized entries and may be considered as full rank. If there are less pyramidal neurons in the upper layer than interneurons in the lower layer, selects a subspace in the interneuron space of dimension . This seems to simplify the learning of the interneuron-to-pyramidal cell connections WPI. In fact, this learning now has only to match the nl+1-dimensional interneuron subspace embedded in dimensions to an equal (nl+1-)dimensional pyramidal cell subspace emedded in nl dimensions.

Learning of the interneuron-to-pyramidal cell connections works with the interneuron nudging as before, and combining Eqs 36 with 39 yields the ‘loop consistency’

The learning of the microcircuit was described in the absence of output nudging. Conceptually, this is not a problem as one could introduce a pre-learning phase where the lateral connections are first correctly aligned before learning of the feedforward weights begins. In simulations we find that both the lateral connections as well as the forward connections can be trained simultaneously, without the need for such a pre-learning phase. We conjecture that this is due to the fact that our plasticity rules are gradient descent on the energy functions L, EPI and EIP respectively.

Simulation details

The voltage dynamics is solved by a forward-Euler scheme . The derivative is calculated either through (i) the implicit differential equation (Eq. 7) yielding , or (ii) by isolating and solving for the explicit differential equation , as explained after Eq. 49 in the SI.

(i) The implicit differential equation, , see Eq. 22, is iteratively solved by assigning and calculating the error with and .

This iteration exponentially converges to a fixed point on a time scale , where 1 − k > 0 is the smallest Eigenvalue of the Hessian , see SI, Sect. C.

(ii) The explicit differential equation is obtained by eliminating the from the right-hand side of the implicit differential equation. Since enters linearly we get with . The explicit form is obtained by matrix inversion, , as the Hessian is invertible if it is strictly positive definite. The external input and the target enter through , where the derivative of the target voltage is only added for the output neurons o. This explicit differential equation is shown to be contractive in the sense that for each input trajectory rin(t) and target trajectory u*(t), the voltage trajectory u(t) is locally attracting for neighbouring trajectories. This local attracting trajectory is the vanishing-gradient trajectory F(u, t) = 0, and the gradient remains 0 even if the input contains delta-functions, see SI Sect. D.

Solving the explicit differential equation seems to be more robust when the learning rate for gets larger. The explicit form is also less sensitive to large Euler steps dt, see SI Sect. C. By this reason, the ordinary differential equations (ODE) were solved in the explicit form when including plasticity . The algorithms are summarized as follows, once without interneurons (Algo 1), and once with interneurons (Algo 2):

Details for Fig. 3b

Color coded snapshot of cortical local field potentials (LFPs) in a human brain from 56 deep iEEG electrodes at various locations, converted with the sigmoidal voltage-to-rate function and plotted onto a standard Talairach Brain (Talairach & Tournoux, 1988). The iEEG data is from a patient with pharmacoresistant epilepsy and electrodes implanted during presurgical evaluation, extracted from the data release of Burrello et al., 2019. The locations of the electrodes are chosen in accordance with plausibilty, as the original positions of the electrodes were omitted due to ethical standards to prevent patient identification.

Details for Fig. 3c

Simulations of the voltage dynamics (Eq. 7a) and weight dynamics (Eq. 8), with learning rate η = 103, step size dt = 1 ms for the forward Euler integration, membrane time constant τ = 10 ms and logistic activation function. Weights were initialized randomly from a normal distribution 𝒩(0, 0.12) with a cut-off at ±0.3. The number of neurons in the network 𝒩 was n = 96, among them 56 output neurons 𝒪 ⊂ 𝒩 that were simultaneously nudged, and 40 hidden neurons. During training, all output neurons were nudged simultaneously (with β = 0.1), whereas during testing, only 42 out of 56 neurons were nudged, the remaining 14 left to reproduce the traces. Data points of the iEEG signal were sampled with a frequency of 512Hz. For simplicity, we therefore assumed that successive data points are separated by 2ms, and up-sampled the signal via simple interpolation to 1ms resolution as required by our integration scheme. Furthermore, the raw values were normalized by dividing them by a factor of 200 to ensure that they are approximately in a range of ±1 −2. Training and testing was done on two separate 8s traces of the iEEG recording. Same data as in Fig. 3b1.

Algorithm 1 with projection neurons only, for Figs 3 & 4 Algorithm 1 with projection neurons only, for Figs 3 & 4

(using the explicit ODE, i.e. Step 12 instead of 11)

Algorithm 2 including plastic interneurons, for Fig. 5 Algorithm 2 including plastic interneurons, for Fig. 5

(using the explicit ODE, i.e. Step 13 instead of 12)

Details for Fig. 4

Simulation of the neuronal and synaptic dynamics as given by Eq. 7a, Eq. 7b and Eq. 8. For 5 ms, 10 ms and 50 ms presentation time, we used an integration step size of dt = 0.05 ms, dt = 0.1 ms and dt = 0.5 ms, respectively (and dt = 1 ms otherwise). As an activation function, we used the step-linear function (hard sigmoidal) with for u≤ 0, for u≥ 1 and in between. The learning rate was initially set to η = 103 and then reduced to η = 104 after 22 000 s. The nudging strength was β = 0.1 and the membrane time constant τ = 10 ms. In these simulations (and only for these) we assumed that at each presynaptic layer l = 0, 1, .., n 1 there is a first neuron indexed by 0 that fires with constant rate , effectively allowing the postsynaptic neurons to learn a bias through the first column of the weight matrix Wl+1. Weights were initialized randomly from a normal distribution 𝒩(0, 0.012) with a cut-off at ±0.03. For an algorithmic conversion see the scheme below. In Fig. 4c1, ‘rt-DeEP w/o lookahead’ is based on the dynamics . For ‘ w/o error + backprop’, we use as the forward model (so without error terms on the membrane potential, but a prospective r), and calculate weight updates using error backpropagation. In Fig. 4c2, we provide three controls: the test error for (i) a standard shallow artificial neural network trained on MNIST (black dashed line), (ii) rt-DeEP without prospective coding (as in Fig. 4c1), but in c2 with plasticity only turned on when the network is completely stationary, i.e., after waiting for several 100ms, such that synaptic weights are not changed during transients (orange dashed line, denoted by ‘w/o transients’), and (iii) an equivalent artificial neural network, , trained using error backpropagation (black dashed line, ‘standard backprop’).

Details for Fig. 5

Simulation of neuronal and synaptic dynamics with plastic microcircuit, i.e., the pyramidal-to-interneuron and lateral weights of the microcircuit learned during training.

For the results shown in Fig. 5c2, the following parameters were used. As an activation function, we used a hard sigmoid function and the membrane time constant was set to τ = 10ms. Image presentation time is 100 ms. Forward, pyramidal-to-interneuron and interneuron-to-pyramidal weights were initialized randomly from a normal distribution 𝒩(0, 0.012) with a cut-off at ±0.03. All learning rates were chosen equal η = 103 and were subsequently reduced to η = 104 after 22000s training time. The nudging parameters were set to β = 0.1 and . The feedback connections Bl and the nudging matrices were initialized randomly from a normal distribution 5 · 𝒩(0, 0.012) with a cut-off at ±0.15. The used integration step size was dt = 0.25ms. All weights were trained simultaneously. For an algorithmic conversion see the scheme below. The interneuron membrane potential was calculated by Eq. 9 with a linear transfer function.

Author contributions

WS, MAP and DD designed the conceptual approach. WS, DD, MAP, AFK, JJ, JS and YB contributed to different aspects of the framework and model. WS, DD, MAP and AFK derived different components of the theory. DD, BE and AFK performed the simulations. MAP, DD, JJ and BE wrote the first draft of the manuscript. WS wrote the final draft.

Supplementary information

A Extracting the presynaptic voltage error

Threshold-linear transfer functions

There is an important special case where the presynaptic voltage error can directly be extracted from presynaptic firing rates, without need to invert the transfer function via synaptic depression as shown below. This is the case when voltage errors in the upper layers are small, and the voltage-to-rate transfer function has derivatives ρ′ = 0 or 1, so that . The condition is satisfied, for instance, for a doubly threshold-linear function (a ‘doubly rectified linear unit’, ReLu) defined by ρ(ŭ) = 0 for ŭ < 0 and ρ(ŭ) = ŭ for 0 ≤ ŭrmax, while ρ(ŭ) = rmax for larger voltages. In this case we calculate

The approximation uses the Taylor expansion in ul+1 and assumes that ēl+1 is small. The crucial point of Eq. 41 is that the mismatch error defined on the voltage, , can be factorized into a product of the postsynaptic rate derivative, , and the apical error, and hence it can be expressed as an error defined on the rate. Restricted to the segment where the transfer function is linear and errors do not vanish, the same microcircuit delivers the feedback to the apical tree through the top-down projections, and through the lateral connections from the interneurons. While the plasticity rules for and stay the same, the top-down nudging of the interneurons, see Eq. 9, can then be formulated based on the rate instead of the upper layer voltage, , with transfer function of the interneuron again the (doubly) threshold-linear . Since voltages and rates are identical in the segment for each component, Eqs 36 with 39 can still be inferred.

Rate-to-voltage inversion by short-term synaptic depression

We wish to readout the voltage error also for other nonlinear transfer functions than clipped ReLu’s. To do so, we take inspiration from the classical short-term synaptic depression model (Tsodyks & Markram, 1997; Abbott et al., 1997; Varela et al., 1997), see also Fig. 6a1. We consider a dynamic vesicle release probability that is proportional to the pool size of available vesicles, , and this pool size is postulated to depend on past presynaptic rates,

where is the low-pass filtered presynaptic rate, a and d are constants. The proportionality factor is , making a probability out of the vesicle pools size. The effective synaptic strength B of a ‘backprojecting top-down’ connection is the product of the absolute synaptic strength B and the vesicle pool size v, i.e. . The contribution to the postsynaptic current of the synapse is Wr, and the contribution to the postsynaptic voltage is .

We search for an activation function such that the postsynaptic voltage contribution is the scaled presynaptic potential, . Plugging in the above expression for B yields , and dividing B out, , see Fig. 6a2. With the expression for in Eq. 42 we obtain a quadratic equation in that is solved by the non-negative and monotonically increasing function

with ‘smooth’ threshold θ = (1 + a)/d and asymptote . This gives us a transfer function that qualitatively matches those observed for pyramidal neurons recorded in the steady state (Anderson et al., 2000; Rauch et al., 2003), see Fig. 6a3.

The approach generalizes to other pairs of strictly monotonic neuronal activation and depression functions , as long as u is not driven below some minimal value, here urest = 0, corresponding to . The last requirement can, for instance, be accomplished by offsetting the activation function into a regime that guarantees that u stays positive.

In our simulations for Fig. 5, we did not explicitly implement a dynamic vesicle pool, i.e. the right-hand side of , but instead directly used the recovered membrane potentials u.

B Looking back and forward in time with derivatives

Since dealing with extrapolations into the future is a crucial notion of the paper we present here some of the calculations. The discounted future voltage was introduced in Eq. 4 as

To show that ũ satisfies , we need to apply the Leibniz integral rule in calculating the derivative . This leads to

Multiplying this equation by τ and using the definition of ũ yields , or .

By applying the Leibniz integral rule one also shows that , defined in Eq. 15,

solves . This differential equation can be written as , with lookahead operator ℒ defined in Eq. 14. To show that , one applies partial integration to . Note that the equality only holds if we integrate from −∞, and hence if the initialization of the trajectory is far back in the past as compared to the time constant τ.

Uniqueness of the rate function

In the main text we concluded from the postulate and the general relation that . This conclusion is a consequence of the uniqueness of a solution of an ordinary differential equation for a given initial condition (that may include delta-functions on the right-hand side, see e.g. Nedeljkov & Oberguggenberger, 2012. In fact, we may consider both variables ρ and as solutions of the differential equation . Because the solution is unique, we conclude that .

Learning the input time constants

In our applications we assumed that the input rates in the original mapping are low-pass filtered by a common time constant that is also shared as membrane time constant of the neurons. The general setting of learning to map time series is that input time series, rin(t) are low-pass filtered with given time constants , and the target output time series are a function of these low-pass filtered inputs, .

To learn to reproduce the input time constants in the student network by τin, we assume that the inputs converge to neurons u1 is that only receive external inputs. Because , the gradient rule for the input time constant is

This local learning rule also globally reduces the Lagrangian, and in the limits of β → ∞ it is gradient descent on the mismatch energy, while in the limit β → 0 it is gradient descent on the cost. The proof works as in Theorem 1. To learn a more complex mapping of time series that includes more complex temporal processing beyond a function of merely , additional variables need to be introduced that form memories (see ‘Generalizations:…’ below).

C From implicit to explicit differential equations

The original Euler-Lagrange equation is , Eq. 20, with the Jacobian . We have shown that is equivalent to the voltage dynamics given in Eq. 22.

Implicit differential equation

The implicit differential equation can be written as

The partial derivative of F(u, t) with respect to time, , captures the external drive from the inputs and output targets that do not depend on u, but may directly depend on time. In fact, instead of the argument t of f, we could consider the two arguments and , and we can then write

where δio is the Kronecker delta and equal to 1 if i is an output neuron, and 0 else. The partial derivatives of f with respect to u represents the (symmetric) Hessian of the Lagrangian, , with δij being the Kronecker-delta, defined in Eq. 21 and ē* defined above Eq. 16. Remember that W = (Win, Wnet) and . In vectorial notation the Hessian of the Lagrangian is

where 1 is the identity matrix.

Fixing the arguments of K in Eq. 45, we need to find a fixed point of the mapping . In the argument , the mapping is affine, , with matrix . The Banach fixed point theorem asserts that if is strictly contracting with k, i.e. if for all pairs of inputs and 0 ≤ k < 1, then the iteration (here with iteration time step dt) e locally converges to a fixed point . Because , the mapping is k-contractive if the eigenvalues of L have absolute value smaller than k. This is the case if the Hessian H(u) = 1L(u) = 1W ρ′(u) − ē′(u), with , is strictly positive definite. Crucially, because the mapping is affine in s, the convergence is global.

For strict positive definiteness of the Hession, the Banach fixed point theorem asserts that during (global) convergence the distance to the fixed point is bounded by a constant times , with iteration index i and ‘virtual’ Euler step dt. In an analogue physical device that implements exactly this feedback circuit in continuous time, the dt becomes truly infinitesimal and in this sense the convergence is instantaneous. If dt remains finite, converges to a moving target because the mapping ) changes with time. The target should not change quicker than the time scale of the convergence. Given a time course of the input rate rin(t) and target uo*(t) that has bounded variation, the dt can be chosen so that convergence becomes arbitrary quick, and in the limit instantaneous.

If rin(t) contains well-separated delta-functions, while otherwise still having bounded variations, the reasoning still applies since at any time point in time, except at the time point of the singularity, f(u, t) = 0. This is shown in Appendix D below.

There is a caveat for the strictly positive definiteness of the Hessian, when the learning rate becomes too big and starts to change the neuronal dynamics. In this case, the Hessian becomes , and the Eigenvalues may become negative due to the term. Simulations can in fact become unstable with a big learning rate, and this is more pronounced if also the Euler dt is large. The explicit differential equation avoids the fast iteration towards a moving target and hence allows for larger dt. This in particular pays out in the presence of a high learning rate (although the Cholesky decomposition also requires positive definiteness). By this reason, the large-scale simulations involving plasticity are performed with the explicit form described next.

Explicit differential equation

To isolate in the implicit differential equation, we rewrite ℒ f = 0 again as

Plugging the Hessian from Eq. 47 into Eq. 48, we obtain the voltage dynamics from Eq. 22 in the equivalent form

In our applications, the Hessian H appears to be invertible (although this may not be the case for arbitrary networks), and Eq. 49 can be solved for the unique using the Cholesky decompositions. In fact, if H is invertible, the system implicit ordinary differential equations from Eq. 49 can be converted into a system of explicit ordinary differential equations (with and H given in Eq. 47),

and as such it has a unique solution for any given initial condition. Because , see Eq. 46, the regularity requirement for to be integrable is satisfied even if and contain step functions (and rin(t) delta-functions), see Sect. D for details.

Even in the presence of such step-functions, the Euler-Lagrange equations ℒ f = 0 lead to an f that is a decaying exponential, . For initialization at t = −∞ we have at any time. In fact, Eq. 50 is equivalent to ℒ f = 0, and hence any solution of Eq. 50, even if contains a delta-function, is also a solution of ℒ f = 0. Possible jumps in or are compensated by the jumps they induce in u (see below for the full mathematical description with a simple example). To give an intuition, we assume that a recurrent network of our prospevtive neurons has separate fixed points for the two constant input currents and . This network still shows an overall relaxation time of τin (but not longer!) when the input rates instantaneously switch from to . Nevertheless, at any moment during this relaxation process, gradient learning of the mapping towards is still guaranteed (Theorem 1).

In the case of a functional feedforward network, the network weight matrix Wnet is lower triangular. This is itself a lower triangular matrix, but now with unit diagonal, and as such it is invertible. For feedforward netowrks, H remains invertible also for small β > 0, since then the emerging upper diagonal entries remain small compared to the diagonal entries 1. It also remains invertible for growing nudging strengths, β → ∞, provided that the weighted target errors s remain small, see (i) in the proof of Theorem 1.

Link to time-varying optimal control

The explicit differential equation, Eq. 50, is a special case of the one in Simonetto, Dall’Anese, et al., 2020, (Eq. 20), where the function to be minimized (their f) can take a general (Lipschitz continuous) form (hence their f is our Lagrangian, f = L). To avoid inverting the Hessian, an iteration algorithm can be applied similar to our implicit form form, although more involved to deal with the more general form of L (Simonetto & Dall’Anese, 2017). The idea of tracking the solution of a time-varying optimization problem with a linear look-ahead in time has been introduced introduced in Zhao & Swamy, 1998.

D Contraction analysis and delta-function inputs

Stability

We show that the voltage dynamics obtained from is always stable, provided that the Hessian H is invertible. For this we rewrite Eq. 49 in the form , where the explicit time dependence of E and f is a short-cut to express the dependence on . Stated in this generality, the stability analysis also applies to the voltage dynamics derived in the Latent Equilibrium (Haider et al., 2021).

According to the implicit function theorem, at any point in time when is invertible, we can locally write s. When absorbing the dependence of on into a time dependence, we can rewrite this differential equation as , see Eq. 50. This differential equation is contractive and thus stable if the Jacobian of g with respect to u is uniformly negative definite (Lohmiller & Slotine, 1998). The contraction analysis tells that locally, where is invertible, we can express as a function that has derivative . For the u-component we get the Jacobian

According to Eq. 49 we have and calculate , with specified above. Since according to Eq. 46 the partial derivative with respect to t does not depend on u, we also calculate . Hence, . The term may cause a violation of the positive definiteness. However, this appears only transiently after initialization since the gradient f exponentially quickly converges to 0 after initialization. If t0 = −∞, the term vanishes, and with it also . This asserts local contraction property of the fixed point trajectory as then . Hence, around a point u on the trajectory, the linear approximation of the dynamics is , showing an exponential local contraction to u.

Global convergence

The above stability analysis yields only the strict contraction property after the convergence of the gradient to f = 0. We have shown that the iteration of globally converges to a fixed point , see Eq. 45. Let us therefore assume that satisfies this fixed point equation. This defines a trajectory u(t) that globally converges to the fixed points . In fact, these fixed points are characterized by the vanishing gradients, f = 0, and these vanishing gradients are globally reached. This is because F(u(t), t), as a function of time, exponentially decays to 0 in each component, as noticed above. To restate this in a compound version, we consider an arbitrary initialization u(t0) for which in general f(u(t0), t0) ≠ 0. Because at any time we have , the length of F(u(t), t) exponentially converges to 0, as expressed by its temporal derivative,

Hence, starting at any initial point, the voltage dynamics finds a self-consistency solution of f(u(t), t) = 0, or equivalently of u = W ρ(u) + ē(u), with .

Delta-function inputs keep

We next explain in more details why delta-function in the input rates rin(t) for the NLA the stationarity (‘equilibrium’) condition is always satisfied (the delta-function in rin for the NLA corresponding to step-function in rin for the Latent Equilibrium). We reconsider the explicit differential equation given in Eq. 50, with and H given in Eq. 47.

To simplify matters, we consider a single delta-function at t = 0 as input in the absence of output nudging. In this case we get , where the input matrix Win is typically sparse (not all network neurons receive external input), and δin(t) is a vector of delta-functions restricted to the input neurons. Following Nedeljkov & Oberguggenberger, 2012, Proposition 2.1, we can then write the explicit differential equation, Eq. 49, in the form

where is globally Lipschitz continuous. Due to the Lipschitz continuity the change in u evoked by during a small time interval [− ε, ε] around t = 0 vanishes when this interval shrinks, ε →0. To quantifies the change in u during these intervals it is therefore enough to consider , or equivalently . To estimate the jump induced by the delta-functions, we consider some mollifier ϕε(t) = ε1ϕ(t/ε), where ϕ(t) is a smooth function on the interval [−1, 1] with integral 1. By ϕin,ε(t) we denote the vector of mollifiers centered at the delta-functions of the input neurons. We now consider the two differential equations, with the second approximating the first on the interval [−ε, ε], but without regular term ,

We assume that for all t∈ [−ε, ε] the matrices H(uε(t)) and are invertible, so that the two Eqs 54a,b can be turned into an explicit differential equations. Analogously to the 1-dimensional case (Nedeljkov & Oberguggenberger, 2012), we conclude that the solution of Eqs 54a,b on the interval [−ε, ε] converge to each other, for ε→ 0. As a consequence, the jump of at t = 0 converges to the corresponding jump of uε in the various dimensions.

To calculate the jump of we have to integrate across the time interval [−ε, ε]. Instead of integrating , we first integrate given in Eq. 54b. Moving from right to left yields

where Iin is the index vector of the input neurons having a delta-function, i.e. Iin,j = 1 if there is a delta-input, and else 0. Note that . Because H is itself a derivative, , we can explicitly calculate the latter integral (also for β > 0, but for clarity here only done for β = 0). The last integral in Eq. 55 is defined as a vector with i-th component being

where in the second last equality we used that Hij(u) = δijWijρ′(uj)) does only depend on the component uj, see Eq. 47. In the last equality we introduced the ‘network voltage error’ . Following the 1-dimensional case treated in Nedeljkov & Oberguggenberger, 2012, Proposition 1.2 we introduce the ‘jump function’ (called G(y) in the cited work)

with thought to represent at some time t before the delta-kick sets in. With this setting, Eq. 55 turns into

In the limit ε → 0 we get a relation between u(t) immediately before and after the jump, u(−0) and u(+0), using that in this limit the boundary points of the trajectories also converge, ,

We now assume that the function J(u) = τ (vv) is invertible around the jump. This is the case if the Jacobian is invertible, and because v = uWnetρ(u), we require invertability of , with Hessian defined in Eq. 47.

In the case of invertability we get the voltage after the jump as

We next calculate the jump in v. This is easy since v = uWnetρ(u) linearly enters in the function J(u) = τ (vv). Plugging the explicit expression for J into Eq. 59 we get τ (v(+0) − v) = τ (v(−0) − v) + WinIin, or

Knowing the jump in v helps to show that the equilibrium condition is always satisfied, even immediately after the delta-in put, provided the initialization is at t0 = −∞. To show this, remember that in the absence of nudging we have . The jump size of for a delta-function at t = 0, rin(t) = δin(t) is . This is because satisfies the differential equations , provided that t0 = −∞. Hence,

With Eqs 61 and 62 we conclude that throughout.

E Example of a single recurrently connected neuron

To get an intuition for the instantaneity in a the recurrent case we consider the example of a single, recurrently connected neuron. We also put this into the context of the Latent Equilibrium (Haider et al., 2021). Consider the weight vector W = (Win, Wnet) with an input an input rate rin(t) driving the postsynaptic voltage u. The postsynaptic rate is , and its low-pass filter with respect to τ is . As always, the low-pass filtering reaches back to an initialization at t0 = −∞, see Eq. 15. The Lagrangian has the form

The Euler-Lagrange equations are derived from . Applying the look-ahead operator ℒ (Eq. 14), and abbreviating , the Euler-Lagrange equations deliver the voltage dynamics,

To simplify matters, we consider the nudging-free case, β = 0. This implies that . With , we obtain the differential equation

Abbreviating v = uWnetρ(u) as ‘network voltage error’, the above differential equation reads as

Integrating the effective voltage dynamics (Eq. 66), assuming initialization infinitely far in the past, is equal to . This equation is equivalent to the Euler-Lagrange equation being integrate, and because the solution of the Euler Lagrange equation is , we have (using t = −∞)

Voltage dynamics for a delta-function input

We next apply a delta-function in the input rate, say rin(t) = δ(t) and consider the dynamics at the level of the voltage, Eq. 65. As in Nedeljkov & Oberguggenberger, 2012, Proposition 1.2 we introduce the ‘jump function’ (called G(y) in the cited work)

As in Nedeljkov & Oberguggenberger, 2012, Proposition 1.2, we show that the voltage u makes a unique jump at the moment of the delta function that J(u) is invertible around the jump.

We set v = uWnetρ(u). Here, u is some voltage before the jump, say u = u(−1) evaluated at time t = −1, when the jump is at t = 0. When is the voltage immediately before the jump, the voltage immediately after the jump is specified by

The reason is that the Win-scaled delta-function triggers a step of size Win when integrating over it as done in Eq. 68. The new value is unique if J is invertible, and looking at the defining integral in Eq. 68, this is the case if .

The jump in u translates into a jump in v = uWnetρ(u) from to . This endpoint can also be expressed as

To check this, we assume without loss of generality that v = u Wnetρ(u) = 0. Then J(u) = τ (u Wnetρ(u)) = τv and according to Eq. 69. Since also , we conclude that , as claimed above.

We finally show that even far away from the initialization, the stationarity condition holds before and immediately after the jump. In fact, for t0 = −∞, the evolution of the ‘network voltage error’ becomes

Here we used that and according to Eq. 70 the v jumps to . Remember that for initialization far in the past, is equivalent to , see Eq. 67. We therefore have to calculate the jump in induced by the delta-input. Since is the solution of for t0 = −∞, we find that

Combing the two Eqs 71 and 72 proves that holds true any moment in time, provided the initialization is far in the past.

One may ask why the delta-kink is different from resetting v at a new intialization off from 0. The reason is that at t = 0 there is a cause for the jump in r(0), while at t0 there is no cause in r(t0). In fact, there is no jump initially, just the start of v at some initial condition. Differently from the initialization at t0, where v(t0) > 0 implies for finite tt0 > 0, the jump of v(0) at t = 0 to a positive value does leave for all t > 0, provided t0 = −∞.

Linear transfer function

We first consider the case of a linear transfer-function ρ(u) = 0 (or threshold linear, being in the linear regime). Then the differential equation becomes

With initialization at t = −∞ and low-pass filtering defined in Eq. 15 the solution is

The point is that the time constant is τ and is not , as this would be the case without prospective firing rate. In fact, for the ‘classical’ differential equation,

and ρ the identity, we obtain the differential equation with and solution that is now the low-pass filtering with respect to the effective time constant.

Sigmoidal transfer function

For a sigmoidal transfer function , a positive feedback weight Wnet > 0, and a constant external input rin, say, the solutions u(t) of Eq. 64 either converge to a fixed point or diverge. When converging, the voltage satisfies the fixed point condition

This fixed point equation can be numerically solved by time-discrete iteration process. But it can also be solved by a time-continuous process that underlies a neural or neuromorphic implementation. The prospective firing rate introduced in the NLA can be seen as a method to quickly find the fixed point in continuous time. When directly solving the implicit differential equation (as opposed to convert this into an explicit differential equation using e.g. the Cholesky decomposition), the fixed point is potentially found with a fewer number of steps.

To estimate the speed of convergence, we look at the initial speed when taking off at initial condition u(0) between the unstable and stable fixed point. The initial speed for the classical differential equation, Eq. 75, and the NLA version, Eq. 64, are, respectively,

where we set Δu(0) = −u(0) + Wnetρ(u(0)) + Winrin(0). As Wnetρ> 0, the initial convergence speed of the NLA solution is larger. The scheme has some resemblance to the Newton algorithm of finding zero’s of a function by using its derivative.

F Generalizations: NLA for conductance-based neurons and more dynamic variables

The mismatch energies and costs can be generalized in different ways. Here we focus on a biophysical version of the mismatch energy that includes conductance-based neurons. This also relates to the least-action principle in physics. But the NLA can also be generalized to include other dynamica variables such as adaptive thresholds or synaptic short-term plasticity.

Equivalent somato-dentritic circuit

For conductance based synapses, the excitatory and inhibitory conductances, gE and gI, are driven by the presynaptic firing rates and have the form gE(t) = WE r(t), and analogously gI(t) = WI r(t). The dynamics of a somatic voltage u and a dendritic voltage v reads as

where c and cd are the somatic and dendritic capacitances, EL/E/I the reversal potentials for the leak, the excitatory and inhibitory currents, gsd the transfer conductance from the dendrite to the soma, and gds in the other direction.

We consider the case when the dendritic capacitance cd is small as compared to the sum of conductances gd on the right-hand-side of Eq. 79, yielding a fast dendritic time constant. In this case we can solve this equation in the steady state for v, plug this into Eq. 78, and get

with effective reversal potential , total conductance , feedforward dendritic voltage vff = (gLEL + gEEE + gIEI)/gff and feedforward dendritic conductance gff = gL + gE + gI. Because , the conductance depends on the presynaptic voltage and its derivative. Equation 80 describes the effective circuit that has the identical voltage time course as Eqs 78 and 79 with , but with a single time-dependent ‘battery voltage’ V and Ohmic resistance R = 1/g.

Somato-dentritic mismatch power and action

The synaptic inputs gE(t) and gI(t) are continuously driving V (t), and the best what one can hope for the dynamics of u is that it traces V with some integration delay determined by the time constant τ = c/g. In fact, if u follows the dynamics of Eq. 80, then u becomes the low-pass filtered target potential, u = V, where is filtered with the dynamic time constant τ (t). The defining equations for the low-pass filtering is

and this self-consistency equation is equivalent to the explicit form

To capture the voltage dynamics with our NLA principle we recall that the somatic voltage u can be nudged by an ‘apical voltage’ ē that causes a somatic voltage error . The voltage error drives a current I = through the conductance g. The electrical power of this current I driven by the voltage ē is P = 2. This motivates the definition of the mismatch power in a network of N neurons by

𝒫 is a virtual power that, nevertheless, is related to some physical flow of ions. Assume we could measure all the ions flowing in the original circuit of Eqs 78 and 79 (in the limit of small ratio Cd/gd). From this flow, delete the ion flow that cancels at the level of electrical charge exchange due to the counter directed flow. The remaining effective ion flow defines an effective current flowing through the conductance g with driving force (V u), Eq. 80. If it were only this effective current in the reduced circuit, the voltage u(t), starting at u(0), would converge with time-constant τ to the low-pass filtering . Without additional ‘hidden’ current, the voltage u would then instantaneously follow that is itself given by the forward dendritic input conductances. The deviation of u from , caused by some initial conditions in u or by a feedback current from the network affecting u, builds the mismatch power 𝒫. The feedback may originate from a target imposed downstream, and the neuron is ‘free’ in how to dynamically match u and . It is therefore tempting to see 𝒫 as a ‘free power’, and the NLA principle as minimizing the corresponding ‘free energy’. In fact, the free-energy principle says that any self-organizing system that is in a dynamic equilibrium with its environment (here in the absence of output nudging) must minimize its free energy (that here builds up by imposing a target; Friston, 2010).

The NLA principle states that the time-integral of 𝒫 is minimized with respect to the look-ahead voltage ũ. We therefore define the physical mismatch energy as

that has the units of energy. EM takes the role of our neural action (A in the main text) for conductance-based neurons.

Euler-Lagrange equations for conductance-based neurons

The NLA for conductance-based neurons seeks to minimize A = ∫ 𝒫 (u) dt with respect to variations of u, such tha . In the simplest example of the main text we considered prospective rates, , so that the low-pass filtered rates become a function of the instantaneous voltage, . These low-pass filtered presynaptic rates, , determine the postsynaptic voltage u. Analogously, the low-pass filtered reversal potential, , determines the postsynaptic voltage u, and we again postulate that is an instantaneous function of the presynaptic voltage, . Here we argue that active dendritic mechanisms advance to postsynaptic reversal potential V, so that the delayed again becomes instantaneous, similarly to the advancement of the apical dendritic potential observed in cortical pyramidal neurons (Ulrich, 2002), see also Fig. 2b. With this instantaneity, the stationarity of the action with respect to generalized (compact and non-compact) variations, , translates to the condition .

Calculating with 𝒫 given in Eq. 83 and τ = c/g for the total conductance g(u) specified after Eq. 80 is a bit more demanding. For a probabilistic version, where 𝒫 is derived from the negative log-likelihood of a Gaussian density of the voltage, , the calculation is done (Jordan et al., 2022). In the probabilistic version, there is an additional normalization term that enters here as log g. In the deterministic version considered here, this log term is not present and we calculate

Notice that the transpose selects the downstream network neuron to backpropagate from there the first- and second-order errors. From and τ = c/g we conclude that

We next apply the look-ahead operator to the expression in this Eq. 86. Assuming an initialization at t0 = −∞, the condition becomes equivalent to , and hence Eq. 86 becomes equivalent to

A learning rule of the form with an appropriate postsynaptic error ēpost can again be derived (see Jordan et al., 2022 for a single neuron in the probabilistic framework). But this, with the above sketch, needs to be worked out yet.

Generalizations: long memories, reinforcement learning

One could also extend the NLA principle by adding e.g. threshold adaptation that endows the dynamics with additional and longer time constants. For this, the rate function is parametrized by an additional threshold, , an the Lagrangian is added by an error term on the threshold. Such an error addition can take the form , where now represents a low-pass filtering of the rate with a long threshold adaptation time constant τϑ. Neurons that show an additional threshold adaptation will still be able to instantaneously transmit a voltage jump through a prospective firing rate, but this will now also depend on the neuron’s history. Short-term plasticity may be included in the same way. Due to the history dependence, the stationarity of the action δA = 0 cannot anymore be reduced to the stationarity of the Lagrangian at any moment in time. As a consequence, the errors will look-ahead into to future more than only based on the local derivatives.

Generalizations are further possible for the cost function that favor voltage regions with high cost, say corresponding to punishment, or negative cost, corresponding to punishment to reward. These extensions will be considered in future work.

Voltage dynamics from the physical least-action principle

We have shown that through the look-ahead in Hamilton’s least-action principle, the notion of friction enters through the backdoor. In the least action formalism in physics, friction is directly introduced by extending the Hamiltonian principle to a generalized D’Alembert principle, where at the level of the Euler-Lagrange equations the generalized force is equated to the dissipation force (Flannery, 2005).

The electro-chemical properties of a membrane can be captured by an equivalent circuit consisting of a battery voltage V, a conductance C and a resistance R, arranged in parallel. The voltage dynamics is derived from the Euler-Lagrange equations that are added by a dissipative force. Formally, in the absence of an inductance defining the kinetic energy, the Lagrangian becomes identical to the potential energy = 𝒰 with

Here, is the dissipative Rayleigh energy related to friction and Q represents the charge across the membrane. According to D’Alembert’s principle, the dynamics is characterized by the Euler-Lagrange equation with dissipative force, , and this equation reduces to . Identifying the charge by means of voltage across the capacitance, Q = Cu, this equation can also be written as , or with τ = RC, just as also derived in Eq. 87. Loosly speaking, the minimization of the energy (M = ∫𝒫 dt) by looking ahead is equivalent to a minimization without looking ahead, but taking account of friction.

G A tutorial on total and partial derivatives as used in the paper

The proof of Theorem 1 given in the Methods makes use of partial and total derivatives and follows the notation of Scellier & Bengio, 2017 and Meulemans et al., 2022. As there is some variability (and community-dependent sloppiness) in the notation of partial and total derivatives, we provide some explanations on how this notation is interpreted, and why, for instance, total derivatives commute with each other, and also partial with each other (although not total with partial).

  1. In a differential geometric setting, the derivative of real-valued function E(u) on a point u of a manifold (like the flat Euclidean space) is considered as a mapping of tangent vectors at u to the real numbers. When interpreting the derivative as mapping, the is living in the dual space of u and is therefore a row vector if is a column vector. If a function f(u) of u is a vector valued, then its derivative is a matrix with entries , where i is indexing the rows (i.e. running down) and j is indexing the columns (i.e. running right). When r = ρ(u) is a column vector with ρ applied to each component of u, we consider the (partial) derivative r′ = ρ′(u) for convenience as column vector with components ρ′(ui). To strictly follow the formalism, it should be a diagonal matrix.

  2. Because we introduced the error vector ē as column vector, it is easier to write where now is also considered as column vector. To be consistent with the above, we should have written The .T appeared us as too heavy so that we neglected it, where it did not have further mathematical consequences (hoping it does not cause confusions). Sticking here to a column vector also renders the backpropagation error to be a column vector in Eq. 21. This error then gets the classical form with the weight transpose, .

  3. We typically have real-valued functions of the form E(uθ, θ), with θ = (W, β) being a vector of parameters, and u being a function of θ. To get the total derivative of E with respect to θ we consider the values E as a function of θ. This can be done by introducing a new function (θ) defined as (θ) = E(uθ, θ), where the components θi are considered as independent variables. The total derivative of E with respect to θ is then defined as vector-valued function for a n-dimensional θ. It can be helpful to think of the components of this total derivative as a (total) directional derivatives in the unit directions. For the last (n-th) unit direction Δθ(n) = (0, .., 0, 1), for instance, we have .

  4. The cost gradient, has the same dimension as W. Recall that by the cost gradient we mean , where C is defined as , with the voltage uo of the output neurons being itself a function of W.

  5. To calculate the partial derivative with respect to θ, we fix the first argument uθ, even if for u we often plugged in the components of the trajectory u = uθ(t) that now does depend on θ. In contrast, the total derivative is . Here, is a row vector, as also , consistent with the convention (that we have broken in Eq. 28 to keep vectors as columns). When u is considered as trajectory, does not vanish in general, but it does when u is simply considered as independent variable.

  6. When replacing the argument u in E(u, θ) by we get the ‘Lagrangian’. The partial derivative of L with respect to ũ, for instance, is then . The partial derivative considers L as a function of independent arguments ũ, and θ.

  7. We also used that the total derivatives can be exchanged, in the current example . This is generally true for derivatives of Lipschitz continuous functions, for which derivatives exist almost everywhere. The total derivatives (where they exist) then commute because the difference quotients in W and β are uniformly bounded. The Moore-Osgood theorem tells that two limits, of which at least one is uniform in the other, can be commuted. This also applies to the double difference quotients involved in the definition of . Remember that the total derivative, for instance with respect to the n-th parameter, can be written as .

H Proof of Theorem 1, part (ii), using only partial derivatives

This Section proves Eqs 2931 in terms with only partial derivatives, banning nested functions that require to deal with total derivatives. The 3 equations are also the core for the proof for Equilibrium Propagation Scellier & Bengio, 2017, although there only applied in the steady state after the network converged to a constant activity.

We assume a network of d neurons whose membrane potential is given by u ∈ ℝd and which are connected via weights W ∈ ℝd×d. By u, we denote the gradient with respect to the membrane potentials, i.e., . Similarly, W is a matrix containing the derivatives with respect to the weights, .

To prove the rt-DeEP theorem (Theorem 1), we first have to make a few definitions and observations:

  1. For a given (W, β), the dynamics yield certain membrane potentials fu ∈ ℝd. Formally, we define this as

    The ith element of fu is denoted by and hence .

  2. We define the following functions:

    • the mismatch energy EM : ℝd × ℝd×d → ℝ,

    • the cost function C : ℝd ⟶ ℝ

    • the Lagrangian L : ℝd × ℝd×d × ℝ → ℝ, (u, W, β) 1→ L(u, W, β) = EM(u, W) + βC(u).

      To make the dependency of the cost and energies on β and W explicit, we further introduce three auxiliary functions FM, FC and FL:

    • for the mismatch energy FM : ℝd×d × ℝ → ℝ, (W, β) 1→ FM(W, β) = EM (fu(W, β), W),

    • for the cost function FC : ℝd×d × ℝ → ℝ, (W, β) 1→ FC(W, β) = C (fu(W, β)),

    • for the Lagrangian FL : ℝd×d × ℝ → ℝ, (W, β) 1→ FL(W, β) = FM(W, β) + βFC(W, β)

  3. The Euler-Lagrange equations can be written as . Hence, far enough away from initialization and for smooth enough input (and targets) we have at all times, even when changing the network input continuously. Note that both the cost and the mismatch energies are defined on low-pass-filtered signals, and it is with respect to the low-pass filtered external input that the low-pass-filtered output error is minimized.

  4. Without output nudging (i.e., β = 0), the output error vanishes and consequently all other prediction errors vanish as well, . This can be easily shown for layered network architectures and holds true for arbitrary connections (e.g., recurrent networks) as long as fu uniquely exists, i.e., has a unique solution for u. From the form of the mismatch energy, we then get

    Since we are assuming smooth functions, this also implies that

  5. From the assumption of well-behaved (smooth) functions, it also follows that partial derivatives commute .

Our goal is to find a plasticity rule that minimizes the cost C, which we do by calculating . Similar to Scellier & Bengio (2017), to achieve this, we first calculate the partial derivatives of FL with respect to the nudging strength β

and the weights W

or in vectorized form

With these identities in place, we can calculate the plasticity rule:

where we used that limits can be exchanged for smooth functions. Using the definition of EM, we obtain a plasticity rule that minimizes the cost function

In practice, the prefactor is absorbed into the learning rate.