Evolino: Hybrid Neuroevolution / Optimal Linear Search for Sequence Learning Jurgen Schmidhuber1,2 , Daan Wierstra2 , and Faustino Gomez2 ¨ {juergen, daan, tino}@idsia.ch 1 TU Munich, Boltzmannstr. 3, 85748 Garching, Munchen, Germany ¨ 2 IDSIA, Galleria 2, 6928 Lugano, Switzerland Abstract Echo State Networks (ESNs; [Jaeger, 2004a]) deal with temporal dependencies by simply ignoring the gradients as- Current Neural Network learning algorithms are sociated with hidden neurons. Composed primarily of a large limited in their ability to model non-linear dynami- pool of neurons (typically hundreds or thousands) with fixed cal systems. Most supervised gradient-based recur- random weights, ESNs are trained by computing a set of rent neural networks (RNNs) suffer from a vanish- weights analytically from the pool to the output units using ing error signal that prevents learning from inputs fast, linear regression. The idea is that with so many ran- far in the past. Those that do not, still have prob- dom hidden units, the pool is capable of very rich dynamics lems when there are numerous local minima. We that just need to be correctly "tapped" by adjusting the output introduce a general framework for sequence learn- weights. This simple approach is currently the title holder in ing, EVOlution of recurrent systems with LINear the Mackey-Glass time-series benchmark, improving on the outputs (Evolino). Evolino uses evolution to dis- accuracy of all other methods by as much as three orders of cover good RNN hidden node weights, while us- magnitude [Jaeger, 2004a]. ing methods such as linear regression or quadratic The drawback of ESNs, of course, is that the only truly programming to compute optimal linear mappings computationally powerful, nonlinear part of the net does not from hidden state to output. Using the Long Short- learn at all. This means that on some seemingly simple tasks, Term Memory RNN Architecture, the method is such as generating multiple superimposed sine waves, the tested in three very different problem domains: 1) method fails. According to our experience, it is also not able context-sensitive languages, 2) multiple superim- to solve a simple context-sensitive grammar task [Gers and posed sine waves, and 3) the Mackey-Glass sys- Schmidhuber, 2001]. Moreover, because ESNs use such a tem. Evolino performs exceptionally well across large number of processing units, they are prone to overfit- all tasks, where other methods show notable defi- ting, i.e. poor generalization. ciencies in some. One method that adapts all weights and succeeds in us- ing gradient information to learn long-term dependencies is Long Short-Term Memory (LSTM; [Hochreiter and Schmid- 1 Introduction huber, 1997; Gers and Schmidhuber, 2001]). LSTM uses a specialized network architecture that includes linear memory Real world non-linear dynamical systems are black-box in na- cells that can sustain their activation indefinitely. The cells ture: it is possible to observe their input/output behavior, but have input and output gates that learn to open and close at the internal mechanism that generates this behavior is often appropriate times either to let in new information from out- unknown. Modeling such systems to accurately predict their side and change the state of the cell, or to let activation out behavior is a huge challenge with potentially far-reaching im- to potentially affect other cells or the network's output. The pact on areas as broad as speech processing/recognition, fi- cell structure enables LSTM to use gradient descent to learn nancial forecasting, and engineering. dependencies across almost arbitrarily long time spans. How- Artificial Neural Networks with feedback connections or ever, in cases where gradient information is of little use due Recurrent Neural Networks (RNNs; [Werbos, 1990; Robin- to numerous local minima, LSTM becomes less competitive. son and Fallside, 1987; Williams and Zipser, 1989]) are An alternative approach to training RNNs is neuroevolu- an attractive formalism for non-linear modeling because of tion [Yao, 1999]. Instead of using a single neural network, their ability, in principle, to approximate any dynamical sys- tem with arbitrary precision [Siegelmann and Sontag, 1991]. the space of network parameters is searched in parallel using the principle of natural selection. A population of chromo- However, training RNNs with standard gradient descent algo- somes or strings encoding, for instance, network weight val- rithms is only practical when a short time window (less than ues and connectivity is evaluated on the problem, and each 10 time-steps) is sufficient to predict the correct system out- chromosome is awarded a fitness value that quantifies its rel- put. For longer temporal dependencies, the gradient vanishes ative performance. The more highly fit chromosomes are as the error signal is propagated back through time so that combined by exchanging substrings (crossover) and by ran- network weights are never adjusted correctly to account for events far in the past [Hochreiter et al., 2001]. domly changing some values (mutation), producing new so- problem. Properties that require non-linearity and recurrence y1(t) y2(t) ym(t) are then dealt with by evolution. Figure 1 illustrates the basic operation of an Evolino net- work. The output of the network at time t, y (t) Rm , is Linear Output computed by the following formulas: Layer W y (t) = W (t), (1 ) 1(t) 2(t) 3(t) 4(t) n(t) (t) = f (u(t), u(t - 1), . . . , u(0)), (2 ) n where (t) R is the output of a recurrent neural network Recurrent f (·), and W is a weight matrix. Note that because the net- works are recurrent, f (·) is indeed a function of the entire in- Neural Network put history, u(t), u(t - 1), . . . , u(0). In the case of maximum margin classification problems [Vapnik, 1995] we may com- pute W by quadratic programming. In what follows, how- u1(t) u2(t) u3(t) u4(t) up(t) ever, we focus on mean squared error minimization problems and compute W by linear regression. Figure 1: Evolino network. A recurrent neural network receives In order to evolve an f (·) that minimizes the error between sequential inputs u(t) and produce the vector (1 , 2 , . . . , n ) at ev- y and the correct output, d, of the system being modeled, ery time step t. These values are linearly combined with the weight Evolino does not specify a particular evolutionary algorithm, matrix W to yield the network's output vector y (t). While the RNN but rather only stipulates that networks be evaluated using the is evolved, the output layer weights are computed using a fast, opti- following two-phase procedure. mal method such as linear regression or quadratic programming. In the first phase, a training set of sequences obtained from the system, {ui , di }, i = 1..k , each of length li , is presented to the network. For each sequence ui , starting at time t = 0, lutions that hopefully improve upon the existing population. each input pattern ui (t) is successively propagated through This approach has been very effective in solving continu- the recurrent network to produce a vector of activations i (t) ous, partially observable reinforcement learning tasks where k the gradient is not directly available, outperforming conven- that is stored as a row in a n × i li matrix . Associated tional methods (e.g. Q-learning, SARSA) on several diffi- with each i (t), is a target row vector di in D containing the cult learning benchmarks [Moriarty and Miikkulainen, 1996; correct output values for each time step. Once all k sequences Gomez and Miikkulainen, 1999]. However, neuroevolution have been seen, the output weights W (the output layer in fig- is rarely used for supervised learning tasks such as time se- ure 1) are computed using linear regression from to D. The ries prediction because it has difficulty fine-tuning solution column vectors in (i.e. the values of each of the n outputs parameters (e.g. network weights), and because of the pre- over the entire training set) form a non-orthogonal basis that vailing maxim that gradient information should be used when is combined linearly by W to approximate D. it is available. In the second phase, the training set is presented to the net- In this paper, we present a novel framework called EVO- work again, but now the inputs are propagated through the lution of recurrent systems with LINear outputs (Evolino) recurrent network f (·) and the newly computed output con- that combines elements of the three aforementioned meth- nections to produce predictions y (t). The error in the predic- ods, to address the disadvantages of each, extending ideas tion or the residual error is then used as the fitness measure proposed for feedforward networks of radial basis functions to be minimized by evolution. (RBFs) [Maillard and Gueriot, 1997]. Applied to the LSTM Neuroevolution is normally applied to reinforcement learn- architecture, Evolino can solve tasks that ESNs cannot, and ing tasks where correct network outputs (i.e. targets) are not achieves higher accuracy in certain continuous function gen- known a priori. Here we use neuroevolution for supervised eration tasks than conventional gradient descent RNNs, in- learning to circumvent the problems of gradient-based ap- cluding gradient-based LSTM. proaches. In order to obtain the precision required for time- Section 2 explains the basic concept of Evolino and de- series prediction, we do not try to evolve a network that makes scribes in detail the specific implementation used in this pa- predictions directly. Instead, the network outputs a set of vec- per. Section 3 presents our experiments using Evolino in tors that form a basis for linear regression. The intuition is three different domains: context-sensitive grammars, contin- that finding a sufficiently good basis is easier than trying to uous function generation, and the Mackey-Glass time-series. find a network that models the system accurately on its own. Section 4 and 5 discuss the algorithm and the experimental In this study, Evolino is instantiated using Enforced Sub- results, and summarize our conclusions. Populations to evolve LSTM networks. The next sections de- scribe ESP and LSTM, and the details of how they are com- 2 The Evolino Framework bined within the Evolino framework. Evolino is a general framework for supervised sequence 2.1 Enforced Subpopulations learning that combines neuroevolution (i.e. the evolution of Enforced SubPopulations differs from standard neuroevolu- neural networks) and analytical linear methods that are opti- tion methods in that instead of evolving complete networks, mal in some sense, such as linear regression or quadratic pro- it coevolves separate subpopulations of network components gramming. The underlying principle of Evolino is that often a or neurons (figure 2). Evolution in ESP proceeds as follows: linear model can account for a large number of properties of a Time series peephole output GF ESP fitness Go S input GI output external inputs Figure 3: Long Short-Term Memory. The figure shows an LSTM pseudo-inverse memory cell. The cell has an internal state S together with a forget weights gate (GF ) that determines how much the state is attenuated at each time step. The input gate (GI ) controls access to the cell by the LSTM Network external inputs that are summed into the unit, and the output gate Figure 2: Enforced SubPopulations (ESP). The population of (GO ) controls when and how much the cell fires. Small dark nodes neurons is segregated into subpopulations. Networks are formed by represent the multiplication function. randomly selecting one neuron from each subpopulation. A neuron accumulates a fitness score by adding the fitness of each network in which it participated. The best neurons within each subpopula- functions needed to form good networks because members of tion are mated to form new neurons. The network shown here is an different evolving sub-function types are prevented from mat- LSTM network with four memory cells (the triangular shapes). ing. Subpopulations also reduce noise in the neuron fitness measure because each evolving neuron type is guaranteed to be represented in every network that is formed. This allows 1. Initialization: The number of hidden units H in the net- ESP to evolve recurrent networks, where SANE could not. works that will be evolved is specified and a subpopula- If the performance of ESP does not improve for a predeter- tion of n neuron chromosomes is created for each hidden mined number of generations, a technique called burst muta- unit. Each chromosome encodes a neuron's input, out- tion is used. The idea of burst mutation is to search the space put, and recurrent connection weights with a string of of modifications to the best solution found so far. When burst random real numbers. mutation is activated, the best neuron in each subpopulation 2. Evaluation: A neuron is selected at random from each is saved, the other neurons are deleted, and new neurons are of the H subpopulations, and combined to form a recur- created for each subpopulation by adding Cauchy distributed rent network. The network is evaluated on the task and noise to its saved neuron. Evolution then resumes, but now awarded a fitness score. The score is added to the cu- searching in a neighborhood around the previous best solu- mulative fitness of each neuron that participated in the tion. Burst mutation injects new diversity into the subpopu- network. lations and allows ESP to continue evolving after the initial 3. Recombination: For each subpopulation the neurons are subpopulations have converged. ranked by fitness, and the top quartile is recombined us- ing 1-point crossover and mutated using Cauchy dis- 2.2 Long Short-Term Memory tributed noise to create new neurons that replace the LSTM is a recurrent neural network purposely designed to lowest-ranking half of the subpopulation. learn long-term dependencies via gradient descent. The 4. Repeat the Evaluation­Recombination cycle until a suf- unique feature of the LSTM architecture is the memory cell ficiently fit network is found. that is capable of maintaining its activation indefinitely (fig- ESP searches the space of networks indirectly by sampling ure 3). Memory cells consist of a linear unit which holds the the possible networks that can be constructed from the sub- state of the cell, and three gates that can open or close over populations of neurons. Network evaluations serve to pro- time. The input gate "protects" a neuron from its input: only vide a fitness statistic that is used to produce better neurons when the gate is open, can inputs affect the internal state of that can eventually be combined to form a successful network. the neuron. The output gate lets the state out to other parts This cooperative coevolutionary approach is an extension to of the network, and the forget gate enables the state to "leak" Symbiotic, Adaptive Neuroevolution (SANE; [Moriarty and activity when it is no longer useful. Miikkulainen, 1996]) which also evolves neurons, but in a The state of cell i is computed by: single population. By using separate subpopulations, ESP si (t) = neti (t)gi n (t) + gi orget (t)si (t - 1), f i (3 ) accelerates the specialization of neurons into different sub- Training data Gradient LSTM Evolino LSTM where g in and g f orget are the activation of the input and for- get gates, respectively, and net is the weighted sum of the 1 ..1 0 1 ..2 8 1 ..5 3 external inputs (indicated by the s in figure 3): 1 ..2 0 1 ..6 6 1 ..9 5 k j 1 ..3 0 1 ..9 1 1 ..3 5 5 wik ll uk (t)), (4) ce wij ll cj (t - 1) + ce neti (t) = h( 1 ..4 0 1 ..1 2 0 1 ..8 0 4 Table 1: Generalization results for the an bn cn language. The where h is usually the identity function, and cj is the output table compares Evolino-based LSTM to Gradient-based LSTM. The of cell j : left column shows the set of legal strings used to train each method. cj (t) = tanh(gj ut (t)sj (t)). o (5 ) The other columns show the set of strings that each method was able to accept after training. The result for LSTM with gradient descent where g out is the output gate of cell j . The amount each gate are from [Gers and Schmidhuber, 2001]. Averages of 20 runs. gi of memory cell i is open or closed at time t is calculated by: k type j type gi ype (t) = ( t was found useful for continuous function generation tasks, wik uk (t)), (6) wij cj (t - 1) + but interferes to some extent with performance in the discrete where ty pe can be input, output, or f org et, and is the context-sensitive language task. standard sigmoid function. The gates receive input from the output of other cells cj , and from the external inputs to the 3 Experimental Results network. Experiments were carried out on three test problems: context- 2.3 Combining ESP with LSTM in Evolino sensitive languages, multiple superimposed sine waves, and the Mackey-Glass time series. The first two were chosen to We apply our general Evolino framework to the LSTM archi- highlight Evolino's ability to perform well in both discrete tecture, using ESP for evolution and regression for comput- and continuous domains. For a more detailed description of ing linear mappings from hidden state to outputs. ESP co- setups used in these two problems, and further experiments, evolves subpopulations of memory cells instead of standard we direct the reader to [Wierstra et al., 2005]. The Mackey- recurrent neurons (figure 2). Each chromosome is a string Glass system was selected to compare Evolino with ESNs, the containing the external input weights and the input, output, reference method on this widely used time series benchmark. and forget gate weights, for a total of 4 (I + H ) weights in each memory cell chromosome, where I is the number of 3.1 Context-Sensitive Grammars external inputs and H is the number of memory cells in the network. There are four sets of I + H weights because the Learning to recognize context-sensitive languages is a diffi- cult and often intractable problem for standard RNNs because three gates (equation 6) and the cell itself (equation 4) receive it can require unlimited memory. For instance, recognizing input from outside the cell and the other cells. ESP, as de- the language an bn cn (i.e. strings where the number of as, bs, scribed in section 2.1, normally uses crossover to recombine and cs is equal) entails counting the number of consecutive as, neurons. However, for the present Evolino variant, where fine bs, and cs, and potentially having to remember these quan- local search is desirable, ESP uses only mutation. The top tities until the whole string has been read. Gradient-based quarter of the chromosomes in each subpopulation are dupli- LSTM has previously been used to learn an bn cn , so here we cated and the copies are mutated by adding Cauchy noise to compare the results in [Gers and Schmidhuber, 2001] to those all of their weight values. of Evolino-based LSTM. The linear regression method used to compute the output weights (W in equation 2) is the Moore-Penrose pseudo- Four sets of 20 simulations were run each using a different training set of legal strings, {an bn cn }, n = 1..N , where N inverse method, which is both fast and optimal in the sense that it minimizes the summed squared error [Penrose, was 10, 20, 30, and 40. Symbol strings were presented to the 1955]--compare [Maillard and Gueriot, 1997] for an appli- networks, one symbol at a time. The networks had 4 input units, one for each possible symbol: S for start, a, b, and cation to feedforward RBF nets. The vector (t) consists of c. An input is set to 1.0 when the corresponding symbol is both the cell outputs, ci (equation 5), and their internal states, si (equation 3), so that the pseudo-inverse computes two con- observed, and -1.0 when it is not present. At every time step, the network predicts what symbols could come next, a, b, nection weights for each memory cell. We refer to the con- c, and the termination symbol T , by activating its 4 output nections from internal states to the output units as "output units. An output unit is considered to be "on" if its activation peephole" connections, since they peer into the interior of the is greater than 0.0. cells. ESP evolved LSTM networks with 4 memory cells, For continuous function generation, backprojection (or weights randomly initialized to values between -0.1 and 0.1. teacher forcing in standard RNN terminology) is used where The Cauchy noise parameter for both mutation and burst the predicted outputs are fed back as inputs in the next time mutation was set to 0.00001, i.e. 50% of the mutations is kept step: (t) = f (u(t), y (t - 1), u(t - 1), . . . , y (0), u(0)). within this bound. Evolution was terminated after 50 gener- During training, the correct target values are backprojected, ations, after which the best network in each simulation was in effect "clamping" the network's outputs to the right values. tested. During testing, the network backprojects its own predictions. This technique is also used by ESNs, but whereas ESNs do The results are summarized in Table 1. Evolino-based not change the backprojection connection weights, Evolino LSTM learns in approximately 3 minutes on average, but, evolves them, treating them like any other input to the net- more importantly, it is able to generalize substantially better work. In the experiments described below, backprojection than gradient-based LSTM. MGS) in this domain to show its capacity for making precise predicted predictions. We used the same setup in our experiments as system in [Jaeger, 2004a]. Networks were trained on the first 3000 time steps of the series using a "washout time" of 100 steps. During the washout time the vectors (t) are not collected for calculating the pseudo-inverse. We evolved networks with 30 memory cells for 200 gen- erations, and a Cauchy noise of 10-7 . A bias input of 1.0 was added to the network, and the backprojection values 300 300 1300 600 1000 time steps were scaled by a factor of 0.1. For testing, the outputs were Figure 4: Performance of Evolino on the triple superimposed clamped to the correct targets for the first 3000 steps, after sine wave task. The plot show the behavior of a typical network which the network backprojected its own prediction for the produced after 50 generations (3000 evaluations). The first 300 steps next 84 steps1 . The cell input (equation 4) was squashed with (the data-points left of the vertical dashed line) were used as training the tanh function. The average NRMSE84 for Evolino with data, the rest must be predicted by the network during testing. Time- 30 cells over the 15 runs was 1.9 × 10-3 compared to 10-4.2 steps above 300 show the network predictions (dashed curve) during for ESNs with 1000 neurons [Jaeger, 2004a]. The Evolino testing plotted against the correct system output (solid curve). The results are currently the second-best reported so far. inset is a magnified detail that more clearly shows the two curves. Figure 5 shows the performance of an Evolino network on the MG time-series with even fewer memory cells, after 50 3.2 Multiple Superimposed Sine Waves generations. Because this network has fewer parameters, it is unable to achieve the same precision as with 30 neurons, Jaeger [Jaeger, 2004b] reports that Echo State Networks are but it demonstrates how Evolino can learn complex functions unable to learn functions composed of multiple superim- very quickly; in this case within approximately 3 minutes of posed oscillators. Specifically, functions like sin(0.2x) + CPU time. sin(0.311x), in which the individual sines have the same am- plitude but their frequencies are not multiples of each other. ESNs have difficulty solving this problem because the dy- 4 Discussion namics of all the neurons in the ESN "pool" are coupled, The real strength of the Evolino framework is its general- whereas truly solving the task requires an internal representa- ity. Across different classes of sequence prediction prob- tion of multiple attractors due to the non-periodic behavior of lems, it was able to compete with the best known methods the function. and convincingly outperform them in several cases. In partic- We evolved networks with 10 memory cells to predict ular, it generalized much better than gradient-based LSTM in the aforementioned double sine, sin(0.2x) + sin(0.311x), the context-sensitive grammar task, and it solved the super- and network with 15 cells for a more complex triple sine, imposed sine wave task, which ESNs cannot. These results sin(0.2x) + sin(0.311x) + sin(0.42x). Evolino used the suggest that Evolino could be widely applicable to model- same parameter settings as in the previous section, except that ing complex processes that have both discrete and continuous backprojection was used (see section 2.3). Networks for both properties, such as speech. tasks were evolved for 50 generations to predict the first 300 Evolino avoids the problem of vanishing gradient and local time steps of each function, and then tested on data points minima normally associated with RNN training by searching from time-steps 300..600. the space of networks in parallel through evolution. Further- The average summed squared error over the training set more, by using LSTM memory cells, Evolino searches in a was 0.011 for the double sine and 0.2 for the triple sine. The weight space that is already biased toward extracting, retain- average error over the test set was 0.044 and 1.58, respec- ing, and relating discrete events that may be very far apart in tively. These error levels are barely visible out to time-step time. And, by borrowing the idea of linear regression from 600. Figure 4 shows the behavior of one of the triple sine ESNs, Evolino is capable of making very precise predictions wave Evolino networks out to time-step 1300. The magni- in tasks like the Mackey-Glass benchmark. fied inset illustrates how even beyond 3 times the length of Apart from its versatility, another advantage of Evolino the training set, the network still makes very accurate predic- over ESNs is that it produces more parsimonious solutions. tions. ESNs have large pools of neurons that are more likely to over- fit the data. Evolino networks can be made much smaller and, 3.3 Mackey-Glass Time-Series Prediction therefore, potentially more general, less susceptible to noise, The Mackey-Glass system (MGS; [Mackey and Glass, 1977]) and more easily comprehensible by, for instance, RNN rule is a standard benchmark for chaotic time series prediction. extraction techniques. The system produces an irregular time series that is produced Evolino is a template that can be instantiated by plugging by the following differential equation: y (t) = y (t - )/(1 + in (1) alternative analytical methods for computing optimal y (t - ) ) - y (t), where the parameters are usually set to linear mappings to the outputs, given the hidden state, (2) dif- = 0.2, = 10, = 0.1. The system is chaotic whenever ferent neuroevolution algorithms, and (3) various recurrent the delay > 16.8. We use the most common value for the network architectures. In particular, our implementation used delay = 17. Although the MGS can be modeled very accurately us- 1 The normalized root mean square error (NRMSE84 ) 84 steps ing feedforward networks with a time-window on the input, after the end of the training sequence, is the standard comparison we compare Evolino to ESNs (currently the best method for measure used for this problem. 1.4 Mackey-Glass Evolino Mackey-Glass Evolino 1.2 1 0.8 0.6 0.4 0 600 1000 1400 200 400 800 1200 Figure 5: Performance of Evolino on the Mackey-Glass time-series. The plot shows both the Mackey-Glass system and the prediction made by a typical Evolino-based LSTM network evolved for 50 generations. The obvious difference between the system and the prediction during the first 100 steps is due to the washout time. The inset shows a magnification more clearly showing the deviation between the two curves. [Hochreiter et al., 2001] S. Hochreiter, Y. Bengio, P. Fras- mean squared error and linear regression, but we could as well use the maximum margin optimality criterion [Vapnik, 1995] coni, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. and use quadratic programming to find optimal linear map- Kremer and J. F. Kolen, editors, A Field Guide to Dynam- pings from hidden state to sequence classifications, obtaining ical Recurrent Neural Networks. IEEE Press, 2001. a hitherto unknown species of sequential support vector ma- [Jaeger, 2004a] H. Jaeger. Harnessing nonlinearity: Predict- chines. ing chaotic systems and saving energy in wireless commu- We could also use neuroevolution methods that evolve net- nication. Science, 304:78­80, 2004. work topology as well, so that network complexity is also [Jaeger, 2004b] H. Jaeger. http://www.faculty.iu-bremen.de/ determined through genetic search. Other RNNs, such as hjaeger/courses/seminarspring04/esnstandardslides.pdf, higher-order networks could be used instead of LSTM. Gen- 2004. eralizations to nonlinear readout mechanisms (e.g., nonlinear [Mackey and Glass, 1977] M. C. Mackey and L. Glass. Os- neural networks) with gradient-based search are obvious. We cillation and chaos in physiological control systems. Sci- may also start training LSTM by Evolino, then fine-tune by ence, 197:287­289, 1977. traditional pure gradient search. [Maillard and Gueriot, 1997] E. P. Maillard and D. Gueriot. Future work will further explore this space of possible im- RBF neural network, basis functions and genetic algo- plementations to provide potentially even more powerful pre- rithms. In IEEE International Conference on Neural Net- dictors, classifiers, and sequence generators. works, pages 2187­2190, Piscataway, NJ, 1997. IEEE. 5 Conclusion [Moriarty and Miikkulainen, 1996] D. E. Moriarty and R. Miikkulainen. Efficient reinforcement learning through We introduced EVOlution of recurrent systems with LIN- symbiotic evolution. Machine Learning, 22:11­32, 1996. ear outputs (Evolino), a general framework that combines [Penrose, 1955] R. Penrose. A generalized inverse for matri- evolution of recurrent neural networks and analytical linear ces. In Proceedings of the Cambridge Philosophy Society, methods to solve sequence learning tasks. The implemen- volume 51, pages 406­413, 1955. tation of Evolino in this paper combined the pseudo-inverse [Robinson and Fallside, 1987] A. J. Robinson and F. Fall- and Enforced Subpopulations algorithms to search a space of side. The utility driven dynamic error propagation net- Long-Short Term Memory networks. This yielded a versatile work. Technical Report CUED/F-INFENG/TR.1, Cam- method that can solve both tasks that require long-term mem- bridge University Engineering Department, 1987. ory of discrete events such as context-sensitive languages, and [Siegelmann and Sontag, 1991] H. T. Siegelmann and E. D. continuous time-series such as the Mackey-Glass benchmark Sontag. Turing computability with neural nets. Applied and multiple superimposed sine waves. Mathematics Letters, 4(6):77­80, 1991. Acknowledgments [Vapnik, 1995] V. Vapnik. The Nature of Statistical Learning This research was partially funded by CSEM Alpnach and the Theory. Springer, New York, 1995. EU MindRaces project, FP6 511931. [Werbos, 1990] P. Werbos. Backpropagation through time: References what does it do and how to do it. In Proceedings of IEEE, volume 78, pages 1550­1560, 1990. [Gers and Schmidhuber, 2001] F. A. Gers and J. Schmidhu- [Wierstra et al., 2005] Daan Wierstra, Juergen Schmidhuber, ber. LSTM recurrent networks learn simple context free and Faustino Gomez. Modeling non-linear dynamical and context sensitive languages. IEEE Transactions on systems with evolino. To appear in Proceedings of the Neural Networks, 12(6):1333­1340, 2001. Genetic Evolutionary Computation Conference (GECCO- [Gomez and Miikkulainen, 1999] Faustino Gomez and Risto 05). Springer-Verlag, Berlin; New York, 2005. Miikkulainen. Solving non-Markovian control tasks with [Williams and Zipser, 1989] R. J. Williams and D. Zipser. A neuroevolution. In Proceedings of the 16th International learning algorithm for continually running fully recurrent Joint Conference on Artificial Intelligence, Denver, CO, networks. Neural Computation, 1(2):270­280, 1989. 1999. Morgan Kaufmann. [Yao, 1999] Xin Yao. Evolving artificial neural networks. [Hochreiter and Schmidhuber, 1997] S. Hochreiter and Proceedings of the IEEE, 87(9):1423­1447, 1999. J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735­1780, 1997.