INTRODUCTION
Recurrent networks can in principle use their feedback connections to store representations ofrecent input events in form of activations (\short-term memory", as opp osed to \long-term mem-ory" emb o died by slowly changing weights). This is p otentially signi cant for many applications,including sp eech pro cessing, non-Markovian control, and music comp osition (e.g., Mozer 1992).The most widely used algorithms for learning what to put in short-term memory, however, taketo o much time or do not work well at all, esp ecially when minimal time lags b etween inputs andcorresp onding teacher signals are long. Although theoretically fascinating, existing metho ds donot provide clear practical advantages over, say, backprop in feedforward nets with limited timewindows. This pap er will review an analysis of the problem and suggest a remedy.The problem. With conventional \Back-Propagation Through Time" (BPTT, e.g., Williamsand Zipser 1992, Werb os 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson andFallside 1987), error signals \ owing backwards in time" tend to either (1) blow up or (2) vanish:the temp oral evolution of the backpropagated error exp onentially dep ends on the size of theweights (Ho chreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning tobridge long time lags takes a prohibitive amount of time, or do es not work at all (see section 3).The remedy. This pap er presents \Long Short-Term Memory" (LSTM), a novel recurrentnetwork architecture in conjunction with an appropriate gradient-based learning algorithm. LSTMis designed to overcome these error back- ow problems. It can learn to bridge time intervals inexcess of 1000 steps even in case of noisy, incompressible input sequences, without loss of shorttime lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture1
enforcing constant (thus neither explo ding nor vanishing) error ow through internal states ofsp ecial units (provided the gradient computation is truncated at certain architecture-sp eci c p oints| this do es not a ect long-term error ow though).Outline of pap er. Section 2 will brie y review previous work. Section 3 b egins with an outlineof the detailed analysis of vanishing errors due to Ho chreiter (1991). It will then intro duce a naiveapproach to constant error backprop for didactic purp oses, and highlight its problems concerninginformation storage and retrieval. These problems will lead to the LSTM architecture as describ edin Section 4. Section 5 will present numerous exp eriments and comparisons with comp etingmetho ds. LSTM outp erforms them, and also learns to solve complex, arti cial tasks no otherrecurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. Theapp endix contains a detailed description of the algorithm (A.1), and explicit error ow formulae(A.2).2 PREVIOUS WORKThis section will fo cus on recurrent nets with time-varying inputs (as opp osed to nets with sta-tionary inputs and xp oint-based gradient calculations, e.g., Almeida 1987, Pineda 1987).Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams(1989), Schmidhub er (1992a), Pearlmutter (1989), and many of the related algorithms in Pearl-mutter's comprehensive overview (1995) su er from the same problems as BPTT and RTRL (seeSections 1 and 3).Time-delays. Other metho ds that seem practical for short time lags only are Time-DelayNeural Networks (Lang et al. 1990) and Plate's metho d (Plate 1993), which up dates unit activa-tions based on a weighted sum of old activations (see also de Vries and Princip e 1991). Lin et al.(1995) prop ose variants of time-delay networks called NARX networks.Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencingchanges of unit activations (deVries and Princip e's ab ove-mentioned approach (1991) may in factb e viewed as a mixture of TDNN and time constants). For long time lags, however, the timeconstants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) up datesthe activation of a recurrent unit by adding the old activation and the (scaled) current net input.The net input, however, tends to p erturb the stored information, which makes long-term storageimpractical.Ring's approach. Ring (1993) also prop osed a metho d for bridging long time lags. Whenevera unit in his network receives con icting error signals, he adds a higher order unit in uencingappropriate connections. Although his approach can sometimes b e extremely fast, to bridge atime lag involving 100 steps may require the addition of 100 units. Also, Ring's net do es notgeneralize to unseen lag durations.Bengio et al.'s approaches. Bengio et al. (1994) investigate metho ds such as simulatedannealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discreteerror propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a withminimal time lag 100 (see Exp eriment 3). Bengio and Frasconi (1994) also prop ose an EM approachfor propagating targets. With n so-called \state networks", at a given time, their system can b ein one of only n di erent states. See also b eginning of Section 5. But to solve continuous problemssuch as the \adding problem" (Section 5.4), their system would require an unacceptable numb erof states (i.e., state networks).Kalman lters. Puskorius and Feldkamp (1994) use Kalman lter techniques to improverecurrent net p erformance. Since they use \a derivative discount factor imp osed to decay exp o-nentially the e ects of past dynamic derivatives," there is no reason to b elieve that their KalmanFilter Trained Recurrent Networks will b e useful for very long minimal time lags.Second order nets. We will see that LSTM uses multiplicative units (MUs) to protect errorow from unwanted p erturbations. It is not the rst recurrent net metho d using MUs though.For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some di erences to LSTMare: (1) Watrous and Kuhn's architecture do es not enforce constant error ow and is not designed2
l = outj (output gates) : (21)eoutj (t) = f 0outj (netoutj (t)) 0@ SjXv =1 h(scvj (t)) Xk : k output unit wk cvj ek (t)1A .For all p ossible l time t's contribution to wlm 's up date iswlm (t) = el (t) y m (t
No comments:
Post a Comment