Deep Dive on RNNs

Charles Martin

What is an Artificial Neurone?

Source - Wikimedia Commons

Feed-Forward Network

For each unit: \(y = \text{tanh}\big(Wx + b \big)\)

Recurrent Network

For each unit: \(y_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)\)

Sequence Learning Tasks

Recurrent Network


Recurrent Network

simplifying and rotating…

“State” in Recurrent Networks

  • Recurrent Networks are all about storing a “state” in between computations.
  • A “lossy summary of… past sequences”
  • h is the “hidden state” of our RNN
  • What influences h?

Defining the RNN State

We can define a simplified RNN represented by this diagram as follows:

\[h_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)\]

\[\hat{y}_t = \text{softmax}(c + Wh_t)\]

Unfolding an RNN in Time

Unfolding an RNN in Time
Unfolding an RNN in Time
  • By unfolding the RNN we can compute \(\hat{y}\) for a given length of sequence.
  • Note that the weight matrices \(U\), \(V\), \(W\) are the same for each timestep; this is the big advantage of RNNs!

Forward Propagation

We can now use the following equations to compute \(\hat{y}_3\), by computing \(h\) for the previous steps:

\[h_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)\]

\[\hat{y}_t = \text{softmax}(c + Wh_t)\]

Y-hat is Softmax’d

\(\hat{y}\) is a probability distribution! A finite number of weights that add to 1:

\[\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \text{ for } j = 1,\ldots, K\]

Calculating Loss: Categorical Cross Entropy

We use the categorical cross-entropy function for loss:

\[\begin{align*} h_t &= \text{tanh}\big( {b} + {Vh}_{t-1} + {Ux}_t \big) \\ \hat{y}_t &= \text{softmax}(c + Wh_t) \\ L_t &= -y_t \cdot \text{log}(\hat{y}_t) \\ \text{Loss} &= \sum_t L_t \\ \end{align*}\]

Backpropagation Through Time (BPTT)

Propagates error correction backwards through the network graph, adjusting all parameters (U, V, W) to minimise loss.

Example: Character-level text model

  • Training data: a collection of text.
  • Input (X): snippets of 30 characters from the collection.
  • Target output (y): 1 character, the next one after the 30 in each X.

Training the Character-level Model

  • Target: A probability distribution with \(P(n) = 1\)
  • Output: A probability distribution over all next letters.
  • E.g.: “My cat is named Simon” would lead to X: “My cat is named Simo” and y: “n”

Using the trained model to generate text

  • S: Sampling function, sample a letter using the output probability distribution.
  • The generated letter is reinserted at as the next input.
  • We don’t want to always draw the most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.


  • RNN as a sequence generator
  • Input is current symbol, output is next predicted symbol.
  • Connect output to input and continue!
  • CharRNN simply applies this to a (subset) of ASCII characters.
  • Train and generate on any text corpus: Fun!

Char-RNN Examples

Shakespeare (Karpathy, 2015):

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Latex Algebraic Geometry:

N.B. “Proof. Omitted.” Lol.

RNN Architectures and LSTM

Bidirectional RNNs

  • Useful for tasks where the whole sequence is available.
  • Each output unit (\(\hat{y}\)) depends on both past and future - but most sensitive to closer times.
  • Popular in speech recognition, translation etc.

Encoder-Decoder (seq-to-seq)

  • Learns to generate output sequence (y) from an input sequence (x).
  • Final hidden state of encoder is used to compute a context variable C.
  • For example, translation.

Deep RNNs

  • Does adding deeper layers to an RNN make it work better?
  • Several options for architecture.
  • Simply stacking RNN layers is very popular; shown to work better by Graves et al. (2013)
  • Intuitively: layers might learn some hierarchical knowledge automatically.
  • Typical setup: up to three recurrent layers.

Long-Term Dependencies

  • Learning long dependencies is a mathematical challenge.
  • Basically: gradients propagated through the same weights tend to vanish (mostly) or explode (rarely)
  • E.g., consider a simplified RNN with no nonlinear activation function or input.
  • Each time step multiplies h(0) by W.
  • This corresponds to raising power of eigenvalues in \(\Lambda\).
  • Eventually, components of h(0) not aligned with the largest eigenvector will be discarded.

\[\begin{align*} h_t &= Wh_{t-1}\\ h_t &= (W^t)h_0 \end{align*}\]

(supposing W admits eigendecomposition with orthogonal matrix Q)

\[\begin{align*} W &= Q\Lambda Q^{\top}\\ h_t &= Q\Lambda ^t Qh_0 \end{align*}\]

Vanishing and Exploding Gradients

  • “in order to store memories in a way that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish”
  • “whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction.”

Gated RNNs

  • Possible solution!
  • Provide a gate that can change the hidden state a little bit at each step.
  • The gates are controlled by learnable weights as well!
  • Hidden state weights that may change at each time step.
  • Create paths through time with derivatives that do not vanish/explode.
  • Gates choose information to accumulate or forget at each time step.
  • Most effective sequence models used in practice!

Long Short-Term Memory

  • Self-loop containing an internal state (c).
  • Three extra gating units:
    • Forget gate: controls how much memory is preserved.
    • Input gate: control how much of current input is stored.
    • Output gate: control how much of state is shown to output.
  • Each gate has own weights and biases, so this uses lots more parameters.
  • Some variants on this design, e.g., use c as additional input to three gate units.

Long Short-Term Memory

  • Forget gate: f
  • Internal state: s
  • Input gate: g
  • Output gate: q
  • Output: h

Other Gating Units

  • Are three gates necessary?
  • Other gating units are simpler, e.g., Gated Recurrent Unit (GRU)
  • For the moment, LSTMs are winning in practical use.
  • Maybe someone wants to explore alternatives in a project?

Visualising LSTM activations

Sometimes, the LSTM cell state corresponds with features of the sequential data:

Source: (Karpathy, 2015)

CharRNN Applications: FolkRNN

Some kinds of music can be represented in a text-like manner.

Source: Sturm et al. 2015. Folk Music Style Modelling by Recurrent Neural Networks with Long Short Term Memory Units

Other CharRNN Applications

Google Magenta Performance RNN

  • State-of-the-art in music generating RNNs.
  • Encode MIDI musical sequences as categorical data.
  • Now supports polyphony (multiple notes), dynamics (volume), expressive timing (rubato).
  • E.g.: YouTube demo

Neural iPad Band, another CharRNN

  • iPad music transcribed as sequence of numbers for each performer.
  • Trick: encode multiple ints as one (preserving ordering).
  • Video

Books and Learning References


  • Recurrent Neural Networks let us capture and model the structure of sequential data.
  • Sampling from trained RNNs allow us to generate new, creative sequences.
  • The internal state of RNNs make them interesting for interactive applications, since it lets them capture and continue from the current context or “style”.
  • LSTM units are able to overcome the vanishing gradient problem to some extent.