## What is an Artificial Neuron?

Source - Wikimedia Commons

## Feed-Forward Network

For each unit: $y = \text{tanh}\big(Wx + b \big)$

## Recurrent Network

For each unit: $y_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)$

simplifying…

## Recurrent Network

simplifying and rotating…

## “State” in Recurrent Networks

Recurrent Networks are all about storing a “state” in between computations.

A “lossy summary of… past sequences”

h is the “hidden state” of our RNN

What influences h?

### Defining the RNN State

We can define a simplified RNN represented by this diagram as follows:

$h_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)$

$\hat{y}_t = \text{softmax}(c + Wh_t)$

## Unfolding an RNN in Time

• By unfolding the RNN we can compute $\hat{y}$ for a given length of sequence.
• Note that the weight matrices $U$, $V$, $W$ are the same for each timestep; this is the big advantage of RNNs!

## Forward Propagation

We can now use the following equations to compute $\hat{y}_3$, by computing $h$ for the previous steps:

$h_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)$

$\hat{y}_t = \text{softmax}(c + Wh_t)$

### Y-hat is Softmax’d

$\hat{y}$ is a probability distribution!

$\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \text{ for } j = 1,\ldots, K$

## Calculating Loss: Categorical Cross Entropy

We use the categorical cross-entropy function for loss:

\begin{align*} h_t &= \text{tanh}\big( {b} + {Vh}_{t-1} + {Ux}_t \big) \\ \hat{y}_t &= \text{softmax}(c + Wh_t) \\ L_t &= -y_t \cdot \text{log}(\hat{y}_t) \\ \text{Loss} &= \sum_t L_t \\ \end{align*}

## Backpropagation Through Time (BPTT)

Propagates error correction backwards through the network graph, adjusting all parameters (U, V, W) to minimise loss.

## Example: Character-level text model

• Training data: a collection of text.
• Input (X): snippets of 30 characters from the collection.
• Target output (y): 1 character, the next one after the 30 in each X.

## Training the Character-level Model

• Target: A probability distribution with $P(n) = 1$
• Output: A probability distribution over all next letters.
• E.g.: “My cat is named Simon” would lead to X: “My cat is named Simo” and y: “n”

## Using the trained model to generate text

• S: Sampling function, sample a letter using the output probability distribution.
• The generated letter is reinserted at as the next input.
• We don’t want to always draw the most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.

## Char-RNN

• RNN as a sequence generator
• Input is current symbol, output is next predicted symbol.
• Connect output to input and continue!
• CharRNN simply applies this to a (subset) of ASCII characters.
• Train and generate on any text corpus: Fun!

## Char-RNN Examples

Shakespeare (Karpathy, 2015):

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Latex Algebraic Geometry:

N.B. “Proof. Omitted.” Lol.

## Bidirectional RNNs

• Useful for tasks where the whole sequence is available.
• Each output unit ($\hat{y}$) depends on both past and future - but most sensitive to closer times.
• Popular in speech recognition, translation etc.

## Encoder-Decoder (seq-to-seq)

Learns to generate output sequence (y) from an input sequence (x).

Final hidden state of encoder is used to compute a context variable C.

For example, translation.

## Deep RNNs

• Does adding deeper layers to an RNN make it work better?
• Several options for architecture.
• Simply stacking RNN layers is very popular; shown to work better by Graves et al. (2013)
• Intuitively: layers might learn some hierarchical knowledge automatically.
• Typical setup: up to three recurrent layers.

## Long-Term Dependencies

• Learning long dependencies is a mathematical challenge.
• Basically: gradients propagated through the same weights tend to vanish (mostly) or explode (rarely)
• E.g., consider a simplified RNN with no nonlinear activation function or input.
• Each time step multiplies h(0) by W.
• This corresponds to raising power of eigenvalues in $\Lambda$.
• Eventually, components of h(0) not aligned with the largest eigenvector will be discarded.

\begin{align*} h_t &= Wh_{t-1}\\ h_t &= (W^t)h_0 \end{align*}

(supposing W admits eigendecomposition with orthogonal matrix Q)

\begin{align*} W &= Q\Lambda Q^{\top}\\ h_t &= Q\Lambda ^t Qh_0 \end{align*}

• “in order to store memories in a way that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish”
• “whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction.”

## Gated RNNs

• Possible solution!
• Provide a gate that can change the hidden state a little bit at each step.
• The gates are controlled by learnable weights as well!
• Hidden state weights that may change at each time step.
• Create paths through time with derivatives that do not vanish/explode.
• Gates choose information to accumulate or forget at each time step.
• Most effective sequence models used in practice!

## Long Short-Term Memory

• Self-loop containing an internal state (c).
• Three extra gating units:
• Forget gate: controls how much memory is preserved.
• Input gate: control how much of current input is stored.
• Output gate: control how much of state is shown to output.
• Each gate has own weights and biases, so this uses lots more parameters.
• Some variants on this design, e.g., use c as additional input to three gate units.

## Long Short-Term Memory

• Forget gate: f
• Internal state: s
• Input gate: g
• Output gate: q
• Output: h

## Other Gating Units

• Are three gates necessary?
• Other gating units are simpler, e.g., Gated Recurrent Unit (GRU)
• For the moment, LSTMs are winning in practical use.
• Maybe someone wants to explore alternatives in a project?

## Visualising LSTM activations

Sometimes, the LSTM cell state corresponds with features of the sequential data:

Source: (Karpathy, 2015)

## CharRNN Applications: FolkRNN

Some kinds of music can be represented in a text-like manner.

## Other CharRNN Applications

• State-of-the-art in music generating RNNs.
• Encode MIDI musical sequences as categorical data.
• Now supports polyphony (multiple notes), dynamics (volume), expressive timing (rubato).

## Neural iPad Band, another CharRNN

• iPad music transcribed as sequence of numbers for each performer.
• Trick: encode multiple ints as one (preserving ordering).
• Video

## Summary

• Recurrent Neural Networks let us capture and model the structure of sequential data.
• Sampling from trained RNNs allow us to generate new, creative sequences.
• The internal state of RNNs make them interesting for interactive applications, since it lets them capture and continue from the current context or “style”.
• LSTM units are able to overcome the vanishing gradient problem to some extent.