Charles Martin

For each unit: \(y = \text{tanh}\big(Wx + b \big)\)

For each unit: \(y_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)\)

simplifying…

simplifying and rotating…

- Recurrent Networks are all about storing a “state” in between computations.
- A “lossy summary of… past sequences”
*h*is the “hidden state” of our RNN- What influences
*h*?

We can define a simplified RNN represented by this diagram as follows:

\[h_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)\]

\[\hat{y}_t = \text{softmax}(c + Wh_t)\]

- By unfolding the RNN we can compute \(\hat{y}\) for a given length of sequence.
- Note that the weight matrices \(U\), \(V\), \(W\) are the same for each timestep; this is the big advantage of RNNs!

We can now use the following equations to compute \(\hat{y}_3\), by computing \(h\) for the previous steps:

\[h_t = \text{tanh}\big(Ux_t + Vh_{t-1} + b \big)\]

\[\hat{y}_t = \text{softmax}(c + Wh_t)\]

\(\hat{y}\) is a probability distribution! A finite number of weights that add to 1:

\[\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \text{ for } j = 1,\ldots, K\]

We use the categorical cross-entropy function for loss:

\[\begin{align*} h_t &= \text{tanh}\big( {b} + {Vh}_{t-1} + {Ux}_t \big) \\ \hat{y}_t &= \text{softmax}(c + Wh_t) \\ L_t &= -y_t \cdot \text{log}(\hat{y}_t) \\ \text{Loss} &= \sum_t L_t \\ \end{align*}\]

Propagates error correction backwards through the network graph, adjusting all parameters (*U*, *V*, *W*) to minimise loss.

**Training data:**a collection of text.**Input (**snippets of 30 characters from the collection.*X*):**Target output (**: 1 character, the next one after the 30 in each*y*)*X*.

- Target: A probability distribution with \(P(n) = 1\)
- Output: A probability distribution over all next letters.
- E.g.: “My cat is named Simon” would lead to
**X**: “My cat is named Simo” and**y**: “n”

**S**: Sampling function, sample a letter using the output probability distribution.- The generated letter is reinserted at as the next input.
- We don’t want to always draw the most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.

- RNN as a sequence generator
- Input is current symbol, output is next predicted symbol.
- Connect output to input and continue!
- CharRNN simply applies this to a (subset) of ASCII characters.
- Train and generate on any text corpus: Fun!

Shakespeare (Karpathy, 2015):

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Latex Algebraic Geometry:

N.B. “*Proof.* Omitted.” Lol.

- Useful for tasks where the whole sequence is available.
- Each output unit (\(\hat{y}\)) depends on both past and future - but most sensitive to closer times.
- Popular in speech recognition, translation etc.

- Learns to generate output sequence (
**y**) from an input sequence (**x**). - Final hidden state of encoder is used to compute a context variable
*C*. - For example, translation.

- Does adding deeper layers to an RNN make it work better?
- Several options for architecture.
- Simply stacking RNN layers is very popular; shown to work better by Graves et al. (2013)
- Intuitively: layers might learn some hierarchical knowledge automatically.
- Typical setup: up to three recurrent layers.

- Learning long dependencies is a mathematical challenge.
- Basically: gradients propagated through the same weights tend to vanish (mostly) or explode (rarely)
- E.g., consider a simplified RNN with no nonlinear activation function or input.
- Each time step multiplies
*h(0)*by*W*. - This corresponds to raising power of eigenvalues in \(\Lambda\).
- Eventually, components of
*h(0)*not aligned with the largest eigenvector will be discarded.

\[\begin{align*} h_t &= Wh_{t-1}\\ h_t &= (W^t)h_0 \end{align*}\]

(supposing **W** admits eigendecomposition with orthogonal matrix **Q**)

\[\begin{align*} W &= Q\Lambda Q^{\top}\\ h_t &= Q\Lambda ^t Qh_0 \end{align*}\]

- “in order to store memories in a way that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish”
- “whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction.”

- Note that this problem is only relevant for recurrent networks since the weights
**W**affecting the hidden state are the same at each time step. - Goodfellow and Benigo (2016): “the problem of learning long-term dependencies remains one of the main challenges in deep learning”
- WildML (2015). Backpropagation Through Time and Vanishing Gradients
- ML for artists

- Possible solution!
- Provide a gate that can change the hidden state a little bit at each step.
- The gates are controlled by
**learnable weights**as well! - Hidden state weights that may
**change**at each time step. - Create
**paths through time**with derivatives that do not vanish/explode. - Gates choose information to
**accumulate**or**forget**at each time step. **Most effective sequence models used in practice!**

- Self-loop containing an internal state (c).
- Three extra gating units:
**Forget gate**: controls how much memory is preserved.**Input gate**: control how much of current input is stored.**Output gate**: control how much of state is shown to output.

- Each gate has own
**weights**and**biases**, so this uses*lots*more parameters. - Some variants on this design, e.g., use c as additional input to three gate units.

- Forget gate:
*f* - Internal state:
*s* - Input gate:
*g* - Output gate:
*q* - Output:
*h*

Source: (Olah, C. 2015.)

- Are three gates necessary?
- Other gating units are simpler, e.g., Gated Recurrent Unit (GRU)
- For the moment, LSTMs are winning in practical use.
- Maybe someone wants to explore alternatives in a project?

Sometimes, the LSTM cell state corresponds with features of the sequential data:

Source: (Karpathy, 2015)

Some kinds of music can be represented in a text-like manner.

- State-of-the-art in music generating RNNs.
- Encode MIDI musical sequences as categorical data.
- Now supports polyphony (multiple notes), dynamics (volume), expressive timing (rubato).
- E.g.: YouTube demo

- iPad music transcribed as sequence of numbers for each performer.
- Trick: encode multiple ints as one (preserving ordering).
- Video

- Recurrent Neural Networks let us capture and model the structure of sequential data.
- Sampling from trained RNNs allow us to generate new, creative sequences.
- The internal state of RNNs make them interesting for interactive applications, since it lets them capture and continue from the current context or “style”.
- LSTM units are able to overcome the vanishing gradient problem to some extent.