From Seq2Seq to Attention: How a Simple Translation Problem Led Toward Transformers

Imagine it is 2016.

I want to build a machine translation model that converts English into Tamil.

I give it this sentence: The dog chased the cat because it was fast.

Seems simple.

But hidden inside this sentence is a difficult problem:

Who was fast?
Was it the dog?
Or the cat?

That single ambiguity exposes one of the biggest limitations in early neural translation systems.

Let’s walk through why.

The Early Seq2Seq Idea:

The classic sequence-to-sequence models were built with models like Long Short-Term Memory (LSTM).

They had three major pieces:

  1. Encoder
    Think of this like someone reading a sentence and understanding it.
  1. Context Vector
    This is like a summary of what was read — a single “idea” or “memory”.  
  1. Decoder
    This is like someone taking that idea and rephrasing it in another language (or generating a response).

What happens in Seq to Seq:

Step #1: The encoder takes the input word by word and processes it:

    “The”   → h1
    “dog”   → h2
    “chased”→ h3
    “the”   → h4
    “cat”   → h5
    “because”→ h6
    “it”    → h7
    “was”   → h8
    “fast”  → h9

    Step #2: Old Seq2Seq Compresses Everything into a dense fixed size vector.

    The context vector should contain all the information about the sentence such as subject, object, action, relationships between them, even the pronoun reference of “it,” etc.

    The Bottleneck Problem:

    Now the model should know, who was fast, dog or cat? What should the word “it” refer to?

    This is where it gets interesting.

    If “it” refers to “dog” then the translation would be something like,

    நாய் வேகமாக இருந்ததால் பூனையை துரத்தியது
    (As Dog was fast)

    If the same refers to “cat”, then,

    பூனை வேகமாக இருந்ததால்…
    (As Cat was fast)

    Wrong referent changes the sentence meaning – Wrong translation.

    The issue is not vocabulary.

    It is relationship tracking.

    Early Seq2Seq often lost those relationships because everything was compressed into one vector.

    This became known as the encoder bottleneck.

    Early Attention mechanism – By Bahdanau, Cho, and Bengio in 2014–2015:

    To solve this weakness, the first attention mechanism introduced by Bahdanau, Cho, and Bengio in 2014–2015 is used as a component in the decoder.

    RNNs with attention enabled better translation of long sentences and improved performance in tasks like speech recognition and text generation.

    What Attention does in RNN?

    RNN + Attention Architecture: Attention is applied INSIDE the decoder at every step:

    After the encoder has processed the input, decoder starts with first word

    At each decoding step:

    Before generating each word of output, the model looks back and determines which words deserve more attention.

    When decoder reaches translating “it was fast…”, it can look back

    dog      ← high attention

    cat      ← lower attention

    Instead of relying only on a fixed context vector, the decoder can dynamically revisit relevant encoder states.

    Even After Attention… What Was Still Wrong?

    Problem #1: Reading One Word at a Time Was Slow

    You cannot speed it up much because the next step depends on the previous one.

    Analogy:

    It’s like cooking one dosa fully before starting the next one.

    Transformers later allowed cooking many dosas at once 😄

    Problem #2: Memory Could Still Fade

    Even with attention helping sometimes, the model still carried memory step by step because it still processes one word at a time, and long complicated relationships could get blurry.

    Analogy:

    Like trying to remember the beginning of a very long phone number while someone keeps adding more digits.

    Problem 3 — Attention Was Only Used as a Helper

    Input words were encoded through recurrence, but they did not directly attend to one another. Attention existed only in the decoder.

    When researchers realized Problem #3, another idea came up “Why only add attention in decoder?!”

    Why not let all words talk to each other from the start?

    Why wait till the process reaches Decoder + Attention?

    This was a turning point — this is where the Transformer architecture emerged.

    Ref:

    Conference paper: NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE