• Trying to Understand (blu3mo)

  • While watching videos by 3Blue1Brown, I gained a good understanding.

  • While the understanding is fresh, I want to jot down those insights.

  • I will do a form of language processing focused on solidifying memory, rather than writing text for explaining to others.

    • Initially, write as words = tokens.
  • https://www.youtube.com/watch?v=KlZ-QmPteqM

    • Well, the discussion on probability distributions makes sense.
    • Also, things like Softmax.
    • If you were to do what GPT does but only with word2vec without attention,
      • Turn the last token of a sentence into a vector.
      • Train to create a model that predicts the probability distribution of the “next token” from a given “token.”
        • Isn’t this essentially akin to generating “plausible sentences” in a Markov chain-like manner? (blu3mo)(blu3mo)
    • However, since this method only looks at the “last token,” it naturally cannot capture the “meaning of words in context.”
    • Hence the need for an attention mechanism.
  • https://www.youtube.com/watch?v=j3_VgCt18fA

    • I understood the meaning of what 0xikkun said about AbstractMap resembling an attention mechanism. (blu3mo)(blu3mo)(blu3mo)
      • Particularly, the experiment in Search for Abstractly Related Pages bears a resemblance in structure.
      • Rather than GPT, it’s more akin to an attention mechanism where data differs vertically and horizontally.
      • Trying to explain the analogy relationships:
        • Each token in the Transformer = Each page or consultation matter in AbstractMap.
        • derived from tokens are like “questions abstracted from pages or consultation matters.”
        • $\Sigma V_n + E_n” is like “text explaining the abstract relationship between consultation matters and pages.”
        • Something along those lines?
  • https://www.youtube.com/watch?v=9-Jl0dxWQs8

    • In the part about MLP, each token is processed independently.
      • That makes sense (blu3mo).
      • It doesn’t seem very meaningful to train all the way from “MLP for the 1st token” to “MLP for the 1024th token.”
        • Because positional encoding can convey the information of “which word number,” so it seems fine.
    • I finally intuitively understood the purpose of using non-linear functions like ReLU.
      • The example with Michael Jordan was enlightening.
        • When only linear transformations are performed, no matter how hard you try, a 50% chance of being classified as Michael Jordan or any other Jordan is inevitable.
        • To address this, using ReLU or similar flat areas to squash the 50% Michael Jordan classification, was the imagery I grasped.