-
Trying to Understand (blu3mo)
-
While watching videos by 3Blue1Brown, I gained a good understanding.
-
While the understanding is fresh, I want to jot down those insights.
-
I will do a form of language processing focused on solidifying memory, rather than writing text for explaining to others.
- Initially, write as words = tokens.
-
https://www.youtube.com/watch?v=KlZ-QmPteqM
- Well, the discussion on probability distributions makes sense.
- Also, things like Softmax.
- If you were to do what GPT does but only with word2vec without attention,
- Turn the last token of a sentence into a vector.
- Train to create a model that predicts the probability distribution of the “next token” from a given “token.”
- Isn’t this essentially akin to generating “plausible sentences” in a Markov chain-like manner? (blu3mo)(blu3mo)
- However, since this method only looks at the “last token,” it naturally cannot capture the “meaning of words in context.”
- Hence the need for an attention mechanism.
-
https://www.youtube.com/watch?v=j3_VgCt18fA
- I understood the meaning of what 0xikkun said about AbstractMap resembling an attention mechanism. (blu3mo)(blu3mo)(blu3mo)
- Particularly, the experiment in Search for Abstractly Related Pages bears a resemblance in structure.
- Rather than GPT, it’s more akin to an attention mechanism where data differs vertically and horizontally.
- Trying to explain the analogy relationships:
- Each token in the Transformer = Each page or consultation matter in AbstractMap.
- derived from tokens are like “questions abstracted from pages or consultation matters.”
- $\Sigma V_n + E_n” is like “text explaining the abstract relationship between consultation matters and pages.”
- Something along those lines?
- I understood the meaning of what 0xikkun said about AbstractMap resembling an attention mechanism. (blu3mo)(blu3mo)(blu3mo)
-
https://www.youtube.com/watch?v=9-Jl0dxWQs8
- In the part about MLP, each token is processed independently.
- That makes sense (blu3mo).
- It doesn’t seem very meaningful to train all the way from “MLP for the 1st token” to “MLP for the 1024th token.”
- Because positional encoding can convey the information of “which word number,” so it seems fine.
- I finally intuitively understood the purpose of using non-linear functions like ReLU.
- The example with Michael Jordan was enlightening.
- When only linear transformations are performed, no matter how hard you try, a 50% chance of being classified as Michael Jordan or any other Jordan is inevitable.
- To address this, using ReLU or similar flat areas to squash the 50% Michael Jordan classification, was the imagery I grasped.
- The example with Michael Jordan was enlightening.
- In the part about MLP, each token is processed independently.