https://www.youtube.com/watch?v=kCc8FmEb1nY

  • It’s possible to convert a string into an array of token IDs.
  • When there is data consisting of block size + 1 tokens, each data from 1 to block size can be used as input for training, allowing training to be conducted block size times.
    • This also becomes the context length.
    • By doing this, it is possible to train for next token prediction with both short and long context lengths.
  • Bigram Neural Network
    • Trained only on pairs of i-th token and i+1-th token.
      • Predictions are made using a table of size vocab size x vocab size.
        • Referring to the row of each vocab, the probability of the next token is written there.
      • This is trained using a Neural Network.