https://www.youtube.com/watch?v=kCc8FmEb1nY

When there is data for block size + 1 token, each data from 11, 12, … 1~block size can be used as input, enabling training for block size times.

  • This also becomes the context length.