Premise: This is the rambling of an amateur in Machine Learning.

  • The Transformer and the AbstractMap I created recently have similar structures.

  • Building on that,

  • What if we replace the “token vectors” and “linear transformation” in the attention mechanism of the Transformer with “LLM processing” of “natural language sentences”? Would something interesting happen if we did something similar?

  • How would we train the “LLM processing” part?

    • Roughly improving prompts genetically, for example
    • Or training the LLM model used there by backpropagating?
  • Concept:

    • There are 100 sentences.
    • For each sentence, generate a sentence to serve as Query and Key.
      • Measure cosine distance even with embeddings.
      • By weighting through attention, the Key sentence influences the Query sentence.
  • It’s really just an analogy, but if we were to forcibly make it work, the reasons could be:

    • It’s known that the attention mechanism is useful for understanding natural language.
    • So abstractly, this mechanism could also be useful here.
  • It seems beneficial if each token corresponds to a large amount of text rather than just a sentence.

    • When there are 100 documents, and each document influences others, allowing their meanings to seep into each other, the value of each document seems to increase.