dnsod_si666

joined 11 months ago
[–] dnsod_si666@alien.top 1 points 11 months ago (1 children)

RWKV looks awesome

[–] dnsod_si666@alien.top 1 points 11 months ago

You could also use this to measure different models against each other right? And just in general, use this as a model benchmark.

  1. Get dataset of text.
  2. Tokenize dataset.
  3. Measure true probabilities straight from the dataset.
  4. Train model number 1 on tokenized dataset.
  5. Measure KL divergence of model from true probabilities.
  6. Repeat steps 4,5 for model number 2
  7. Compare KL divergence of model 1 to model 2.

-Separate Idea- Also isn’t getting the true probabilities useful anyway, because then we could have the training process be:

  1. Get dataset.
  2. Tokenize.
  3. Get true probabilities.
  4. Train on probabilities instead of directly on the tokens.

Like instead of training twice (sequence to probabilities):

  1. sequence1 -> [1, 0]
  2. sequence1 -> [0, 1] You train it once with:
  3. sequence1 -> [0.5, 0.5]

So you are training on less data which would reduce training costs and whatnot.

[–] dnsod_si666@alien.top 1 points 11 months ago

This may be a dumb question, but why do we use any sampling modifications at all? Is that not defeating the purpose of the model training to learn those probabilities?