You could also use this to measure different models against each other right? And just in general, use this as a model benchmark.
- Get dataset of text.
- Tokenize dataset.
- Measure true probabilities straight from the dataset.
- Train model number 1 on tokenized dataset.
- Measure KL divergence of model from true probabilities.
- Repeat steps 4,5 for model number 2
- Compare KL divergence of model 1 to model 2.
-Separate Idea- Also isn’t getting the true probabilities useful anyway, because then we could have the training process be:
- Get dataset.
- Tokenize.
- Get true probabilities.
- Train on probabilities instead of directly on the tokens.
Like instead of training twice (sequence to probabilities):
- sequence1 -> [1, 0]
- sequence1 -> [0, 1] You train it once with:
- sequence1 -> [0.5, 0.5]
So you are training on less data which would reduce training costs and whatnot.
RWKV looks awesome