Long time lurker here. Made an account just to post this.
I've been experimenting with some modifications on the transformer architecture (addition of a new standalone component).
Recently I got to something that seems to be an improvement of the validation loss by ~25-30% over vanilla decoder transformers; the task is next token prediction.
My question is if this is significant enough to dedicate more serious effort in (eg. getting more compute credits to create a bigger model, running a beefier benchmark, sharing more with folks in academia to get feedback, write a paper, etc.) or if it's likely a fluke.
In terms of methodology, I've compared vanilla-vs-modification on 3 datasets (in increasing difficulty): Penn Treebank, Lord of the Rings, the complete works of Shakespeare. The datasets are small enough that the results are verifiable on any laptop quickly.
I've also been controlling for things that stay the same across the 2 variants (vocab size, embedding dim, number of layers, layer norm and residual connection positions, etc.).
With the addition of the new component however, I have 130% more parameters when keeping other things equal (from 800K on vanilla to 1.8M on modified version). I've tried to remediate this by also increasing the number of layers in the vanilla model to bring the parameter count to the same level and the improvement is still noticeable.
I'm providing below the loss comparisons after 100 iterations across both vanilla and modified over the 3 datasets.
I'd appreciate any input you may have! What next steps, if any, do you recommend? For background, I’m a software engineer by day and neural net enthusiast by night since college (more than 10 years ago). I’m loosely connected with some folks who may be able to give input but would appreciate the community’s feedback before nagging them and being more serious about this :)
# Lord of the Rings
**vanilla**:
step 100 evaluated train loss = 2.9514, valid loss = **2.9790**
step 100 evaluated train loss = 2.8528, valid loss = **2.8742** (w/ more layers => 10% more params than modified)
**modified**:
step 100 evaluated train loss = 2.1858, valid loss = **2.1094**
# Shakespeare's works
**vanilla**:
step 100 evaluated train loss = 3.1380, valid loss = **3.1767**
step 100 evaluated train loss = 2.9478, valid loss = **2.9677** (w/ more layers => 10% more params than modified)
**modified**:
step 100 evaluated train loss = 2.2036, valid loss = **2.2190**
# Penn Treebank
**vanilla**:
step 100 evaluated train loss = 2.7331, valid loss = **2.7417**
step 100 evaluated train loss = 2.8184, valid loss = **2.5611** (w/ 10 layers =>10% more params than modified)
**modified**
step 100 evaluated train loss = 2.0061, valid loss = **2.0184**