Could you please provide some more information as to your constraints? If space is an issue, you might be better off with a more memory friendly model, like an LSTM. You even have per-token attention with some models.
There's a really interesting sparkfun video which I'll look around for, showing a question-answering model using some sort of BERT(?) running on a Raspberry Pi Zero-type chip, with 25-50MB of flash memory.
Could you please provide some more information as to your constraints? If space is an issue, you might be better off with a more memory friendly model, like an LSTM. You even have per-token attention with some models.
There's a really interesting sparkfun video which I'll look around for, showing a question-answering model using some sort of BERT(?) running on a Raspberry Pi Zero-type chip, with 25-50MB of flash memory.