Well, but how to merry it with batching so that flash attention kernels can work with it?
Any complicated masks for attention imply hard times of making possible supporting batching.
Well, but how to merry it with batching so that flash attention kernels can work with it?
Any complicated masks for attention imply hard times of making possible supporting batching.