Forsaken-Data4905

joined 2 years ago
[–] Forsaken-Data4905@alien.top 1 points 2 years ago

The point is that the adapted layers have a significantly higher parameter count in the freezed model, leading to huge savings of memory. You never take your gradient with respect to adapted layers, only to adaptor layers and whatever is left of the original model.

This is of course not necessarily true for smaller models.