Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[Discussion] What are best practices when building/training very small models? (alien.top)

submitted 2 years ago by Snagnar@alien.top to c/machinelearning@academy.garden

6 comments fedilink hide all child comments

I am currently trying to build small convolutional regression models with very tight constraints regarding model size (max. a few thousand parameters).

Are there any rules of thumb/gold standards/best practices to consider here? E.g. should I prefer depth of the model over width, do skip connections add anything in these small scales, are there any special training hacks that might boost performance, etc?

Any hints or pointers, where to look are greatly appreciated.

top 6 comments

sorted by: hot top controversial new old

[–] semicausal@alien.top 1 points 2 years ago

In my experience, it honestly depends on what you're trying to have the models learn and the task at hand.

- Spend lots of time cleaning up your data and doing feature engineering. Regulated industries like insurance spend significantly more time in feature engineering than tuning fancy models, for example.

- I would recommend trying regression and random forest models first, or even xgboost

[–] qalis@alien.top 1 points 2 years ago

(I assume you are talking about convolutional models in the context of computer vision)

I had similar constraints (embedded devices in specific environment) and we didn't use deep learning at all. Instead, we used classical image descriptors from OpenCV like color histograms, HOG, SIFT etc. with SVM as classifier. It can work surprisingly well for many problems, and is blazing fast.

Consider how you can make the problem easier. Maybe you can do binary classification instead of multiclass, or use only grayscale images. Anything that will make the task itself easier will be a good improvement.

If your problem absolutely requires neural networks, I would use all tools available:

Skip connections, either residuals or to all layers (like DenseNet)
Sharpness-Aware Minimizer (SAM) or some of its variants
Label smoothing
Data augmentation with a few really problem-relevant transformations
Extensive hyperparameter tuning with Gaussian Process or multivariate Tree Parzen Estimator (see e.g. Optuna)
You can concatenate those classical features like color histograms or HOG to the flattened output of the CNN, before the MLP head. This way you reduce what CNN needs to learn, so you can get away with less parameters
Go for more convolutional layers instead of large MLP head. Convolutional layers eat up a lot less of parameter budget than MLPs.

You can also consider training a larger network and then applying compression techniques, such as knowledge distillation, quantization or pruning.

[–] txhwind@alien.top 1 points 2 years ago

if you have a lot of data, you can try to train larger ones first, and solve the model size problem with all kinds of model compression or inference optimization methods.

[–] Seankala@alien.top 1 points 2 years ago

TL;DR The more constraints on the model, the more time should spend analyzing your data and formulating your problem.

I'll agree with the top comment. I've also had to deal with a problem at work where we were trying to perform product name classification for our e-commerce product. The problem was that we couldn't afford to have anything too large or increase infrastructure costs (i.e., if possible we didn't want to use any more GPU computing resources than we already were).

It turns out that extensive EDA was what saved us. We were able to come up with a string-matching algorithm sophisticated enough that it achieved high precision with practically no latency concerns. Might not be as flexible as something like BERT but it got the job done.

[–] fuankarion@alien.top 1 points 2 years ago

A few thousand parameters is very little. I would have a look at models like LeNet, which are extremely small but have been proven effective. Maybe start by copying the architecture and then drop the number of filters until you reach your desired parameter number.

[–] Seankala@alien.top 1 points 2 years ago

TL;DR The more constraints on the model, the more time should spend analyzing your data and formulating your problem.