Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[P] Model training bottlenecked by CPU. (alien.top)

submitted 2 years ago by AdSignificant9235@alien.top to c/machinelearning@academy.garden

8 comments fedilink hide all child comments

Recently, I've been working on some projects for fun, trying out some things I hadn't worked with before, such as profiling.

But after profiling my code, I found out that my average GPU activity is around 50%. Apparently, the code frequently hangs for a few hundred milliseconds on the dataloader process. I've tried a few things in the dataloader: increasing/decreasing the number of workers, setting pin-memory to true or false, but neither seems to really matter. I have an NVME drive, so the disk is not the problem either. I've concluded that the bottleneck must be the CPU.

Now, I've read that pre-processing the data might help, so that the dataloader doesn't have to decode the images, for example, but I don't really know how to go about this. I have around 2TB of NVME storage, and I've got a couple datasets on the disk (ImageNet and INaturalist are the two biggest ones), so I don't suppose I'll be able to store them on the disk uncompressed.

Is there anything I can do to lighten the load on the CPU during training so that I can take advantage of the 50% of the GPU that I'm not using at the moment?

top 8 comments

sorted by: hot top controversial new old

[–] Celmeno@alien.top 1 points 2 years ago

Might be cache misses or RAM access times. NVMe is mind boggling slow compared to L2 cache

[–] RetroPenguin_@alien.top 1 points 2 years ago

Does your GPU util peak at 100%? If not, increase the batch size until it does (roughly). A couple ideas: do any transforms on the GPU or before starting your training job, have the CPU be solely responsible for loading images from disc. Increase the number of workers to the number of CPUs you have.

[–] Main_Path_4051@alien.top 1 points 2 years ago

Which Api do you use ?

[–] Coarchitect@alien.top 1 points 2 years ago

Number of workers should be the number of cpus. The preprocessing should be done in the dataset class not data loader.

[–] AtharvBhat@alien.top 1 points 2 years ago

I've dealt with similar issues in my own projects.

A couple of pointers :-

Use Image formats that are fast to decode for example BMP ( you can try converting all your images to BMP before you start training ) This will increase their size on disk but should reduce the CPU load. If you are doing any complex preprocessing on large images in your dataset class, try preprocessing images first and storing them to disk and loading those directly

These are just some general suggestions. It'd be more helpful if we knew more about your task so that we can offer more directed suggestions :)

[–] DidgeridooMH@alien.top 1 points 2 years ago

Check your VRAM usage. If it's above your dedicated VRAM, then you may be waiting for data to transfer from shared (system RAM) to your GPUs dedicated memory. You can also check the task manager while training to see the copy buffer usage and see if that's constantly spiking throughout training.

[–] btcmx@alien.top 1 points 2 years ago

As other have rightly pointed out, verify you're using the Data Loader the right way. Ideally you need to create a custom dataset (in PyTorch terms) and apply all the transformations in this custom dataset. This might be helpful. Also, have you tried PyTorch Lightning?

[–] arg_max@alien.top 1 points 2 years ago

CPU bottlenecks can be easily found by monitoring the CPU usage during training. If all of your cores are constantly at 100% your cpu might be too slow. If both the cpu and gpu are idle from time to time your storage could be the bottleneck.

To increase data loading performance, you could try out Nvidias Dali or FFCV which are both libraries optimized for that purpose. They replace some of the inefficient python code with highly optimised code. FFCV is quite nice but it requires you to convert your dataset into a specific format.