ML Engineering 101: A Thorough Clarification of The Error “DataLoader employee (pid(s) xxx) exited unexpectedly” | by Mengliu Zhao

Torch.multiprocessing Greatest Practices

Nevertheless, digital reminiscence is just one facet of the story. What if the difficulty doesn’t go away after adjusting the swap disk?

The opposite facet of the story is the underlying problems with the torch.multiprocessing module. There are a selection of greatest practices suggestions on the official webpage:

However in addition to these, three extra approaches needs to be thought of, particularly concerning reminiscence utilization.

The very first thing is shared reminiscence leakage. Leakage implies that reminiscence isn’t launched correctly after every run of the kid employee, and you’ll observe this phenomenon while you monitor the digital reminiscence utilization at runtime. Reminiscence consumption will hold growing and attain the purpose of being “out of reminiscence.” It is a very typical reminiscence leakage.

So what’s going to trigger the leakage?

Let’s check out the DataLoader class itself:

https://github.com/pytorch/pytorch/blob/major/torch/utils/knowledge/dataloader.py

Wanting beneath the hood of DataLoader, we’ll see that when nums_worker > 0, _MultiProcessingDataLoaderIter is named. Inside _MultiProcessingDataLoaderIter, Torch.multiprocessing creates the employee queue. Torch.multiprocessing makes use of two totally different methods for reminiscence sharing and caching: file_descriptor and file_system. Whereas file_system requires no file descriptor caching, it’s liable to shared reminiscence leaks.

To verify what sharing technique your machine is utilizing, merely add within the script:

torch.multiprocessing.get_sharing_strategy()

To get your system file descriptor restrict (Linux), run the next command within the terminal:

ulimit -n

To modify your sharing technique to file_descriptor:

torch.multiprocessing.set_sharing_strategy(‘file_descriptor’)

To depend the variety of opened file descriptors, run the next command:

ls /proc/self/fd | wc -l

So long as the system permits, the file_descriptor technique is beneficial.

The second is the multiprocessing employee beginning technique. Merely put, it’s the controversy as as to whether to make use of a fork or spawn because the worker-starting technique. Fork is the default solution to begin multiprocessing in Linux and may keep away from sure file copying, so it’s a lot quicker, however it might need points dealing with CUDA tensors and third-party libraries like OpenCV in your DataLoader.

To make use of the spawn technique, you possibly can merely go the argument multiprocessing_context= “spawn”. to the DataLoader.

Three, make the Dataset Objects Pickable/Serializable

There’s a tremendous good submit additional discussing the “copy-on-read” impact for course of folding: https://ppwwyyxx.com/weblog/2022/Demystify-RAM-Utilization-in-Multiprocess-DataLoader/

Merely put, it’s now not a great method to create an inventory of filenames and cargo them within the __getitem__ technique. Create a numpy array or panda dataframe to retailer the checklist of filenames for serialization functions. And for those who’re acquainted with HuggingFace, utilizing a CSV/dataframe is the beneficial solution to load a neighborhood dataset: https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset.example-2