I am training a neural network on a Nano using Python 3.6 and CUDA. However, my process gets killed. If I run the same code on OS X, the script works fine.
device = torch.device("cuda" if (torch.cuda.is_available() and in_args.gpu == "gpu") else "cpu")
I get the output below when monitoring performance with tegrastats (1000ms interval). I think the 64GB swap file works as well as CUDA & the GPUs.
Would you mind to retry it without adding the swap file?
The swap memory is only accessible by the CPU.
However, Jetson’s memory is shared so there is some possibility that the swap memory is used as GPU memory.
It’s recommended to try if this issue still remains without using swap memory first.
But this may require you to decrease the batchsize to feed the model into native memory.
So I guess I am asking too much from the Nano. I turned back on swap so some non GPU stuff may be swapped, I reduce the batch size and will see how far this will get me.
If you have the memory pointer, you could use CTypes to call mlock and prevent swapping for this memory. The post you linked only states that it is not possible to lock a python object.
But first: Did you make sure it really is an Out-Of-Memory kill and not a segfault or something? IIRC, OOM kills are logged in the dmesg output. If it is an OOM kill, deactivating swap would only make it worse, not better. I would think the CUDA library makes sure that memory pages shared between the GPU and CPU are prevented from swapping. But if the library does not take care of this, the result would be incorrect computations or segmentation faults since the GPU would work on incorrect memory.
What is the actual error message that comes up when the script gets killed?
Yes, you’re running out of memory. The kernel OOM killer chooses your training process to kill.
In general, swap is never a good solution – modern Linux systems often choose to run without it entirely. (In fact, Kubernetes will even fail to start a node if swap is enabled!)
If you try to run a workload that’s bigger than the available RAM, it will fail. Get a bigger computer, or make the workload smaller.
In general, swap is a good solution and you should not disable it. Kubernetes will refuse to start a node if swap is enabled only because it is too difficult for the devs to handle the situation with swap correctly ( eg: handling memory limits correctly). For a normal machine, disabling swap does not make any sense.
What you do not want is swap thrashing but this is not what the OP complained about. Instead, he might even get away with increasing the swap space and see if this allows his script to run through.
I have done systems programming for 25 years, and I can tell you: Swap is bad.
The sales people from the early computer era were right: “Virtual memory is a way to sell real memory.”
For any system where you need to guarantee performance and behavior, swap adds unacceptable uncontrollable factors.
Note that virtual memory mapping is great! Similarly, demand paging of position-independent shared libraries may be acceptable, depending on your particular performance needs. But that’s not the same thing as swap; actually paging dirty memory to disk.
The Linux kernel will already overcommit on memory, and if it turns out you ACTUALLY need all the memory it “promised” to you, it will kill you … or some other process. Whichever process the OOM killer decides to kill. If you have a system that needs to actually provide defined services to defined customers, like almost every server and embedded system on the planet, any uncertainty about this process is just bad, period.
I too prefer to run completely without swap. However, I cheat and have a swap file or partition (the capability to swap) most of the time. I only enable it when mandatory. Someone may have a situation where they know they will need this, but if that isn’t the case, it is best to avoid having actual enabled swap.
I have swap on but turn vm.swappiness way down on my devices with flash memory. @arne.caspari’s link says you should use a high value, but that will wear the device faster and I’m not sure if you even gain any performance from it given that even a very fast ssd is still much slower than ram. It seems like bad advice to use 100 as swappiness for most workloads. 10 is the value Red Hat recommends for database workloads and 60 is the default most distros use.