I am training a neural network on a Nano using Python 3.6 and CUDA. However, my process gets killed. If I run the same code on OS X, the script works fine.
device = torch.device("cuda" if (torch.cuda.is_available() and in_args.gpu == "gpu") else "cpu")
I get the output below when monitoring performance with tegrastats (1000ms interval). I think the 64GB swap file works as well as CUDA & the GPUs.
Any suggestions what I am missing here? Do I have to assign memory to the GPU (similar to the Link below)?
https://stackoverflow.com/questions/48285308/killed-error-in-tensorflow-when-i-try-load-convolutional-pretrained-model-in-jet
tegrastat output in the middle of job:
RAM 3478/3957MB (lfb 95x4MB) SWAP 2130/65536MB (cached 4MB) CPU [24%@1428,25%@1428,18%@1428,18%@1428] EMC_FREQ 0% GR3D_FREQ 33% PLL@46.5C CPU@49.5C PMIC@100C GPU@48C AO@55C thermal@49.5C POM_5V_IN 2670/3214 POM_5V_GPU 82/58
tegrastat output just before the job gets killed:
RAM 1391/3957MB (lfb 128x4MB) SWAP 653/65536MB (cached 28MB) CPU [7%@102,7%@102,7%@102,2%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@46C CPU@49C PMIC@100C GPU@48.5C AO@54.5C thermal@48.75C POM_5V_IN 1614/3197 POM_5V_GPU 41/59 POM_5V_CPU 165/878
RAM 1391/3957MB (lfb 128x4MB) SWAP 653/65536MB (cached 28MB) CPU [10%@1428,7%@1428,6%@1428,5%@1428] EMC_FREQ 0% GR3D_FREQ 90% PLL@47C CPU@50.5C PMIC@100C GPU@48C AO@54.5C thermal@48.5C POM_5V_IN 3317/3197 POM_5V_GPU 286/59 POM_5V_CPU 899/878
RAM 1392/3957MB (lfb 128x4MB) SWAP 653/65536MB (cached 28MB) CPU [7%@102,8%@102,7%@102,4%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@46C CPU@49C PMIC@100C GPU@48.5C AO@54.5C thermal@49C POM_5V_IN 1573/3194 POM_5V_GPU 41/59 POM_5V_CPU 165/877