When two containers training on DGX-1 seperately at same time
and using same GPU. We sometimes encounter such error:
“out of memory invalid argument an illegal memory access was encountered”
but we found there is still enough GPU memory.
sorry for post on wrong forum topic…
What software produced this error? Your own application?
Those appear to be three separate error conditions. It is plausible that they might occur in sequence if the software continues to operate on a null pointer after a memory allocation failure, which I would consider questionable software design. I will note that there is no indication here that these errors pertain to the GPU rather than the host.
How did you determine that there is “enough GPU memory”? Did you identify the failing allocation call and add a print-out of the amount of available memory just before it? Note that with any memory allocator, the total amount of available memory reported may be larger than what can be allocated in a single allocation, due to fragmentation.