CUDA fragmentation

genifycom · September 14, 2018, 6:37pm

I am using the Bitonic sort from here https://gist.github.com/mre/1392067

Since this is doing an in-place sort, I had expected that elements (structures) in the original global memory would get swapped and thus not have to allocate additional memory.

However, after running this will 100 million points, a cudaMemGetInfo shows zero for used, available and total memory.

Does this mean:
a) That in-place swaps are not happening and that each write to global memory allocates a new location. Thus memory is getting filled through this action
b) That some memory corruption has occurred in the CUDA code
c) That some other fragmentation is occurring (NOTE:I am using structures for each element of the array not simple 32 bit values like floats).

I guess that since I am using structures, the driver may allocate a new structure and link this rather than an in-place swap. In which case memory is getting completely fragmented.

Any increase in understanding would help :-)

Thank you

Robert_Crovella · September 14, 2018, 7:29pm

If it is showing 0 for total memory, then clearly something fundamental is wrong. Perhaps you are not calling cudaMemGetInfo correctly.

genifycom · September 14, 2018, 9:46pm

Thank you for responding. Using the same print memory routine, I get reasonable answers (using a Quadro P2000) before the sort.

Starting memory
Device memory: used 165281792 available 5130682368 total 5295964160
After LAS cudaMalloc
Device memory: used 3107586048 available 2188378112 total 5295964160
Total size loaded 2941378300
CUDA Synchronize
CUDA Synchronize done
Time to execute CallSort 64.0448
After Sort
Device memory: used 0 available 0 total 0

njuffa · September 14, 2018, 10:09pm

(1) Make sure you have proper CUDA error checking on all API calls and kernel launches
(2) Run application under control of cuda-memcheck

Robert_Crovella · September 14, 2018, 10:12pm

Any time you are having trouble with a CUDA code, my suggestion would be that you always do the 2 things that njuffa mentioned before asking others for help.

genifycom · September 14, 2018, 10:38pm

I have checkCudaErrors around all of my calls (been a programmer for a long time :-) )

But I will have to check out cuda-memcheck.

Will that show me levels of fragmentation?

njuffa · September 14, 2018, 10:44pm

Note that detecting errors in CUDA API calls and detecting errors in kernel launches are separate and different tasks. Just checking the success of CUDA API calls (which I presume is what checkCudaErrors does) is insufficient.

My working hypotheses here is that cudaMemGetInfo fails due to a prior undetected error in the CUDA stack. cuda-memcheck can diagnose many kind of errors. If it reports zero issues with your code, consider posting a minimal, self-contained reproducer.

genifycom · September 14, 2018, 11:00pm

Thanks for that.

Do the NVidia samples do a sufficient job in terms of detecting errors in Kernel launches? If not, can you point me to an example please?

njuffa · September 14, 2018, 11:05pm

One way to check for errors in kernel launches is to call this macro after every kernel launch:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

Robert_Crovella · September 14, 2018, 11:18pm

If you are doing proper CUDA error checking on a call to cudaMemGetInfo, and you are getting 0 back for total memory, that is just bizarre, unexplainable. At that point if I were trying to investigate, I would want a complete repro case. A full code that demonstrates the problem, stripped down to a minimum level, but still complete, along with what GPU you are running on, what OS, what CUDA version, and the compile command.

This should be completely unaffected by whether you are doing proper CUDA error checking around any particular kernel call.

njuffa · September 14, 2018, 11:21pm

I have a Quadro P2000 here (Win7, CUDA 8) on which I could run a minimal self-contained reproducer.

genifycom · September 14, 2018, 11:56pm

Ok
I have a Quadro P2000 (Win10, CUDA 9.2). I will do some more testing over the weekend and see if I can get a working self-contained reproducer.

Thank you both for your help.