CUDA fragmentation

I am using the Bitonic sort from here https://gist.github.com/mre/1392067

Since this is doing an in-place sort, I had expected that elements (structures) in the original global memory would get swapped and thus not have to allocate additional memory.

However, after running this will 100 million points, a cudaMemGetInfo shows zero for used, available and total memory.

Does this mean:
a) That in-place swaps are not happening and that each write to global memory allocates a new location. Thus memory is getting filled through this action
b) That some memory corruption has occurred in the CUDA code
c) That some other fragmentation is occurring (NOTE:I am using structures for each element of the array not simple 32 bit values like floats).

I guess that since I am using structures, the driver may allocate a new structure and link this rather than an in-place swap. In which case memory is getting completely fragmented.

Any increase in understanding would help :-)

Thank you

If it is showing 0 for total memory, then clearly something fundamental is wrong. Perhaps you are not calling cudaMemGetInfo correctly.

Thank you for responding. Using the same print memory routine, I get reasonable answers (using a Quadro P2000) before the sort.

Starting memory
Device memory: used 165281792 available 5130682368 total 5295964160
After LAS cudaMalloc
Device memory: used 3107586048 available 2188378112 total 5295964160
Total size loaded 2941378300
CUDA Synchronize
CUDA Synchronize done
Time to execute CallSort 64.0448
After Sort
Device memory: used 0 available 0 total 0

(1) Make sure you have proper CUDA error checking on all API calls and kernel launches
(2) Run application under control of cuda-memcheck

Any time you are having trouble with a CUDA code, my suggestion would be that you always do the 2 things that njuffa mentioned before asking others for help.

I have checkCudaErrors around all of my calls (been a programmer for a long time :-) )

But I will have to check out cuda-memcheck.

Will that show me levels of fragmentation?

Note that detecting errors in CUDA API calls and detecting errors in kernel launches are separate and different tasks. Just checking the success of CUDA API calls (which I presume is what checkCudaErrors does) is insufficient.

My working hypotheses here is that cudaMemGetInfo fails due to a prior undetected error in the CUDA stack. cuda-memcheck can diagnose many kind of errors. If it reports zero issues with your code, consider posting a minimal, self-contained reproducer.

Thanks for that.

Do the NVidia samples do a sufficient job in terms of detecting errors in Kernel launches? If not, can you point me to an example please?

One way to check for errors in kernel launches is to call this macro after every kernel launch:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

If you are doing proper CUDA error checking on a call to cudaMemGetInfo, and you are getting 0 back for total memory, that is just bizarre, unexplainable. At that point if I were trying to investigate, I would want a complete repro case. A full code that demonstrates the problem, stripped down to a minimum level, but still complete, along with what GPU you are running on, what OS, what CUDA version, and the compile command.

This should be completely unaffected by whether you are doing proper CUDA error checking around any particular kernel call.

I have a Quadro P2000 here (Win7, CUDA 8) on which I could run a minimal self-contained reproducer.

Ok
I have a Quadro P2000 (Win10, CUDA 9.2). I will do some more testing over the weekend and see if I can get a working self-contained reproducer.

Thank you both for your help.