Since this is doing an in-place sort, I had expected that elements (structures) in the original global memory would get swapped and thus not have to allocate additional memory.
However, after running this will 100 million points, a cudaMemGetInfo shows zero for used, available and total memory.
Does this mean:
a) That in-place swaps are not happening and that each write to global memory allocates a new location. Thus memory is getting filled through this action
b) That some memory corruption has occurred in the CUDA code
c) That some other fragmentation is occurring (NOTE:I am using structures for each element of the array not simple 32 bit values like floats).
I guess that since I am using structures, the driver may allocate a new structure and link this rather than an in-place swap. In which case memory is getting completely fragmented.
Thank you for responding. Using the same print memory routine, I get reasonable answers (using a Quadro P2000) before the sort.
Starting memory
Device memory: used 165281792 available 5130682368 total 5295964160
After LAS cudaMalloc
Device memory: used 3107586048 available 2188378112 total 5295964160
Total size loaded 2941378300
CUDA Synchronize
CUDA Synchronize done
Time to execute CallSort 64.0448
After Sort
Device memory: used 0 available 0 total 0
Any time you are having trouble with a CUDA code, my suggestion would be that you always do the 2 things that njuffa mentioned before asking others for help.
Note that detecting errors in CUDA API calls and detecting errors in kernel launches are separate and different tasks. Just checking the success of CUDA API calls (which I presume is what checkCudaErrors does) is insufficient.
My working hypotheses here is that cudaMemGetInfo fails due to a prior undetected error in the CUDA stack. cuda-memcheck can diagnose many kind of errors. If it reports zero issues with your code, consider posting a minimal, self-contained reproducer.
If you are doing proper CUDA error checking on a call to cudaMemGetInfo, and you are getting 0 back for total memory, that is just bizarre, unexplainable. At that point if I were trying to investigate, I would want a complete repro case. A full code that demonstrates the problem, stripped down to a minimum level, but still complete, along with what GPU you are running on, what OS, what CUDA version, and the compile command.
This should be completely unaffected by whether you are doing proper CUDA error checking around any particular kernel call.