I have 16 OpenMP threads with a different CUDA stream in each. I am getting the right results but too much running time. Basicaly each thread calls a set of kernels. Before the fork of the OpenMP I call Setheaplimit as you can see in Fig 1 bellow. But at the middle of the processing appears a cudaFree from anywhere (Fig 2). I do not know from where. Perhaps from cuFFT that I call tree times in each thread. But you see that happens only in one. Can anyone give me a clue how I get free from this over runtime? I do not call cudaFree inside OpenMP threads.
calling cudaFree without an underlying memory allocation should result in a crash
however, you do not complain of a crash, which would then suggest that the cudaFree is successful
this in turn suggests that the underlying memory is not allocated by you, at least not directly
i almost want to hypothesize that it is linked to the call to cudaDeviceSetLimit
you can easily test this: see whether you can afford to not call cudaDeviceSetLimit, and see whether the cudaFree is absent when cudaDeviceSetLimit is absent, when you re-profile
otherwise, profile an elementary program, first with cudaDeviceSetLimit present, and then with cudaDeviceSetLimit absent