In device function, I want to allocate global GPU memory. But this is limited. I can set the limit by calling cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t* hsize) on host. However, it seems that I can only set this limit hsize up to 10241024(1024+1024-1)= 2146435072
, around 2GB. Any number bigger than this one assigned to hsize makes hsize equal to 18446744071563116544, a very big number. When I call cuda-memcheck, it shows an error:
Program hit cudaErrorLaunchOutOfResources (error 7) due to “too many resources requested for launch” on CUDA API call to cudaLaunch. Saved host backtrace up to driver entry point at error
I am using TITAN X GPU, which has 12GB memory, but it seems only 2 GB memory is available for dynamic allocation in device function. Is there any way to use all of my GPU memory in device function?