I am trying to convert some part of C++ code into CUDA Kernels. Till now i have got following queries … any inputs on these are highly appreciated
For executing any portion of C++ code on GPU using CUDA i need to CudaMalloc which takes alot of time which is actually more than complete execution time on CPU, then what is appropriate way to allocate memory on GPU ?
What different types of memories are available on GPU which is appropriate to Use with CUDA ?
What is the “complete execution time on CPU”? How did you determine that “CudaMalloc … takes a lot of time”?
Note that CUDA context creation / initialization occurs lazily, usually triggered by the first CUDA API call, such as the first cudaMalloc() call. CUDA context initialization varies based on system configuration, and generally takes longer if there is a lot of system memory and a lot of GPU memory.
This is what i come to know CUDA subsystem gets initialized by the first CUDA runtime API call.
To prevent this it’s mentioned in user guide that we should use CUT_DEVICE_INIT which will do required initialization and after that we should use Cudamalloc and after this i have verified cudamalloc takes only 10 micro seconds…
But currently i am struggling to add CUT_DEVICE_INIT with Cuda 6.5 version… forums say CUT_DEVICE_INIT is removed after Cuda 5.0 so what is the alternate way to initialize cuda subsystem with Cuda6.5 ??
Which “user guide” recommended the use of CUT_DEVICE_INIT? Can you point to the relevant document and the relevant section in that document?
As far as I recall, all the CUT stuff was part of a utility library that was introduced to shorten example program shipping with CUDA, and NVIDIA pointed out numerous time that this code was not to be considered part of the CUDA deliverables, could change or go away at any time, and should therefore not be used by CUDA programmers for production code.
Try calling cudaFree(0) to trigger initialization of the CUDA context.
cudaFree Frees the memory space pointed to by devPtr, which must have been returned by a previous call to cudaMalloc() and if argument is zero as you suggested it does no operation…
It’s solving issue of initializing CUDA subsystem but CudaFree doesn’t seem to be an API for this purpose so is there any other API for Cuda6.5 which can be used at the start of the application to initialize CUDA subsystem…
In the CUDA runtime API, there is no dedicated context creation API call. Instead, context creation and initialization happen lazily, as needed. This is by design, as the CUDA runtime seeks to hide low-level details that are exposed when using the CUDA driver API.
Most CUDA runtime API calls will trigger the context creation if a context doesn’t exist yet. Calling cudaFree(0) is one API call that is convenient to manually trigger creation of the context, as it initiates no other activity besides the side effect of kicking off context creation and initialization.
If, for reasons I do not understand, you do not want to use cudaFree(0) to trigger the CUDA context creation, you are free to invoke some other suitable CUDA runtime API call for this purpose, or you can simply rely on the default lazy on-demand initialization.