Conceptual questions on memory and blocks limit

I have a few conceptual questions for which I couldn’t really objective answers in other related posts here or StackOverflow, so maybe you guys can provide some guidance:

  • When we declare an array with cudaMallocManaged and are finished with the computation in GPU, it is not needed to manually declare yet another array in host memory and do a cudaMemcpy(host_array, device_array, length), as a managed variable is moved between host and device depending on where it is called from. Correct?

  • When we declare an array with cudaMallocManaged without checking if there is enough free video memory for the operation (cudaMemGetInfo()), does the function return an error or it virtualizes the space with video memory + host memory and whatnot?

  • When setting up the kernel function call, assume a certain number of blocks is passed, a_kernel_method<<< nBlocks, 1>>>(). What defines the number of blocks that a device can execute and, more specifically, what happens if I submit more blocks than the device can run at once, or it is limited by the API version instead? What is the blocks limit in v9?

  • From the CUDA documentation: “On current GPUs, a thread block may contain up to 1024 threads”. If I don’t specify a SM in the compilation and the number of threads is only known in runtime, that is, it can go up to 1024, what happens when it is run on a device that doesn’t support the number of threads? When writing the code, should we make sure this never gets bigger than 1024 or the API handles that?

If you guys know any of these questions and want to share the information, then I already thank you.

Rather than using the Q+A here or on SO as your primary source of learning, you may want to consider actually using the documentation, i.e. the CUDA programming guide, as well as other original source material such as blogs. I think most of your questions are already answered in locations such as those. Even if you don’t want to read the whole programming guide, there is a particular section dedicated to managed memory.

correct. This should be evident simply by studying a managed memory sample code. It is a basic principle of managed memory: only one pointer is needed to refer to data, whether from device or host code.

The behavior varies by CUDA version and system configuration (e.g. GPU, OS, etc.) with CUDA 9.x, on windows, or on linux with a GPU prior to Pascal generation, the attempt to allocate more than available memory will return a runtime API error. With a Pascal or Volta GPU on CUDA 9.x on Linux, oversubscription is possible. You can read more about it in the relevant section of the programming guide:

or via a blog:

and probably elsewhere.

The limit on blocks is a hardware limit defined by the GPU itself. Run the deviceQuery program to determine what it is for your GPU. It is not a function of CUDA version. You can also discover this information from the relevant table in the programming guide:

If you exceed the limit, an API error is returned on a subsequent CUDA runtime API call (not from the kernel call itself.) If you exceed the limit, the kernel will not launch at all. I strongly recommend proper CUDA error checking. Not sure what that is? Google “proper CUDA error checking” and read the first hit. Also note that in order to take advantage of the higher limit available for the grid X dimension on cc3.0 and higher GPUs, it is necessary to explicitly compile for that target on CUDA 8.0 and older (CUDA 9.0 compiles for cc3.0 by default.)

This is also covered in dozens of questions on cuda SO tag.

All CUDA devices except the very first generation of CUDA capable GPUs support a maximum of 1024 threads per block. So unless you are using CUDA 6.5 or older, and one of these very old GPUs (they are not supported in CUDA 7.0 and beyond), this should be a non-issue. To answer your question, you would programmatically determine at runtime such a hardware limit using a method similar to what is depicted in the cuda deviceQuery sample code, and then your code would be able to programmatically at runtime adhere to the hardware limit, for the GPU you intend to run on.

If, for any reason, you pass a number greater than 1024, the kernel launch will fail, with a runtime API error as previously described. The API doesn’t automatically “handle” this case, other than to return an error and fail the kernel launch.

Good afternoon, txbob, and thanks for your time to provide this detailed answer.

I always have the browser with 10+ tabs pointing to either nvidia documentation or other rich discussions from active guys, like Robert Crovella, Talonmies and Njuffa, but it is possible that some information escapes my eyes.

The blog from Mark Harris I knew, but not the one from Nikolay, so I am bookmarking it for later reference. I will digest all the information you provided and adjust code to do the proper error checking as well as kernel parameters.

BTW, your assistance on getting the CUDA code in a DLL with MSVC was really good as I can work on totally separate modules. Now I am trying to convince nvcc/MSVC to link to the cuRAND libs, but it refuses regardless of the -lcurand -L"path_to_x64_lib". The fight goes on, txbob!

Thank you, as usual.