General CUDA Questions New to CUDA and need some help!

Hello,

I’ve been investigating CUDA for about a week or so now and I’ve been very impressed so far. Using pre-existing kernels seems simple and is pretty straightforward. I’ve recently begun writing my own kernel to do a histogram and I’ve run into a few questions that I can’t seem to find answers too. Most of them are basic, theoretical questions, but could you please answer them or point me to the correct resource to answer them? I’ve looked around in the documentation and on these forums, but a lot of the information I’ve found is somewhat confusing and, in some cases, seems contradicting.

  1. The concept of grids, blocks, and threads is a little confusing at times. This is what I’ve gathered, but there are some things I’ve read that seem to suggest otherwise.

a) A grid is basically equal to a kernel call. When you call

myKernel<<<config_parameters>>> (arguments)

, you are creating a grid. Only one grid may be executing on the GPU at any time.

b) A block is nothing more than a group of threads. Multiple blocks can be executed on the GPU at one time. The number of blocks running in parallel on the GPU is based on the number of free processors and the number of threads per block.

  1. This is a question pertaining to what I believe I understand in (1). What exactly do the <<<config_parameters>>> define? The first one, usually labeled as nBlocks or gridDim, is the number of blocks in the grid. The second parameter, usually labeled as blockSize, is really the number of threads within each block, correct? Therefore, each call to the kernel will execute the code within the kernel (nBlocks * blockSize) times.

  2. This also has to do with (1) above. Why have blocks? Why have M blocks with N threads when you could just have 1 block of (MN) threads? Wouldn’t the computations be the same? There would be (MN) individual execution paths within each grid/kernel call either way, wouldn’t there?

  3. What is the difference between having a grid with (2 x 3) blocks and a grid with (6 x 1) or (1 x 6) blocks. Perhaps this is something that will become more apparent me down the road when I write a kernel dealing with two dimensional data, rather than the one dimensional data I’m dealing with at this time.

  4. What happens if the number of threads specified is greater than the number of processors in the GPU? It appears that everything is still computed correctly in my situation, but what is happening at a lower level? Does doing this run the risk of causing unforeseen errors/bugs?

  5. Lets say we have 1 GB of memory on the GPU. Our data is 950 MB. If all of the memory on the GPU is available to us, then it would be easy to know if our data will fit on the GPU, but that isn’t the case. There is memory used by CUDA and the OS’s GUI. Since these can be variable, and can even change during the execution of our code, is there a way to determine how much memory is available at runtime? What happens if we exceed this amount?

I guess it’s worth mentioning I’m using CUDA 2.0 on a GTX280. We are developing in LINUX.

Thanks in advance!

Bryan

Hi,
As a beginner in CUDA, I found these articles useful:
http://llpanorama.wordpress.com/2008/05/21…t-cuda-program/
http://llpanorama.wordpress.com/2008/06/11…nd-grids-oh-my/

  1. At the moment, that is correct. I don’t think it’s good to assume that grids will never be resident on the GPU concurrently, though. (that’s not a hint at anything, don’t get excited. I’m just saying, there’s nothing in the CUDA spec that explicitly says “grids will never run simultaneously”)

  2. Bascially, yes, although keep in mind that gridDim and blockDim are dim3 variables. You can have 2D arrays of blocks and 3D arrays of threads for each block; that’s primarily a convenience thing, though.

  3. Because blocsk are guaranteed to be local to each other so you can use shared memory. As soon as you talk about making shared memory accessible from every SM, you start talking about either an absolutely enormous increase in complexity on the chip or using global memory instead of shared memory, which means it would be really slow. You can think of a CUDA device as basically an array of SIMT (SIMD except the hardware transparently handles divergent paths) processors with some scheduling hardware to manage the whole thing and onboard memory; it’s not 240 independent processors. This is why divergent warps are bad for performance.

  4. Nothing, except the values you get out of blockIdx. Either way, you have six independent blocks.

  5. The GPU has its own scheduling hardware to assign blocks to free SMs, and each SM has its own scheduling hardware to schedule warps (which don’t have to be from the same block). In other words: it Just Works, don’t worry about it.

  6. Not really. Hopefully you’re running a console app; if you boot directly to a console and don’t run X, you should get significantly more free memory.

I have a question about number 1 which u mentioned, please correct me if I am wrong: As I understood from the programming manual, at each moment just one kernel can be run in on GPU, so it would not be possible to run parallel kernels (grids) concurrently, yes?

Suppose that I have 4 matrices (A, B, C, and D) and a kernel for matrix multiplication. Which one of the following scenarios can be implemented on a GPU:

1- Simultaneously compute E=AB and F=CD with one kerenl, and then as the next step compute the G=E*F? (this needs 2 steps)

OR

2-I have to first compute E, then at next step F, and finally G? (this needs 3 steps)

[quote name=‘tmurray’ date=‘Sep 4 2008, 02:13 PM’]

  1. At the moment, that is correct. I don’t think it’s good to assume that grids will never be resident on the GPU concurrently, though. (that’s not a hint at anything, don’t get excited. I’m just saying, there’s nothing in the CUDA spec that explicitly says “grids will never run simultaneously”)

You could either do this with a reduction or with a custom matrix multiply kernel. I wouldn’t worry about the number of “steps” something takes; launch overhead is very small unless your matrix dimensions are tiny.

So, you are saying that scenario #1 can be implemented?

I see no reason why it couldn’t, but it’s probably messy and faster (both in dev time and execution time) to just do scenario 2.

Thanks for those links kolonel. They’ve been very informative.

And thanks tmurray for your prompt response. It’s always nice to see people from the company answering questions on their message boards.

I do have one follow up for now. You mention:

Was this a typo? From my understanding, the threads are local and have access to common shared memory, but the blocks do not. Block 0 has a different shared memory than Block 1, correct?

And to add to my understanding of the difference between blocks and threads, each block is executed entirely on 1 SM. The threads in Block 0, for instance, cannot be divided between several SMs, right? So, therefore, if your threads do not need to have the benefit of shared memory, it would be beneficial to spread them out over multiple blocks, so that you can take advantage of more than one SM. I think I’m starting to understand the underlying architecture a little more which is helping me understand how things are structured at the software level as well.

Thanks again for all the help!

Sorry, I was unclear. All threads within a block are guaranteed to run on the same SM. Blocks, even those that execute on the same SM due to whatever scheduling takes place, can never share memory.

Basically, the idea with block sizes is that there’s a sweet spot where one block (or two blocks, or as many blocks can be concurrently scheduled on a single SM) can keep an individual SM busy and there are enough blocks to keep all of the SMs busy while still keeping within the register and shared memory limitations. It’ll be different for each app, but it should be fairly intuitive once you get used to CUDA a bit more.