I’ve been investigating CUDA for about a week or so now and I’ve been very impressed so far. Using pre-existing kernels seems simple and is pretty straightforward. I’ve recently begun writing my own kernel to do a histogram and I’ve run into a few questions that I can’t seem to find answers too. Most of them are basic, theoretical questions, but could you please answer them or point me to the correct resource to answer them? I’ve looked around in the documentation and on these forums, but a lot of the information I’ve found is somewhat confusing and, in some cases, seems contradicting.
- The concept of grids, blocks, and threads is a little confusing at times. This is what I’ve gathered, but there are some things I’ve read that seem to suggest otherwise.
a) A grid is basically equal to a kernel call. When you call
, you are creating a grid. Only one grid may be executing on the GPU at any time.
b) A block is nothing more than a group of threads. Multiple blocks can be executed on the GPU at one time. The number of blocks running in parallel on the GPU is based on the number of free processors and the number of threads per block.
This is a question pertaining to what I believe I understand in (1). What exactly do the <<<config_parameters>>> define? The first one, usually labeled as nBlocks or gridDim, is the number of blocks in the grid. The second parameter, usually labeled as blockSize, is really the number of threads within each block, correct? Therefore, each call to the kernel will execute the code within the kernel (nBlocks * blockSize) times.
This also has to do with (1) above. Why have blocks? Why have M blocks with N threads when you could just have 1 block of (MN) threads? Wouldn’t the computations be the same? There would be (MN) individual execution paths within each grid/kernel call either way, wouldn’t there?
What is the difference between having a grid with (2 x 3) blocks and a grid with (6 x 1) or (1 x 6) blocks. Perhaps this is something that will become more apparent me down the road when I write a kernel dealing with two dimensional data, rather than the one dimensional data I’m dealing with at this time.
What happens if the number of threads specified is greater than the number of processors in the GPU? It appears that everything is still computed correctly in my situation, but what is happening at a lower level? Does doing this run the risk of causing unforeseen errors/bugs?
Lets say we have 1 GB of memory on the GPU. Our data is 950 MB. If all of the memory on the GPU is available to us, then it would be easy to know if our data will fit on the GPU, but that isn’t the case. There is memory used by CUDA and the OS’s GUI. Since these can be variable, and can even change during the execution of our code, is there a way to determine how much memory is available at runtime? What happens if we exceed this amount?
I guess it’s worth mentioning I’m using CUDA 2.0 on a GTX280. We are developing in LINUX.
Thanks in advance!