I’ve tried reading the pertinent manual sections, but something is just not making its way through my thick skull… I can’t seem to wrap my mind around how to appropriately choose thread blocks for launching my kernel. This wasn’t an issue when I was testing my work with small data sets; however, I am now encountering an “invalid configuration argument” when trying to work with larger data sets.
I also tried looking at the CUDA Occupancy spreadsheet, but it might as well be written in Mandarin for all I can understand…
My current example attempts to work on a set of 1D float arrays which are 19,695,844 elements in length.
I was following someone else’s example, launching the kernel:
int block_size = 128;
//nArrayElements = 19,695,844
int n_blocks = nArrayElements/block_size + (nArrayElements%block_size == 0 ? 0 : 1 );
PerformWork<<< n_blocks, block_size >>>(…)
Can anyone point in me the right direction to launch my kernel in a better way?
int block_size = 128;
int max_blocksX=65535; //maximum gridDim, according to dev guide
//nArrayElements = 19,695,844
int n_blocksX = (nArrayElements-1)/block_size+1;
PerformWork<<<dim3(n_blocksX,n_blocksY), block_size >>>(...)
this one scales to 2^31-1, then you’ve got a problem with the int. :-p
use something like threadIdx.x + blockIdx.xblockDim.x + blockIdx.yblockDim.x*gridDim.x to get the absolute index in every thread.
You’re basically just wrapping a 1D array in a 2D one. It’s a small trick in addressing to use 2D. You can convert a 1D address into a 2D one, and a 2D address back into a 1D one using some simple math. However, I think Ocire made a mistake. 1D->2D conversion is:
int n_blocks = (nArrayElements-1) / block_size + 1;
int n_blocksX = (n_blocks >= max_blocksX) ? max_blocksX : n_blocks;
int n_blocksY = (n_blocks-1) / max_blocksX + 1;
Looking ahead, I’m still trying to figure out the best way to launch my kernel in general. I am starting with a specific example to try to figure it all out…
The documentation shows
DoWork<<< Dg, Db >>>(…)
Dg = Dim3 such that Dg.x*Dg.y = number of blocks
Db = Dim3 such that Db.xDb.yDb.z = number of threads per block
Another section tells me that I should have at least twice as many blocks as multiprocessors (16 in my case), but 100 is better and 1000 will scale better in the future. The number of threads should be a multiple of the warp size (32 in my case). I am examining the particular case of an array that is sized 10,475,400. I started by looking at the block sizes that would give me between 100 and 1,000 blocks. I saw that selecting Db by:
Db = 512 x 128 x 1 = 65,536 = 2,048*32
Would give me ~160 blocks => 10,475,400/(512*128) = 159.84.
The maximum number of threads per block is 512, so Db.xDb.yDb.z must be <= 512. Note that depending on the number of registers used in your kernel, the limit may be lower.
I always choose dimensions the other way around. Fix the block size (benchmark different block sizes in multiples of 32 to find the optimum). For 1D data, it is easiest just to keep the block size 1D. Then I launch enough blocks of that size to cover the whole data set.
In your case, say you benchmark and find that a block size of 128 is the fastest. Then ceil(10,475,400 / 128) = 81 839. That is larger than 65535, so: ceil(sqrt(81839)) = 287. You need to launch a kernel with a 287x287 grid with a block size of 128.
Note that the sqrt might not be the best way to choose the 2D block size, I haven’t thought that one through fully yet. So far my data sets have always fit nicely with only a 1D grid.
Well, I feel like an idiot… I even spent a little time today writing the card properties (including maximum threads per block) to a diagnostics file. I definitely should have caught that.
That being said, I am getting very slow times when I take this approach. My baseline is to use 512 threads and a block size of ceil(10,475,400 / 512) = 20,460. This gives me a run time of 3.3 minutes (a speed up of less than 5x - another problem I am working on). I have played around with different block sizes and grid dimensions as per your suggestion. Any 2D value I use seems to yield a run time of around 11 minutes (give or take 15 seconds) - over 3x the original run time.
You don’t have to overcomplicate the code by using threadIdx.y, just leave the blockDims 1D.
Apart from that, your correction was effective, as now consecutive threads will read consecutive parts of global memory, which allows for coalescing. (see programming guide for more information)
Provided you chose the threadIdx.x nicely (e.g., a multiple of the warp size, i.e. 32*x), you should also not have a problem with alignment.
ps: sorry for my mistake in the first post, don’t know why i didn’t see the missing min(…) ;-)
just some things to check, maybe it’ll give you the missing seconds:
does each thread only read a single value from global mem? if not, it would be advisable to use shared mem or const/texture mem.
is your code divergent within a warp?
have you chosen the optimal blockSize according to the occupancy calculator? (you can also just test different ones)
are you using many registers in your kernel? can you reduce them?
are you using “expensive” functions like sinf, cosf, expf, etc.? there exist lower precision alternatives, which are much faster (they usually are something like __cosf etc., also make sure you are using the single precision functions if you compile for sm_1.3)
look for things like 1.f/sqrtf(x), which will cost much more than rsqrtf(x) and do the same.
look for constant values which are computed the same way in every thread.
the performance guidelines in the programming guide are a good way to start and well worth reading. there is also information about the runtime of most functions.
Assuming you are bandwidth limited (most CUDA apps are), you should compute the effective bandwidth you are getting by counting up the total number of reads and writes your kernel does and dividing by the run time of the kernel. On GTX 280, you should be able to get ~110 GiB/s for fully coalesced reads.
As others have said, getting coalesced reads is very important. The difference in performance can be a factor of 10-20. The visual profiler can be used to verify that your reads are coalesced, but only on pre-G200 hardware.
If you can give us at least some idea of the memory access pattern of the kernel, we can suggest uses of constant, texture, and/or shared memory to boost performance. If telling us that would give too much away, you’ll just have to learn how and when to use them on your own. The “performance guidelines” section of the CUDA programming guide is the best place to start. Sites with additional information are linked to from the CUDA 2.1 FAQ: http://forums.nvidia.com/index.php?showtopic=84440