i wrote very simple kernel (see below) and runned it with different parameters sets.
Results slightly surprised me:
even if you exlude global memory reading kernel time does not changes while you’re enlarging number of blocks (each block is 32 threads = 1 warp in given example).
The graph on the attached image shows dependence of kernel time (Y) from number of blocks (X).
Does anybody have an idea, how to explain “constant” parts of the graph?
__device__ inline
float func(float &x, float &y)
{
return (__cosf(x)*__cosf(x)+__sinf(x)*__sinf(x) + __cosf(y)*__cosf(y)+__sinf(y)*__sinf(y))/2;
//return 1;
}
__global__ void
get_ints(float *d_minx, float *d_miny, float *d_maxx, float *d_maxy, float *dres)
{
int tidx = blockIdx.x*blockDim.x*blockDim.y + threadIdx.y *blockDim.x+ threadIdx.x;
int xs = XS[0];
int ys = YS[0];
int all = xs*ys;
int t = K[0];
for (int kk = 0; kk < t; ++kk)
{
int idx = kk*blockDim.x*blockDim.y*gridDim.x + tidx;
float x0 = kk;//d_minx[idx];
float y0 = kk;//;d_miny[idx];
float xL = kk + 1;//d_maxx[idx];
float yL = kk + 1;//d_maxy[idx];
float dx = (xL - x0)/xs;
float dy = (yL - y0)/ys;
float x = x0 + dx/2;
float y = y0 + dy/2;
float ret = 0;
//for (int i = 0; i < ys; ++i)
//{
// for (int j = 0; j < xs; ++j)
// {
// ret += func(x, y)*dx*dy;
// x += dx;
// }
// y += dy;
//}
// seems ineffective
// well, nevermind - it's just a test :)
for (int i = 0; i < all; ++i)
{
ret += func(x, y)*dx*dy;
y += dy;
if (i % ys == ys - 1)
{
x += dx;
y = 0;
}
}
dres[idx] = ret;
}
}
If the answer is 4, that would explain the graph. A card running 4*16=64 blocks at a time would show discrete jumps every 64 blocks, because the 65th block needs to wait until the first set of blocks finish before starting. The 65th block adds a fixed amount of running time to the entire kernel. Blocks 66-128 can slot right into otherwise unused multiprocessors at that point, leading to the flat time behavior you see. Then you get another discrete jump in time at block 129, which has to wait until blocks 65-128 finish before starting.
Of course you’re right! Graph jumps since kernel occupancy value allows to run only 4 blocks per MP concurrently. ( I meant it implicitly :) )
However the more interesting question is: “Why kernel time does not changes between the jumps?”
It can not be explained via global mem latency, since there are no global memory reads. Any ideas?
Ah, good point. Unless there is latency to hide, there should be jumps every 16 blocks, rather than every 64. Perhaps there are read-after-write register dependencies in the generated code? Only 32 threads per block means that those sorts of dependencies will stall the pipeline. Does interleaving block execution hide that as well?
Those plots are very interesting! The shorter stairsteps in the 96 thread plot do confirm that there was some kind of instruction latency in the 32 thread version that extra blocks could hide. I’m a little surprised that the stairsteps are every 32 blocks and not every 16 as I expected. That says that, for this kernel, if you have an odd number of blocks on a multiprocessor, there is room to run another block with no additional time penalty. The CUDA manual says that read-after-write dependencies can be eliminated once you have 192 threads per block. Perhaps the jumps every 32 blocks are because there still is some sort of collision in this kernel, even with 96 threads.
CEILING(MyWarpsPerBlock*2; 4)16MyRegCount
I see. You’re right.
Well that means that it is better (from ‘occupancy->max’ point of view) to use blocks with even warps number (in case we are not limited by shares memory allocation).
Official NVIDIA comment about reg allocation would be valuable :)
I’ve made some experiments with my kernel and here are the results. Maybe they wil be intersting for some of you.
1.JPG Graph represents dependence of kernel perfomance from total blocks number for different threads pre block (tpb) values.
As you can see peak performance is achieved on even warp per blocks count on some “magic” numbers of blocks.
NB Term “Operation” stands here for elementar operation is done by kernel is context of the conrete task. (I.e. number of “jobs” that kernel do in terms of your task is number of “operation”. )
2.JPG This graph is just other form of previous. It gives us kernel time dependece from blocks number for different tbp.
3.JPG Let’s try to make out a formula for previous graph. IMO and according to my experiments graph “jumps” then the following inequality (see image) becomes true. I.e. MP resources become overloaded according to occupancy value.
4.JPG So we can figure out a kind of kernel time model which includes “occupancy” and other kernel parameters.
T0 is rather “strange” parameter. It is kernel execution time when MP is full-loaded (i.e. directly before the fisrt “jump”), divided by number of “operations” which are executed by kernel.
5.JPG I changed number of “operations” per kernel, made kernel timing graphs and compared it with theoretical one (prom prevoius image). T0 values for theoretical were counted from previous data. Theoretical graphs fit perfect expect the begginig of graph steps for even warps per block numbers (marked with red).
6.JPG T0(threads-per-block) graph. At first sight - very strange.
7.JPG However occupancy depends threads-per-block in similar way.
We can guess that T0 = tA*occupancy + tB (see 8.JPG)
Does anybody have remarks?
Now I’m thrying to figure out:
why theoretical graphs diverges from experimental one in 5.JPG