Waiting for global memory access.

Well, I’ve made some other tests:

i wrote very simple kernel (see below) and runned it with different parameters sets.

Results slightly surprised me:

even if you exlude global memory reading kernel time does not changes while you’re enlarging number of blocks (each block is 32 threads = 1 warp in given example).

The graph on the attached image shows dependence of kernel time (Y) from number of blocks (X).

Does anybody have an idea, how to explain “constant” parts of the graph?

__device__ inline

float func(float &x, float &y)

{

	return (__cosf(x)*__cosf(x)+__sinf(x)*__sinf(x) + __cosf(y)*__cosf(y)+__sinf(y)*__sinf(y))/2;

	//return 1;

}

__global__ void

get_ints(float *d_minx, float *d_miny, float *d_maxx, float *d_maxy, float *dres)

{

    int tidx = blockIdx.x*blockDim.x*blockDim.y + threadIdx.y *blockDim.x+ threadIdx.x;

	

    int xs = XS[0];

    int ys = YS[0];

    int all = xs*ys;

    int t = K[0];

   for (int kk = 0; kk < t; ++kk)

    {

	

        int idx = kk*blockDim.x*blockDim.y*gridDim.x + tidx;

       float x0 = kk;//d_minx[idx];

        float y0 = kk;//;d_miny[idx];

        float xL = kk + 1;//d_maxx[idx];

        float yL = kk + 1;//d_maxy[idx];

  

        float dx = (xL - x0)/xs;

        float dy = (yL - y0)/ys;

       float x = x0 + dx/2;

        float y = y0 + dy/2;

       float ret = 0;

       //for (int i = 0; i < ys; ++i)

        //{

        //	for (int j = 0; j < xs; ++j)

        //	{

        //  ret += func(x, y)*dx*dy;

        //  x += dx;

        //	}

        //	y += dy;

        //}

       // seems ineffective 

        // well, nevermind - it's just a test :)

        for (int i = 0; i < all; ++i)

        {  	

                ret += func(x, y)*dx*dy;  	

                y += dy;

                if (i % ys == ys - 1)

                {

                      x += dx;

                      y = 0;

                 }

         }

    dres[idx] = ret;

    }

}

Can you use the CUDA occupancy calculator ( http://developer.download.nvidia.com/compu…_calculator.xls ) to figure out how many blocks can run your kernel simultaneously per multiprocessor?

If the answer is 4, that would explain the graph. A card running 4*16=64 blocks at a time would show discrete jumps every 64 blocks, because the 65th block needs to wait until the first set of blocks finish before starting. The 65th block adds a fixed amount of running time to the entire kernel. Blocks 66-128 can slot right into otherwise unused multiprocessors at that point, leading to the flat time behavior you see. Then you get another discrete jump in time at block 129, which has to wait until blocks 65-128 finish before starting.

seibert

Of course you’re right! Graph jumps since kernel occupancy value allows to run only 4 blocks per MP concurrently. ( I meant it implicitly :) )

However the more interesting question is: “Why kernel time does not changes between the jumps?”
It can not be explained via global mem latency, since there are no global memory reads. Any ideas?

Ah, good point. Unless there is latency to hide, there should be jumps every 16 blocks, rather than every 64. Perhaps there are read-after-write register dependencies in the generated code? Only 32 threads per block means that those sorts of dependencies will stall the pipeline. Does interleaving block execution hide that as well?

Probably you’re right.

However it is very surprising for me that read-after-write register dependencies affects performance so much.

Can you try the same test with 64 and 96 threads per block? I would expect more stair-stepping if it really is read-after-write latency then.

graphs for the same kernel with 64 and 96 threads per block are attached

hm… the question is how to interpret these results.

Those plots are very interesting! The shorter stairsteps in the 96 thread plot do confirm that there was some kind of instruction latency in the 32 thread version that extra blocks could hide. I’m a little surprised that the stairsteps are every 32 blocks and not every 16 as I expected. That says that, for this kernel, if you have an odd number of blocks on a multiprocessor, there is room to run another block with no additional time penalty. The CUDA manual says that read-after-write dependencies can be eliminated once you have 192 threads per block. Perhaps the jumps every 32 blocks are because there still is some sort of collision in this kernel, even with 96 threads.

seibert
thanks for tips

IMU the key thing here is the fact that block size affects occupancy.

(to be honest I thought that occupancy depends only from amount of registers and shared mem allocated by the kernel).

There is one more thing I still do not understand:

Open Occupancy calculator, choose:
32 threads p block
25 registers/thread
40 shared mem/thread

Calculator gives us: “allocation = 1600 registers per block”
While I expected 25*32 = 800.

If you set 64 threads p block - allocation is still 1600 registers per block ==> occupancy increases by 2.

Hmm, apparantly registers are allocated with some kind of minimum. This is the calculation:

CEILING(MyWarpsPerBlock*2; 4)16MyRegCount
So apparantly there is a minimum amount of registers allocated, namely the amount of registers for 2 warps.

CEILING(MyWarpsPerBlock*2; 4)16MyRegCount
I see. You’re right.

Well that means that it is better (from ‘occupancy->max’ point of view) to use blocks with even warps number (in case we are not limited by shares memory allocation).

Official NVIDIA comment about reg allocation would be valuable :)

I’ve made some experiments with my kernel and here are the results. Maybe they wil be intersting for some of you.

1.JPG Graph represents dependence of kernel perfomance from total blocks number for different threads pre block (tpb) values.
As you can see peak performance is achieved on even warp per blocks count on some “magic” numbers of blocks.
NB Term “Operation” stands here for elementar operation is done by kernel is context of the conrete task. (I.e. number of “jobs” that kernel do in terms of your task is number of “operation”. )

2.JPG This graph is just other form of previous. It gives us kernel time dependece from blocks number for different tbp.

3.JPG Let’s try to make out a formula for previous graph. IMO and according to my experiments graph “jumps” then the following inequality (see image) becomes true. I.e. MP resources become overloaded according to occupancy value.

4.JPG So we can figure out a kind of kernel time model which includes “occupancy” and other kernel parameters.
T0 is rather “strange” parameter. It is kernel execution time when MP is full-loaded (i.e. directly before the fisrt “jump”), divided by number of “operations” which are executed by kernel.

5.JPG I changed number of “operations” per kernel, made kernel timing graphs and compared it with theoretical one (prom prevoius image). T0 values for theoretical were counted from previous data. Theoretical graphs fit perfect expect the begginig of graph steps for even warps per block numbers (marked with red).

6.JPG T0(threads-per-block) graph. At first sight - very strange.

7.JPG However occupancy depends threads-per-block in similar way.
We can guess that T0 = tA*occupancy + tB (see 8.JPG)

Does anybody have remarks?

Now I’m thrying to figure out:

  1. why theoretical graphs diverges from experimental one in 5.JPG
  2. what tA and tB mean.