Interchangeability of block X and Y configuration on launch

I’m currently testing GPU performance by analyzing the launch configuration of the blockDim and gridDim

I have pretty much understanding for all the different outputs from different configurations apart from one.
If I launch my configuration as
blockDimX = 2^2
blockDimY = 2^6

It’s 3 times slower than
blockDimX = 2^6
blockDimY = 2^2

edited numbers, I meant 2^2 and 2^6, not 2 and 6 by itself I’m sorry

I’m respecting the dimensions and keeping the grid the same)
How can this be?
It doesn’t look like it’s affecting warps, or threads per SM, or threads per block, or registers limit, any of the important parameters… Any hints?

EDIT: Might be important to reference the compute capability is 7.5x

blockDim.x of 2 and blockDim.x of 6 are both probably lousy choices.

An authoritative answer cannot be given IMO without understanding how you compute x and y indexes in kernel code. However I will assume the “canonical”:

int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;

Let’s also suppose you have a “two-dimensional” data set, and perhaps you are doing an operation like:

out[idy*width+idx] = in[idy*width+idx];

A blockDim.x of 2 and blockDim.y of 6 means you have 2 threads in the x direction and 6 in the y direction. (A total of 12 threads per block is bad choice performance-wise). The only warp in that block will have

2 threads that read from row q
2 threads that read from row q+1
2 threads that read from row q+2
...
2 threads that read from row q+5

The GPU cache lines are organized horizontally or row-wise in memory. Those first two threads above will force a cacheline load of at least 32 quantities and maybe 128 or more quantities, yet that warp only needs 2 of them. The next two threads will force another, different cacheline load, but only need 2 of the loaded quantities. And so on.

Now work the same description with the other case.

Then answer these 2 questions:

  1. How many cachelines had to be loaded per block in each case?
  2. Per block, what was the efficiency (bytes used/bytes loaded) in each case?

Hopefully that does it for you.

Yes, huge caches with zero inefficiencies (evictions, etc.) should mostly calm this down. But GPUs provide no guarantee of thread or threadblock execution order, so for large data sets (that exceed the cache size, and remember the L1 cache is “pretty small”) could easily thrash with poor block sizing choices like these.

It’s a lousy way to design GPU code (either one of those choices). Design as if caches did not exist. (memory still gets loaded segment by segment, as if there were 32-byte cachelines, even when no caching occurs). Ideally the caches are there to fix “small” problems that are hard to design out of your code. Don’t depend on them for the problems that are easy to design out.

1 Like

I’m sorry, I made a mistake in my original description, it would be 2^2 and 2^6 instead of simply 2 and 6

However, I do think your answer explained the detail that I was missing, which was that caches are organized row-wise.

Thank you!!