full warp Vs. half warp coalesced global memory loads

Hi all,

I use the code listed below to split a source image into two half size images where one destination image contains the even rows and the other destination image contains the odd rows.

It’s implemented as a sliding window in the vertical direction because it is part of a larger project which requires it to be a sliding window. While playing around with the block dimensions,

however, I noticed there is a large drop in performance when going from a (32,4) block to a (16,8) block. I know that there’s a performance hit when using blocks with an x-dimension less

than 16, but I was under the impression that 16 ints are coalesced in a single 64-bit memory transaction within a half warp.

Could anybody tell me why the full warp (32,4)block is almost twice as fast as the half warp (16,8) block?

Cheers,

Nico

Here’s the ouput generated on a Quadro FX 1600M (in all tests, the total amount of threads as well as the required amount of shared memory is the same):

Uploading input data to GPU memory…

Running GPU test…

Array size : 2048x2048 (int)

blockDim : 128x1

gridDim : 16x1

Average GPU time : 1.945023 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim : 64x2

gridDim : 32x1

Average GPU time : 1.967922 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim : 32x4

gridDim : 64x1

Average GPU time : 2.053766 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim : 16x8

gridDim : 128x1

Average GPU time : 4.024047 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim : 8x16

gridDim : 256x1

Average GPU time : 17.548977 ms

Checking result…

Test: PASSED

Shutting down…

Press ENTER to exit…

[codebox]#include <cutil_inline.h>

#define NUM_ITERATIONS 128

global void splitKernel(int* src, int* dst1, int* dst2, const int w, const int h) {

extern __shared__ int shared[];

const unsigned int stride 	= blockDim.y*w;

unsigned int		write	= threadIdx.y*w+blockIdx.x*blockDim.x+threadIdx.x;

unsigned int         read 	= threadIdx.y*w+blockIdx.x*blockDim.x+threadIdx.x;

for (unsigned int i=0;i<h/(2*blockDim.y);++i) {

	shared[(0*blockDim.y+threadIdx.y)*blockDim.x+threadIdx.x] = src[read];

	read+=stride;

	shared[(1*blockDim.y+threadIdx.y)*blockDim.x+threadIdx.x] = src[read];

	read+=stride;

	__syncthreads();

	dst1[write] = shared[(2*threadIdx.y+0)*blockDim.x+threadIdx.x];

	dst2[write] = shared[(2*threadIdx.y+1)*blockDim.x+threadIdx.x];

	write+=stride;

}

}

int main(int argc, char **argv) {

if ( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

    cutilDeviceInit(argc, argv);

else

    cudaSetDevice( cutGetMaxGflopsDeviceId() );

unsigned int hTimer;

cutilCheckError( cutCreateTimer(&hTimer) );

int* 	h_data1 = 0;

int* 	h_data2 = 0;

int*	d_data1 = 0;

int*	d_data2 = 0;

const unsigned int width  			= 2048;

const unsigned int height			= 2048;

cudaMallocHost((void**)&h_data1, width*height*sizeof(int));

cudaMallocHost((void**)&h_data2, width*height*sizeof(int));

for (unsigned int i = 0 ; i<width*height;++i)

	h_data1[i] = rand();

cudaMalloc((void **)&d_data1, width*height*sizeof(int));

cudaMalloc((void **)&d_data2, width*height*sizeof(int));

fprintf(stderr,"Uploading input data to GPU memory...\n");

cudaMemcpy(d_data1, h_data1, width*height*sizeof(int), cudaMemcpyHostToDevice);

for (unsigned int i=0;i<5;++i) {

	dim3 bdim(128>>i , 1<<i);

	dim3 gdim(width/bdim.x,1);

	fprintf(stderr,"\nRunning GPU test...\nArray size : %dx%d (int)\nblockDim : %dx%d\ngridDim : %dx%d\n",width,height,bdim.x,bdim.y,gdim.x,gdim.y);

	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutResetTimer(hTimer) );

	cutilCheckError( cutStartTimer(hTimer) );

	for(int it = 0; it < NUM_ITERATIONS; it++){

		splitKernel<<<gdim, bdim , 2*bdim.x*bdim.y*sizeof(int)>>>(d_data1, &d_data2[0], &d_data2[width*height/2], width, height);

		cutilCheckMsg("dwtKernel() execution failed\n");

	}

	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutStopTimer(hTimer) );

	double gpuTime = cutGetTimerValue(hTimer) / NUM_ITERATIONS;

	fprintf(stderr,"Average GPU time : %f ms\n", gpuTime);

	cudaMemcpy(h_data2, d_data2, width*height*sizeof(int), cudaMemcpyDeviceToHost);

	fprintf(stderr,"Checking result...\n");

	bool passed = true;

	for (unsigned int y=0;y<height;++y) {

		for (unsigned int x=0;x<width;++x) {

			if (h_data1[y*width+x]!= h_data2[ (y&0x1)*width*(height>>1)+(y>>1)*width+x])

				passed = false;

		}

	}

	fprintf(stderr,"Test: %s\n", (passed ? "PASSED" : "FAILED"));

}

fprintf(stderr,"Shutting down...\n");

cudaFree(d_data1);

cudaFree(d_data2);

cudaFreeHost(h_data1);

cudaFreeHost(h_data2);

cutilCheckError( cutDeleteTimer(hTimer) );

cutilExit(argc, argv);

cudaThreadExit();

}

[/codebox]

The only thing I can think of at this moment is that you process two times as many values per synchronisation barrier when you have a doubly wide block. Are you sure the performance difference has to do with global memory access at all?

If so, have you tried the CUDA profiler to see if memory read/writes are really being coalesced in both cases?

Thanks for the reply,

If I understand the documentation correctly, the syncthreads function sychronizes all threads within a block. In all tests the total number of threads (and therefore also the number of processed values) within a block is the same.

I’m not sure it has anything to do with global memory accesses, but it was my first guess. The profiler shows that everything is coalesced and apart from the gpuTime column, all values for (32,4) and (16,8) are approximately the same (see attachment).
profiler.png

I do notice that “CTA Launched” is twice as much in the second selected column. Is this on purpose? As this would explain “two times as slow” very well.

It is supposed to be twice as much for the (16,8) block, because it has to launch twice as many blocks, but for each block the iteration count of the sliding window in the for loop is half that of the iteration count for the (32,4) block.

This is consistent for all other block sizes:

(128,1) -> cta Launched = 8

(64,2) -> cta Launched = 16

(32,4) -> cta Launched = 32

(16,8) -> cta Launched = 64

(8,16) -> cta Launched = 128

After some extensive testing, I found that the problem is caused by the for-loop, and not by the global memory loads. I’m still seeing some strange performance hit when going from a (32,4) block to a (16,8) one.

For example, for (M,N) block sizes I can divide the 2048x2048 image into (2048/M)x(2048/N) blocks and process each block independently, or I could issue (2048/M) blocks and let each block iterate (2048/N) times in the vertical direction.

It looks like using a for-loop is much faster for block sizes of (128,1) , (64,2) and (32,4), but when using a block size of (16,8) the for-loop is much slower. I’m guessing this is caused by the block scheduling / time slicing or maybe it’s just my graphics card…

N.

Uploading input data to GPU memory…

Running GPU test…

Array size : 2048x2048 (int)

blockDim: 128x1

Average GPU time WITHOUT for-loop: 2.750547 ms

Checking result…

Test: PASSED

Average GPU time WITH for-loop: 1.953461 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim: 64x2

Average GPU time WITHOUT for-loop: 2.739609 ms

Checking result…

Test: PASSED

Average GPU time WITH for-loop: 1.936773 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim: 32x4

Average GPU time WITHOUT for-loop: 2.735976 ms

Checking result…

Test: PASSED

Average GPU time WITH for-loop: 2.017203 ms

Checking result…

Test: PASSED

Running GPU test…

Array size : 2048x2048 (int)

blockDim: 16x8

Average GPU time WITHOUT for-loop: 2.741539 ms

Checking result…

Test: PASSED

Average GPU time WITH for-loop: 3.584320 ms

Checking result…

Test: PASSED

Shutting down…

Press ENTER to exit…

[codebox]#include <cutil_inline.h>

#define NUM_ITERATIONS 128

global void splitKernel1(int* src, int* dst1, int* dst2, const unsigned int w, const unsigned int h) {

extern __shared__ int shared[];

shared[                      threadIdx.y*blockDim.x+threadIdx.x] = src[(2*(blockIdx.y*blockDim.y+threadIdx.y)+0)*w+(blockIdx.x*

blockDim.x+threadIdx.x)];

shared[blockDim.x*blockDim.y+threadIdx.y*blockDim.x+threadId

x.x] = src[(2*(blockIdx.y*blockDim.y+threadIdx.y)+1)w+(blockIdx.x

blockDim.x+threadIdx.x)];

__syncthreads();

//! do something

//! __syncthreads();

dst1[(blockIdx.y*blockDim.y+threadIdx.y)*w+(blockIdx.x*block

Dim.x+threadIdx.x)] = shared[ threadIdx.y*blockDim.x+threadIdx.x];

dst2[(blockIdx.y*blockDim.y+threadIdx.y)*w+(blockIdx.x*block

Dim.x+threadIdx.x)] = shared[blockDim.xblockDim.y+threadIdx.yblockDim.x+threadId

x.x];

}

global void splitKernel2(int* src, int* dst1, int* dst2, const unsigned int w, const unsigned int h) {

extern __shared__ int shared[];

for (unsigned int i=0;i<h/(2*blockDim.y);++i) {

	shared[                      threadIdx.y*blockDim.x+threadIdx.x] = src[(2*(i*blockDim.y+threadIdx.y)+0)*w+(blockIdx.x*blockDim.

x+threadIdx.x)];

	shared[blockDim.x*blockDim.y+threadIdx.y*blockDim.x+threadId

x.x] = src[(2*(i*blockDim.y+threadIdx.y)+1)w+(blockIdx.xblockDim.

x+threadIdx.x)];

	__syncthreads();

	//! do something

	//! __syncthreads();

	dst1[(i*blockDim.y+threadIdx.y)*w+(blockIdx.x*blockDim.x+thr

eadIdx.x)] = shared[ threadIdx.y*blockDim.x+threadIdx.x];

	dst2[(i*blockDim.y+threadIdx.y)*w+(blockIdx.x*blockDim.x+thr

eadIdx.x)] = shared[blockDim.xblockDim.y+threadIdx.yblockDim.x+threadId

x.x];

}

}

int main(int argc, char **argv) {

if ( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

    cutilDeviceInit(argc, argv);

else

    cudaSetDevice( cutGetMaxGflopsDeviceId() );

unsigned int hTimer;

cutilCheckError( cutCreateTimer(&hTimer) );

int* 	h_data1 = 0;

int* 	h_data2 = 0;

int*	d_data1 = 0;

int*	d_data2 = 0;

const unsigned int width  			= 2048;

const unsigned int height			= 2048;

cudaMallocHost((void**)&h_data1, width*height*sizeof(int));

cudaMallocHost((void**)&h_data2, width*height*sizeof(int));

for (unsigned int i = 0 ; i<width*height;++i)

	h_data1[i] = rand();

cudaMalloc((void **)&d_data1, width*height*sizeof(int));

cudaMalloc((void **)&d_data2, width*height*sizeof(int));

fprintf(stderr,"Uploading input data to GPU memory...\n");

cudaMemcpy(d_data1, h_data1, width*height*sizeof(int), cudaMemcpyHostToDevice);

for (unsigned int i=0;i<4;++i) {

	dim3 bdim(128>>i , 1<<i);

	dim3 gdim1(width/bdim.x,height/(2*bdim.y));

	dim3 gdim2(width/bdim.x,1);

	fprintf(stderr,"\nRunning GPU test...\nArray size : %dx%d (int)\nblockDim: %dx%d\n",width,height,bdim.x,bdim.y);

	//!------------------------------------------------------------------------------------------------------------------------------

	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutResetTimer(hTimer) );

	cutilCheckError( cutStartTimer(hTimer) );

	for(int it = 0; it < NUM_ITERATIONS; it++){

		splitKernel1<<<gdim1, bdim, 2*bdim.x*bdim.y*sizeof(int)>>>(d_data1, &d_data2[0], &d_data2[width*height/2], width, height);

		cutilCheckMsg("dwtKernel() execution failed\n");

	}

	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutStopTimer(hTimer) );

	double gpuTime = cutGetTimerValue(hTimer) / NUM_ITERATIONS;

	fprintf(stderr,"Average GPU time WITHOUT for-loop: %f ms\n", gpuTime);

	cudaMemcpy(h_data2, d_data2, width*height*sizeof(int), cudaMemcpyDeviceToHost);

	fprintf(stderr,"Checking result...\n");

	bool passed = true;

	for (unsigned int y=0;y<height;++y) {

		for (unsigned int x=0;x<width;++x) {

			if (h_data1[y*width+x]!= h_data2[ (y&0x1)*width*(height>>1)+(y>>1)*width+x])

				passed = false;

		}

	}

	fprintf(stderr,"Test: %s\n", (passed ? "PASSED" : "FAILED"));

	//!------------------------------------------------------------------------------------------------------------------------------

	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutResetTimer(hTimer) );

	cutilCheckError( cutStartTimer(hTimer) );

	for(int it = 0; it < NUM_ITERATIONS; it++){

		splitKernel2<<<gdim2, bdim, 2*bdim.x*bdim.y*sizeof(int)>>>(d_data1, &d_data2[0], &d_data2[width*height/2], width, height);

		cutilCheckMsg("dwtKernel() execution failed\n");

	}

	cutilSafeCall( cudaThreadSynchronize() );

	cutilCheckError( cutStopTimer(hTimer) );

	gpuTime = cutGetTimerValue(hTimer) / NUM_ITERATIONS;

	fprintf(stderr,"Average GPU time WITH    for-loop: %f ms\n", gpuTime);

	cudaMemcpy(h_data2, d_data2, width*height*sizeof(int), cudaMemcpyDeviceToHost);

	fprintf(stderr,"Checking result...\n");

	passed = true;

	for (unsigned int y=0;y<height;++y) {

		for (unsigned int x=0;x<width;++x) {

			if (h_data1[y*width+x]!= h_data2[ (y&0x1)*width*(height>>1)+(y>>1)*width+x])

				passed = false;

		}

	}

	fprintf(stderr,"Test: %s\n", (passed ? "PASSED" : "FAILED"));

}

fprintf(stderr,"Shutting down...\n");

cudaFree(d_data1);

cudaFree(d_data2);

cudaFreeHost(h_data1);

cudaFreeHost(h_data2);

cutilCheckError( cutDeleteTimer(hTimer) );

cutilExit(argc, argv);

cudaThreadExit();

}

[/codebox]

It would be explained if warps are assembled horizontally (across X) and never Y, so if your block X dimension is 8, a warp would have only 8 threads, wasting 24 other “free” threads the hardware is capable of simultaneously using.

This is something simple I have never thought about or tried (my apps don’t use 2D thread blocks), but it would explain the behavior. The CUDA programming guide doesn’t say anything about this one way or another other than “use a multiple of 32 threads per block”, which doesn’t get into any dimension effects.

If my theory is correct, the performance drops by 1/2 when you go from X=32 to X=16 since each hardware warp only has 16 active threads, not 32, so it’s wasting half its compute power. (SPs are idle).
And the performance drops again from X=16 to X=8 because even more of each warp is idle, wasting even more compute. But this case suffers doubly because now the number of threads is less than the memory transaction size, so now your memory accesses are also slowed down to 1/2. And maybe worse on old hardware since you’ll lose coalescing in some cases! (depends on how coalescing works with 2D threads, which is another unknown.)

This is just a theory, but it makes sense. You could verify it by a simple program, something like

if (__any(threadIdx.x==0 && threadIdx.y==0)) AtomicAdd(accumulator_pointer, 1);

This tells every thread in the warp that contains the x=0 y=0 thread to safely increment a counter. Examine the value. If your block dimension is 8 wide, and the accumulator shows 8, it means that the warp does indeed hold only 8 values and therefore 24 of the hardware threads are idle.

If the counter says 32 then warps do get formed in 2 dimensions, my theory is wrong, and I don’t know why your code gets slower. :-)

At first, I was thinking it might be something along those lines, but my graphics card does not support warp vote functions, so I wasn’t able to test this theory like you suggested.
In my last post I already tracked down the problem to the for-loop in the code. Without a for-loop, the kernel executes in a steady 2.74ms for all block sizes which seems to indicate
that the performance drop is not caused by any underpopulated warps. When using a for-loop which generates exactly the same results, however, the kernel executes in less than
2ms (which is strangely enough faster than the kernel without for-loop) for block sizes of (128,1), (64,2) and (32,4) but for a block size of (16,8), performance drops to 3.58ms.
As mentioned earlier, there don’t seem to be any issues with memory coalescing or underpopulated warps when not using a for-loop, because in that case using a block size of (16,8)
performs equally well as a block size of (32,4), so I guess the question is “how does a for-loop affect block-scheduling?”

N.