CUDA 2.2 running half the speed of 2.1

Hi,

When compiled with CUDA 2.2, my code runs at half the speed that it does when compiled with CUDA 2.1. Here are exerpts from the profiler logs.

CUDA 2.1:

timestamp	method	gputime	cputime	occupancy	gridSizeX	gridSizeY	blockSizeX	blockSizeY	blockSizeZ	dynSmemPerBlock	staSmemPerBlock	registerPerThread	streamID	memTransferSize	memTransferDir	gld_incoherent	gld_coherent	gst_incoherent	gst_coherent	local_load	local_store	branch	divergent_branch	instructions	warp_serialize	cta_launched																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

433463	_Z10cuda_4pipejjjjjjjjPhS_	74317.1	74348	0.25	65535	1	64	1	1	0	64	37	4			0	0	0	0	0	0	163802	0	24717049	0	10922																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

582794	memcopy	4.9	242											1	1																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																

583662	_Z10cuda_4pipejjjjjjjjPhS_	74302.2	74316	0.25	65535	1	64	1	1	0	64	37	4			0	0	0	0	0	0	163858	0	24718589	0	10922																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

732528	memcopy	3.81	12											1	1

CUDA 2.2:

timestamp	method	gputime	cputime	occupancy	gridSizeX	gridSizeY	blockSizeX	blockSizeY	blockSizeZ	dynSmemPerBlock	staSmemPerBlock	registerPerThread	streamID	memTransferSize	memTransferDir	gld_incoherent	gld_coherent	gst_incoherent	gst_coherent	local_load	local_store	branch	divergent_branch	instructions	warp_serialize	cta_launched																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

455903	_Z10cuda_4pipejjjjjjjjPhS_	147923	148007	0.25	65535	1	64	1	1	0	64	37	0			0	0	0	0	0	0	316738	0	49356931	0	10922																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

623450	_Z10cuda_4pipejjjjjjjjPhS_	147917	147995	0.25	65535	1	64	1	1	0	64	37	0			0	0	0	0	0	0	316738	0	49353107	0	10922

In particular, note that the number of instructions performed has doubled. Any suggestions on how I go about determining why?

Thanks!

I trimmed unnecessary code from source and combined everything necessary into one file. The source is attached. When compiled with the command
/usr/local/cuda/bin/nvcc -O -I /usr/local/cuda/include -L /usr/local/cuda/lib -o cuda-4pipe cuda-4pipe.cu
it runs at twice the speed when compiled with CUDA 2.1 as when compiled with CUDA 2.2. On my 9600GSO, the CUDA 2.1 version runs in about 3.8 seconds while the CUDA 2.2 version takes 7.5 seconds. Any suggestions at where the issue may lie?

Thanks!
cuda_4pipe.zip (2.77 KB)

As a quick first question (without even looking at your code first), have you looked at register use in both compiles?

Different toolkits can create different register counts. But unfortunately that can sometimes increase your block requirements meaning that the number of simultaneous blocks an SM can handle decreases, giving you a net slowdown. In my SHA1 code, CUDA 2.0 used 1/3 of the registers that CUDA 2.2 did, and this caused problems on G80/G90.

Well, it’s not register count, I got 35 in CUDA 2.0 and 37 in 2.3.

But I did reproduce the 2.3 slowdown… 1.8 secs with CUDA 2.0 and 3.3 secs with CUDA 2.3, on a 280GTX… which is even a different architecture than your 9600GSO.

Now why… I don’t know.

Reproduced in Barra, which is still some different kind of architecture… :)

CUDA 2.2 is correct.

The bug was that CUDA 2.1 was using a signed 16-bit multiplication for bx * bd in:

/* Block index */

  int bx = blockIdx.x;

/* Block of threads dimension */

  int bd = blockDim.x;

/* Thread index */

  int tx = threadIdx.x;

/* Drop out early if we don't have any data to process */

  if( ((bx * bd) + tx) > process_amount) {

	return;

  }

So blocks with an ID greater than 32767 caused a numeric overflow/wraparound, and you end up processing much less data than expected…

Too bad for your speedup. ;)

If bx * bd is wrapping, then shouldn’t it always less than process_amount and doing more work than it should? I’m not sure I follow…

Edit: I still don’t follow the explanation, but it was trivial to demonstrate that you are correct. Thanks to this CUDA 2.0 and 2.1 bug, all blocks with indices > 32767 are not processed. This invalidates nearly all of distributed.net’s RC5 CUDA results to date, a lot of work, but at least it’s now known and corrected in CUDA 2.2 and 2.3.

Right, but the comparison that is done afterward is unsigned. (go figure…)

For the record, the assembly code generated by CUDA 2.1 was:

mov.half.u16 r0.hi, s[0x000c]   # blockIdx.x

mov.half.u16 r1.lo, s[0x0002]   # blockDim.x

cvt.u32.u16.rn r2, r0.lo		# threadIdx.x

mad24.s16 r2, r0.hi, r1.lo, r2

set.ge.u32 p0|_, s[0x002c], r2  # p0 = (process_amount >= r2);

@p0.equ return				  # if(!p0) return;

And by CUDA 2.2:

mov.u16.u16.rd r1.lo, s[0x0002]

mul24.u16.lo r1, s[0x000c], r1.lo

cvt.u32.u16.rn r0, r0.lo

add.s32 r1, r0, r1

set.ge.u32 p0|_, s[0x002c], r1

@p0.equ return