CUDA 2.2 running half the speed of 2.1

frmky · July 10, 2009, 8:31pm

Hi,

When compiled with CUDA 2.2, my code runs at half the speed that it does when compiled with CUDA 2.1. Here are exerpts from the profiler logs.

CUDA 2.1:

timestamp	method	gputime	cputime	occupancy	gridSizeX	gridSizeY	blockSizeX	blockSizeY	blockSizeZ	dynSmemPerBlock	staSmemPerBlock	registerPerThread	streamID	memTransferSize	memTransferDir	gld_incoherent	gld_coherent	gst_incoherent	gst_coherent	local_load	local_store	branch	divergent_branch	instructions	warp_serialize	cta_launched																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

433463	_Z10cuda_4pipejjjjjjjjPhS_	74317.1	74348	0.25	65535	1	64	1	1	0	64	37	4			0	0	0	0	0	0	163802	0	24717049	0	10922																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

582794	memcopy	4.9	242											1	1																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																

583662	_Z10cuda_4pipejjjjjjjjPhS_	74302.2	74316	0.25	65535	1	64	1	1	0	64	37	4			0	0	0	0	0	0	163858	0	24718589	0	10922																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

732528	memcopy	3.81	12											1	1

CUDA 2.2:

timestamp	method	gputime	cputime	occupancy	gridSizeX	gridSizeY	blockSizeX	blockSizeY	blockSizeZ	dynSmemPerBlock	staSmemPerBlock	registerPerThread	streamID	memTransferSize	memTransferDir	gld_incoherent	gld_coherent	gst_incoherent	gst_coherent	local_load	local_store	branch	divergent_branch	instructions	warp_serialize	cta_launched																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

455903	_Z10cuda_4pipejjjjjjjjPhS_	147923	148007	0.25	65535	1	64	1	1	0	64	37	0			0	0	0	0	0	0	316738	0	49356931	0	10922																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																					

623450	_Z10cuda_4pipejjjjjjjjPhS_	147917	147995	0.25	65535	1	64	1	1	0	64	37	0			0	0	0	0	0	0	316738	0	49353107	0	10922

In particular, note that the number of instructions performed has doubled. Any suggestions on how I go about determining why?

Thanks!

frmky · July 11, 2009, 5:45pm

I trimmed unnecessary code from source and combined everything necessary into one file. The source is attached. When compiled with the command
/usr/local/cuda/bin/nvcc -O -I /usr/local/cuda/include -L /usr/local/cuda/lib -o cuda-4pipe cuda-4pipe.cu
it runs at twice the speed when compiled with CUDA 2.1 as when compiled with CUDA 2.2. On my 9600GSO, the CUDA 2.1 version runs in about 3.8 seconds while the CUDA 2.2 version takes 7.5 seconds. Any suggestions at where the issue may lie?

Thanks!
cuda_4pipe.zip (2.77 KB)

SPWorley · July 25, 2009, 1:34am

As a quick first question (without even looking at your code first), have you looked at register use in both compiles?

Different toolkits can create different register counts. But unfortunately that can sometimes increase your block requirements meaning that the number of simultaneous blocks an SM can handle decreases, giving you a net slowdown. In my SHA1 code, CUDA 2.0 used 1/3 of the registers that CUDA 2.2 did, and this caused problems on G80/G90.

SPWorley · July 25, 2009, 2:01am

Well, it’s not register count, I got 35 in CUDA 2.0 and 37 in 2.3.

But I did reproduce the 2.3 slowdown… 1.8 secs with CUDA 2.0 and 3.3 secs with CUDA 2.3, on a 280GTX… which is even a different architecture than your 9600GSO.

Now why… I don’t know.

Sylvain_Collange · July 25, 2009, 9:55am

Reproduced in Barra, which is still some different kind of architecture… :)

CUDA 2.2 is correct.

The bug was that CUDA 2.1 was using a signed 16-bit multiplication for bx * bd in:

/* Block index */

  int bx = blockIdx.x;

/* Block of threads dimension */

  int bd = blockDim.x;

/* Thread index */

  int tx = threadIdx.x;

/* Drop out early if we don't have any data to process */

  if( ((bx * bd) + tx) > process_amount) {

	return;

  }

So blocks with an ID greater than 32767 caused a numeric overflow/wraparound, and you end up processing much less data than expected…

Too bad for your speedup. ;)

frmky · July 26, 2009, 3:59am

If bx * bd is wrapping, then shouldn’t it always less than process_amount and doing more work than it should? I’m not sure I follow…

Edit: I still don’t follow the explanation, but it was trivial to demonstrate that you are correct. Thanks to this CUDA 2.0 and 2.1 bug, all blocks with indices > 32767 are not processed. This invalidates nearly all of distributed.net’s RC5 CUDA results to date, a lot of work, but at least it’s now known and corrected in CUDA 2.2 and 2.3.

Sylvain_Collange · July 26, 2009, 9:16am

Right, but the comparison that is done afterward is unsigned. (go figure…)

For the record, the assembly code generated by CUDA 2.1 was:

mov.half.u16 r0.hi, s[0x000c]   # blockIdx.x

mov.half.u16 r1.lo, s[0x0002]   # blockDim.x

cvt.u32.u16.rn r2, r0.lo		# threadIdx.x

mad24.s16 r2, r0.hi, r1.lo, r2

set.ge.u32 p0|_, s[0x002c], r2  # p0 = (process_amount >= r2);

@p0.equ return				  # if(!p0) return;

And by CUDA 2.2:

mov.u16.u16.rd r1.lo, s[0x0002]

mul24.u16.lo r1, s[0x000c], r1.lo

cvt.u32.u16.rn r0, r0.lo

add.s32 r1, r0, r1

set.ge.u32 p0|_, s[0x002c], r1

@p0.equ return

Topic		Replies	Views
Code compiled in CUDA 4.0 slower than CUDA 3.1 CUDA Programming and Performance	11	17007	September 21, 2011
CUDA v2.0 beta is slower than CUDA v1.1 Is it just temporarily ? CUDA Programming and Performance	3	2664	July 20, 2008
cuda 3.2 slower than cuda 2.0 ? CUDA Programming and Performance	11	4345	November 3, 2010
FMA regression with CUDA 3.1/3.2: 17% slower than 3.0? CUDA Programming and Performance	10	8785	January 25, 2011
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	260	July 7, 2024
GRID K520 board slower on CUDA 6 CUDA Programming and Performance	10	2217	June 13, 2014
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9352	January 7, 2008
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7153	December 6, 2007
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6002	March 25, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16316	January 30, 2011

CUDA 2.2 running half the speed of 2.1

Related topics