Yet another performance question

Hi all,

I have a kernel which puzzles me, below is the kernel in its simplest way:

for ( int iTraceChunk = iTraceChunkStart; iTraceChunk < smTracesChunks; iTraceChunk++ )

{

				__syncthreads();

				int iCurrentTraceChunkSize = TRACE_SRXO_CHUNK_SIZE;

				if ( threadIdx.x < iCurrentTraceChunkSize )		  // One thread per trace - maybe should do 2 or 3 per trace ????.

				{

								 // Do the algorithm's main calculation - uses ~30 registers. the smXXX paramters are the output params.

								CalculateTraceParams( .... , smnb[ threadIdx.x ], smNb[ threadIdx.x ], smnumMoveSamples[ threadIdx.x ], smw1[ threadIdx.x ], smw2[ threadIdx.x ] );

				}

				__syncthreads();

				

				////////////////////////////////////////////////////////////////////////////////////

				// if I break out of the kernel now – I get ~145ms which is amazing due to the fact that this is the 

				// main calculation including: 30 registers, asin, atan, sqrt etc…

				////////////////////////////////////////////////////////////////////////////////////

				////////////////////////////////////////////////////////////////////////////////////

				// Continuing the kernel with the code below brings the time of the kernel to ~1900ms !!!!!!

				// I don’t get it. Its only additions and work with shared memory. Why would it take so much time? 

				////////////////////////////////////////////////////////////////////////////////////

				// Go over the samples chunks.

				for ( int iTimeChunk = 0; iTimeChunk < iTimeBlock; iTimeChunk++ )

				{

								float fOut1 = 0.0f, fOut2 = 0.0f, fOut3 = 0.0f;

								int iTimeIndex = iTimeChunk * BLOCK_THREAD_SIZE + threadIdx.x;

								// Go over the traces in the current chunk using the loop which I unrolled manually.

								for( int iTraceIndex = 0; iTraceIndex < iCurrentTraceChunkSize; iTraceIndex += 8 )

								{

												int iCurrentTraceIndex = iTraceChunk * TRACE_SRXO_CHUNK_SIZE + iTraceIndex;

												if ( ( smNb[ iTraceIndex ] != -1 ) && ( iTimeIndex >= smNb[ iTraceIndex ] ) && ( iTimeIndex < smNb[ iTraceIndex ] + smnumMoveSamples[ iTraceIndex ] ) )

												{

																fOut1 += smw1[ iTraceIndex ] + smw2[ iTraceIndex ];

																fOut2 += smw1[ iTraceIndex ] - smw2[ iTraceIndex ];

																fOut3 += smw1[ iTraceIndex ] * smw2[ iTraceIndex ];

												}

												iCurrentTraceIndex++;

												if ( ( smNb[ iTraceIndex + 1 ] != -1 ) && ( iTimeIndex >= smNb[ iTraceIndex + 1 ] ) && ( iTimeIndex < smNb[ iTraceIndex + 1 ] + smnumMoveSamples[ iTraceIndex + 1 ] ) )

												{

																fOut1 += smw1[ iTraceIndex + 1 ] + smw2[ iTraceIndex + 1 ];

																fOut2 += smw1[ iTraceIndex + 1 ] - smw2[ iTraceIndex + 1 ];

																fOut3 += smw1[ iTraceIndex + 1 ] * smw2[ iTraceIndex + 1 ];

												}

												iCurrentTraceIndex++;

												.......

												.......

												iCurrentTraceIndex++;

												if ( ( smNb[ iTraceIndex + 7 ] != -1 ) && ( iTimeIndex >= smNb[ iTraceIndex + 7 ] ) && ( iTimeIndex < smNb[ iTraceIndex + 7 ] + smnumMoveSamples[ iTraceIndex + 7 ] ) )

												{

																fOut1 += smw1[ iTraceIndex + 7 ] + smw2[ iTraceIndex + 7 ];

																fOut2 += smw1[ iTraceIndex + 7 ] - smw2[ iTraceIndex + 7 ];

																fOut3 += smw1[ iTraceIndex + 7 ] * smw2[ iTraceIndex + 7 ];

												}

								}

								// Write data to gmem

								pTemp1[ smOutputPos + iTimeIndex ] += fOut1;

								pTemp2[ smOutputPos + iTimeIndex ] += fOut2;

								pTemp3[ smOutputPos + iTimeIndex ] += fOut3;

				}

}

iTimeChunk == 4 and iCurrentTraceChunkSize == 80

Why does it take so much time? bank conflicts? if so what can be done?

Thanks in advance

Hi,

If you are “breaking out” of your kernel, do you still write the results to global mem? If you don’t, the optimizer will eliminate almost all of your computation.
In general, I think it’s better to use clock() to measure how long a chunk of code takes, instead of commenting regions out and hoping nothing funky happens.

In any case, use the profiler to look for bank conflicts (the field is “warp serialize”, or something like that) and other oddities. Make sure no two threads in a half-warp access different addresses in the same bank.

Hi,

I’ve taken your advice and used the clock option and indeed the first part of the code is ~10 times faster then the second part which only works with shared memory and sums things up.

Any ideas why? I’ll run the profiler tomorrow but Im not sure I can avoid the bank conflicts and the amount of work done by the second part just seems to me to have taken by far faster then the

first one which now I see is the opposite.

thanks

Hi,
I’ve ran the code through the profiler on a Tesla C1060. I’d appriciate any assistance… :)
Profiler.JPG

Looking at the profiler output it looks like divergent warps might be your problem (note that the longer the kernel takes, the higher the amount of divergent warps), bank conflicts are about the last thing to optimize for.

The first part of your code runs when threadIdx.x < iCurrentTraceChunkSize (no divergence, there might just be some threads inactive in the last warp)
The second part has 3 if’s that are potential warp divergence causes, and they are within 3 nested for loops!

Maybe you can get rid of some for loops by letting a block calculate what you now calculate per thread?

Hi Denis,

The problem is that the output of the first part of the calculation can get to ~200MB*100 for even a medium sized project. So if I split it up (to 2 kernels, is that what you suggest?)

I’ll have to write and read that amount of data. It seems to me too much no?

I still dont get why the second part which works only on registers and smem would take sooooo long :( the divergence should have such a vast impact on the performance?

thanks

eyal

Hi Eyal,

The second part is running in a double for loop, so it is running 320 times. 1900-145= 1755 msec / 320 = 5,5 msec.

You might win some time by doing the following changes: iTraceChunk * TRACE_SRXO_CHUNK_SIZE -> __mul24(iTraceChunk, TRACE_SRXO_CHUNK_SIZE) and the same for iTimeChunk * BLOCK_THREAD_SIZE (integer multiplication takes a lot more time (from the back of my mind 16x) that a floating point multiplication, __mul24 only 4x as much.

But it might be indeed that splitting the kernel in 2 parts and using the parallelism possible in the second part will enhance the speed. (you can count how much data you need to write in the first and read in the second. Divide that by the bandwidth of almost 100 GB/s and you see how much overhead that will introduce.

That is one not so nice part of CUDA, it is not always possible to predict how much a code change will improve the speed. You really have to try and see.

With iCurrentTraceChunkSize = 80, and the underlying part going from iTraceIndex +0 to iTraceIndex +7, you have 80*8, so a 40x8 threadblock with a

#pragma unroll 2

for(i_offset=0;i_offset<2;i_offset++) {

}

followed by a reduction can calculate one pTemp1[ smOutputPos + iTimeIndex ] per block

Hi,

Ok, I’ve narrowed it down further. I’ve changed the code to this:

for ( int iTraceChunk = iTraceChunkStart; iTraceChunk < smTracesChunks; iTraceChunk++ )

{

				__syncthreads();

				int iCurrentTraceChunkSize = TRACE_SRXO_CHUNK_SIZE;

				if ( threadIdx.x < iCurrentTraceChunkSize )		  // One thread per trace - maybe should do 2 or 3 per trace ????.

				{

								 // Do the algorithm's main calculation - uses ~30 registers. the smXXX paramters are the output params.

								CalculateTraceParams( .... , smnb[ threadIdx.x ], smNb[ threadIdx.x ], smnumMoveSamples[ threadIdx.x ], smw1[ threadIdx.x ], smw2[ threadIdx.x ] );

				}

				__syncthreads();

				

				// Go over the samples chunks.

				for ( int iTimeChunk = 0; iTimeChunk < iTimeBlock; iTimeChunk++ )

				{

								float fOut1 = 0.0f;

								int iTimeIndex = iTimeChunk * BLOCK_THREAD_SIZE + threadIdx.x;

								// Go over the traces in the current chunk using the loop which I unrolled manually.

								for( int iTraceIndex = 0; iTraceIndex < iCurrentTraceChunkSize; iTraceIndex++ )

								{

									fOut1++;

								}

								// Write data to gmem

								pTemp1[ smOutputPos + iTimeIndex ] += fOut1;

				}

}

This takes ~145ms, if I comment out the fOut1++ line, it takes ~25ms - obviously the optimizer have optimized out the loops

if I take this line out - but it just reshows what I keep getting - those loops/register adds (no ifs, no smem) - causes the kernel to run

~6x times slower !!! (and in the real code it gets to ~10x slower) --> I’ve used a different project settings therefore the time diffs from my previous posts.

I just don’t understand how this can be :( :(

smTracesChunks == 68, iTimeBlock == 4 and iCurrentTraceChunkSize == 80.

So that fOut1 will equal 68804 = 21760.

Any further suggestions are more then welcomed :(

thanks

eyal

It is quite simple, integer calculation is lots slower in CUDA as floating point. You have lots of integer operations (more than 21760 per thread!)

Get rid of integer ops by using the fact that your for loops are parallelizable.

Hi,

Long time :) was busy with some other stuff… :(

Attached are the profiler results for two different runs. The first “Loop unrolled” is with the inner most loop unrolled to 16

(i.e. for ( … iValue+=16 ) ), the second result is for the “No Unrolling” which is a simple loop without unrolling ( …iValue++)

Look at the kernel with the 75% occupancy in both outputs. The “Loop unrolled” version runs ~30% faster.

It seems to me that the main difference between the runs is in the “branch” field (altough I dont fully understand why there should be such a diff), the divergence looks ~ the same, thats because how the data is distributed and is distributed the

same for both runs.

BTW, if I add more integer/float operations in the inner loop, the time of the kernel doesnt seem to increase too much,

therefore I assume that i’m still not GFlop limited (and btw according to the number of operations/grid size I indeed shouldnt

be ).

What do you think? I’m quite puzzeled at whats eating my performance.

thanks a lot

eyal
Profile1.jpg

Not sure if you noticed, but more than 70% of the instructions you schedule are branches.

So let’s have a look at your IPC and see what that tells us (caveat: I my be grossly misinterpreting the values in the profiler. So someone shoot me down if I’m wrong here)

You have (on the marked NoUnroll kernel) the following numbers:

Instructions per MP: 268834658

Time: 1.855s

Now I see you have a C1060, so that’s 1.3 GHz shader clock.

As the MP schedules an instruction every 4 clocks, we can compute your IPC in the following way:

IPC=4*intructions/clock/time=0.44

So basically, that’s not too bad. You can potentially get up to 1.0 (and a little higher, actually), but that will only make you twice as fast.

Any significant speed-up will have to come from algorithmic changes.

Hi,

Thanks for the info. Yes, the main loop contains an if statement and a few computations with in it, so the main

code that runs is the if. I guess the best speedup could have been achieved had I been able to remove the number

of if statements. Not sure I can :)

Still a speed up of two can be great… :) any suggestions regarding this or how to check what’s causing the IPC to

be at 0.44 and not higher?

thanks

eyal

Well, there are a couple of things you can try. Most people will tell you to look if you’re bandwidth limited.

If you truly are, that’s your problem and nothing short of an algorithmic change will help you.

So a better tip is to look at the PTX for high-latency instructions. Missing the ‘f’ behind a floating-point literal can

be really, really expensive. Simply look for opcodes that have a 64 in the name and change the offending lines

(use “-LIST:source=on” when building the PTX to make life easier). Other things that hurt are integer ops, especially

modulo. If you loop over some variable and compute per-thread offsets with integer-divide and modulo, you can either

(big win) use shift-and-mask techniques to remove the mods, or (still a good win) use an incremental update technique,

using some conditionals. (Branches hurt, but some integer-ops hurt more).

Hope that helps. ;)

Hi,

Thanks for the tips, I’ll have a look at the PTX tomorrow, however as far as I remember I did see 64 in the instructions.

Could you please elaborate as to why this is a problem?

As for the integer stuff, I do calculate offsets per thread but mostly __mul24 and ‘+’, i.e. no div and mod.

What is the incremental update technique you were speaking about? can you elaborate on this as well ?

thanks a lot,

eyal

Hi,

Here’s the PTX code for the main internal loop, do you see something special/faulty about this??? thanks.

// 559		  for( int iTraceIndex = 0; iTraceIndex < BLOCK_THREAD_SIZE; iTraceIndex++ )

 // 560		  {

 // 561  

 // 562			if ( ( iTimeIndex >= smNb[ iTraceIndex ] ) && ( iTimeIndex < smnumMoveSamples[ iTraceIndex ] ) )

	ld.shared.s32 	%r82, [%rd36+0];	// id:490 __cuda_smNb16960+0x0

	setp.gt.s32 	%p10, %r82, %r58;	// 

	@%p10 bra 	$Lt_1_106;		 	// 

 //<loop> Part of loop body line 556, head labeled $Lt_1_92

	add.u64 	%rd37, %rd35, %rd18; 	// 

	ld.shared.s32 	%r83, [%rd37+0];	// id:491 __cuda_smnumMoveSamples17984+0x0

	setp.le.s32 	%p11, %r83, %r58;	// 

	@%p11 bra 	$Lt_1_106;		 	// 

 //<loop> Part of loop body line 556, head labeled $Lt_1_92

	add.u64 	%rd38, %rd35, %rd22; 	// 

	ld.shared.s32 	%r84, [%rd38+0];	// id:492 __cuda_smInputTracePos19008+0x0

	add.s32 	%r85, %r84, %r58;		// 

	mov.s32 	%r86, %r85;			  // 

	mov.s32 	%r87, 0;			 	// 

	mov.s32 	%r88, 0;			 	// 

	mov.s32 	%r89, 0;			 	// 

	tex.1d.v4.f32.s32 {%f16,%f17,%f18,%f19},[texPhaseInput,{%r86,%r87,%r88,%r89}];

	.loc	14	568	0

 // 564				  float w2 = smw2[ iTraceIndex ];

 // 565				  float w1 = 1.f - w2;

 // 566				  int iInputPos = smInputTracePos[ iTraceIndex ] + iTimeIndex;

 // 567  

 // 568				  fTraceOutPhase += tex1Dfetch( texPhaseInput, iInputPos ) * w1 + tex1Dfetch( texPhaseInput, iInputPos + 1 ) * w2;

	mov.f32 	%f20, %f16;			  // 

	add.s32 	%r90, %r85, 1;	   	// 

	mov.s32 	%r91, %r90;			  // 

	mov.s32 	%r92, 0;			 	// 

	mov.s32 	%r93, 0;			 	// 

	mov.s32 	%r94, 0;			 	// 

	tex.1d.v4.f32.s32 {%f21,%f22,%f23,%f24},[texPhaseInput,{%r91,%r92,%r93,%r94}];

	mov.f32 	%f25, %f21;			  // 

	add.u64 	%rd39, %rd35, %rd23; 	// 

	ld.shared.f32 	%f26, [%rd39+0];	// id:493 __cuda_smw220032+0x0

	mov.f32 	%f27, 0f3f800000;		// 1

	sub.f32 	%f28, %f27, %f26;		// 

	mul.f32 	%f29, %f28, %f20;		// 

	mad.f32 	%f30, %f25, %f26, %f29;	// 

	add.f32 	%f12, %f12, %f30;		// 

	mov.s32 	%r95, %r85;			  // 

	mov.s32 	%r96, 0;			 	// 

	mov.s32 	%r97, 0;			 	// 

	mov.s32 	%r98, 0;			 	// 

	tex.1d.v4.f32.s32 {%f31,%f32,%f33,%f34},[texPhase2Input,{%r95,%r96,%r97,%r98}];

	.loc	14	569	0

 // 569				  fTraceOutPhase2 += tex1Dfetch( texPhase2Input, iInputPos ) * w1 + tex1Dfetch( texPhase2Input, iInputPos + 1 ) * w2;

	mov.f32 	%f35, %f31;			  // 

	mov.s32 	%r99, %r90;			  // 

	mov.s32 	%r100, 0;				// 

	mov.s32 	%r101, 0;				// 

	mov.s32 	%r102, 0;				// 

	tex.1d.v4.f32.s32 {%f36,%f37,%f38,%f39},[texPhase2Input,{%r99,%r100,%r101,%r102}];

	mov.f32 	%f40, %f36;			  // 

	mul.f32 	%f41, %f28, %f35;		// 

	mad.f32 	%f42, %f40, %f26, %f41;	// 

	add.f32 	%f11, %f11, %f42;		// 

	mov.s32 	%r103, %r85;		 	// 

	mov.s32 	%r104, 0;				// 

	mov.s32 	%r105, 0;				// 

	mov.s32 	%r106, 0;				// 

	tex.1d.v4.f32.s32 {%f43,%f44,%f45,%f46},[texUseForStackInput,{%r103,%r104,%r105,%r106}];

	.loc	14	570	0

 // 570				  fTraceOutStack += tex1Dfetch( texUseForStackInput, iInputPos ) * w1 + tex1Dfetch( texUseForStackInput, iInputPos + 1 ) * w2;

	mov.f32 	%f47, %f43;			  // 

	mov.s32 	%r107, %r90;		 	// 

	mov.s32 	%r108, 0;				// 

	mov.s32 	%r109, 0;				// 

	mov.s32 	%r110, 0;				// 

	tex.1d.v4.f32.s32 {%f48,%f49,%f50,%f51},[texUseForStackInput,{%r107,%r108,%r109,%r110}];

	mov.f32 	%f52, %f48;			  // 

	mul.f32 	%f53, %f28, %f47;		// 

	mad.f32 	%f54, %f52, %f26, %f53;	// 

	add.f32 	%f14, %f14, %f54;		// 

	.loc	14	574	0

 // 571		  ///		fTraceOutPhase += threadIdx.x * w1 + blockIdx.x * w2;

 // 572		  //		fTraceOutPhase2 += numMoveSamples * w1 + Nb * w1; 

 // 573		  //		fTraceOutStack += iTimeIndex * w1 + iInputPos * w2;

 // 574				  fFold++;

	mov.f32 	%f55, 0f3f800000;		// 1

	add.f32 	%f13, %f13, %f55;		// 

$Lt_1_106:

$L_1_63:

 //<loop> Part of loop body line 556, head labeled $Lt_1_92

	add.u16 	%rh1, %rh1, 4;	   	// 

	add.u64 	%rd35, %rd35, 4;	 	// 

	add.u64 	%rd36, %rd36, 4;	 	// 

	mov.u16 	%rh2, 1024;			  // 

	setp.ne.s16 	%p12, %rh1, %rh2;	// 

	@%p12 bra 	$Lt_1_92;			  // 

 //<loop> Part of loop body line 532, head labeled $Lt_1_85

	.loc	14	534	0

	add.s32 	%r62, %r62, 1;	   	// 

	ld.shared.s32 	%r111, [smCalculatedTracesChunks];	// id:470 smCalculatedTracesChunks+0x0

	setp.gt.s32 	%p13, %r111, %r62;	// 

	@%p13 bra 	$Lt_1_85;			  // 

	bra.uni 	$Lt_1_83;				//

Do you have optimization enabled?

That code looks totally unoptimized. (Note all those redundant “mov.s32 X,0” instructions)
Other than that, you have some 64bit integer-math in there, but I don’t know how expensive that is.

Hi,

I thought I had…

This is the command I use:

(CUDA_BIN_PATH)\nvcc.exe" --keep --ptxas-options="-v -mem " --opencc-options "-LIST:source=on" -maxrregcount=30 -ccbin "(VCInstallDir)bin” -arch sm_13 -c -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\Program Files (x86)\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" -I./ -I…/…/common/inc -o $(PlatformName)$(ConfigurationName)\GGPUGenericEngine.obj D:\GeoProject\GeoEngine\GenericEngine\GPU\GGPUGenericEngine.cu