Extension cl_nv_pragma_unroll doesn't seem to work

eimunic · September 21, 2011, 9:57am

Hello,

I am running in a curious problem concerning the unroll pragma.

When I use “#pragma unroll 4” before a loop within my kernel, I don’t get any performance-improvement and NO compiler-warning.

So I’ve unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.

When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:

warning #161: unrecognized #pragma

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

        ^

What am I doing wrong?

Btw, I am using CUDA 4.0 and a Quadro 6000.

Best,

eimunic

laughingrice · September 25, 2011, 10:08am

Hello,

I am running in a curious problem concerning the unroll pragma.

When I use “#pragma unroll 4” before a loop within my kernel, I don’t get any performance-improvement and NO compiler-warning.

So I’ve unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.

When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:
warning #161: unrecognized #pragma

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

        ^
What am I doing wrong?

Btw, I am using CUDA 4.0 and a Quadro 6000.

Best,

eimunic

This pragma should go in the .cl file not in the .cpp file

eimunic · September 28, 2011, 12:04pm

Hi laughingrice,

thanks for the advice. Now it does not produce any compiler-warnings, BUT still no speedup.

Can I somehow determine if the loop unrooling took place (beside looking at the ptx files)?

Best,
eimunic

eimunic · October 6, 2011, 8:55am

Hello,

I am still wondering why I do not get any speed up at all.

Here is what the code looks like:

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

__kernel void dense(__global float *Ahat, __global float *bhat, __global float *x, 

			     const int n, const int workload, __global float *buffer)

{

int idx = get_group_id(0) * workload * BLOCKDIM; 

    int idy = get_group_id(1) * BLOCKDIM + get_local_id(1); 

__local float tile[BLOCKDIM][BLOCKDIM+1];

    __local float s_x[BLOCKDIM];

/*the dense part*/

    float tmp = 0.0f;

    int i,j,l;

for(i=0;i < workload;++i){

	/*load to shared memory*/

	...

    	barrier(CLK_LOCAL_MEM_FENCE);

	/*start computation*/

#ifdef MY_UNROLL

	for(j=0;j < get_local_size(1) ;j+=16){

	    tmp+=tile[get_local_id(1)][j] * s_x[j];

	    tmp+=tile[get_local_id(1)][j+1] * s_x[j+1];

	    tmp+=tile[get_local_id(1)][j+2] * s_x[j+2];

	    tmp+=tile[get_local_id(1)][j+3] * s_x[j+3];

	    tmp+=tile[get_local_id(1)][j+4] * s_x[j+4];

	    tmp+=tile[get_local_id(1)][j+5] * s_x[j+5];

	    tmp+=tile[get_local_id(1)][j+6] * s_x[j+6];

	    tmp+=tile[get_local_id(1)][j+7] * s_x[j+7];

	    tmp+=tile[get_local_id(1)][j+8] * s_x[j+8];

	    tmp+=tile[get_local_id(1)][j+9] * s_x[j+9];

	    tmp+=tile[get_local_id(1)][j+10] * s_x[j+10];

	    tmp+=tile[get_local_id(1)][j+11] * s_x[j+11];

	    tmp+=tile[get_local_id(1)][j+12] * s_x[j+12];

	    tmp+=tile[get_local_id(1)][j+13] * s_x[j+13];

	    tmp+=tile[get_local_id(1)][j+14] * s_x[j+14];

	    tmp+=tile[get_local_id(1)][j+15] * s_x[j+15];

	}

#else

#pragma unroll 16

	for(j=0;j < get_local_size(1) ;++j){

	    tmp+=tile[get_local_id(1)][j] * s_x[j];

	}

#endif

    	barrier(CLK_LOCAL_MEM_FENCE);

    }

    buffer[idy * get_num_groups(0) + get_group_id(0)] = tmp;

}

So if I define MY_UNROLL I get a speedup of two. From my point of view the two versions (with define MY_UNROLL and without) should yield the same runtime, but they don’t.

Thus it seems that the #pragma unroll is not working.

Can someone help me out, please?

Best,

eimunic

laughingrice · October 12, 2011, 10:34pm

Two things that I can think about. Partial unrolling requires an extra conditional to handle the tail cases, maybe that takes away too much of the benefit for some reason. Another thing is that I seem to recall that the compiler had some issue with unrolling either an internal or external loop (don’t recall which). Try manually setting workload to 1 and removing the external loop and see if that helps.
Also watch the compiler output for any warning regarding failed loop unrolling, or compare ptx code and see if the loop is actually unrolled.

Topic		Replies	Views
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6152	September 19, 2008
#pragma unroll not working? CUDA Programming and Performance	3	5019	June 8, 2009
#pragma unroll CUDA Programming and Performance	20	5954	July 27, 2010
Problems about #pragma unroll and auto optimization CUDA Programming and Performance	0	3505	March 8, 2009
CUDA #pragma CUDA Programming and Performance	1	1784	July 28, 2013
Loop unrolling CUDA Programming and Performance	3	2746	April 25, 2012
compiler bug? CUDA Programming and Performance	4	1841	January 13, 2009
pragma unroll error - Advisory: Loop was not unrolled, unexpected control flow construct CUDA Programming and Performance	3	1742	July 18, 2010
BUG? nvcc fails to unroll the loop CUDA Programming and Performance	6	6105	May 26, 2009
loop unrolling CUDA Programming and Performance	11	17239	January 31, 2008

Extension cl_nv_pragma_unroll doesn't seem to work

Related topics