Extension cl_nv_pragma_unroll doesn't seem to work


I am running in a curious problem concerning the unroll pragma.

When I use “#pragma unroll 4” before a loop within my kernel, I don’t get any performance-improvement and NO compiler-warning.

So I’ve unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.

When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:

warning #161: unrecognized #pragma

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable


What am I doing wrong?

Btw, I am using CUDA 4.0 and a Quadro 6000.



This pragma should go in the .cl file not in the .cpp file

Hi laughingrice,

thanks for the advice. Now it does not produce any compiler-warnings, BUT still no speedup.

Can I somehow determine if the loop unrooling took place (beside looking at the ptx files)?



I am still wondering why I do not get any speed up at all.

Here is what the code looks like:

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

__kernel void dense(__global float *Ahat, __global float *bhat, __global float *x, 

			     const int n, const int workload, __global float *buffer)


int idx = get_group_id(0) * workload * BLOCKDIM; 

    int idy = get_group_id(1) * BLOCKDIM + get_local_id(1); 

__local float tile[BLOCKDIM][BLOCKDIM+1];

    __local float s_x[BLOCKDIM];

/*the dense part*/

    float tmp = 0.0f;

    int i,j,l;

for(i=0;i < workload;++i){

	/*load to shared memory*/



	/*start computation*/

#ifdef MY_UNROLL

	for(j=0;j < get_local_size(1) ;j+=16){

	    tmp+=tile[get_local_id(1)][j] * s_x[j];

	    tmp+=tile[get_local_id(1)][j+1] * s_x[j+1];

	    tmp+=tile[get_local_id(1)][j+2] * s_x[j+2];

	    tmp+=tile[get_local_id(1)][j+3] * s_x[j+3];

	    tmp+=tile[get_local_id(1)][j+4] * s_x[j+4];

	    tmp+=tile[get_local_id(1)][j+5] * s_x[j+5];

	    tmp+=tile[get_local_id(1)][j+6] * s_x[j+6];

	    tmp+=tile[get_local_id(1)][j+7] * s_x[j+7];

	    tmp+=tile[get_local_id(1)][j+8] * s_x[j+8];

	    tmp+=tile[get_local_id(1)][j+9] * s_x[j+9];

	    tmp+=tile[get_local_id(1)][j+10] * s_x[j+10];

	    tmp+=tile[get_local_id(1)][j+11] * s_x[j+11];

	    tmp+=tile[get_local_id(1)][j+12] * s_x[j+12];

	    tmp+=tile[get_local_id(1)][j+13] * s_x[j+13];

	    tmp+=tile[get_local_id(1)][j+14] * s_x[j+14];

	    tmp+=tile[get_local_id(1)][j+15] * s_x[j+15];



#pragma unroll 16

	for(j=0;j < get_local_size(1) ;++j){

	    tmp+=tile[get_local_id(1)][j] * s_x[j];





    buffer[idy * get_num_groups(0) + get_group_id(0)] = tmp;


So if I define MY_UNROLL I get a speedup of two. From my point of view the two versions (with define MY_UNROLL and without) should yield the same runtime, but they don’t.

Thus it seems that the #pragma unroll is not working.

Can someone help me out, please?



Two things that I can think about. Partial unrolling requires an extra conditional to handle the tail cases, maybe that takes away too much of the benefit for some reason. Another thing is that I seem to recall that the compiler had some issue with unrolling either an internal or external loop (don’t recall which). Try manually setting workload to 1 and removing the external loop and see if that helps.
Also watch the compiler output for any warning regarding failed loop unrolling, or compare ptx code and see if the loop is actually unrolled.