[BUG] clBuildProgram taking exponential amount of time

Hi forum,

I recently started porting a kernel I had written for CUDA to OpenCL. Since it does not use any advanced features, not that many changes were necessary, apart from keywords like global etc.

However, I discovered that clBuildProgram does take exponential amounts of time (c * 2^n) for every n BF_ENC operations I include. This only happens when I use declare the second argument __constant, using __global it compiles quickly and the resulting PTX looks just fine.

I’ve tried to reduce the problem as much as possible (so don’t try to read any functionality into it) with the kernel below. Replace every __constant by __global and watch what happens. Also try reducing the number of calls to BF_ENC until it compiles in about 10 seconds:

#define BF_ENC(LL,R,S,P) ( \

	LL^=(((	S[       ((uint)(R>>24)&0xff)] + \

		S[0x0100+((uint)(R>>16)&0xff)])^ \

		S[0x0200+((uint)(R>> 8)&0xff)])+ \

		S[0x0300+((uint)(R    )&0xff)]) \


__kernel void BFencKernel(__global unsigned long *data, __constant unsigned int *bf_constant_schedule) {

	__private unsigned long block = data[get_global_id(0)];


	unsigned int l, r;

	l = data[get_global_id(0)];

	r = l;

	__constant unsigned int *p=bf_constant_schedule;

	__constant unsigned int *s=bf_constant_schedule+18;


	BF_ENC(r,l,s,p[ 1]);

	BF_ENC(l,r,s,p[ 2]);

	BF_ENC(r,l,s,p[ 3]);

	BF_ENC(l,r,s,p[ 4]);

	BF_ENC(r,l,s,p[ 5]);

	BF_ENC(l,r,s,p[ 6]);

	BF_ENC(r,l,s,p[ 7]);

	BF_ENC(l,r,s,p[ 8]);

	BF_ENC(r,l,s,p[ 9]);









	block = ((unsigned long)r) << 32 | l;

	data[get_global_id(0)] = block;


I’m using CUDA Toolkit 3.2, driver 270.18, Gentoo linux 2.6.37-x86_64 (vanilla). OpenCL is 1.0 obviously, the PTX generated is 2.2. GCC is version 4.5.1-r1. Right now I’m using “-w -cl-nv-verbose” as option to clBuildProgram, but that does not make a difference. The build is for my GeForce 8600 GT (CC 1.1).

I’ve read posts about some strange memory dependence using constant memory, but I can’t figure out anything wrong with my code, especially since it is very similar to my CUDA C routine.

Small amendment: The kernel compiles fine (and returns the expected PTX with a lot of ld.const.u32) when I use the clcc tool available here: http://forums.nvidia.com/index.php?showtopic=188884