constant memory problem

I’m facing a weird problem when using constant memory (see attached template project)

In my .cu file, constant memory is defined as:

constant float4 f1[8];
constant float4 f2[8];
constant float4 f3[8];
constant float4 f4[8];
constant float4 fd[4];

of which only fd is used in the test kernel.
When compiled like this, many tests fail.

If I write it as:
constant float4 fd[4];
constant float4 f1[8];
constant float4 f2[8];
constant float4 f3[8];
constant float4 f4[8];

then all tests pass.

If I comment one or more of [f1,f2,f3,f4], then all test pass also.

I’m using cuda 3.0 beta SDK and driver 195.30 on 64bit fedora 11. Same issue on 64 bit ubuntu 9.10

I also tried different drivers, but that didn’t help. Maybe it’s a cuda 3.0 beta or a pebcak problem :-)
I’d really appreciate it if someone could tell me why the first case doesn’t work.

Best regards,
Nico
template.zip (4.68 KB)

Without looking at your code, my guess is that you have an out of bounds index somewhere to that fd constant memory array. When you declare other constant memory storage after it, your code is probably reading out of bounds, but into something “safe”, so no errors occur. When you declare fd last, it sits at the end of your legal constant memory block, which makes you code read out of bounds into somewhere it isn’t allowed, and it fails.

Just a theory mind you…

Thanks for the reply avidday.

I’ve dealt with my fair share of out-of-bounds problems, so I already checked that :)

I’m accessing the constant memory in my kernel without a variable index, so it’s either

fd[0],fd[1],fd[2] or fd[3]

The other constant memory allocations aren’t used, they’re just there to illustrate the ‘bug’, so I guess it can’t be an out-of-bounds issue.

Cheers,

Nico

The problem is probably here:

cudaMemcpyToSymbol ("fd", tmpd, 4*sizeof(float4), 0, cudaMemcpyHostToDevice );

It should be fd without the quotes (“”):

cudaMemcpyToSymbol (fd, tmpd, 4*sizeof(float4), 0, cudaMemcpyHostToDevice );

Also this is also weird (altough probably working):

float4 tmpd[4];

	int* tmpdi = (int*)&(tmpd[0].x);

	tmpdi[ 0] = 0x424d49f5;

	tmpdi[ 1] = 0x40c9d0ff;

	tmpdi[ 2] = 0x3f208ee8;

	tmpdi[ 3] = 0x3e134f49;

	tmpdi[ 4] = 0x424a97e3;

	tmpdi[ 5] = 0x40e91b65;

	tmpdi[ 6] = 0x3f1f1d80;

	tmpdi[ 7] = 0x3e134f49;

	tmpdi[ 8] = 0x424e0a39;

	tmpdi[ 9] = 0x40cdd1c5;

	tmpdi[10] = 0xbe6ecfdc;

	tmpdi[11] = 0x3e134f49;

	tmpdi[12] = 0xc24ca2ff;

	tmpdi[13] = 0xc0e93cea;

	tmpdi[14] = 0xbeeb9dc7;

	tmpdi[15] = 0xbe114300;

Why don’t you just access tmpd[0].x = …, tmpd[0].y = …, … tmpd[3].z = … ??

eyal

Nico , I tested your code compiled with CUDA 3.0 beta on windows 7 OS.

Indeed, it’s very weird. When changed the declaration “constant float4 fd[4];” position, I got different results.

I compared the .ptx files, but they are almost same except the declaration “constant float4 fd[4];” position.

I simplified the kernel function as following, but still I got wrong result:

__global__ void testKernel( float d, float* g_odata, int imageW, int imageH) 

{

	const   int ix = IMAD(blockDim.x, blockIdx.x, threadIdx.x);

	const   int iy = IMAD(blockDim.y, blockIdx.y, threadIdx.y);

	g_odata[IMAD(iy, imageW, ix)]=  d*fd[3].z + d*fd[3].w;

}

And, if I changed the last expression in this kernel function as following, I got correct result:

__global__ void testKernel( float d, float* g_odata, int imageW, int imageH) 

{

	const   int ix = IMAD(blockDim.x, blockIdx.x, threadIdx.x);

	const   int iy = IMAD(blockDim.y, blockIdx.y, threadIdx.y);

	g_odata[IMAD(iy, imageW, ix)]= d*(fd[3].z + fd[3].w);

}

Hi eyal,

The problem is not located in the cudaMemcpyToSymbol call. According to the reference manual you can pass either a character array or the parameter name.
I double checked it be uploading the data to GPU and then downloading it again to an array initialized to all zeroes and the returned data is correct.

The reason I’m allocating the float constant array as ints is because I wanted to fill the array in this trimmed down template project with the exact same values as my larger project where I first encountered this bug.

Cheers,
Nico

Thanks a lot for performing the tests iceberg.

It’s a weird problem indeed. My best guess is that it’s some sort of optimization bug.

When I change

[codebox]

ftmp.z = afd[0].z+bfd[1].z+cfd[2].z+dfd[3].z;

ftmp.w = afd[0].w+bfd[1].w+cfd[2].w+dfd[3].w;

[/codebox]

to

[codebox]

ftmp.z = afd[0].z+bfd[1].z+cfd[2].z+1.00001fd*fd[3].z;

ftmp.w = afd[0].w+bfd[1].w+cfd[2].w+1.00001fd*fd[3].w;

[/codebox]

then it passes my epsilon tests in all cases, even with fd in the last position.

N.

I agree with you. I believe it’s a nvcc compiler 3.0 beta related bug.