cbuffer doesn't help constant memory

Hi, I’m using 8800GTX with cuda 0.8.1. I’m doing random parallel searches into an global memory array, d_A, of 3MB. The head 512Bytes are visited by all threads.

The elements in A are structures: “Box”, of 4 ints.

To use the 64KB constant memory with 8KB cache on each multiprocessor to faster the search, I load first part of d_A into constant buffer. I write:

//file search.cu:

 __constant__ Box c_First[n];

...

CUDA_SAFE_CALL(cudaMemcpyToSymbol(c_First, h_A, sizeof(Box)*n, 0)); //h_A are host copy of d_A

...

//file search_kernel.cu:

extern __constant__ Box c_First[n];

...

if(i < n)

...c_First[i]...;

else

...d_A[i]...;

I use 1024 blocks with 64 threads each, since I have to use a big shared memory, which denies much threads.

The weird thing is, no matter how I setup the size of c_First, it can’t outperform the old version without constant buffer at all. Search results are correct, but timings are:

I’m not sure with this weirdness. Maybe it’s I used too much branches in the kernel, indeed very much.

There should be no cache thrashing problem in the “w 8KB cbuf” or “w 512B” case, since all I visit to c_First are already in the 8KB cache.

How to optimize this random search please?

Thank you very much!

Hi, I better simplify my question:

To use the constant cache to faster the search, I load a first n elements of the data from global memory into constant memory. the kernel is like:

__constant__ Box c_First[n];

...

CUDA_SAFE_CALL(cudaMemcpyToSymbol(c_First, h_A, sizeof(Box)*n, 0)); //h_A are host copy of d_A

...

//file search_kernel.cu:

extern __constant__ Box c_First[n];

...

if(i < n)

...c_First[i]...;

else

...d_A[i]...;

but this is slower than the version without c_First. I think it’s because both branches are taken. I’ll try to generate the .ptx. thanks for any opinions.

The use of constant memory is only register-speed if all of the threads are accessing the same element of the constant buffer at the same time. The more divergent the constant buffer accesses are, the slower it gets. If you’re randomly accessing the constant buffer, then you can expect it to be slow.

John Stone