Hi, I’m using 8800GTX with cuda 0.8.1. I’m doing random parallel searches into an global memory array, d_A, of 3MB. The head 512Bytes are visited by all threads.
The elements in A are structures: “Box”, of 4 ints.
To use the 64KB constant memory with 8KB cache on each multiprocessor to faster the search, I load first part of d_A into constant buffer. I write:
//file search.cu:
__constant__ Box c_First[n];
...
CUDA_SAFE_CALL(cudaMemcpyToSymbol(c_First, h_A, sizeof(Box)*n, 0)); //h_A are host copy of d_A
...
//file search_kernel.cu:
extern __constant__ Box c_First[n];
...
if(i < n)
...c_First[i]...;
else
...d_A[i]...;
I use 1024 blocks with 64 threads each, since I have to use a big shared memory, which denies much threads.
The weird thing is, no matter how I setup the size of c_First, it can’t outperform the old version without constant buffer at all. Search results are correct, but timings are:
I’m not sure with this weirdness. Maybe it’s I used too much branches in the kernel, indeed very much.
There should be no cache thrashing problem in the “w 8KB cbuf” or “w 512B” case, since all I visit to c_First are already in the 8KB cache.
How to optimize this random search please?
Thank you very much!