Hello everybody. Now I’m trying to build a app using CUDA kernel, and I find one strange issue. I got 30ms as execute time of kernel for 32bit code, and 56ms for x64 code.

device int nOffset1[500] = { 10, 56, …, 36 };
device int nOffset2[500] = { 41, 68, 24, …, 36 };

global void kernel(int* pnSrc, int* pnDes)
{

int nResult = 0;
int i, j;

for (i = 0; i < 500; i++)
{
for (j = 0; j < 500; j++)
{
nResult += pnSrc[nThreadId + nOffset1[i] + nOffset2[j]];
}
}

}

if I set offsets as const value, execute time gets smaller and difference time also gets smaller. So i think this problem is for addressing overhead.

Why this problem happen? And how can I solve this problem?

I don’t know the answer to your 32 vs 64 bit speed difference, though you might want to post exactly the code doing that timing since it’s common to get it wrong by including unrelated hostside calls or querying timers before the device has completed.

Is this toy code or is it a real application you want to accelerate? It’s applying a kind of stencil convolution to a 1D array, but that stencil is huge and sparse with 250,000 entries. If this is a real application you can probably speed it up by on the order of 100x by completely removing the inner j loop and having your i indices each access a new second array pre-accumulated with the effect of ALL the offset[j] sums at once. ie, make a new array x[i] = d[41+i]+d[i+68]+d[i+24] … d[i+36]. Then loop over i and accumulate the 500 terms as result[tid]=x[tid+10] + x[tid+56] … +x[tid+36]; There are ways to speed up that second accumulation loop as well (especially if there is structure to the offset array) but getting rid of just the [j] array will give you the most immediate huge speedup with smallest coding effort.

Hello SPWorley.