Hi folks,
Preamble:
Firstly, I must apologise in advance for a bit of a long post. I’m afraid it was necessary for me to adequately describe my issue. I wouldn’t waste your time if I wasn’t really stuck on it. Also, it’s a key concept to understand. Hopefully others can benefit from this thread as well.
Story:
I’m a researcher who utilises CUDA. Till now, I have never needed (multi-block) thread communication but the time has come. After lot’s of reading I’ve decided the best approach is probably to have multiple kernels (2) and to call them in a loop, so as to take advantage of the natural/implicit block-level synchronisation between kernel calls.
Instead of jumping into the specifics of my program I decided to write a simple program which simply sums the numbers in the range [0…N], where N is an arbitrary number. At the moment I’m setting my threadsPerBlock to 256 and then allocating an appropriate number of blocks to cover my problem size (padding with zeroes as necessary). The plan is for each block to calculate its own sum and then for a 2nd kernel to adds all the sums. (Much like the SDK example).
If anyone believes my idea/approach is incorrect and/or inefficient thus far, please let me know.
Here is some code:
#define MAX_TPB 16 // threads per block
__global__ void kernelA(TYPE* c){
c[blockIdx.x * blockDim.x + threadIdx.x] = blockIdx.x * blockDim.x + threadIdx.x;
}
__global__ void kernelB(TYPE* c){
__shared__ TYPE smem[MAX_TPB];
smem[threadIdx.x] = c[blockIdx.x * blockDim.x + threadIdx.x];
// Unrolling the loop = faster? Not sure, playing it safe :)
if (threadIdx.x < 8){ smem[threadIdx.x] += smem[threadIdx.x+8]; }
__syncthreads();
if (threadIdx.x < 4){ smem[threadIdx.x] += smem[threadIdx.x+4]; }
__syncthreads();
if (threadIdx.x < 2){ smem[threadIdx.x] += smem[threadIdx.x+2]; }
__syncthreads();
if (threadIdx.x < 1){ smem[threadIdx.x] += 0.01*smem[threadIdx.x+1]; }
__syncthreads();
if (threadIdx.x == 0) c[blockIdx.x * blockDim.x + threadIdx.x] = smem[threadIdx.x];
}
//SUM block sums on CPU
The problem:
If I only run the first if() then all works as expected. If I run the first 2 if()'s then all SOMETIMES works as expected (i get 1 of 2 results) and if I run all if()'s then it NEVER works as expected (but consistent incorrect answers).
Edit:
I realised, after posting, that my problem was actually with how I was copying back to global memory in kernelB - so silly!
Bad:
c[threadIdx.x] = smem[threadIdx.x];
Good:
c[blockIdx.x * blockDim.x + threadIdx.x] = smem[threadIdx.x];
However, I’d still like to know if:
A. My overall approach is correct/good
B. If unrolling the loops manually is beneficial. (ill probably be using blockDim=256, due to my kernel mem usage)
C. What is the limit for N? (ie. max number of blocks?)
Also, I’m having a new issue now:
If I run the kernel in a loop, it doesn’t work cudaErrorString==“invalid argument” when the loop iterates more than once:
// This doesn't work (or, rather, doesn't work 99% of the time.
// Oddly after many re-compilations it sometimes works for a few runs!? )
for (int i=0; i<N; i++){
kernel<<<...>>>(..., i);
}
If I manually unroll, it works:
// No problems here...
kernel<<<...>>>(..., 0);
kernel<<<...>>>(..., 1);
...
Does anyone have any ideas as to why this may be the case?
Edit:
I ran the program in debug mode and it produced correct results, so I figured it was a compilation issue and suspected the optimization level to be the culprit. I removed the “-O3” flag from my NVCC args and the issue seems to be resolved! I’m still confused by this, however, as I was under the impression that nvcc merely handled non CUDA functions. Or does it work on a filename extension basis?