I have a kernel which contains code roughly like this:
__shared__ BigStruct shared_buffer[NumBig];
for (offset=0;offset<total;offset+=NumBig) {
__syncthreads()
if (thread < NumBig) {
// copy the struct into shared_buffer (see below)
}
__syncthreads().
// All threads process shared_buffer.
}
-
BigStruct is ~64 bytes of 4 byte fields.
-
Implicit struct copy (i.e. shared_buffer[thread] = global_mem[offset+thread]) produces a launch failure. Certain cases do work: shared_buffer[0] = global_mem[offset+thread].
-
If loop executes once (i.e. total<=NumBig): Member-by-member copy works, otherwise it generates a launch failure/locks up the machine.
-
If loop executes more than once: Member-by-member copy of the first several fields in the struct works (produces incorrect results), if I attempt to copy more than a few members a launch failure is produced. If I copy all of the fields, the machine locks up.
-
In each of these cases the kernel uses 15 registers, 2k local memory, and <2k of shared memory per block.
-
Commenting out the syncthreads eliminates the launch failures but produces incorrect results.
If I skip buffering the BigStruct in shared memory (and just have every thread read from global memory everything works [slowly]).
I’d appreciate suggestions on how to debug/work-around.
-Thanks
Abe