big structs

I have a kernel which contains code roughly like this:

__shared__ BigStruct shared_buffer[NumBig];

for (offset=0;offset<total;offset+=NumBig) {


  if (thread < NumBig) {

    // copy the struct into shared_buffer (see below)



  // All threads process shared_buffer.

  • BigStruct is ~64 bytes of 4 byte fields.

  • Implicit struct copy (i.e. shared_buffer[thread] = global_mem[offset+thread]) produces a launch failure. Certain cases do work: shared_buffer[0] = global_mem[offset+thread].

  • If loop executes once (i.e. total<=NumBig): Member-by-member copy works, otherwise it generates a launch failure/locks up the machine.

  • If loop executes more than once: Member-by-member copy of the first several fields in the struct works (produces incorrect results), if I attempt to copy more than a few members a launch failure is produced. If I copy all of the fields, the machine locks up.

  • In each of these cases the kernel uses 15 registers, 2k local memory, and <2k of shared memory per block.

  • Commenting out the syncthreads eliminates the launch failures but produces incorrect results.

If I skip buffering the BigStruct in shared memory (and just have every thread read from global memory everything works [slowly]).

I’d appreciate suggestions on how to debug/work-around.




Please file a bug and include everything needed to reproduce the problem.