type cast bug...?

Hi!

I faced with following problem: I need to copy two elements (each of that has size 128 bits) from the global memory to cache. I have to do that for each Thread from Block. If I realize above described like following:

for (int i = 0; i<32; i+=16)
(PixelType)&shared[tid+i] = (PixelType)&d_in[num+i];

the Kernel will process perfectly.

If I try to expand cycle,

(PixelType)&shared[tid] = (PixelType)&d_in[num];
(PixelType)&shared[tid+16] = (PixelType)&d_in[num+16];

I will get the message “unspecified driver error” from Kernel.

Here:
unsigned char* d_in, extern shared unsigned char shared,
union align(16) un
{
unsigned char c[16];
};
typedef un PixelType;

The full code version is in attached file. Another version, that returns “unspecified launch failure” was described here:
http://forums.nvidia.com/index.php?showtopic=31828

How do you feel, is it a bug? I guess that cycle expanding should not affect to the stability…

Many thanks.

As far as I can tell, this looks like a bug. You should probably file it.

Does the loop-version work fine?

I think you may have an issue due to alignment on 16-byte boundaries (your union definition). Say “shared” array starts at address 0. Then:

  • thread with tid=0 will write to byte addresses 0 through 31, inclusively.
  • thread with tid=1 will write to byte addresses 1 through 32, inclusively.

So, now thread 1 is violating the alignment requirement. It is also overwriting the data that thread 0 wrote (unless you intend this, you’ll have correctness issues). And you also have bank conflicts as multiple threads are writing to the same bank of shared memory, which will reduce your performance.

Let me know if I missed something reading the code.

Paulius

Thanks, paulius.

tid it not thread number (I have overlooked to specify in topic :( ): tid = 2treadId.x16. Therefore, your interesting version is fail.

When I use shared memory I receive problems such as or something similar http://forums.nvidia.com/index.php?showtopic=30410*hl=loop. How I should organize a code to avoid such problems? (I spend a lot of time to solve it) Somebody from NVIDIA commented it, please.