I faced with following problem: I need to copy two elements (each of that has size 128 bits) from the global memory to cache. I have to do that for each Thread from Block. If I realize above described like following:
for (int i = 0; i<32; i+=16)
(PixelType)&shared[tid+i] = (PixelType)&d_in[num+i];
the Kernel will process perfectly.
If I try to expand cycle,
(PixelType)&shared[tid] = (PixelType)&d_in[num];
(PixelType)&shared[tid+16] = (PixelType)&d_in[num+16];
I will get the message “unspecified driver error” from Kernel.
unsigned char* d_in, extern shared unsigned char shared,
union align(16) un
unsigned char c;
typedef un PixelType;
The full code version is in attached file. Another version, that returns “unspecified launch failure” was described here:
How do you feel, is it a bug? I guess that cycle expanding should not affect to the stability…