a weird Local memory error in transpose

Hi, all:
I am using OpenCL, not Cuda. Sorry for the inconvenient.

I have a transpose program using Local memory as follows. Generally, the program runs smoothly, but with a weird bug. I transpose my matrix twice, and compare with the raw matrix. I find that most of the position is correct, but several position’s value is different from raw matrix. Here is an example. My matrix is 4096490664 and I will transpose every 4096*4096 twice. In this big matrix, four positions’ value is wrong.
They are

framenum 0, x 1230, y 3448, value 22 ,input 64, trans 86
framenum 0, x 1230, y 3449, value 29 ,input 67, trans 96
framenum 0, x 1231, y 3448, value 1 ,input 67, trans 68
framenum 0, x 1231, y 3449, value 9 ,input 60, trans 69

You may see that the four position are together. More weirdly, if I re-run the program, sometimes the error is disappear, sometimes the error is at other position, sometimes, 8 or 12 or 16 positions have errors. I mean, the error happens randomly when running the program.

If I only use global memory to do transpose ( the simplest way), no error happens. And I if use local memory to do simple copy, that is (1), copy data from global memory to local memory, (2) copy data from local memory to global memory, No error either!

So I am every confusing, why this happens? I try to use different block size, try with or without volatile , try BLOCK_DIM*(BLOCK_DIM+1) or BLOCK_DIM*(BLOCK_DIM). The error always here.

My card is K40. By the way, I tried two K40 cards. And both of them have this phenomenon.

Is anyone have any suggestion? Many thanks for your reply in advance.

Here is the program: transpose_local_memory.

__kernel void transpose_local_uchar(__global unsigned char* input, __global volatile unsigned char* output, int width, int height, int BLOCK_DIM, __local volatile unsigned char* block)
// read the matrix tile into shared memory

for(int frm=0;frm<16;frm++){
int x2=get_global_id(2)/4;
int y2=get_global_id(2)%4;

unsigned int xIndex = get_global_id(0)+get_global_size(0)*x2;
unsigned int yIndex = get_global_id(1)+get_global_size(1)*y2;

if (xIndex >= width || yIndex >= height)
unsigned int index_in = yIndex * width + xIndex+widthheightfrm;
//block[get_local_id(1)(BLOCK_DIM+1)+get_local_id(0)] = input[index_in];
(BLOCK_DIM)+get_local_id(0)] = input[index_in];


// write the transposed matrix tile to global memory
xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0)+get_global_size(1)y2;
yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1)+get_global_size(0)x2;
if((xIndex < height) && (yIndex < width))
unsigned int index_out = yIndex * height + xIndex+width
//output[index_out] = block[get_local_id(0)(BLOCK_DIM+1)+get_local_id(1)];
output[index_out] = block[get_local_id(0)


(1) Uninitialized data
(2) Out-of-bounds memory access
(3) Race condition

s Use cuda-memcheck, it may find the issue (try both debug and release builds)
(2) Acquaint yourself with the CUDA debugger
(3) Use printf to log key variables. Are the values what you expected? I not, trace back to where they originate[/s]
Sorry, noticed belatedly that you are using OpenCL, not CUDA. Wasn’t expecting that in a forum dedicated to CUDA. Suggestions: Carefully reconsider your choice of programming language. Ask for help on a forum dedicated to OpenCL.

another suggestion: Provide a complete example that someone could compile and run. Define the expected output and how to observe the problem. Indicate your compile command line showing how you build the code.

Many thanks for all of your suggestions.