Coalesced shared memory access? Read and write from which thread to which?

Nsight VSE CUDA Profiler and the CUDA Visual Profiler 6.0 can collect the number of transactions per memory instruction executed and display this information at the SASS (assembly) and C source level. I recommend you run your program in these tools and look at the annotated source code to determine if your access patterns are coalesced.

Unfortunately I have problems running nsight visual profiler.

Is there any way to check with nvprof command if I have coalesced access?

little_jimmy ,

I checked your approach and it gives me wrong results also.

Now , I spotted something.

  1. Regarding my code ( using
if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1)  );
if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );

)

If I run cuda-memcheck ,it shows :

========= Invalid __global__ read of size 4
=========     at 0x00000a18.....
=========     by thread (15,15,0) in block (3,4,0)
=========     Address 0x13002fff88 is out of bounds

 Invalid __global__ read of size 4
=========     at 0x00000a18 in .....
=========     by thread (15,15,0) in block (3,4,1)
=========     Address 0x1300364000 is out of bounds

Program hit error 4 on CUDA API call to cudaDeviceSynchronize 
=========     Saved host backtrace up to driver entry point at error

It shows 2 errors regarding the address out of bounds which are in threads 15,15 and blocks 3,4 .
I am using 16,16 threads and 4,5 blocks, so its on the last blocks.

If I run using

cuda-memcheck --destroy-on-device-error kernel

,it shows me many errors but all refer to thread 15,15 and block 3,4 for all images that I load.

I am refreshig the code:

for (int i = 0; i < N; i++ ) {   //N is number of images

           __syncthreads();

           J = RowIdx * Cols + ColIdx + Rows * Cols * i;

           Jm1 = theIJ - 1;
	   if ( 0 == ColIdx ) Jm1 += Cols;

	   Jp1 = theIJ + 1;
	   if ( Cols == ( ColIdx - 1 ) ) Jp1 -= Cols;

            myshared[ty][tx + 1] = *( dev_input + J );

//taking into account boundary conditions
            if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1)  );
	    if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );
            
             __syncthreads();

All the errors refer to line :

if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );

If I use N<=6 I receive no errors!Errors appear for N>6 , N is the number of the images.

I am using rows =80 ,cols = 64 ,so 4 blocks in x => 416threads = 64 ,5 blocks in y => 516=80.

  1. Using your code , I am taking error:
Invalid __global__ read of size 4
=========     at 0x00000cb8 in 
...
=========     by thread (0,0,0) in block (3,0,0)
....

And all errors refer to thread 0 ,0 in block 3,0 for all images.

And to line :

boundary_in_shared[(threadIdx.y * 2) + 1] = image_in_gbl[tile_gbl_offset + tile_gbl_index - (columns - 1)];

ANy ideas?

Thanks!

I think I found the error!

if ( Cols == ( ColIdx - 1 ) ) Jp1 -= Cols;

It should be :

if ( ( Cols  - 1)  == ColIdx  ) Jp1 -= Cols;

and now no errors!!!