Coalesced shared memory access? Read and write from which thread to which?

Greg · May 20, 2014, 4:27am

Nsight VSE CUDA Profiler and the CUDA Visual Profiler 6.0 can collect the number of transactions per memory instruction executed and display this information at the SASS (assembly) and C source level. I recommend you run your program in these tools and look at the annotated source code to determine if your access patterns are coalesced.

ggeo · May 20, 2014, 7:15am

Unfortunately I have problems running nsight visual profiler.

Is there any way to check with nvprof command if I have coalesced access?

ggeo · May 22, 2014, 2:18pm

little_jimmy ,

I checked your approach and it gives me wrong results also.

Now , I spotted something.

Regarding my code ( using

if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1)  );
if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );

)

If I run cuda-memcheck ,it shows :

========= Invalid __global__ read of size 4
=========     at 0x00000a18.....
=========     by thread (15,15,0) in block (3,4,0)
=========     Address 0x13002fff88 is out of bounds

 Invalid __global__ read of size 4
=========     at 0x00000a18 in .....
=========     by thread (15,15,0) in block (3,4,1)
=========     Address 0x1300364000 is out of bounds

Program hit error 4 on CUDA API call to cudaDeviceSynchronize 
=========     Saved host backtrace up to driver entry point at error

It shows 2 errors regarding the address out of bounds which are in threads 15,15 and blocks 3,4 .
I am using 16,16 threads and 4,5 blocks, so its on the last blocks.

If I run using

cuda-memcheck --destroy-on-device-error kernel

,it shows me many errors but all refer to thread 15,15 and block 3,4 for all images that I load.

I am refreshig the code:

for (int i = 0; i < N; i++ ) {   //N is number of images

           __syncthreads();

           J = RowIdx * Cols + ColIdx + Rows * Cols * i;

           Jm1 = theIJ - 1;
	   if ( 0 == ColIdx ) Jm1 += Cols;

	   Jp1 = theIJ + 1;
	   if ( Cols == ( ColIdx - 1 ) ) Jp1 -= Cols;

            myshared[ty][tx + 1] = *( dev_input + J );

//taking into account boundary conditions
            if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1)  );
	    if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );
            
             __syncthreads();

All the errors refer to line :

if ( tx == tile_width -1 ) myshared[ty][tx+2] =  *( dev_input + (J+1) );

If I use N<=6 I receive no errors!Errors appear for N>6 , N is the number of the images.

I am using rows =80 ,cols = 64 ,so 4 blocks in x => 416threads = 64 ,5 blocks in y => 516=80.

Using your code , I am taking error:

Invalid __global__ read of size 4
=========     at 0x00000cb8 in 
...
=========     by thread (0,0,0) in block (3,0,0)
....

And all errors refer to thread 0 ,0 in block 3,0 for all images.

And to line :

boundary_in_shared[(threadIdx.y * 2) + 1] = image_in_gbl[tile_gbl_offset + tile_gbl_index - (columns - 1)];

ANy ideas?

Thanks!

ggeo · May 22, 2014, 3:33pm

I think I found the error!

if ( Cols == ( ColIdx - 1 ) ) Jp1 -= Cols;

It should be :

if ( ( Cols  - 1)  == ColIdx  ) Jp1 -= Cols;

and now no errors!!!

Topic		Replies	Views
Shared memory question CUDA Programming and Performance	27	7633	June 23, 2008
Coalesced Memory access related doubt CUDA Programming and Performance	13	2229	December 9, 2010
help me! coalesced access CUDA Programming and Performance	15	7187	October 27, 2010
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3576	January 10, 2010
Coalescing global memory and avoiding shared bank conflicts Do I need to use this complex of indexin CUDA Programming and Performance	3	3262	March 30, 2009
Memory coalescing in one thread CUDA Programming and Performance	17	16790	March 31, 2011
Coalesced memory access example CUDA Programming and Performance	2	3359	March 28, 2011
Memory coalescing and multiple arrays CUDA Programming and Performance	23	12011	March 20, 2009
Help Avoiding Un-Coalesced Memory Access CUDA Programming and Performance	9	9342	October 4, 2010
performance unchanged by shared memory 1D heat equation attempt CUDA Programming and Performance	15	5009	July 12, 2009

Coalesced shared memory access? Read and write from which thread to which?

Related topics