Nsight VSE CUDA Profiler and the CUDA Visual Profiler 6.0 can collect the number of transactions per memory instruction executed and display this information at the SASS (assembly) and C source level. I recommend you run your program in these tools and look at the annotated source code to determine if your access patterns are coalesced.
Unfortunately I have problems running nsight visual profiler.
Is there any way to check with nvprof command if I have coalesced access?
little_jimmy ,
I checked your approach and it gives me wrong results also.
Now , I spotted something.
- Regarding my code ( using
if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1) );
if ( tx == tile_width -1 ) myshared[ty][tx+2] = *( dev_input + (J+1) );
)
If I run cuda-memcheck ,it shows :
========= Invalid __global__ read of size 4
========= at 0x00000a18.....
========= by thread (15,15,0) in block (3,4,0)
========= Address 0x13002fff88 is out of bounds
Invalid __global__ read of size 4
========= at 0x00000a18 in .....
========= by thread (15,15,0) in block (3,4,1)
========= Address 0x1300364000 is out of bounds
Program hit error 4 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
It shows 2 errors regarding the address out of bounds which are in threads 15,15 and blocks 3,4 .
I am using 16,16 threads and 4,5 blocks, so its on the last blocks.
If I run using
cuda-memcheck --destroy-on-device-error kernel
,it shows me many errors but all refer to thread 15,15 and block 3,4 for all images that I load.
I am refreshig the code:
for (int i = 0; i < N; i++ ) { //N is number of images
__syncthreads();
J = RowIdx * Cols + ColIdx + Rows * Cols * i;
Jm1 = theIJ - 1;
if ( 0 == ColIdx ) Jm1 += Cols;
Jp1 = theIJ + 1;
if ( Cols == ( ColIdx - 1 ) ) Jp1 -= Cols;
myshared[ty][tx + 1] = *( dev_input + J );
//taking into account boundary conditions
if ( tx == 0 ) myshared[ty][tx] = *( dev_input + (J-1) );
if ( tx == tile_width -1 ) myshared[ty][tx+2] = *( dev_input + (J+1) );
__syncthreads();
All the errors refer to line :
if ( tx == tile_width -1 ) myshared[ty][tx+2] = *( dev_input + (J+1) );
If I use N<=6 I receive no errors!Errors appear for N>6 , N is the number of the images.
I am using rows =80 ,cols = 64 ,so 4 blocks in x => 416threads = 64 ,5 blocks in y => 516=80.
- Using your code , I am taking error:
Invalid __global__ read of size 4
========= at 0x00000cb8 in
...
========= by thread (0,0,0) in block (3,0,0)
....
And all errors refer to thread 0 ,0 in block 3,0 for all images.
And to line :
boundary_in_shared[(threadIdx.y * 2) + 1] = image_in_gbl[tile_gbl_offset + tile_gbl_index - (columns - 1)];
ANy ideas?
Thanks!
I think I found the error!
if ( Cols == ( ColIdx - 1 ) ) Jp1 -= Cols;
It should be :
if ( ( Cols - 1) == ColIdx ) Jp1 -= Cols;
and now no errors!!!