help me! coalesced access

vaglienti · October 26, 2010, 6:52pm

help me External Image

/******

dim3 nThreads(16,16);

int dimBlockW = (width+15)/16;

int dimBlockH = (height+15)/16;

dim3 nBlocks(dimBlockW,dimBlockH);

uchar4 *dptr=NULL;

size_t num_bytes;

cutilSafeCall(cudaGraphicsMapResources(1, &pboResource, 0));

cutilSafeCall(cudaGraphicsResourceGetMappedPointer((void **)&dptr,&num_bytes,pboResource));

kernel<<< nBlocks, nThreads >>> (dptr,width,height);

cutilSafeCall(cudaThreadSynchronize());

.....

…

******/

global kernel(uchar4 *img, int width,int height)

{

int j = (blockIdx.x) * blockDim.x + threadIdx.x;

int i = (blockIdx.y) * blockDim.y + threadIdx.y;	



if ((j<width) && (i<height))

{

     ....

     .......

     .......

     ......

img[(i*width)+j] = make_uchar4( r,g,b,255); <-------------- COALESCED ACCESS ?

}

vaglienti · October 26, 2010, 6:52pm

help me External Image

/******

dim3 nThreads(16,16);

int dimBlockW = (width+15)/16;

int dimBlockH = (height+15)/16;

dim3 nBlocks(dimBlockW,dimBlockH);

uchar4 *dptr=NULL;

size_t num_bytes;

cutilSafeCall(cudaGraphicsMapResources(1, &pboResource, 0));

cutilSafeCall(cudaGraphicsResourceGetMappedPointer((void **)&dptr,&num_bytes,pboResource));

kernel<<< nBlocks, nThreads >>> (dptr,width,height);

cutilSafeCall(cudaThreadSynchronize());

.....

…

******/

global kernel(uchar4 *img, int width,int height)

{

int j = (blockIdx.x) * blockDim.x + threadIdx.x;

int i = (blockIdx.y) * blockDim.y + threadIdx.y;	



if ((j<width) && (i<height))

{

     ....

     .......

     .......

     ......

img[(i*width)+j] = make_uchar4( r,g,b,255); <-------------- COALESCED ACCESS ?

}

happyjack272 · October 27, 2010, 1:57am

i’m assuming that img is in shared memory. (you can have 2d-arrays so that its just img[i][j], btw)

in which case, since blockDim.x is only 16, it depends on the compute version. because it depends on the number of memory banks (i.e. whether it uses “half-warps”).
lower compute version, with 16 memory banks, yes, that’s coalsced. but later ones have 32 banks and thus require 32-word granularity.

for compute versions w/32 memory banks, for instance, each 16x16 thread block will have the threads in threadIdx.y=0 and threadIdx.y=1 have bank conflicts (both writting to banks 0 thru 15), thus taking 2 operations instead of 1. likewise for 2 & 3, 4 & 5, etc.

i’m assuming makechar4(…) creates a 32-bit word. don’t know much about it. just that the processor really doesn’t actually support vector operations so that one instruction is about all the speedup you’re going to get. i mistake on nvidia’s part, if you ask me, esp. for integer ops. for integer ops all you have to do is not do the carries. but i digress.

happyjack272 · October 27, 2010, 1:57am

i’m assuming that img is in shared memory. (you can have 2d-arrays so that its just img[i][j], btw)

in which case, since blockDim.x is only 16, it depends on the compute version. because it depends on the number of memory banks (i.e. whether it uses “half-warps”).
lower compute version, with 16 memory banks, yes, that’s coalsced. but later ones have 32 banks and thus require 32-word granularity.

for compute versions w/32 memory banks, for instance, each 16x16 thread block will have the threads in threadIdx.y=0 and threadIdx.y=1 have bank conflicts (both writting to banks 0 thru 15), thus taking 2 operations instead of 1. likewise for 2 & 3, 4 & 5, etc.

i’m assuming makechar4(…) creates a 32-bit word. don’t know much about it. just that the processor really doesn’t actually support vector operations so that one instruction is about all the speedup you’re going to get. i mistake on nvidia’s part, if you ask me, esp. for integer ops. for integer ops all you have to do is not do the carries. but i digress.

vaglienti · October 27, 2010, 6:59am

i’m assuming that img is in shared memory. (you can have 2d-arrays so that its just img[i][j], btw)

in which case, since blockDim.x is only 16, it depends on the compute version. because it depends on the number of memory banks (i.e. whether it uses “half-warps”).

lower compute version, with 16 memory banks, yes, that’s coalsced. but later ones have 32 banks and thus require 32-word granularity.

for compute versions w/32 memory banks, for instance, each 16x16 thread block will have the threads in threadIdx.y=0 and threadIdx.y=1 have bank conflicts (both writting to banks 0 thru 15), thus taking 2 operations instead of 1. likewise for 2 & 3, 4 & 5, etc.

i’m assuming makechar4(…) creates a 32-bit word. don’t know much about it. just that the processor really doesn’t actually support vector operations so that one instruction is about all the speedup you’re going to get. i mistake on nvidia’s part, if you ask me, esp. for integer ops. for integer ops all you have to do is not do the carries. but i digress.

img is not in shared memory…it’s in global memory

vaglienti · October 27, 2010, 6:59am

i’m assuming that img is in shared memory. (you can have 2d-arrays so that its just img[i][j], btw)

in which case, since blockDim.x is only 16, it depends on the compute version. because it depends on the number of memory banks (i.e. whether it uses “half-warps”).

lower compute version, with 16 memory banks, yes, that’s coalsced. but later ones have 32 banks and thus require 32-word granularity.

for compute versions w/32 memory banks, for instance, each 16x16 thread block will have the threads in threadIdx.y=0 and threadIdx.y=1 have bank conflicts (both writting to banks 0 thru 15), thus taking 2 operations instead of 1. likewise for 2 & 3, 4 & 5, etc.

i’m assuming makechar4(…) creates a 32-bit word. don’t know much about it. just that the processor really doesn’t actually support vector operations so that one instruction is about all the speedup you’re going to get. i mistake on nvidia’s part, if you ask me, esp. for integer ops. for integer ops all you have to do is not do the carries. but i digress.

img is not in shared memory…it’s in global memory

insmvb00 · October 27, 2010, 8:06am

Hi!

-If the Compute capability of your device is less than 1.2, if width is >= 2 (TOTALLY UNCOALESCED)
-Else {

You are accessing to 4 bytes words → segment size = 128(bytes) → 32 words → if width is 1 OR 2 (TOTALLY COALESCED)
else if (2 < width < 32) {
UNCOALESCED ACCESSES more ineficcient in each increment of width
}
else // width > 32
TOTALLY UNCOALESCED ACCESSES (32 transacctions by warp)
}

I hope that it’s clear now :)
Equally, you can check this results with the profiler (in Linux or nsight in windows)

Regards!

insmvb00 · October 27, 2010, 8:06am

Hi!

-If the Compute capability of your device is less than 1.2, if width is >= 2 (TOTALLY UNCOALESCED)
-Else {

You are accessing to 4 bytes words → segment size = 128(bytes) → 32 words → if width is 1 OR 2 (TOTALLY COALESCED)
else if (2 < width < 32) {
UNCOALESCED ACCESSES more ineficcient in each increment of width
}
else // width > 32
TOTALLY UNCOALESCED ACCESSES (32 transacctions by warp)
}

I hope that it’s clear now :)
Equally, you can check this results with the profiler (in Linux or nsight in windows)

Regards!

vaglienti · October 27, 2010, 10:08am

Hi!

-If the Compute capability of your device is less than 1.2, if width is >= 2 (TOTALLY UNCOALESCED)

-Else {

You are accessing to 4 bytes words → segment size = 128(bytes) → 32 words → if width is 1 OR 2 (TOTALLY COALESCED)

                                                                                                                              else if (2 < width < 32) {

                                                                                                                                     UNCOALESCED ACCESSES more ineficcient in each increment of width

                                                                                                                                     }

                                                                                                                                     else // width > 32

                                                                                                                                            TOTALLY UNCOALESCED ACCESSES (32 transacctions by warp)

}

I hope that it’s clear now :)

Equally, you can check this results with the profiler (in Linux or nsight in windows)

Regards!

but if I put ------> dim3 nThreads(32,16) ? i = 0 j = 0 1 2…31 ==> img [0] → img[31] i=1 j =0 1 2… ==> img [width] → img[width+31] …

warp ↔ row of the block

vaglienti · October 27, 2010, 10:08am

Hi!

-If the Compute capability of your device is less than 1.2, if width is >= 2 (TOTALLY UNCOALESCED)

-Else {

You are accessing to 4 bytes words → segment size = 128(bytes) → 32 words → if width is 1 OR 2 (TOTALLY COALESCED)

                                                                                                                              else if (2 < width < 32) {

                                                                                                                                     UNCOALESCED ACCESSES more ineficcient in each increment of width

                                                                                                                                     }

                                                                                                                                     else // width > 32

                                                                                                                                            TOTALLY UNCOALESCED ACCESSES (32 transacctions by warp)

}

I hope that it’s clear now :)

Equally, you can check this results with the profiler (in Linux or nsight in windows)

Regards!

but if I put ------> dim3 nThreads(32,16) ? i = 0 j = 0 1 2…31 ==> img [0] → img[31] i=1 j =0 1 2… ==> img [width] → img[width+31] …

warp ↔ row of the block

insmvb00 · October 27, 2010, 10:55am

Sorry, but i don’t understand what you’re trying to say me.
Wich is the value of width?

PD: In the block #0, j goes from 0 to 15; not to 31

insmvb00 · October 27, 2010, 10:55am

Sorry, but i don’t understand what you’re trying to say me.
Wich is the value of width?

PD: In the block #0, j goes from 0 to 15; not to 31

vaglienti · October 27, 2010, 11:48am

ok

I changed… now ----> size of block = (32 x 16) size of grid = [(width+31)/32 x ((height+15)/16)]

for example: a image of 1024 (pixel) x 768(pixel) ----> width = 1024 height = 768 ------> size of block = [32x16] and size of grid = [32 x 48] because ((1024+31)/32) = 32 and ((768+15)/16) = 48

img is : 1024 x 768 x 4 byte img: [0] [1] [2]… [1024x768] as linear memory (in device memory). =========> for example warp with i=0 and j=0 j=1…j=31 ====> img: [01024+0] [01024+1]…[0*1024+31] segment of 128 byte (4 x 32)

vaglienti · October 27, 2010, 11:48am

ok

I changed… now ----> size of block = (32 x 16) size of grid = [(width+31)/32 x ((height+15)/16)]

for example: a image of 1024 (pixel) x 768(pixel) ----> width = 1024 height = 768 ------> size of block = [32x16] and size of grid = [32 x 48] because ((1024+31)/32) = 32 and ((768+15)/16) = 48

img is : 1024 x 768 x 4 byte img: [0] [1] [2]… [1024x768] as linear memory (in device memory). =========> for example warp with i=0 and j=0 j=1…j=31 ====> img: [01024+0] [01024+1]…[0*1024+31] segment of 128 byte (4 x 32)

happyjack272 · October 27, 2010, 2:54pm

k, in that case, now with a 32 thread x-dimension, then it should be coalesced (as far as global memory access is concerned) for any compute version now. assuming that img is aligned. (i.e. the memory address of img[0] is divisible by the memory bandwidth - i’d align it on at least a 32-dword (128-byte) boundary just to be safe.)

another optimization, if possible, is, if this data is then read by another kernel, you might want to fuse the kernel and thus eliminate the intermediate global memory read/write altogether.

if that’s not possible i’d do what i can to make sure you can run (and are running) another kernel with high compute intensity (low global memory access) on the same SM at the same time to occupy all the idle times the global memory access latency is going to produce. i.e. i/o limited kernel + compute limited kernel = balanced kernel. i.e. 1+1=1!

happyjack272 · October 27, 2010, 2:54pm

k, in that case, now with a 32 thread x-dimension, then it should be coalesced (as far as global memory access is concerned) for any compute version now. assuming that img is aligned. (i.e. the memory address of img[0] is divisible by the memory bandwidth - i’d align it on at least a 32-dword (128-byte) boundary just to be safe.)

another optimization, if possible, is, if this data is then read by another kernel, you might want to fuse the kernel and thus eliminate the intermediate global memory read/write altogether.

if that’s not possible i’d do what i can to make sure you can run (and are running) another kernel with high compute intensity (low global memory access) on the same SM at the same time to occupy all the idle times the global memory access latency is going to produce. i.e. i/o limited kernel + compute limited kernel = balanced kernel. i.e. 1+1=1!

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	1985	December 9, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1969	October 8, 2020
[beginner] memory access CUDA Programming and Performance	7	11289	November 3, 2010
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	5909	October 29, 2008
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10168	June 28, 2009
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1017	November 1, 2011
Coalesced memory access example CUDA Programming and Performance	2	3249	March 28, 2011
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
char global memory access optimization CUDA Programming and Performance	17	11866	May 31, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11666	March 20, 2009

help me! coalesced access

Related topics