help me! coalesced access

help me External Image

/******

dim3 nThreads(16,16);

int dimBlockW = (width+15)/16;

int dimBlockH = (height+15)/16;

dim3 nBlocks(dimBlockW,dimBlockH);

uchar4 *dptr=NULL;

size_t num_bytes;

cutilSafeCall(cudaGraphicsMapResources(1, &pboResource, 0));

cutilSafeCall(cudaGraphicsResourceGetMappedPointer((void **)&dptr,&num_bytes,pboResource));

kernel<<< nBlocks, nThreads >>> (dptr,width,height);

cutilSafeCall(cudaThreadSynchronize());

.....

******/

global kernel(uchar4 *img, int width,int height)

{

int j = (blockIdx.x) * blockDim.x + threadIdx.x;

int i = (blockIdx.y) * blockDim.y + threadIdx.y;	



if ((j<width) && (i<height))

{

     ....

     .......

     .......

     ......

img[(i*width)+j] = make_uchar4( r,g,b,255); <-------------- COALESCED ACCESS ?

     }

}

help me External Image

/******

dim3 nThreads(16,16);

int dimBlockW = (width+15)/16;

int dimBlockH = (height+15)/16;

dim3 nBlocks(dimBlockW,dimBlockH);

uchar4 *dptr=NULL;

size_t num_bytes;

cutilSafeCall(cudaGraphicsMapResources(1, &pboResource, 0));

cutilSafeCall(cudaGraphicsResourceGetMappedPointer((void **)&dptr,&num_bytes,pboResource));

kernel<<< nBlocks, nThreads >>> (dptr,width,height);

cutilSafeCall(cudaThreadSynchronize());

.....

******/

global kernel(uchar4 *img, int width,int height)

{

int j = (blockIdx.x) * blockDim.x + threadIdx.x;

int i = (blockIdx.y) * blockDim.y + threadIdx.y;	



if ((j<width) && (i<height))

{

     ....

     .......

     .......

     ......

img[(i*width)+j] = make_uchar4( r,g,b,255); <-------------- COALESCED ACCESS ?

     }

}

i’m assuming that img is in shared memory. (you can have 2d-arrays so that its just img[i][j], btw)

in which case, since blockDim.x is only 16, it depends on the compute version. because it depends on the number of memory banks (i.e. whether it uses “half-warps”).
lower compute version, with 16 memory banks, yes, that’s coalsced. but later ones have 32 banks and thus require 32-word granularity.

for compute versions w/32 memory banks, for instance, each 16x16 thread block will have the threads in threadIdx.y=0 and threadIdx.y=1 have bank conflicts (both writting to banks 0 thru 15), thus taking 2 operations instead of 1. likewise for 2 & 3, 4 & 5, etc.

i’m assuming makechar4(…) creates a 32-bit word. don’t know much about it. just that the processor really doesn’t actually support vector operations so that one instruction is about all the speedup you’re going to get. i mistake on nvidia’s part, if you ask me, esp. for integer ops. for integer ops all you have to do is not do the carries. but i digress.

i’m assuming that img is in shared memory. (you can have 2d-arrays so that its just img[i][j], btw)

in which case, since blockDim.x is only 16, it depends on the compute version. because it depends on the number of memory banks (i.e. whether it uses “half-warps”).
lower compute version, with 16 memory banks, yes, that’s coalsced. but later ones have 32 banks and thus require 32-word granularity.

for compute versions w/32 memory banks, for instance, each 16x16 thread block will have the threads in threadIdx.y=0 and threadIdx.y=1 have bank conflicts (both writting to banks 0 thru 15), thus taking 2 operations instead of 1. likewise for 2 & 3, 4 & 5, etc.

i’m assuming makechar4(…) creates a 32-bit word. don’t know much about it. just that the processor really doesn’t actually support vector operations so that one instruction is about all the speedup you’re going to get. i mistake on nvidia’s part, if you ask me, esp. for integer ops. for integer ops all you have to do is not do the carries. but i digress.

img is not in shared memory…it’s in global memory

img is not in shared memory…it’s in global memory

Hi!

-If the Compute capability of your device is less than 1.2, if width is >= 2 (TOTALLY UNCOALESCED)
-Else {

  • You are accessing to 4 bytes words → segment size = 128(bytes) → 32 words → if width is 1 OR 2 (TOTALLY COALESCED)
    else if (2 < width < 32) {
    UNCOALESCED ACCESSES more ineficcient in each increment of width
    }
    else // width > 32
    TOTALLY UNCOALESCED ACCESSES (32 transacctions by warp)
    }

I hope that it’s clear now :)
Equally, you can check this results with the profiler (in Linux or nsight in windows)

Regards!

Hi!

-If the Compute capability of your device is less than 1.2, if width is >= 2 (TOTALLY UNCOALESCED)
-Else {

  • You are accessing to 4 bytes words → segment size = 128(bytes) → 32 words → if width is 1 OR 2 (TOTALLY COALESCED)
    else if (2 < width < 32) {
    UNCOALESCED ACCESSES more ineficcient in each increment of width
    }
    else // width > 32
    TOTALLY UNCOALESCED ACCESSES (32 transacctions by warp)
    }

I hope that it’s clear now :)
Equally, you can check this results with the profiler (in Linux or nsight in windows)

Regards!

but if I put ------> dim3 nThreads(32,16) ? i = 0 j = 0 1 2…31 ==> img [0] → img[31] i=1 j =0 1 2… ==> img [width] → img[width+31] …

warp ↔ row of the block

but if I put ------> dim3 nThreads(32,16) ? i = 0 j = 0 1 2…31 ==> img [0] → img[31] i=1 j =0 1 2… ==> img [width] → img[width+31] …

warp ↔ row of the block

Sorry, but i don’t understand what you’re trying to say me.
Wich is the value of width?

PD: In the block #0, j goes from 0 to 15; not to 31

Sorry, but i don’t understand what you’re trying to say me.
Wich is the value of width?

PD: In the block #0, j goes from 0 to 15; not to 31

ok

I changed… now ----> size of block = (32 x 16) size of grid = [(width+31)/32 x ((height+15)/16)]

for example: a image of 1024 (pixel) x 768(pixel) ----> width = 1024 height = 768 ------> size of block = [32x16] and size of grid = [32 x 48] because ((1024+31)/32) = 32 and ((768+15)/16) = 48

img is : 1024 x 768 x 4 byte img: [0] [1] [2]… [1024x768] as linear memory (in device memory). =========> for example warp with i=0 and j=0 j=1…j=31 ====> img: [01024+0] [01024+1]…[0*1024+31] segment of 128 byte (4 x 32)

ok

I changed… now ----> size of block = (32 x 16) size of grid = [(width+31)/32 x ((height+15)/16)]

for example: a image of 1024 (pixel) x 768(pixel) ----> width = 1024 height = 768 ------> size of block = [32x16] and size of grid = [32 x 48] because ((1024+31)/32) = 32 and ((768+15)/16) = 48

img is : 1024 x 768 x 4 byte img: [0] [1] [2]… [1024x768] as linear memory (in device memory). =========> for example warp with i=0 and j=0 j=1…j=31 ====> img: [01024+0] [01024+1]…[0*1024+31] segment of 128 byte (4 x 32)

k, in that case, now with a 32 thread x-dimension, then it should be coalesced (as far as global memory access is concerned) for any compute version now. assuming that img is aligned. (i.e. the memory address of img[0] is divisible by the memory bandwidth - i’d align it on at least a 32-dword (128-byte) boundary just to be safe.)

another optimization, if possible, is, if this data is then read by another kernel, you might want to fuse the kernel and thus eliminate the intermediate global memory read/write altogether.

if that’s not possible i’d do what i can to make sure you can run (and are running) another kernel with high compute intensity (low global memory access) on the same SM at the same time to occupy all the idle times the global memory access latency is going to produce. i.e. i/o limited kernel + compute limited kernel = balanced kernel. i.e. 1+1=1!

k, in that case, now with a 32 thread x-dimension, then it should be coalesced (as far as global memory access is concerned) for any compute version now. assuming that img is aligned. (i.e. the memory address of img[0] is divisible by the memory bandwidth - i’d align it on at least a 32-dword (128-byte) boundary just to be safe.)

another optimization, if possible, is, if this data is then read by another kernel, you might want to fuse the kernel and thus eliminate the intermediate global memory read/write altogether.

if that’s not possible i’d do what i can to make sure you can run (and are running) another kernel with high compute intensity (low global memory access) on the same SM at the same time to occupy all the idle times the global memory access latency is going to produce. i.e. i/o limited kernel + compute limited kernel = balanced kernel. i.e. 1+1=1!