Coalescing

TRT_Sam · October 22, 2010, 12:41pm

Hello!

I am in front of a trouble of coalescing in a simple kernel that can be resume as follow :

global void Kernel(unsigned char* OP,int W, int H)
{
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.yblockDim.y + threadIdx.y;

if (x<W && y<H)
{
    float OP_val;

        // Process

    OP[(W*y)+x] = (unsigned char)OP_val;
}

}

For some reasons I cannot give more details about the processing part but there no memory access in this part but texture hits.
Anyway, I do not understand why the writes are not coalesced. Reading the programming guide, it should be…
Does somebody know why? Can it come from the unsigned char cast?
Thanks in advance.

Sam

TRT_Sam · October 22, 2010, 12:41pm

Hello!

I am in front of a trouble of coalescing in a simple kernel that can be resume as follow :

global void Kernel(unsigned char* OP,int W, int H)
{
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.yblockDim.y + threadIdx.y;

if (x<W && y<H)
{
    float OP_val;

        // Process

    OP[(W*y)+x] = (unsigned char)OP_val;
}

}

For some reasons I cannot give more details about the processing part but there no memory access in this part but texture hits.
Anyway, I do not understand why the writes are not coalesced. Reading the programming guide, it should be…
Does somebody know why? Can it come from the unsigned char cast?
Thanks in advance.

Sam

tera · October 22, 2010, 12:59pm

What is the compute capability of your GPU? How is OP aligned? What are the values of W, H, and blocksize?

tera · October 22, 2010, 12:59pm

What is the compute capability of your GPU? How is OP aligned? What are the values of W, H, and blocksize?

TRT_Sam · October 22, 2010, 1:53pm

Thanks for your quick answer.

I am using a GTS250 (capability 1.1). The image size (WxH) is 1280x1024 and I tried different block sizes (8x8, 16x16, 32x8, 256x1 …). OP is allocated with cudaMalloc which is implicitely aligned isn’t it?

TRT_Sam · October 22, 2010, 1:53pm

Thanks for your quick answer.

I am using a GTS250 (capability 1.1). The image size (WxH) is 1280x1024 and I tried different block sizes (8x8, 16x16, 32x8, 256x1 …). OP is allocated with cudaMalloc which is implicitely aligned isn’t it?

insmvb00 · October 25, 2010, 11:02am

Hi!

if(W>=4) this access will be totally uncoalesced (16tx by halfwarp).

See the algorithm G.3.2.1 of the programming guide.

Regards!

insmvb00 · October 25, 2010, 11:02am

Hi!

if(W>=4) this access will be totally uncoalesced (16tx by halfwarp).

See the algorithm G.3.2.1 of the programming guide.

Regards!

TRT_Sam · October 25, 2010, 11:35am

Thanks for your answer!

So, if I got it, accesing “unsigned char” (one byte word) cannot be coalesced cause it must access at least 4-bytes word. Right?

TRT_Sam · October 25, 2010, 11:35am

Thanks for your answer!

So, if I got it, accesing “unsigned char” (one byte word) cannot be coalesced cause it must access at least 4-bytes word. Right?

insmvb00 · October 25, 2010, 11:59am

I’m not sure (about the segment size), but as you are accessing to 1byte words, the segment that is taken is of 64 bytes, ok?

well, if your thread 0 of block 0 accesses at word 0 of this segment,

the tread 1 of block 0 accesses at word 4 of this segment,

…

the thread 15 of block 0 accesses at word 60 if this segment, you see that for this first halfwarp all accesses lies in the same segment! but if W were > 4?

If W > 4 there’ll be uncoalescing and in your case (CC 1.1) 16 tx per halfwarp.

insmvb00 · October 25, 2010, 11:59am

I’m not sure (about the segment size), but as you are accessing to 1byte words, the segment that is taken is of 64 bytes, ok?

well, if your thread 0 of block 0 accesses at word 0 of this segment,

the tread 1 of block 0 accesses at word 4 of this segment,

…

the thread 15 of block 0 accesses at word 60 if this segment, you see that for this first halfwarp all accesses lies in the same segment! but if W were > 4?

If W > 4 there’ll be uncoalescing and in your case (CC 1.1) 16 tx per halfwarp.

TRT_Sam · October 25, 2010, 1:35pm

Ok. I got it. The only way to have coalesced access using 1-byte word is to group them into a structure :

struct align((4)) u8_4 { // 32 bit

uint8_t a,b,c,d;

};

So, you can access 4-bytes word and respect the fisrt requirement of the section G.3.2.1 of the programming guide. I made it and I have 8192 coalesced store instead of 327680 uncoalesced stores and got a x2 speed up on this kernel.

@ insmvb00, thanks for your advices and explanations.

TRT_Sam · October 25, 2010, 1:35pm

Ok. I got it. The only way to have coalesced access using 1-byte word is to group them into a structure :

struct align((4)) u8_4 { // 32 bit

uint8_t a,b,c,d;

};

So, you can access 4-bytes word and respect the fisrt requirement of the section G.3.2.1 of the programming guide. I made it and I have 8192 coalesced store instead of 327680 uncoalesced stores and got a x2 speed up on this kernel.

@ insmvb00, thanks for your advices and explanations.

mkaushik · October 25, 2010, 5:48pm

From the manual, 1.1 devices cannot coalesce accesses when threads access 1byte each. In your example above, the segment size cannot be 64 bytes since the word size is 1byte. And thread 0 accesses byte 0, thread 1 accesses byte 1 (and not 4).

mkaushik · October 25, 2010, 5:48pm

From the manual, 1.1 devices cannot coalesce accesses when threads access 1byte each. In your example above, the segment size cannot be 64 bytes since the word size is 1byte. And thread 0 accesses byte 0, thread 1 accesses byte 1 (and not 4).

insmvb00 · October 25, 2010, 5:56pm

Hi!

The algorithm does not say anything about the segment size when the word size is less than 4 bytes. ok, i’m agree.
But i’m not sure that there is’n coalescing working with words which size is < 4 bytes, i think.

On the other hand, with W=4, the thread 0 accesses to word 0, the thread 1 accesses to word 4, the thread 2 accesses to word 8, …

Regards!

insmvb00 · October 25, 2010, 5:56pm

Hi!

The algorithm does not say anything about the segment size when the word size is less than 4 bytes. ok, i’m agree.
But i’m not sure that there is’n coalescing working with words which size is < 4 bytes, i think.

On the other hand, with W=4, the thread 0 accesses to word 0, the thread 1 accesses to word 4, the thread 2 accesses to word 8, …

Regards!

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2105	December 9, 2010
Help with kernel CUDA Programming and Performance	6	1603	April 23, 2010
Question about coalesced memory access CUDA Programming and Performance	10	2813	September 24, 2009
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2203	May 28, 2009
Help me about coalescing my program run too slow CUDA Programming and Performance	5	2939	May 14, 2008
help me! coalesced access CUDA Programming and Performance	15	7059	October 27, 2010
question in the sample code (simpleStream.cu) CUDA Programming and Performance	3	3826	November 26, 2007
Coalesced read/write memory details More informations about coalesced memory CUDA Programming and Performance	10	12452	March 2, 2008
how do coalesced writes work CUDA Programming and Performance	9	5040	April 2, 2007
Need help on non-coalesced access CUDA Programming and Performance	0	1134	May 9, 2009

Coalescing

Related topics