Coalescing

Hello!

I am in front of a trouble of coalescing in a simple kernel that can be resume as follow :

global void Kernel(unsigned char* OP,int W, int H)
{
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

if (x<W && y<H)
{
    float OP_val;

        // Process

    OP[(W*y)+x] = (unsigned char)OP_val;
}

}

For some reasons I cannot give more details about the processing part but there no memory access in this part but texture hits.
Anyway, I do not understand why the writes are not coalesced. Reading the programming guide, it should be…
Does somebody know why? Can it come from the unsigned char cast?
Thanks in advance.

Sam

Hello!

I am in front of a trouble of coalescing in a simple kernel that can be resume as follow :

global void Kernel(unsigned char* OP,int W, int H)
{
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

if (x<W && y<H)
{
    float OP_val;

        // Process

    OP[(W*y)+x] = (unsigned char)OP_val;
}

}

For some reasons I cannot give more details about the processing part but there no memory access in this part but texture hits.
Anyway, I do not understand why the writes are not coalesced. Reading the programming guide, it should be…
Does somebody know why? Can it come from the unsigned char cast?
Thanks in advance.

Sam

What is the compute capability of your GPU? How is OP aligned? What are the values of W, H, and blocksize?

What is the compute capability of your GPU? How is OP aligned? What are the values of W, H, and blocksize?

Thanks for your quick answer.

I am using a GTS250 (capability 1.1). The image size (WxH) is 1280x1024 and I tried different block sizes (8x8, 16x16, 32x8, 256x1 …). OP is allocated with cudaMalloc which is implicitely aligned isn’t it?

Thanks for your quick answer.

I am using a GTS250 (capability 1.1). The image size (WxH) is 1280x1024 and I tried different block sizes (8x8, 16x16, 32x8, 256x1 …). OP is allocated with cudaMalloc which is implicitely aligned isn’t it?

Hi!

if(W>=4) this access will be totally uncoalesced (16tx by halfwarp).

See the algorithm G.3.2.1 of the programming guide.

Regards!

Hi!

if(W>=4) this access will be totally uncoalesced (16tx by halfwarp).

See the algorithm G.3.2.1 of the programming guide.

Regards!

Thanks for your answer!

So, if I got it, accesing “unsigned char” (one byte word) cannot be coalesced cause it must access at least 4-bytes word. Right?

Thanks for your answer!

So, if I got it, accesing “unsigned char” (one byte word) cannot be coalesced cause it must access at least 4-bytes word. Right?

I’m not sure (about the segment size), but as you are accessing to 1byte words, the segment that is taken is of 64 bytes, ok?

well, if your thread 0 of block 0 accesses at word 0 of this segment,

the tread 1 of block 0 accesses at word 4 of this segment,

the thread 15 of block 0 accesses at word 60 if this segment, you see that for this first halfwarp all accesses lies in the same segment! but if W were > 4?

If W > 4 there’ll be uncoalescing and in your case (CC 1.1) 16 tx per halfwarp.

I’m not sure (about the segment size), but as you are accessing to 1byte words, the segment that is taken is of 64 bytes, ok?

well, if your thread 0 of block 0 accesses at word 0 of this segment,

the tread 1 of block 0 accesses at word 4 of this segment,

the thread 15 of block 0 accesses at word 60 if this segment, you see that for this first halfwarp all accesses lies in the same segment! but if W were > 4?

If W > 4 there’ll be uncoalescing and in your case (CC 1.1) 16 tx per halfwarp.

Ok. I got it. The only way to have coalesced access using 1-byte word is to group them into a structure :

struct align((4)) u8_4 { // 32 bit

uint8_t a,b,c,d;

};

So, you can access 4-bytes word and respect the fisrt requirement of the section G.3.2.1 of the programming guide. I made it and I have 8192 coalesced store instead of 327680 uncoalesced stores and got a x2 speed up on this kernel.

@ insmvb00, thanks for your advices and explanations.

Ok. I got it. The only way to have coalesced access using 1-byte word is to group them into a structure :

struct align((4)) u8_4 { // 32 bit

uint8_t a,b,c,d;

};

So, you can access 4-bytes word and respect the fisrt requirement of the section G.3.2.1 of the programming guide. I made it and I have 8192 coalesced store instead of 327680 uncoalesced stores and got a x2 speed up on this kernel.

@ insmvb00, thanks for your advices and explanations.

From the manual, 1.1 devices cannot coalesce accesses when threads access 1byte each. In your example above, the segment size cannot be 64 bytes since the word size is 1byte. And thread 0 accesses byte 0, thread 1 accesses byte 1 (and not 4).

From the manual, 1.1 devices cannot coalesce accesses when threads access 1byte each. In your example above, the segment size cannot be 64 bytes since the word size is 1byte. And thread 0 accesses byte 0, thread 1 accesses byte 1 (and not 4).

Hi!

The algorithm does not say anything about the segment size when the word size is less than 4 bytes. ok, i’m agree.
But i’m not sure that there is’n coalescing working with words which size is < 4 bytes, i think.

On the other hand, with W=4, the thread 0 accesses to word 0, the thread 1 accesses to word 4, the thread 2 accesses to word 8, …

Regards!

Hi!

The algorithm does not say anything about the segment size when the word size is less than 4 bytes. ok, i’m agree.
But i’m not sure that there is’n coalescing working with words which size is < 4 bytes, i think.

On the other hand, with W=4, the thread 0 accesses to word 0, the thread 1 accesses to word 4, the thread 2 accesses to word 8, …

Regards!