I am in front of a trouble of coalescing in a simple kernel that can be resume as follow :
global void Kernel(unsigned char* OP,int W, int H)
{
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.yblockDim.y + threadIdx.y;
if (x<W && y<H)
{
float OP_val;
// Process
OP[(W*y)+x] = (unsigned char)OP_val;
}
}
For some reasons I cannot give more details about the processing part but there no memory access in this part but texture hits.
Anyway, I do not understand why the writes are not coalesced. Reading the programming guide, it should be…
Does somebody know why? Can it come from the unsigned char cast?
Thanks in advance.
I am in front of a trouble of coalescing in a simple kernel that can be resume as follow :
global void Kernel(unsigned char* OP,int W, int H)
{
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.yblockDim.y + threadIdx.y;
if (x<W && y<H)
{
float OP_val;
// Process
OP[(W*y)+x] = (unsigned char)OP_val;
}
}
For some reasons I cannot give more details about the processing part but there no memory access in this part but texture hits.
Anyway, I do not understand why the writes are not coalesced. Reading the programming guide, it should be…
Does somebody know why? Can it come from the unsigned char cast?
Thanks in advance.
I am using a GTS250 (capability 1.1). The image size (WxH) is 1280x1024 and I tried different block sizes (8x8, 16x16, 32x8, 256x1 …). OP is allocated with cudaMalloc which is implicitely aligned isn’t it?
I am using a GTS250 (capability 1.1). The image size (WxH) is 1280x1024 and I tried different block sizes (8x8, 16x16, 32x8, 256x1 …). OP is allocated with cudaMalloc which is implicitely aligned isn’t it?
I’m not sure (about the segment size), but as you are accessing to 1byte words, the segment that is taken is of 64 bytes, ok?
well, if your thread 0 of block 0 accesses at word 0 of this segment,
the tread 1 of block 0 accesses at word 4 of this segment,
…
the thread 15 of block 0 accesses at word 60 if this segment, you see that for this first halfwarp all accesses lies in the same segment! but if W were > 4?
If W > 4 there’ll be uncoalescing and in your case (CC 1.1) 16 tx per halfwarp.
I’m not sure (about the segment size), but as you are accessing to 1byte words, the segment that is taken is of 64 bytes, ok?
well, if your thread 0 of block 0 accesses at word 0 of this segment,
the tread 1 of block 0 accesses at word 4 of this segment,
…
the thread 15 of block 0 accesses at word 60 if this segment, you see that for this first halfwarp all accesses lies in the same segment! but if W were > 4?
If W > 4 there’ll be uncoalescing and in your case (CC 1.1) 16 tx per halfwarp.
Ok. I got it. The only way to have coalesced access using 1-byte word is to group them into a structure :
struct align((4)) u8_4 { // 32 bit
uint8_t a,b,c,d;
};
So, you can access 4-bytes word and respect the fisrt requirement of the section G.3.2.1 of the programming guide. I made it and I have 8192 coalesced store instead of 327680 uncoalesced stores and got a x2 speed up on this kernel.
@ insmvb00, thanks for your advices and explanations.
Ok. I got it. The only way to have coalesced access using 1-byte word is to group them into a structure :
struct align((4)) u8_4 { // 32 bit
uint8_t a,b,c,d;
};
So, you can access 4-bytes word and respect the fisrt requirement of the section G.3.2.1 of the programming guide. I made it and I have 8192 coalesced store instead of 327680 uncoalesced stores and got a x2 speed up on this kernel.
@ insmvb00, thanks for your advices and explanations.
From the manual, 1.1 devices cannot coalesce accesses when threads access 1byte each. In your example above, the segment size cannot be 64 bytes since the word size is 1byte. And thread 0 accesses byte 0, thread 1 accesses byte 1 (and not 4).
From the manual, 1.1 devices cannot coalesce accesses when threads access 1byte each. In your example above, the segment size cannot be 64 bytes since the word size is 1byte. And thread 0 accesses byte 0, thread 1 accesses byte 1 (and not 4).
The algorithm does not say anything about the segment size when the word size is less than 4 bytes. ok, i’m agree.
But i’m not sure that there is’n coalescing working with words which size is < 4 bytes, i think.
On the other hand, with W=4, the thread 0 accesses to word 0, the thread 1 accesses to word 4, the thread 2 accesses to word 8, …
The algorithm does not say anything about the segment size when the word size is less than 4 bytes. ok, i’m agree.
But i’m not sure that there is’n coalescing working with words which size is < 4 bytes, i think.
On the other hand, with W=4, the thread 0 accesses to word 0, the thread 1 accesses to word 4, the thread 2 accesses to word 8, …