Just like the topic has said, I have to use the type “short” because my memory of the graphic card is limited, but the question is I can get coalesced reading and writing when I used “short”, anyone has any ideas??
If it is necessary, I can upload some code for analysis.
Thank you guys!
I’ve even coalesced chars. Just read them to shared memory as ints, and then access the bytes (or shorts) in shared memory. Pointer type casting is your friend.
It’s very nice to see your reply!
I was thinking about doing that, but I know that the shared memory has a 64k limit, I am not sure if I could put enough data in.
Shared memory has a limit of 16kb per block on all current architectures.
That is a good method to get coalesced in reading and writting with sizeof(type) != 4 bytes (for instance chars or shorts type). I had done some projects in field of image processing with same as your method and all of them worked correctly.
I originally designed to let a block to process half of a 512512 slice, with every pixel has a data type of “short”, i.e. I used 256 threads in every block. Do I have to reduce the number of threads in a block to make the shared memory work in this case? You see half of a 512512 slice is obviously larger than 64kb…
I can’t understand with your piece of information, so would you might post more detail about your project for me or someone else around here can understand clearly and give you some suggestions.
regard,
[codebox]global void d_recursiveGaussianY(short *d_src, short *d_dest, int depth, int height, int width, float b0, float b1, float b2, float b3, float B, int order, float M11, float M12, float M13, float M21, float M22, float M23, float M31, float M32, float M33)
{
float wP1 = 0.f, wP2 = 0.f, wP3 = 0.f;
int y = 0;
float outF1 = 0.f, outF2 = 0.f, outF3 = 0.f;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int yy = blockIdx.y*width*height;
if(x > width)
return;
d_src += x + yy;
d_dest += x + yy;
wP1 = (float)*d_src/sqrt(B); wP2 = wP1; wP3 = wP1;
for(y=0;y<height;y++)
{
float xC = (float)*d_src;
float wC = (float)(xC - b1*wP1 - b2*wP2 - b3*wP3)/b0;
*d_dest = (short)wC;
d_src += width; d_dest += width;
wP3 = wP2; wP2 = wP1; wP1 = wC;
}
d_src -= width;
d_dest -= width;
.
.
.
.
*d_dest = (short)out;
d_src -= width;
d_dest -= width;
for(y=height-1-1;y>=0;y--)
{
float wC = (float)*d_dest;
out = (float)(B*wC - b1*outF1 - b2*outF2 - b3*outF3)/b0;
*d_dest = (short)out;
d_src -= width; d_dest -= width;
outF3 = outF2; outF2 = outF1; outF1 = out;
}
}
extern “C” void Call_d_recursiveGaussianY(short *d_src, short *d_dest, int width, int height, int depth, float b0, float b1, float b2, float b3, float B, int order, int n, int nthread, float M11, float M12, float M13, float M21, float M22, float M23, float M31, float M32, float M33)
{
d_recursiveGaussianY<<<dim3(n, depth), nthread>>>(d_src, d_dest, depth, height, width, b0, b1, b2, b3, B, order, M11, M12, M13, M21, M22, M23, M31, M32, M33);
}
[/codebox]
Above is the codes in the kernel file. And I called it from a member function of a class.
[codebox]int CCudaRecursiveGaussian::GaussianFilterY(short *d_src, short *d_dest, int width, int height, int depth, int order, float sigma)
{
int nthread = 256;
int n = iDivUp(width, nthread);/*This function has only one line: return (a % b != 0) ? (a / b + 1) : (a / B);*/
.
.Calculating the parameters.
.
Call_d_recursiveGaussianY(d_src, d_dest, width, height, depth, b0, b1, b2, b3, B, order, n, nthread, M11, M12, M13, M21, M22, M23, M31, M32, M33);
return 0;
}[/codebox]