Now I’m working on my own code, and I just want to start simple.
I’ve got 4 loops that I want to do on the GPU.
Here is the CPU version :
for (int dy=0;dy<dataH-data2H;dy++)
{
for (int dx=0;dx<dataW-data2W;dx++)
{
for (j=0;j<data2H;j++)
{
for (i=0;i<data2W;i++)
{
h_out[dy*corrW + dx] += h_data1[dy*dataW + dx + j*dataW + i] * h_data2[j*data2W + i];
}
}
}
}
Here, data2=data2W=100.
I made a simple kernel, which doesn’t work because (I think) threads are reading/writing the same memory at the same time (so i’ve got more or less the result but not the good one anyway)
I was thinking to use atomic addition and shared memory. Is it the best way to do it ?
And about shared memory, how many pixels can I put in it ?
I have a Quadro 4000 (48 KB per multiprocessor, and 8 multiprocessors)
To use atomicAdd(float*,float) I have to specify “-arch=sm_20” in my additional options right ?
I still got “error : identifier “atomicAdd” is undefined” though.
If you have good advice for the implementation in CUDA, please tell me.
here is my first try, I followed your advice. Thank you !
__global__ void corrGPU2( float* dev_data1, float* dev_data2,float* dev_corr, int dataW,int data2W,int data2H,int corrW, const int N) {
int offset = threadIdx.x + blockIdx.x * blockDim.x;
float i = offset;
int a = 0;
a=i/corrW;
i = i - a*corrW;
int offset2 = i + a * dataW;
if (offset < N)
{
for (int j=0;j<data2H;j++)
{
for (int i=0;i<data2W;i++)
{
dev_corr[offset] += dev_data1[offset2 + j*dataW + i] * dev_data2[j*data2W + i];
}
}
}
}
So, first I have my offset as usual, thank I make a second offset based on the first one, just to read dev_data1 as I should.
Then the 2 inner loops and the product directly into the result array.
I tried to put dev_data2 into constant memory, and i was expecting great improvements, as every threads are reading the same dev_data2 (size : 4900) but I only went from 21 ms to 20 ms.
What should I do to make it faster ? To make the readings coalescent / contiguous ?
EDIT : using an intermediate variable to sum, and only write at the end in the output array, the time drops to 14 ms !
A variable is unique for each thread but is it as fast as one in shared memory ? I tried a “shared float cache[threadPerBlock]” array to save the result for each thread but it’s slower …
here is my first try, I followed your advice. Thank you !
__global__ void corrGPU2( float* dev_data1, float* dev_data2,float* dev_corr, int dataW,int data2W,int data2H,int corrW, const int N) {
int offset = threadIdx.x + blockIdx.x * blockDim.x;
float i = offset;
int a = 0;
a=i/corrW;
i = i - a*corrW;
int offset2 = i + a * dataW;
if (offset < N)
{
for (int j=0;j<data2H;j++)
{
for (int i=0;i<data2W;i++)
{
dev_corr[offset] += dev_data1[offset2 + j*dataW + i] * dev_data2[j*data2W + i];
}
}
}
}
So, first I have my offset as usual, thank I make a second offset based on the first one, just to read dev_data1 as I should.
Then the 2 inner loops and the product directly into the result array.
I tried to put dev_data2 into constant memory, and i was expecting great improvements, as every threads are reading the same dev_data2 (size : 4900) but I only went from 21 ms to 20 ms.
What should I do to make it faster ? To make the readings coalescent / contiguous ?
EDIT : using an intermediate variable to sum, and only write at the end in the output array, the time drops to 14 ms !
A variable is unique for each thread but is it as fast as one in shared memory ? I tried a “shared float cache[threadPerBlock]” array to save the result for each thread but it’s slower …
If dev_data1 and dev_data2 are both in global memory I think you would want to ensure that you’re doing coalesced memory access all the time - especially when targeting Compute 1.x devices.
To me it seems your kernel is entirely memory bandwidth bound, so making use of the available peak memory bandwidth and minimizing repeated memory access will win the game.
If dev_data1 and dev_data2 are both in global memory I think you would want to ensure that you’re doing coalesced memory access all the time - especially when targeting Compute 1.x devices.
To me it seems your kernel is entirely memory bandwidth bound, so making use of the available peak memory bandwidth and minimizing repeated memory access will win the game.
Yes they’re both in global memory (except when I tried constant mem for dev_data2), but I don’t know how to ensure coalesced memory access … ( ps : I have a Compute 2.1 device but still …)
How should I do ? Here there a method or a really simple example ?
I did understood the principle of coalesced memory but to ensure that for real it’s another thing
Yes they’re both in global memory (except when I tried constant mem for dev_data2), but I don’t know how to ensure coalesced memory access … ( ps : I have a Compute 2.1 device but still …)
How should I do ? Here there a method or a really simple example ?
I did understood the principle of coalesced memory but to ensure that for real it’s another thing
On compute 1.x you’re only accessing dev_data1 coalesced when offset2 + j*dataW is a multiple of 16.
If you can’t ensure this alignment, preload aligned chunks of dev_data1 into a shared memory array. Then you can access your shared memory inside the for loops.
Compute 2.x has relaxed requirements regarding the alignment, so you’re better of when you are accessing unaligned memory locations. The hardware will break apart the transaction into two coalesced memory transactions.
On compute 1.x you’re only accessing dev_data1 coalesced when offset2 + j*dataW is a multiple of 16.
If you can’t ensure this alignment, preload aligned chunks of dev_data1 into a shared memory array. Then you can access your shared memory inside the for loops.
Compute 2.x has relaxed requirements regarding the alignment, so you’re better of when you are accessing unaligned memory locations. The hardware will break apart the transaction into two coalesced memory transactions.
What do you mean by “preload aligned chunks of dev_data1 into a shared memory array” ?
I’ve already tried to load it into a shared memory array, but what do you mean by aligned chunks ?
Thank you anyway ! :)
EDIT : I ran my code in Release mode and now the time is 3,3 ms
EDIT : release mode + dev_data2 in constant memory : 2,9 ms
Do you think Page Locked Memory could really change something ?
What do you mean by “preload aligned chunks of dev_data1 into a shared memory array” ?
I’ve already tried to load it into a shared memory array, but what do you mean by aligned chunks ?
Thank you anyway ! :)
EDIT : I ran my code in Release mode and now the time is 3,3 ms
EDIT : release mode + dev_data2 in constant memory : 2,9 ms
Do you think Page Locked Memory could really change something ?