Basic parallel programming need some help

Hello everyone.

Now I’m working on my own code, and I just want to start simple.

I’ve got 4 loops that I want to do on the GPU.

Here is the CPU version :

for (int dy=0;dy<dataH-data2H;dy++) 

	 {

		 for (int dx=0;dx<dataW-data2W;dx++)

		 {

			 for (j=0;j<data2H;j++)

			 {

				 for (i=0;i<data2W;i++)

				 {

					h_out[dy*corrW + dx] += h_data1[dy*dataW + dx + j*dataW + i] * h_data2[j*data2W + i];

				 }

			 }

		 }

	 }

Here, data2=data2W=100.

I made a simple kernel, which doesn’t work because (I think) threads are reading/writing the same memory at the same time (so i’ve got more or less the result but not the good one anyway)

I was thinking to use atomic addition and shared memory. Is it the best way to do it ?

And about shared memory, how many pixels can I put in it ?

I have a Quadro 4000 (48 KB per multiprocessor, and 8 multiprocessors)

Thanks in advance !

To use atomicAdd(float*,float) I have to specify “-arch=sm_20” in my additional options right ?
I still got “error : identifier “atomicAdd” is undefined” though.

If you have good advice for the implementation in CUDA, please tell me.

Any advice to do that with CUDA ?

Any advice to do that with CUDA ?

Hi,

I think the 1st key to spliting that up well is here

{code]

h_out[dy*corrW + dx] += …

[/code]

if each element of h_out is only updated by a single thread then there is no need to use atomicAdd

(Minimise use of atomics as they are costly, essential sometimes but at a price)

So I would be looking to have 1 thread per element of h_out (e.g. dx blocks each of dy threads)

and each thread doing the 2 inner loops, each thread can then sum to a local variable and write final total to h_out

Also look to see if threads of the same warp read adjacent cells from h_data1 (or h_data2 ) at same time

if you can design the code so that they do then that will give you ‘contiguous’ reads from device memory ( see manual)

Also look to see if threads in the same block will be needing the same cell of h_data1 or h_data2 more than once

if they are then those are possible candidates for putting into shared memory

NB you may not need to use shared memory at all.

PS there may be more efficient ways of splitting the work up, thats just some quick ideas

Hi,

I think the 1st key to spliting that up well is here

{code]

h_out[dy*corrW + dx] += …

[/code]

if each element of h_out is only updated by a single thread then there is no need to use atomicAdd

(Minimise use of atomics as they are costly, essential sometimes but at a price)

So I would be looking to have 1 thread per element of h_out (e.g. dx blocks each of dy threads)

and each thread doing the 2 inner loops, each thread can then sum to a local variable and write final total to h_out

Also look to see if threads of the same warp read adjacent cells from h_data1 (or h_data2 ) at same time

if you can design the code so that they do then that will give you ‘contiguous’ reads from device memory ( see manual)

Also look to see if threads in the same block will be needing the same cell of h_data1 or h_data2 more than once

if they are then those are possible candidates for putting into shared memory

NB you may not need to use shared memory at all.

PS there may be more efficient ways of splitting the work up, thats just some quick ideas

Hi,

here is my first try, I followed your advice. Thank you !

__global__ void corrGPU2( float* dev_data1, float* dev_data2,float* dev_corr, int dataW,int data2W,int data2H,int corrW, const int N) {

	int offset = threadIdx.x + blockIdx.x * blockDim.x;

	

	float i = offset;

	int a = 0;

	a=i/corrW;

	i = i - a*corrW;

	int offset2 = i + a * dataW;

	

	if (offset < N)

	{

	   for (int j=0;j<data2H;j++)                         

	   {                                 

		   for (int i=0;i<data2W;i++)                                 

		   {                                      

				dev_corr[offset] += dev_data1[offset2 + j*dataW + i] * dev_data2[j*data2W + i];

			}

		}

	}

}

So, first I have my offset as usual, thank I make a second offset based on the first one, just to read dev_data1 as I should.

Then the 2 inner loops and the product directly into the result array.

I tried to put dev_data2 into constant memory, and i was expecting great improvements, as every threads are reading the same dev_data2 (size : 4900) but I only went from 21 ms to 20 ms.

What should I do to make it faster ? To make the readings coalescent / contiguous ?

EDIT : using an intermediate variable to sum, and only write at the end in the output array, the time drops to 14 ms !

A variable is unique for each thread but is it as fast as one in shared memory ? I tried a “shared float cache[threadPerBlock]” array to save the result for each thread but it’s slower …

Hi,

here is my first try, I followed your advice. Thank you !

__global__ void corrGPU2( float* dev_data1, float* dev_data2,float* dev_corr, int dataW,int data2W,int data2H,int corrW, const int N) {

	int offset = threadIdx.x + blockIdx.x * blockDim.x;

	

	float i = offset;

	int a = 0;

	a=i/corrW;

	i = i - a*corrW;

	int offset2 = i + a * dataW;

	

	if (offset < N)

	{

	   for (int j=0;j<data2H;j++)                         

	   {                                 

		   for (int i=0;i<data2W;i++)                                 

		   {                                      

				dev_corr[offset] += dev_data1[offset2 + j*dataW + i] * dev_data2[j*data2W + i];

			}

		}

	}

}

So, first I have my offset as usual, thank I make a second offset based on the first one, just to read dev_data1 as I should.

Then the 2 inner loops and the product directly into the result array.

I tried to put dev_data2 into constant memory, and i was expecting great improvements, as every threads are reading the same dev_data2 (size : 4900) but I only went from 21 ms to 20 ms.

What should I do to make it faster ? To make the readings coalescent / contiguous ?

EDIT : using an intermediate variable to sum, and only write at the end in the output array, the time drops to 14 ms !

A variable is unique for each thread but is it as fast as one in shared memory ? I tried a “shared float cache[threadPerBlock]” array to save the result for each thread but it’s slower …

If dev_data1 and dev_data2 are both in global memory I think you would want to ensure that you’re doing coalesced memory access all the time - especially when targeting Compute 1.x devices.

To me it seems your kernel is entirely memory bandwidth bound, so making use of the available peak memory bandwidth and minimizing repeated memory access will win the game.

If dev_data1 and dev_data2 are both in global memory I think you would want to ensure that you’re doing coalesced memory access all the time - especially when targeting Compute 1.x devices.

To me it seems your kernel is entirely memory bandwidth bound, so making use of the available peak memory bandwidth and minimizing repeated memory access will win the game.

Yes they’re both in global memory (except when I tried constant mem for dev_data2), but I don’t know how to ensure coalesced memory access … ( ps : I have a Compute 2.1 device but still …)
How should I do ? Here there a method or a really simple example ?

I did understood the principle of coalesced memory but to ensure that for real it’s another thing

Yes they’re both in global memory (except when I tried constant mem for dev_data2), but I don’t know how to ensure coalesced memory access … ( ps : I have a Compute 2.1 device but still …)
How should I do ? Here there a method or a really simple example ?

I did understood the principle of coalesced memory but to ensure that for real it’s another thing

On compute 1.x you’re only accessing dev_data1 coalesced when offset2 + j*dataW is a multiple of 16.

If you can’t ensure this alignment, preload aligned chunks of dev_data1 into a shared memory array. Then you can access your shared memory inside the for loops.

Compute 2.x has relaxed requirements regarding the alignment, so you’re better of when you are accessing unaligned memory locations. The hardware will break apart the transaction into two coalesced memory transactions.

On compute 1.x you’re only accessing dev_data1 coalesced when offset2 + j*dataW is a multiple of 16.

If you can’t ensure this alignment, preload aligned chunks of dev_data1 into a shared memory array. Then you can access your shared memory inside the for loops.

Compute 2.x has relaxed requirements regarding the alignment, so you’re better of when you are accessing unaligned memory locations. The hardware will break apart the transaction into two coalesced memory transactions.

I work on Compute 2.1, so what can I do more ?

What do you mean by “preload aligned chunks of dev_data1 into a shared memory array” ?
I’ve already tried to load it into a shared memory array, but what do you mean by aligned chunks ?

Thank you anyway ! :)

EDIT : I ran my code in Release mode and now the time is 3,3 ms
EDIT : release mode + dev_data2 in constant memory : 2,9 ms
Do you think Page Locked Memory could really change something ?

I work on Compute 2.1, so what can I do more ?

What do you mean by “preload aligned chunks of dev_data1 into a shared memory array” ?
I’ve already tried to load it into a shared memory array, but what do you mean by aligned chunks ?

Thank you anyway ! :)

EDIT : I ran my code in Release mode and now the time is 3,3 ms
EDIT : release mode + dev_data2 in constant memory : 2,9 ms
Do you think Page Locked Memory could really change something ?