atomicAdd support Double Which card?

London · October 18, 2010, 12:22pm

Hi Avidday,

The scenario we have is say we have a list of objects, time period divided into steps and a number of simulations required.

e.g. 1000000 object, 5 steps and 2000000 simulations.

The loops look like below,

for(int i=0; i<NumberOfSimualtion; i++)

{

	for(int j=0; j<NumberOfSteps; j++)

	{

		for(int k=0; k<NumberOfObjects; k++)

		{

			//code to calculate results for each object and aggregate by keep adding the results up

		}

		//For all the objects per step, the aggreated value from the inner loop will be saved to an array for each step per simuatlion

	}

}

By using CUDA, we can save the most inner loop for the Objects and using the CUDA Thread to interate through the list of Objects. Each thread calculate the resuls for each step per simuatlion of certain number of Objects.

As the results we need for each step per simulation should be the total of all objects, we need a way to aggreagate the reulsts from different threads. we created a reulst array with dimension same as the number of simulations multiply by the number of steps.

At this moment, we are using the Atomic operation to write the results from threads within the same block to shared memory, and another Atomic operation to write data from shared memory to global memory but having the atomic double problems.

If we try to use reduction algorithm, how could we create the storage space for each threads? (Due to the size of the results array, it seems impossible to create one for each threads)

How could we synchronize threads among different blocks?

London · October 18, 2010, 12:22pm

Hi Avidday,

The scenario we have is say we have a list of objects, time period divided into steps and a number of simulations required.

e.g. 1000000 object, 5 steps and 2000000 simulations.

The loops look like below,

for(int i=0; i<NumberOfSimualtion; i++)

{

	for(int j=0; j<NumberOfSteps; j++)

	{

		for(int k=0; k<NumberOfObjects; k++)

		{

			//code to calculate results for each object and aggregate by keep adding the results up

		}

		//For all the objects per step, the aggreated value from the inner loop will be saved to an array for each step per simuatlion

	}

}

By using CUDA, we can save the most inner loop for the Objects and using the CUDA Thread to interate through the list of Objects. Each thread calculate the resuls for each step per simuatlion of certain number of Objects.

As the results we need for each step per simulation should be the total of all objects, we need a way to aggreagate the reulsts from different threads. we created a reulst array with dimension same as the number of simulations multiply by the number of steps.

At this moment, we are using the Atomic operation to write the results from threads within the same block to shared memory, and another Atomic operation to write data from shared memory to global memory but having the atomic double problems.

If we try to use reduction algorithm, how could we create the storage space for each threads? (Due to the size of the results array, it seems impossible to create one for each threads)

How could we synchronize threads among different blocks?

avidday · October 18, 2010, 1:11pm

Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like

for each  simulation

	for each timestep

		1. Run a kernel to compute the partial solution of each active object in the computation domain

		2. Run a kernel to reduce the partial solution

You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.

avidday · October 18, 2010, 1:11pm

Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like

for each  simulation

	for each timestep

		1. Run a kernel to compute the partial solution of each active object in the computation domain

		2. Run a kernel to reduce the partial solution

You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.

London · October 18, 2010, 1:37pm

Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like
for each  simulation

	for each timestep

		1. Run a kernel to compute the partial solution of each active object in the computation domain

		2. Run a kernel to reduce the partial solution
You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.

Thanks. I think this is the way we should change the algorithm.

By the way, what Random Generator do you use that is callable within the kernel function?

London · October 18, 2010, 1:37pm

Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like
for each  simulation

	for each timestep

		1. Run a kernel to compute the partial solution of each active object in the computation domain

		2. Run a kernel to reduce the partial solution
You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.

Thanks. I think this is the way we should change the algorithm.

By the way, what Random Generator do you use that is callable within the kernel function?

avidday · October 18, 2010, 2:01pm

Can’t help you with that - I don’t work with random numbers or the types of numerical methods that need them.

avidday · October 18, 2010, 2:01pm

Can’t help you with that - I don’t work with random numbers or the types of numerical methods that need them.

London · October 18, 2010, 2:16pm

No problem. The advice you gave alreday helped us a lot.

Another question if you do not mind.

How could we allocate memory within the kernel?

When we declare variables in kernel, are they stored in the GPU global memory?

We need to declare a double array in the kernel within the kernel. As the number of elements is dynmaic, we are planning to use pointers and allocate memory dynamicly for it.

Is this the correct way to do it?

Thanks

London · October 18, 2010, 2:16pm

No problem. The advice you gave alreday helped us a lot.

Another question if you do not mind.

How could we allocate memory within the kernel?

When we declare variables in kernel, are they stored in the GPU global memory?

We need to declare a double array in the kernel within the kernel. As the number of elements is dynmaic, we are planning to use pointers and allocate memory dynamicly for it.

Is this the correct way to do it?

Thanks

avidday · October 18, 2010, 3:41pm

The basic answer is that you can’t. Fermi does/will have support for the C++ new operator, but that isn’t what you want for what I can gather.

No, either register or local memory-

I wouldn’t recommend using pointers. Use array indexing instead (for one thing indices are portable between host and device, pointers are not). You can also leverage textures for lookup a lot more easily with indices than pointers. I generally allocate a static scratch space when the output size of a kernel isn’t know a priori. Give each thread or block a maximum number of outputs per kernel launch. If the limit is exceeded, have the block stop, then once the kernel is completed, process the scratch into a partial result set (things like stream compaction and prefix sums are very handy for this), then rinse and repeat until the total input space has been covered.

avidday · October 18, 2010, 3:41pm

The basic answer is that you can’t. Fermi does/will have support for the C++ new operator, but that isn’t what you want for what I can gather.

No, either register or local memory-

I wouldn’t recommend using pointers. Use array indexing instead (for one thing indices are portable between host and device, pointers are not). You can also leverage textures for lookup a lot more easily with indices than pointers. I generally allocate a static scratch space when the output size of a kernel isn’t know a priori. Give each thread or block a maximum number of outputs per kernel launch. If the limit is exceeded, have the block stop, then once the kernel is completed, process the scratch into a partial result set (things like stream compaction and prefix sums are very handy for this), then rinse and repeat until the total input space has been covered.

London · October 18, 2010, 3:50pm

Where is the data stored then if not any of them?

If we use array indexing, will that require the array to be declared with a constant size?

If we use pointers, then we can allocated the memory dynamicly and populate the values.

London · October 18, 2010, 3:50pm

Where is the data stored then if not any of them?

If we use array indexing, will that require the array to be declared with a constant size?

If we use pointers, then we can allocated the memory dynamicly and populate the values.

avidday · October 18, 2010, 4:31pm

I am not sure I understand what you are trying to ask. If you declare a variable in a kernel, it either has thread scope or block scope. If it has thread scope it will be stored in register or local memory. If it has block scope it will be stored in shared memory. Anything in constant or global memory is declared at context scope outside of any kernel.

No, why would it?

avidday · October 18, 2010, 4:31pm

I am not sure I understand what you are trying to ask. If you declare a variable in a kernel, it either has thread scope or block scope. If it has thread scope it will be stored in register or local memory. If it has block scope it will be stored in shared memory. Anything in constant or global memory is declared at context scope outside of any kernel.

No, why would it?

London · October 18, 2010, 4:45pm

Would it be possible to have some code example showing the array indexing method ?

London · October 18, 2010, 4:45pm

Would it be possible to have some code example showing the array indexing method ?

avidday · October 19, 2010, 8:43am

It isn’t clear to me what you want to see, or what you don’t understand. Indexing into a chunk of dynamically allocated memory is functionally identical to doing pointer arithmetic in C. Data structures which would otherwise use pointers can be made to use array indices instead. Indices have another advantage in CUDA on 64 bit platforms - an unsigned integer is 4 bytes, whereas a pointer is 8 bytes in length, so there is both memory and register savings to be had using indexing rather than pointers.

avidday · October 19, 2010, 8:43am

It isn’t clear to me what you want to see, or what you don’t understand. Indexing into a chunk of dynamically allocated memory is functionally identical to doing pointer arithmetic in C. Data structures which would otherwise use pointers can be made to use array indices instead. Indices have another advantage in CUDA on 64 bit platforms - an unsigned integer is 4 bytes, whereas a pointer is 8 bytes in length, so there is both memory and register savings to be had using indexing rather than pointers.

Topic		Replies	Views
multi dimension array CUDA Programming and Performance	26	32775	February 12, 2010
atomicadd for double precision in CUDA Fortran Legacy PGI Compilers	20	21604	November 15, 2013
Global thread barrier CUDA Programming and Performance	78	85672	December 23, 2011
Several threads attacking the same position. Superposition in that position. CUDA Programming and Performance	31	13621	October 19, 2010
Help with memory management CUDA Programming and Performance	20	5765	March 27, 2010
CUDA 4.1 suggested improvements. CUDA Programming and Performance	32	45445	October 8, 2011
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134572	May 26, 2010
Is atomicExch() safe for incremental a global float array? CUDA Programming and Performance	13	18268	July 6, 2009
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
[SOLVED] Code his own shared memory with device memory! CUDA Programming and Performance	15	2560	October 7, 2015

atomicAdd support Double Which card?

Related topics