atomicAdd support Double Which card?

Hi Avidday,

The scenario we have is say we have a list of objects, time period divided into steps and a number of simulations required.

e.g. 1000000 object, 5 steps and 2000000 simulations.

The loops look like below,

for(int i=0; i<NumberOfSimualtion; i++)

{

	for(int j=0; j<NumberOfSteps; j++)

	{

		for(int k=0; k<NumberOfObjects; k++)

		{

			//code to calculate results for each object and aggregate by keep adding the results up

		}

		//For all the objects per step, the aggreated value from the inner loop will be saved to an array for each step per simuatlion

	}

}

By using CUDA, we can save the most inner loop for the Objects and using the CUDA Thread to interate through the list of Objects. Each thread calculate the resuls for each step per simuatlion of certain number of Objects.

As the results we need for each step per simulation should be the total of all objects, we need a way to aggreagate the reulsts from different threads. we created a reulst array with dimension same as the number of simulations multiply by the number of steps.

At this moment, we are using the Atomic operation to write the results from threads within the same block to shared memory, and another Atomic operation to write data from shared memory to global memory but having the atomic double problems.

If we try to use reduction algorithm, how could we create the storage space for each threads? (Due to the size of the results array, it seems impossible to create one for each threads)

How could we synchronize threads among different blocks?

Hi Avidday,

The scenario we have is say we have a list of objects, time period divided into steps and a number of simulations required.

e.g. 1000000 object, 5 steps and 2000000 simulations.

The loops look like below,

for(int i=0; i<NumberOfSimualtion; i++)

{

	for(int j=0; j<NumberOfSteps; j++)

	{

		for(int k=0; k<NumberOfObjects; k++)

		{

			//code to calculate results for each object and aggregate by keep adding the results up

		}

		//For all the objects per step, the aggreated value from the inner loop will be saved to an array for each step per simuatlion

	}

}

By using CUDA, we can save the most inner loop for the Objects and using the CUDA Thread to interate through the list of Objects. Each thread calculate the resuls for each step per simuatlion of certain number of Objects.

As the results we need for each step per simulation should be the total of all objects, we need a way to aggreagate the reulsts from different threads. we created a reulst array with dimension same as the number of simulations multiply by the number of steps.

At this moment, we are using the Atomic operation to write the results from threads within the same block to shared memory, and another Atomic operation to write data from shared memory to global memory but having the atomic double problems.

If we try to use reduction algorithm, how could we create the storage space for each threads? (Due to the size of the results array, it seems impossible to create one for each threads)

How could we synchronize threads among different blocks?

Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like

for each  simulation

	for each timestep

		1. Run a kernel to compute the partial solution of each active object in the computation domain

		2. Run a kernel to reduce the partial solution

You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.

Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like

for each  simulation

	for each timestep

		1. Run a kernel to compute the partial solution of each active object in the computation domain

		2. Run a kernel to reduce the partial solution

You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.

Thanks. I think this is the way we should change the algorithm.

By the way, what Random Generator do you use that is callable within the kernel function?

Thanks. I think this is the way we should change the algorithm.

By the way, what Random Generator do you use that is callable within the kernel function?

Can’t help you with that - I don’t work with random numbers or the types of numerical methods that need them.

Can’t help you with that - I don’t work with random numbers or the types of numerical methods that need them.

No problem. The advice you gave alreday helped us a lot.

Another question if you do not mind.

How could we allocate memory within the kernel?

When we declare variables in kernel, are they stored in the GPU global memory?

We need to declare a double array in the kernel within the kernel. As the number of elements is dynmaic, we are planning to use pointers and allocate memory dynamicly for it.

Is this the correct way to do it?

Thanks

No problem. The advice you gave alreday helped us a lot.

Another question if you do not mind.

How could we allocate memory within the kernel?

When we declare variables in kernel, are they stored in the GPU global memory?

We need to declare a double array in the kernel within the kernel. As the number of elements is dynmaic, we are planning to use pointers and allocate memory dynamicly for it.

Is this the correct way to do it?

Thanks

The basic answer is that you can’t. Fermi does/will have support for the C++ new operator, but that isn’t what you want for what I can gather.

No, either register or local memory-

I wouldn’t recommend using pointers. Use array indexing instead (for one thing indices are portable between host and device, pointers are not). You can also leverage textures for lookup a lot more easily with indices than pointers. I generally allocate a static scratch space when the output size of a kernel isn’t know a priori. Give each thread or block a maximum number of outputs per kernel launch. If the limit is exceeded, have the block stop, then once the kernel is completed, process the scratch into a partial result set (things like stream compaction and prefix sums are very handy for this), then rinse and repeat until the total input space has been covered.

The basic answer is that you can’t. Fermi does/will have support for the C++ new operator, but that isn’t what you want for what I can gather.

No, either register or local memory-

I wouldn’t recommend using pointers. Use array indexing instead (for one thing indices are portable between host and device, pointers are not). You can also leverage textures for lookup a lot more easily with indices than pointers. I generally allocate a static scratch space when the output size of a kernel isn’t know a priori. Give each thread or block a maximum number of outputs per kernel launch. If the limit is exceeded, have the block stop, then once the kernel is completed, process the scratch into a partial result set (things like stream compaction and prefix sums are very handy for this), then rinse and repeat until the total input space has been covered.

Where is the data stored then if not any of them?

If we use array indexing, will that require the array to be declared with a constant size?

If we use pointers, then we can allocated the memory dynamicly and populate the values.

Where is the data stored then if not any of them?

If we use array indexing, will that require the array to be declared with a constant size?

If we use pointers, then we can allocated the memory dynamicly and populate the values.

I am not sure I understand what you are trying to ask. If you declare a variable in a kernel, it either has thread scope or block scope. If it has thread scope it will be stored in register or local memory. If it has block scope it will be stored in shared memory. Anything in constant or global memory is declared at context scope outside of any kernel.

No, why would it?

I am not sure I understand what you are trying to ask. If you declare a variable in a kernel, it either has thread scope or block scope. If it has thread scope it will be stored in register or local memory. If it has block scope it will be stored in shared memory. Anything in constant or global memory is declared at context scope outside of any kernel.

No, why would it?

Would it be possible to have some code example showing the array indexing method ?

Would it be possible to have some code example showing the array indexing method ?

It isn’t clear to me what you want to see, or what you don’t understand. Indexing into a chunk of dynamically allocated memory is functionally identical to doing pointer arithmetic in C. Data structures which would otherwise use pointers can be made to use array indices instead. Indices have another advantage in CUDA on 64 bit platforms - an unsigned integer is 4 bytes, whereas a pointer is 8 bytes in length, so there is both memory and register savings to be had using indexing rather than pointers.

It isn’t clear to me what you want to see, or what you don’t understand. Indexing into a chunk of dynamically allocated memory is functionally identical to doing pointer arithmetic in C. Data structures which would otherwise use pointers can be made to use array indices instead. Indices have another advantage in CUDA on 64 bit platforms - an unsigned integer is 4 bytes, whereas a pointer is 8 bytes in length, so there is both memory and register savings to be had using indexing rather than pointers.