The scenario we have is say we have a list of objects, time period divided into steps and a number of simulations required.
e.g. 1000000 object, 5 steps and 2000000 simulations.
The loops look like below,
for(int i=0; i<NumberOfSimualtion; i++)
{
for(int j=0; j<NumberOfSteps; j++)
{
for(int k=0; k<NumberOfObjects; k++)
{
//code to calculate results for each object and aggregate by keep adding the results up
}
//For all the objects per step, the aggreated value from the inner loop will be saved to an array for each step per simuatlion
}
}
By using CUDA, we can save the most inner loop for the Objects and using the CUDA Thread to interate through the list of Objects. Each thread calculate the resuls for each step per simuatlion of certain number of Objects.
As the results we need for each step per simulation should be the total of all objects, we need a way to aggreagate the reulsts from different threads. we created a reulst array with dimension same as the number of simulations multiply by the number of steps.
At this moment, we are using the Atomic operation to write the results from threads within the same block to shared memory, and another Atomic operation to write data from shared memory to global memory but having the atomic double problems.
If we try to use reduction algorithm, how could we create the storage space for each threads? (Due to the size of the results array, it seems impossible to create one for each threads)
How could we synchronize threads among different blocks?
The scenario we have is say we have a list of objects, time period divided into steps and a number of simulations required.
e.g. 1000000 object, 5 steps and 2000000 simulations.
The loops look like below,
for(int i=0; i<NumberOfSimualtion; i++)
{
for(int j=0; j<NumberOfSteps; j++)
{
for(int k=0; k<NumberOfObjects; k++)
{
//code to calculate results for each object and aggregate by keep adding the results up
}
//For all the objects per step, the aggreated value from the inner loop will be saved to an array for each step per simuatlion
}
}
By using CUDA, we can save the most inner loop for the Objects and using the CUDA Thread to interate through the list of Objects. Each thread calculate the resuls for each step per simuatlion of certain number of Objects.
As the results we need for each step per simulation should be the total of all objects, we need a way to aggreagate the reulsts from different threads. we created a reulst array with dimension same as the number of simulations multiply by the number of steps.
At this moment, we are using the Atomic operation to write the results from threads within the same block to shared memory, and another Atomic operation to write data from shared memory to global memory but having the atomic double problems.
If we try to use reduction algorithm, how could we create the storage space for each threads? (Due to the size of the results array, it seems impossible to create one for each threads)
How could we synchronize threads among different blocks?
Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like
for each simulation
for each timestep
1. Run a kernel to compute the partial solution of each active object in the computation domain
2. Run a kernel to reduce the partial solution
You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.
Use separate kernel launches for the computation and prefix-sum or reduction phases of the calculation. So your algorithm becomes something like
for each simulation
for each timestep
1. Run a kernel to compute the partial solution of each active object in the computation domain
2. Run a kernel to reduce the partial solution
You can repeat 1 and 2 as many times as necessary to get the final solution within a given timestep. Kernel launches only take a few tens of microseconds on the platforms I use CUDA on, so there is not much penalty in running many small kernels in place of a single large kernel that tries to do everything in a single call.
No problem. The advice you gave alreday helped us a lot.
Another question if you do not mind.
How could we allocate memory within the kernel?
When we declare variables in kernel, are they stored in the GPU global memory?
We need to declare a double array in the kernel within the kernel. As the number of elements is dynmaic, we are planning to use pointers and allocate memory dynamicly for it.
No problem. The advice you gave alreday helped us a lot.
Another question if you do not mind.
How could we allocate memory within the kernel?
When we declare variables in kernel, are they stored in the GPU global memory?
We need to declare a double array in the kernel within the kernel. As the number of elements is dynmaic, we are planning to use pointers and allocate memory dynamicly for it.
The basic answer is that you can’t. Fermi does/will have support for the C++ new operator, but that isn’t what you want for what I can gather.
No, either register or local memory-
I wouldn’t recommend using pointers. Use array indexing instead (for one thing indices are portable between host and device, pointers are not). You can also leverage textures for lookup a lot more easily with indices than pointers. I generally allocate a static scratch space when the output size of a kernel isn’t know a priori. Give each thread or block a maximum number of outputs per kernel launch. If the limit is exceeded, have the block stop, then once the kernel is completed, process the scratch into a partial result set (things like stream compaction and prefix sums are very handy for this), then rinse and repeat until the total input space has been covered.
The basic answer is that you can’t. Fermi does/will have support for the C++ new operator, but that isn’t what you want for what I can gather.
No, either register or local memory-
I wouldn’t recommend using pointers. Use array indexing instead (for one thing indices are portable between host and device, pointers are not). You can also leverage textures for lookup a lot more easily with indices than pointers. I generally allocate a static scratch space when the output size of a kernel isn’t know a priori. Give each thread or block a maximum number of outputs per kernel launch. If the limit is exceeded, have the block stop, then once the kernel is completed, process the scratch into a partial result set (things like stream compaction and prefix sums are very handy for this), then rinse and repeat until the total input space has been covered.
I am not sure I understand what you are trying to ask. If you declare a variable in a kernel, it either has thread scope or block scope. If it has thread scope it will be stored in register or local memory. If it has block scope it will be stored in shared memory. Anything in constant or global memory is declared at context scope outside of any kernel.
I am not sure I understand what you are trying to ask. If you declare a variable in a kernel, it either has thread scope or block scope. If it has thread scope it will be stored in register or local memory. If it has block scope it will be stored in shared memory. Anything in constant or global memory is declared at context scope outside of any kernel.
It isn’t clear to me what you want to see, or what you don’t understand. Indexing into a chunk of dynamically allocated memory is functionally identical to doing pointer arithmetic in C. Data structures which would otherwise use pointers can be made to use array indices instead. Indices have another advantage in CUDA on 64 bit platforms - an unsigned integer is 4 bytes, whereas a pointer is 8 bytes in length, so there is both memory and register savings to be had using indexing rather than pointers.
It isn’t clear to me what you want to see, or what you don’t understand. Indexing into a chunk of dynamically allocated memory is functionally identical to doing pointer arithmetic in C. Data structures which would otherwise use pointers can be made to use array indices instead. Indices have another advantage in CUDA on 64 bit platforms - an unsigned integer is 4 bytes, whereas a pointer is 8 bytes in length, so there is both memory and register savings to be had using indexing rather than pointers.