Strategies for Dynamic Global memory access from CPU ?

Hi all,

GPU: 1 Tesla M2090

I am trying to solve a problem in which the amount of data which needs to be sent back to the CPU is not known before you run the GPU kernel. Basically Orbit propagation/ODE integration with result data at every time step. Kernel is compute intensive.

I have never dealt with such kind of data management problem before and would like to know a good way to handle the problem …

I have been doing some digging over the internet( stackoverflow and here etc) and it looks like I have the following options:

  1. Allocate ALL 6 GB memory and then write to it and use another kernel to figure out how much memory needs to be transferred back. The problem is what happens if my kernel produces more than 6 GB of data ? Plus it can be slow.

  2. Use Zero copy: As my kernel does lot of computation per thread and the results are written only once. Can I use this ? I have no experience with zero copy , but it seems like a good candidate to directly write data to host memory. (I have 24 GB RAM so the memory size is not a huge issue)

  3. Dynamically Allocate memory on device and then copy it to the Host. I found out that this is no longer supported since cuda 4.1 and Nvidia is working on a fix ? Maybe I am wrong , the 4.1 programming guide on pg 108 says it can be done
    “Memory allocated via malloc() can be copied using the runtime (i.e. by calling any of the copy memory functions from Sections3.2.2).”

So am confused as to which is the best way forward… or if there is a better way of solving this problem ?

All my computations are in double precision.

Any help or inputs are welcome… thanks for the help.

Georgia Tech.

As the amount of data output per thread apparently varies for your problem, I assume you have some variable tracking the amount of output so far that is manipulated via atomicAcc()?

In that case you have another option: Allocate as much memory as you can spare for the output buffer, have threads exit if the output buffer is full and execute another kernel if not all data has been processed. To avoid spending time on calculations whose results cannot be saved you will probably want to check at the beginning of each block whether the remaining output buffer space is sufficient to hold the maximum number of results from the maximum number of blocks that can run in parallel (or some reasonable estimate) and exit early if not.

To overlap copy and compute you can also create multiple smaller output buffers.

Thanks for the quick reply. I am not using any atomics ( yet ) . I just have a global counter array which keep tracks of how much data each thread has written, which is equal to how many steps it took. Each thread takes a variable number of steps ( propagates an orbit ) but data per step is fixed to 6 doubles . I have a huge global array and each thread rites to a chunk of it (fixed on from the cpu). Some threads completely fill their chunk while others only fill their chunk partially (as they takes less steps) and the counter array helps me keep track of this. THIS IS VERY INEFFICIENT AND breaks down as soon some thread tries to store more than its allocated chunk.

Your suggestion to use output buffers and overlap them seems good. I will try to implement it. I am also trying to implement zero copy and my kernel is mostly compute bound.

Also, am curious to know how would one use atomicAcc as you suggested above ? I am not a CS major ( AE major) … so not exactly sure how to use atomics.

thanks again !

Zerocopy indeed is a good option, although I would again combine it with the use of an atomic buffer pointer.

Here’s a skeleton indicating the use of atomic operations:

struct {
        unsigned int block;
        float data[6];
    } buffer_entry;

#define BUF_SIZE ... // some large number
#define MAX_CONCURRENT_BLOCKS (16*15) // only a rough estimate. Doesn't need to be exact
#define NUM_BLOCKS ...

__device__ struct buffer_entry buffer[BUF_SIZE];
__device__ unsigned long long int buffer_index; // you can use unsigned int here if you know your kernel can never produce more than 4294967295 buffer entries
__device__ char done[NUM_BLOCKS];

__global__ void my_kernel(...)
    const int blockID = (blockIdx.z * gridDim.y + blockIdx.y) * gridDim.x * blockIdx.x;

    if (done[blockID] || buffer_index >= BUF_SIZE - MAX_CONCURRENT_BLOCKS)
        return; // in rare cases this may lead to suboptimal placement of reconvergence points, but otherwise improves performance by preventing useless calculations whose output cannot be saved

   while (...) {
        struct buffer_entry entry;
        // compute and set
        unsigned long long int pos = atomicAdd(&buffer_index, 1);
        if (pos >= BUF_SIZE)
        entry.block = blockID;
        buffer[pos] = entry;

    done[blockID] = 1;

int main(...) {
    char *done_ptr, unsigned long long int buffer_index_ptr;
    CHECK_ERROR(cudaGetSymbolAddress(&done_ptr, done));
    CHECK_ERROR(cudaGetSymbolAddress(&buffer_index_ptr, buffer_index));
    CHECK_ERROR(cudaMemset(done_ptr, 0, sizeof(done)));
    CHECK_ERROR(cudaMemset(buffer_index_ptr, 0, sizeof(buffer_index)));

    my_kernel<<<NUM_BLOCKS, ...>>>(...);


In a more advanced version you could use stream compaction to collect the entries from the entire block and write them out at once with a single atomicAdd(). Given the speed of atomics on Fermi and later hardware it’s probably not worth the trouble though.