achieve inter process communication, host<->GPU to avoid copying between global and shared m

Hi Everybody,

I have to carry out multiplication of a large block structured matrix, M, with a vector, V. I am able to do this, with a single kernel. I plan not to use multiple kernels. Now I have to do it for different Vs, one V for each time step. One way to do this is:

  1. copy M, V to global memory from Host.
    start (time step loop)
  2. call kernel that multiplies
  3. the kernel threads load M and V from global memory to shared memory, and multiply them
  4. Wait for the kernel to complete with cudaThreadSynchronize();
  5. copy result R from device to host
  6. copy new V into the global memory
    end (time step loop)

This looks fairly strightforward, it should work. But the matrix is the same in every time step. Is there a way to avoid the copy of M from global memory to shared memory in step (3)?

I am trying the exploit the fact that the kernel runs asynchronously. Is the following possible?

  1. Copy M to global memory from host. Make the vectors V and R shared with cudaHostAlloc(…,cudaHostAllocMapped). Copy V. Set R to, say 0s.
  2. Create a shared int variable called flag, flag = 1 with cudaHostAlloc(…,cudaHostAllocMapped).
  3. Call the new kernel.
    start (time loop)
  4. Check if flag = 0, else sleep. If flag = 0 (changed by the kernel), display the result vector R (populated by the kernel).
  5. Copy the new V, for the next time step, into V.
  6. Set flag = 1
    end (time loop)

The new kernel in turn executes another time loop

  1. Copy M from global memory to shared memory
    start (time loop)
  2. Check if flag = 1. If no, keep checking. If yes, read V to shared memory, multiply M with V, and store result in R (which is on the host. Hope the memory controller takes care of the copy).
  3. Reset flag=0. (which is again on the host. Hope the DMA has its operations queued, and finishes step 2 before it updates the flag variable)
    end (time loop)

This variation has the one kernel always running after one initial call. Advantage: Step 1. of the new kernel, copying the big matrix M from global memory to shared memory takes place just once. I don’t know if there are overheads in starting kernels, which is saved too.

I tried such IPC with simple addition programs, it does not work. Step2 of the new kernel, a kernel initiated memory copy, happens only when the host calls cudaThreadSynchronize(); But that will never happen if the kernel is always ‘on’, expecting new data to be processed.

  1. Should the above work, which means I only have to debug, or it is simply not possible? The documentation says that the mapped memory helps to amortize one-time copy latencies, without delving into what and what not are possible.
  2. Is there any other mechnism offered by CUDA to achive the aforementioned objective?

Thank you very much,

There is no point in trying this. Assuming your matrix even fits into shared memory at all, it is so small it will be reloaded from global memory in an instant.

To put some numbers to it: The GPU with the biggest shared memory on chip is the GTX580. It has 16 multiprocessors with 48 kilobytes of shared memory, for a total of 768 kbytes. It also has a global memory bandwidth of 192.4 gigabytes per second. Even allowing for 50% overhead, you could thus completely reload shared memory from global memory 125 000 times per second.

Hallo Tera,

Thank you very much for the reply and the figures. I should have made clearer how the block matrix multiplication is executed. Block matrix looks like, for e.g., as in figure in

Such structure enables massive parallel computing. Each thread takes only a block (if the whole matrix fits into shared memory of one multiprocessor, there is no savings in switching to GPU at all.) The full matrix is still too large for one multiprocessor. I have been given one GTX480, 15 Multiprocessor, each with 48kB shared memory, similar to the figures you gave. I was thinking as follows:

  1. Assume the problem has thousand unknowns (size of vector V). Assume each block is a 20*20 submatrix, which makes 400 doubles.

  2. Assign as many threads to a block, as the submatrices each of them have sum up to 48kB

  3. Each thread will multiply a block times corressponding components of vector V. The rest contribute to zero anyway.

  4. The partial sums from each thread (spanning over all multiprocessors, over many blocks) is to be sent back to host.

  5. This is done in every time step.

  6. Do you still think there will be no gain the approach I outlined? I wanted to have the problems of coalesced access, two threads accessing the same bank etc., just once.

But you have a point, if the problem grows bigger, to a million unknowns, say, then there are going to be more than one warp per multiprocessor, and if scheduling them involves main memory, the saving is lost.

  1. Even if so, just out of curiosity, is such an IPC not possible?

Thank you,