stream processing paradigm

Hello,
I would like to know if there is any example out there of stream-oriented processing in CUDA. suppose you have a stream of data coming in on the host. You want the feed the GPU chunks of data with a relatively small resolution. A number of threads should process a chunck, and potentially modify the data which is then integrated back into a stream in output.

One obvious way to do this is shown in the following pseudo-code:

while(1) {
cudaMemcpyAsync(d_buf[i], h_buf[i], cudaMemcpyHostToDevice, stream[i])
kernel<…>(d_buf[i])
cudaMemcpyAsync(h_buf[i], d_buf[i]), cudaMemcpyDeviceToHost, stream[i])
i = (i+1) % nstreams
}

Here the idea is that despite the fact that block size and thread size are relatively small, I should reach a decent performance by virtue of a high number of kernel running concurrently.

However I f ind that when the granularity of the data is fairly small there seem to be an excessive overhead in the various runtime calls, and kernels are barely running concurrently. What is really responsible for such high overhead? is it user/kernel mode transitions?

Ideally I would want to have a single kernel operating in an infinite loop, and feed data continuously (possibly through cudaMemcpy). However to achieve that I’d need a way from host to deposit a value in a mailbox for the device, with the device polling on it. Toying with this idea, I wrote the following code:

#include "cuda_runtime.h"
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>


__global__ void kernel(uint32_t *mailbox) {
    int i=0;
    while(*mailbox == 0) {
        i++;
        if (i%1000 == 0)
            printf("i=%d\n", i);
    }
    printf("GOT DATA\n");
}


int main(int argc, char **argv) {
    uint32_t *mailbox, local;
    cudaStream_t st1;

    cudaMalloc(&mailbox, sizeof(uint32_t));
    cudaStreamCreate(&st1);
    local = 0;
    cudaMemcpyAsync(mailbox, &local, sizeof(uint32_t), cudaMemcpyHostToDevice, st1);
    kernel<<<1, 512>>>(mailbox);
    printf("kernel called\n");
    local = 1;
    cudaMemcpyAsync(mailbox, &local, sizeof(uint32_t), cudaMemcpyHostToDevice, st1);
    printf("second memcpy done\n");
    cudaDeviceSynchronize();
}

The kernel is supposed to exit when the value changes to 1, but apparently it doesn’t see such change, and cudaDeviceSynchronize is stuck waiting for the kernel to complete.

I do know this is not your typical usage model, so please refrain from replying along those lines. I’m interested in either comments about what limits such a model, or examples on how to organize code around this
paradigm.

Thanks!

Try marking your mailbox variable as volatile.

OK there were a few issues with your code. As a general rule, if you want predictable stream behavior, don’t use the default stream. The default stream is synchronizing and blocking device-wide.

Also, you need to mark variables like mailbox as volatile.

The following modification of your code seems to work for me:

$ cat t608.cu
#include "cuda_runtime.h"
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>


__global__ void kernel(volatile uint32_t *mailbox) {
    int i=0;
    while(*mailbox == 0) {
        i++;
        if (i%1000 == 0)
            printf("i=%d\n", i);
    }
    printf("GOT DATA\n");
}


int main(int argc, char **argv) {
    uint32_t *mailbox, local;
    cudaStream_t st1, st2;

    cudaMalloc(&mailbox, sizeof(uint32_t));
    cudaStreamCreate(&st1);
    cudaStreamCreate(&st2);
    local = 0;
    cudaMemcpyAsync(mailbox, &local, sizeof(uint32_t), cudaMemcpyHostToDevice, st1);
    kernel<<<1, 1, 0, st1>>>(mailbox);
    printf("kernel called\n");
    local = 1;
    cudaMemcpyAsync(mailbox, &local, sizeof(uint32_t), cudaMemcpyHostToDevice, st2);
    printf("second memcpy done\n");
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_20 -o t608 t608.cu
$ ./t608
kernel called
second memcpy done
GOT DATA
$