Passing Data to already executing threads

Hi all, I’m curious if any of you have experience with my problem. What I want to try and do is develop a way to have device threads running, waiting on data to be generated by the CPU and passed to them. So far my attempts have mostly looked like this:

[codebox]global void kernel(int *arr) {

while(arr[0] == 0)

	arr[3] = 3;

}

int main() {

    int *test;

    unsigned long i;

    int *arr, *d_arr;

// set up page locked host arrays. test is used to set a bit on the device, arr is the input array to the kernel

cudaMallocHost((void **)&test, sizeof(int));

    cudaMallocHost((void **)&arr, sizeof(int) * 10);

    for(i = 0; i < 10; i++) arr[i] = 0;

test[0] = 1;

// generate cuda streams to permit concurrent kernel execution and memory copies

    cudaStream_t streams[2];

    cudaStreamCreate(&streams[0]);

    cudaStreamCreate(&streams[1]);

cudaMalloc((void **)&d_arr, sizeof(int) * 10);

cudaMemcpyAsync(d_arr, arr, sizeof(int) * 10, cudaMemcpyHostToDevice, streams[0]);

kernel<<<30,256, 0, streams[0]>>>(d_arr);

for(i = 0; i < 1000000000; i++) ; // wait for a time

// copy the value one into the first location of the input array, hopefully causing it to break out of the loop

    cudaMemcpyAsync(d_arr, test, sizeof(int), cudaMemcpyHostToDevice, streams[1]);

    // wait for device threads to return (but they never do...)

    cudaThreadSynchronize();

return 0;

}

[/codebox]

But with no success. It seems that everything completes but that the change in the array value is not seen on the device. Does anyone have experience with this sort of problem?

Thanks very much for any help

You can safely assume what you are trying to do can’t be done. There is no way of guaranteeing memory coherence between the host and device while the device is executing a kernel. The most obvious candidate to try is zero copy memory, but the NVIDIA developers that post here have definitively said that can’t work for the sort of “inter process communication” model you are looking for. I doubt that Fermi brings anything new to the table which changes this (except for the ability to run multiple kernels simultaneously, but that doesn’t effect the memory coherence problem).