Cuda 4.0 Two threads cannot access same memory

Using cuda 4.0 rc I am trying to have two threads access the same memory, but cudaMemcpy fails with error code 11 “invalid argument”.
The question is what might be the reason for this. What argument is invalid and why?

My actual call is

cudaMemcpy(pParticlesHost,pParticlesDev,sizeof(Particle)*numParticles,cudaMemcpyDeviceToHost);

where I a number of particles copy from device memory to host memory. The exact same call will work in one thread and fail in another. The thread, in which it works, is the thread which originally allocated the memory and it is the thread which calls the kernel which updates the device memory. I have tested this without running the kernel to ensure there is no conflict with a running kernel and a thread reading memory. The problem persists.

Before cuda 4.0 what i did was this

thread1 :
allocate and initialize device memory
run kernel to update device memory
read updated data back to host every 100th kernel call

thread2:
render the updated data in host memory

Now I am trying to optimize this as

thread1 :
allocate and initialize device memory
run kernel to update device memory

thread2:
load device memory into host memory
render the updated data

It may be that

  • you need to call cudaSetDevice() to ensure that the both threads are talking to the same device

  • you need to synchronize access to the buffers between the threads on the host

The following should achieve the behavior you are expecting

// Thread 1                    | // Thread 2

cudaSetDevice(1);              | cudaSetDevice(1);

cudaMalloc(&pdev, size);       |

cudaMallocHost(&phost, size);  |

pthread_barrier();             - pthread_barrier();

while (1) {                    | while(1) {

    K<<<.>>>(pdev);            |    cudaMemcpy(phost, pdev, size, D2H)

    K<<<.>>>(pdev);            |    myRender(phost)

}                              | }

Also do ensure that the exact same cudaMemcpy() call succeeds when issued from the Thread 1 and then try to migrate it to Thread 2.

Chris…,
One thing I liked about your reply: You have lot of patience!

  • you need to call cudaSetDevice() to ensure that the both threads are talking to the same device

I neglected to include all details since I did not want to add more than required and I was convinced that the extra details were not relevant to the problem.
I do call setdevice(0) for both threads and I do synchronize access to the buffer as follows

thread1 : aqquire mutex, run kernel, release mutex
thread2 : aqquire mutex, copy memory, release mutex

Now that I write the above, I realize that my thread synchronization may fail because the kernel runs asynchronously. In thread1 the mutex is released before the kernel completes on the device. Could this be the reason for my problems?


EDIT:
The problem turned out to being someone rolling part of the code back so I was in fact not running cuda 4.0! That explains the “invalid argument” error quite well, and everything is fine now.
Thanks for the reply though. It forced me to look elsewhere for the cause. Initially I thought I was missing something elementary… which it turns out I was… but something else than expected ;-)