cudaMemcpy on 9800GTX2

I wonder whether anybody has experience in using 9800GTX2. It seems the computing on the kernel is very fast, but the mem copy time is really long. How do the 2 GPUs work in sharing the memory and doing the computing?

they’re essentially two separate boards that occupy a single PCIe slot. there’s no memory sharing or anything between them, each chip has its own pool of 512MB.

They don’t, they are two different devices under CUDA

So, does it mean that I should specify which GPU I want to use when invoking the kernel?

I’m still wondering why the memcpy is so slow…

I strongly suggest you to use cudaMemcpyAsync function on pinned memory. I solved my cuda bottleneck by this way.

It seems the bottleneck is not memcpy, it’s the first malloc in the execution. In my program, the first cudaMalloc cost most of the time, the following ones are much faster. Is there any hint for explaining this?

Yes. The first cuda* call you make initializes the driver and CUDA context, which takes a little time.

In my application, it takes nearly 4 seconds, much longer than my former experience on 8800GTX. Is that normal?

That is a bit high. It usually takes under 1 second here. What driver version are you running? Maybe the latest drivers fix this problem.

I’m using 2.0

On the 8800GTX card, I used a lower version, but got much better performance. I don’t know what’s going on with the memory allocation on these new cards.

You’re not timing your code right. Insert a cudaThreadSynchronize() after your kernel call to measure how long it’s really taking.

(This is another thing for the SDK to fix. Why won’t the cutTimer routines do this themselves?)