Can my CUDA kernel access host memory if running on NVIDIA ION?

As far as I understand, host and device memories on NVIDIA ION system are physically same. Then it should not be necessary to copy data between the device and the host memory spaces. However, I tried accessing host memory in a CUDA kernel on ION but it didn’t work. Am I doing something wrong?

Is it still necessary to copy data from the host memory to the device memory on ION?

Vasily

To my understanding, you still need to allocate a pinned memory or otherwise there would be some problems with CPU’s cache behavior.

I can’t make it work even if using pinned memory.

Say, it works if I allocate device memory using cudaMalloc. But if I change every cudaMalloc to cudaMallocHost, it doesn’t work.

Are you trying to use host pointers as device pointers?

I think that doesn’t work even on Ion, as the GPU doesn’t go through the CPU’s MMU and so either uses physical addresses or it’s own logical address space.

When using pinned memory, the pointer can’t be used for device IIRC. You need to call cudaHostGetDevicePointer to get a pointer for the device before passing it to the kernel function. Take a look at the “simpleZeroCopy” sample for more details.

Thanks, now it works! I also had to make a few other changes: (i) call cudaSetDeviceFlags( cudaDeviceMapHost ) when initializing; (ii) add a couple of cudaThreadSynchronize() to ensure that data has passed over PCIe. There is a useful detailed document on the topic at http://forums.nvidia.com/index.php?showtopic=98502.

Vasily

1 Like

New problem — mapped host memory appears twice as slow as global memory on NVIDIA ION.

Say, cudaMemcpy() runs at 4.9 GB/s if arrays are allocated using cudaMalloc(), and only at 2.4 GB/s if they are allocated using cudaHostAlloc().

Any fix?

Vasily

I don’t have an ION system so I’m not sure what’s the problem is. However, I have a Mac mini with GeForce 9400M, which is the same GPU as ION.
I ran bandwidthTest with pinned memory and the result is something like:

host-to-device: 4GB/s
device-to-host: 4.4GB/s
device-to-device: 7.4GB/s

So I suspect that host-to-device and device-to-host copy has to go though the PCI express bus even though they all reside in the same physical memory. That’s probably why it’s slower.

Yeah, i get about the same figure on my ION:

host-to-device: 2.1GB/s

device-to-host: 2.3GB/s

device-to-device: 4.6GB/s

Quite unexpected.