CUDA device memory access?

hding001 · July 6, 2011, 11:14pm

Hi,

I copy a big memory from CPU to GPU. Now I want to access the GPU memory by small slots.

my syntax is as follows:
// large memory on GPU
cudaMemcpy(d_signal, h_signal, mem_size, cudaMemcpyHostToDevice);
// small slot in memory
for(int i=0; i<N; i++)
{
d_slot[i] = d_signal[i+10];
}

I found above part return a segmentation error. I thought my operation for the device memory is correct. Can someone help me with this problem?

hamster143 · July 7, 2011, 12:21am

You can’t access device memory like that from host code. But you could call

cudaMemcpy(d_slot, d_signal+10, N*sizeof(d_signal[0]), cudaMemcpyDeviceToDevice);

hding001 · July 7, 2011, 1:30am

Thanks a lot. It works. If I want to do it without cudaMemcpy, how should I do? write a device function or …?

I tried to avoid the cudaMemcpy because of the overhead. Do you have any good idea if I have a large data base, need to copy it to GPU to do FFT&IFFT. how to optimize?

Thanks a lot.

ArchaeaSoftware · July 7, 2011, 1:22pm

With discrete GPUs, there is no getting around copying the data across PCI Express. But, you can avoid the overhead of a memcpy call by using mapped pinned memory: instead of calling cudaMalloc() to allocate device memory that you copy to/from, call cudaHostAlloc() with the cudaHostAllocMapped flag. This passes back pinned host memory that you can access with the CPU, but that also has been mapped into the CUDA address space. Call cudaHostGetDevicePointer() to get the device pointer of that pinned memory.

hding001 · July 7, 2011, 4:22pm

Thanks a lot. I am trying now, will report if I get any results.

hding001 · July 7, 2011, 4:59pm

Hi, Thanks a lot for your previous help. It works. Now for memory copy from GPU to CPU, is there any good way to avoid the memcpy function call overhead.

Thanks a lot.

ArchaeaSoftware · July 9, 2011, 2:36pm

The kernel can write to the device pointer corresponding to the mapped pinned memory - this is actually a preferred mode of operation, since the GPU is just posting writes and there is no latency to cover - just remember to do CPU/GPU synchronization before reading the results written by the kernel. cudaDeviceSynchronize() is the big hammer (waits until the GPU is done processing), you need to use CUDA events for finer-grained synchronization.

alrikai · July 19, 2011, 3:54am

Just as a warning, don’t overdo it on the pinned memory; pinned memory cannot be paged out, which can bad if you allocate too much of it (since your system can’t page it out, it’ll have less overall memory to work with, which in turn can increase your page fault rate).

ArchaeaSoftware · July 20, 2011, 9:58am

You’d be surprised how much pinned memory you can allocate before it starts to noticeably drag on system performance.

I was: a couple years ago, I downloaded a CPU benchmark (this was on Windows - I think it was Futuremark) and ran it as a baseline, then created a CUDA program that performed a pinned allocation of a specified size and wait for a keypress before exiting.

That way, I could pin variable amount of memory and rerun the benchmark, watching for performance degradation.

The benchmark didn’t start to exhibit slower performance until 50% of physical RAM had been pinned. If I recall correctly, the machine got pretty sluggish (but still useable) at 75%. I was surprised the OS was letting me pin that much, to be honest with you; no application should try to pin that much memory anyway.

It was long enough ago that I’d have to re-do the test in order to report the results more formally. But it’s not a hard test to undertake yourself.

Obviously, YMMV on the specific machine and the workloads you are running concurrently with your CUDA app, but I have not felt the slightest twinge of guilt bout allocating pinned memory since doing that study.

alrikai · August 3, 2011, 10:13pm

You’d be surprised how much pinned memory you can allocate before it starts to noticeably drag on system performance.

I was: a couple years ago, I downloaded a CPU benchmark (this was on Windows - I think it was Futuremark) and ran it as a baseline, then created a CUDA program that performed a pinned allocation of a specified size and wait for a keypress before exiting.

That way, I could pin variable amount of memory and rerun the benchmark, watching for performance degradation.

The benchmark didn’t start to exhibit slower performance until 50% of physical RAM had been pinned. If I recall correctly, the machine got pretty sluggish (but still useable) at 75%. I was surprised the OS was letting me pin that much, to be honest with you; no application should try to pin that much memory anyway.

It was long enough ago that I’d have to re-do the test in order to report the results more formally. But it’s not a hard test to undertake yourself.

Obviously, YMMV on the specific machine and the workloads you are running concurrently with your CUDA app, but I have not felt the slightest twinge of guilt bout allocating pinned memory since doing that study.

Ah that’s very interesting! Maybe I should check that out…

shawkie · August 4, 2011, 7:58am

You’d be surprised how much pinned memory you can allocate before it starts to noticeably drag on system performance.

I was: a couple years ago, I downloaded a CPU benchmark (this was on Windows - I think it was Futuremark) and ran it as a baseline, then created a CUDA program that performed a pinned allocation of a specified size and wait for a keypress before exiting.

That way, I could pin variable amount of memory and rerun the benchmark, watching for performance degradation.

The benchmark didn’t start to exhibit slower performance until 50% of physical RAM had been pinned. If I recall correctly, the machine got pretty sluggish (but still useable) at 75%. I was surprised the OS was letting me pin that much, to be honest with you; no application should try to pin that much memory anyway.

It was long enough ago that I’d have to re-do the test in order to report the results more formally. But it’s not a hard test to undertake yourself.

Obviously, YMMV on the specific machine and the workloads you are running concurrently with your CUDA app, but I have not felt the slightest twinge of guilt bout allocating pinned memory since doing that study.

If the system has 96GB of memory and Windows 7 can run happily on 2GB then wouldn’t you expect to be able to pin 94GB without it affecting general system performance? A lot of the problems that GPUs are being used to tackle require a lot of memory and with the introduction of the unified address space and improved cache on Fermi I would have thought that you might well want to pin this much memory.

In fact, something else I’ve been wondering is whether nVidia might make it possible to allow some device memory to be used as a hardware managed cache. I think it would already be possible to implement a fully associative cache inside a kernel (I once considered doing this with shared memory on previous generation hardware) but I don’t think it would perform very well and a more complex n-way cache would probably be quite tricky to do in software.

shawkie · August 5, 2011, 8:20am

This is quite an interesting link:

[url=“Archived MSDN and TechNet Blogs | Microsoft Docs”]Archived MSDN and TechNet Blogs | Microsoft Docs

I’m assuming that the nonpaged pool limits are what apply to pinned memory. So, if you really wanted to, you could pin approximately 75% of system memory up to 128GB on Windows 7 x64.

Topic		Replies	Views
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6268	January 22, 2025
How to make host pinned shared memory across process fork(2)? CUDA Programming and Performance	14	5227	January 6, 2015
fine control of memory pinning in CUDA CUDA Programming and Performance	12	16552	May 1, 2008
Pinned memory error invalid device pointer CUDA Programming and Performance	9	6080	April 10, 2009
Can I use Unified Memory in a soft real-time system? CUDA Programming and Performance	13	346	April 1, 2024
Can a CUDA kernel read "mapped, pinned" host memory through a "Device Pointer"? CUDA Programming and Performance	10	2824	November 20, 2012
Does pinned memory can accessed by Device? CUDA Programming and Performance	4	1242	March 18, 2024
Sharing CUDA Host Memory Between Processes CUDA Programming and Performance	10	30352	May 12, 2018
Accessing Managed Memory During Asynchronous Copies CUDA Programming and Performance	4	409	March 4, 2024
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1988	April 18, 2017

CUDA device memory access?

Related topics