Performance issues accessing pinned memory 1070 / 3060

Hi,

I see different performance on a 1070 and a 3060 when accessing pinned memory allocated with cuMemAllocHost.

I have a kernel that accesses the same 32 bit location about 32 times. All threads accesses the same location. On a 3060 I cannot see any performance Impact - if it accesses the memory or not.

On a 1070 I see a almost 40% performance drop if it accesses the memory.

Is there anything I can do to get a better performance on a 1070 if the kernel accesses pinned memory?

Thanks a lot,
Daniel

What results do you get from running the bandwidthTest on the 1070?

Hi,

I do not think that this is a bandwidth issue. We are talking about 32 bits - I have not run any test so far, but I can say that the bandwidth of all gpus I use are in the expected ranges. No issues so far. The problem I see is that accessing these 32 bits on a 30xx is nothing that impacts the performance of a kernel compared to not accessing these bits. running the same kernel on a 1070 gives you a 30% - 40% performance drop.

My< guess is that it is related to the memory-type (pinned memory) and that I missed to set some flags or the like - or that this is just expected on a 1070.

If the kernels is performing other memory operations in between the pinned system memory reads then warps may be stalled in the SM waiting for the longer latency pinned system memory access. The 1070 (GP10x) L1TEX cache requires in order completion of all loads for all warps going to that L1TEX (each SM has 2 L1TEX). 3060 (GA10x) L1TEX supports out of order return between warps and out of order return on hits.

Recent Nsight Compute does not support Pascal Architecture. An older version of Nsight Compute or NVVP is required to profile the 1070.

Thanks for this information. Would it bring some improvements if I replace the reding from the pinned memory with this:

  • copy pinned memory async to device global memory (if this is possibel as a async function which a kernel is active)
  • issue a prefetch to the desired memory location - or - async copy the value from globale to shared memory
  • reding the value - either the prefetched global value, or the value from the shared memory

The Pascal L1TEX is in order across warps in the SM. This means that order to warps, in the same thread block or different thread block, issue memory operations matters.

  • copy pinned memory async to device global memory (if this is possibel as a async function which a kernel is active)

The asynchronous copy engine can run at the same time as the GR engine; however, I don’t think you will have the control level that you require. If the values in pinned system memory do not change during the kernel then you may want to try cudaMemcpyAsync HtoD prior to the kernel and a cudaMemcpyAsyncDtoH after the kernel and access the data in device memory.

  • issue a prefetch to the desired memory location - or - async copy the value from globale to shared memory

I don’t think a prefetch instruction (available in PTX) will do anything for pinned system memory.

  • reding the value - either the prefetched global value, or the value from the shared memory

You have not specified how much pinned system memory is read or if the value changes by either GPU writes or CPU writes.

Reading the shared memory implies the value is not changing.

If the data is read-only then

  • if the data is small then consider passing via kernel parameters (can be up to 4KiB)
  • if the data is accessed often then try cudaMemcpyAsync (HtoD) prior to kernel execution
  • if the data is accessed many times by multiple warps then try to cache in shared memory. System memory is not cached in GPU L2 but may be cached in L1 if you use the correct command line options and instructions.

If the data is mutable by GPU only then

  • try cudaMemcpyAsync (HtoD) prior to kernel execution

If the data is mutable by CPU during execution then

Thanks a lot for the detailed information.

The data is 4 bytes. All thready read this integer and it is modified by the CPU dure execution of the kernel. That works great on most modern GPUs, but not on the 10xx and older. I didn’t check 20xx.

Thanks.

Is there any recommended way to send some data to an executing kernel other than using pinned memory?

Thanks.

There are various methods, this answer covers one and links to others. Ultimately it is possible do it using pinned memory, or using ordinary device memory, or using managed memory.

For the record, Nsight Compute 2019.5 is the last Linux version to support SM 6.1 - here.