New to CUDA having memory transfer issues

Hi, I’m relatively new to CUDA, but I’ve coded in OpenCL before and I’m trying to learn CUDA.
(GPU Programming looks to be trending towards CUDA and I really only need to know OpenCL for FPGAs now.)

Anyways I have a relatively simple example that is based off of Mark Harris’ example from his “An Even Easier Introduction to CUDA.”

You should be able to copy and run the program.
Anyways, my main question is why I’m only seeing 1GBps on DtoH on average and 191MBps on HtoD?
I’ve achieved 11GBps on my OpenCL stuff and even the bandwidthTest sample states that I should be getting ~11GBps.

Or is it that I’m just looking at the profiler incorrectly?

Any help would be greatly appreciated!

Another question would be does, cudaMallocManaged create pinned memory? I would assume that it does but it always helps to make sure.

Thanks,

BHa

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include <iostream>
#include <math.h>

//for perf. counters
#include <Windows.h>

using namespace std;

// Kernel function to add the elements of two arrays
__global__ void add(float *a, float *b, float *c)
{
   const int x = threadIdx.x;
   const int y = blockIdx.x;
   const int width = blockDim.x;

   const int id = y * width + x;
   
   c[id] = a[id] + b[id];
}

int main(void)
{
   int N = 4194304;
   size_t size = N * sizeof(float);
   int blockSize = 128;
   int numBlocks = (N + blockSize - 1) / blockSize;

   LARGE_INTEGER perfFrequency;
   LARGE_INTEGER performanceCountNDRangeStart;
   LARGE_INTEGER performanceCountNDRangeStop;

   float *x = NULL;
   float *y = NULL;
   float *z = NULL;

   // Allocate Unified Memory – accessible from CPU or GPU
   cudaMallocManaged(&x, size);
   cudaMallocManaged(&y, size);
   cudaMallocManaged(&z, size);

   // initialize x and y arrays on the host
   for (int i = 0; i < N; ++i) 
   {
      x[i] = 1.0f;
      y[i] = 2.0f;
   }
   cout << "\nSingle Queue\n";
   QueryPerformanceCounter(&performanceCountNDRangeStart);
   // Run kernel on 1M elements on the GPU
   add <<<numBlocks, blockSize>>>(x, y, z);

   // Wait for GPU to finish before accessing on host
   cudaDeviceSynchronize();
   QueryPerformanceCounter(&performanceCountNDRangeStop);
   QueryPerformanceFrequency(&perfFrequency);
   printf("NDRange performance counter time %f ms.\n", 1000.0f*(float)(performanceCountNDRangeStop.QuadPart - performanceCountNDRangeStart.QuadPart) / (float)perfFrequency.QuadPart);

// Check for errors (all values should be 3.0f)
   for (unsigned int a = 0; a < N; ++a)
   {
      if (z[a] != x[a] + y[a])
         printf("Kernel 0 Failure @ %d: %0.2f != %0.2f + %0.2f\n", a, z[a], x[a], y[a]);
   }

   // Free memory
   cudaFree(x);
   cudaFree(y);
   cudaFree(z);

   return 0;
}

what GPU are you running on?

cudaMallocManaged creates “managed” memory, which is definitely not the same as pinned memory.

I’m running on a K5100M.

So should I use cudaMalloc with cudaMemcpy instead then to get pinned memory speed?

The benchmarking methodology used here appears unsuitable. It seems the operations under test are invoked just once. At minimum, for benchmarking anything you would want to run everything twice and measure on the second pass to account for one-time startup and warm-up effects. With memory operations of any kind, it is customary to run a higher number of times (e.g. 10 times, as in the STREAM benchmark) and record the fastest time. This gives a good assessment of steady-state performance.

Due to the asynchronous nature of many GPU activities, it is a good idea to insert a call to cudaDeviceSynchronize() before starting the timed portion of a CUDA benchmark, to make sure all previous activity has finished.

PCIe transmissions are packetized and have a fixed overhead. Maximum throughput is usually not achieved until transfer block size reaches 16 MB or so. When measuring throughput between system memory and GPU via PCIe it is also important to maximize the throughput of the system memory, which is especially important in a dual sockets scenario, where careful CPU and memory bindings must be used to have the GPU communicate with the “near” CPU.

With a PCIe gen3 x16 link, you should see a throughput of 11.5 GB/sec simultaneous upstream / downstream at full performance. I am not familiar with the K5100M, does it have PCI gen 3 and dual DMA engines?

Note that robustly benchmarking anything CUDA on a Windows platform with the default WDDM driver can be a major challenge due to performance issues and artifacts caused by WDDM. If possible, use a TCC driver with Windows, or use (a professional grade) Linux.

@njuffa

My apologies, I should have had it set up to run multiple times, that is something that I do in OpenCL for that exact reason but had forgotten to do here.

That being said, even @ 10 times, I still only get a speed of ~26ms which compared to a similar OpenCL setup I have that is running ~6.2ms is still not similar.

I expect CUDA to be around that speed (and I’m assuming that it will be once I fix the transfer issues I have)

That being said, switching to Linux is not “relatively” easy for me at this point.

I also do not believe that I can place my GPU into TCC mode, as it is servicing my display (laptop). (This also means that installing another GPU is not easy for me either.)

Thanks!

BHa

Looking at the posted code again, I don’t see that what is being measured is the speed of PCIe transfers, but the execution time of a kernel? Color me confused.

What is the purpose of these measurements? I find examining the performance of actual CUDA application code with the help of the powerful CUDA profiler much more fruitful than running benchmarks, which are often tricky to configure to make sure that one isolates and measures exactly what one intended to measure (and often difficult to correlate back to application-level performance on top of that).

@njuffa

I totally understand the confusion. Yes I’m measuring the execution time of the kernel, but the kernel is simply an add operation which is a minuscule amount of time. This should roughly put out the amount of time that the transfers take (with overhead).

That being said, I’m also looking at NVVP at the same time and maybe due to my own inexperience with NVVP or some other issues, I’m only seeing average transfer speeds of 1GBps. (I’m looking underneath Unified Memory on the timeline and clicking on DtoH and HtoD to find the transfer speeds.)

The real purpose of this to just obtain familiarity with CUDA. I’m essentially learning it from scratch and I’m trying to attempt things I know should work (compared to OpenCL) and see how they perform comparatively.

Thanks for your help though!

BHa

this particular code seems to measure time required to send data over pci-e plus access them once

the culprit is what you don’t use pinned memory, so copying data works via double-buffering

CUDA SDK contains bandwidthTest.cu - llok up word PINNED there

@BulatZiganshin

Right that’s what I was thinking in my response to txBob.

Is the correct way to setup pinned memory through the use of cudaMalloc & cudaMemcpy or is there another way of doing so. (If possible maybe a link would be incredibly helpful)

It looks like in the programming guide, the correct call is using cudaHostAlloc and cudaFreeHost.

If the goal is to measure PCIe throughput in isolation, take a look at the CUDA sample application bandwidthTest. In general, you would want to use pinned system memory for such experiments, to avoid hidden overhead e.g. additional copies.

bandwidthTest should at least give you a good idea how to measure uni-directional PCIe throughput. Not sure whether it offers a mode to measure simultaneous upstream / downstream traffic which requires a GPU with dual DMA controllers, and the use of CUDA streams on the software side.

As I stated, PCIe throughput is very much a function of transfer block size. For small block sizes effective throughput can be very low due to fixed overhead of the interconnect. It will be instructive to run bandwidthTest with different block sizes to get a feel for that.

@njuffa

I ran it through shmoo mode and saw at roughly 1024000 that it was running @ 11GBps.

Anyways, using the programming guide and realizing that the method to obtain pinned memory was through the use of cudaHostAlloc and cudaHostGetDevicePointers, I was able to see the speeds that I was looking for.

Maybe in the future, cudaMallocManaged uses pinned memory as well to make life easier for the programmer, but no big deal in my opinion.

I was able to get speeds closer to ~3.3ms which is a significant difference compared to my OpenCL version which is good news for me.

Thanks for everyone’s help!
I’m slowly getting better at CUDA!

BHa

Has anyone ever profiled Thrust before in this kind of regard?

I’m assuming the Thrust devs are acquainted with CUDA well enough to know to use cudaHostAlloc for their host_vector implementation, for example. I’m away from my CUDA machine so I can’t test right now.

AFAIK thrust doesn’t automatically used pinned memory for host allocations.

thrust has an experimental pinned allocator. If you simply google that, you’ll find docs/examples

here’s one:

http://stackoverflow.com/questions/25064383/how-to-asynchronously-copy-memory-from-the-host-to-the-device-using-thrust-and-c

It should be noted that the performance disparity between PCIe transfers using pinned host memory and those using regular paged host memory has diminished over the years, as Intel is paying more attention to system memory bandwidth these days (reducing the overhead of the host-to-host copy needed with paged host memory).

Now that AMD is gearing up to deliver even faster x86 system memory configurations (by increasing the number of DDR4 channels for their planned Naples server platforms to eight), we may see that difference shrink even more.

txbob, thank you for the link.

That’s so interesting. I wonder why they consider a pinned-memory allocator to be experimental. Are there some subtle caveats I’m not seeing up front?

Pinned host allocations are not without issues. They are primarily intended to be used in the way the CUDA driver uses them internally: it creates a good-sized (single-digits MB) pinned buffer at context creation and re-uses it for all DMA transfers from and to the GPU.

Since pinned host allocations require physically contiguous chunks of memory, these allocations may be slower to create, are more likely to cause system memory fragmentation, and can interfere with the operating system’s page-based memory virtualization. Operating systems may limit pinned allocations to just a fraction of the total system memory, or may experience significant slowdown if many pinned allocations are used.

Hypothesis: These may be reason why pinned allocations are not supported by Thrust as a standard feature.

That’s so cool about pinned memory!

Thank you for your detailed explanation on the inner workings. Those are some really good reasons why an allocator like that would be experimental.

I don’t even think I’m transferring data back and forth enough to see any real benefits from pinned memory anyway but it’s good to know how it works. And it’s even better to know the hidden costs of using pinned memory and that it seems like it’s something that shouldn’t be used all willy nilly.

Truth be told, I’d rather see some C++14 support and move semantics added to some of the Thrust stuff than a pinned allocator now :P