Is this a fair way of timing UMA code?

Hey all,

I’ve been tasked with benchmarking the performance of code that uses unified data buffers as opposed to the traditional host/device model. To be honest I didn’t expect any difference in performance, which is why the results I’m getting are confusing me.

I’ve written a contrived example that adds 1 to a buffer of floats once on the device and once on the host. The performance of the UMA code is consistently better than that of the traditional cuda code (for lack of a better term) on my Jetson TK1

I’ve written a scoped timing struct that uses the cudaEvent_t type to keep track of when it was constructed/destructed. On destruction it spits out its lifespan.

// Scoped timing struct
struct CudaStopWatch{
   std::string name;             // Name of timed region
   cudaEvent_t start, stop;      // Start/Stop event
   CudaStopWatch(std::string n)  // Constructor
    : name(n)
      cudaEventCreate(&start);   // Create start event
      cudaEventCreate(&stop);    // Create stop event
      cudaEventRecord(start);    // Start recording
   ~CudaStopWatch(){             // On destruction
      cudaEventRecord(stop);     // Stop recording
      cudaEventSynchronize(stop);// And synchronize the event

      // Print out the elapsed time
      float mS(0.f);
      cudaEventElapsedTime(&mS, start, stop);
      printf("%s took %f mS to execute\n", name.c_str(), mS);

If my thinking is sound, this struct should time the Cuda code that occurs from its instantiation to the end of its scope.

Here’s the actual code of interest:

#include <stdio.h>
#include <string>

#define MAX_THREADS 1024

// Add 1 to buf
void inc(float * data, int N){
   int idx = threadIdx.x+blockDim.x*blockIdx.x;
   if (idx < N) data[idx]++;

// Test the UMA version
void testUma(int N){
   // Size of data buffer, in bytes
   const int nBytes = sizeof(float) * N;

   // Allocate unified buffer
   float * data(0);
   cudaMallocManaged((void **)&data, nBytes);

   // Initialize buffer data
   for (int i=0; i<N; i++)
      data[i] = float(i);

   {  // Time execution
      CudaStopWatch CSW("UMA");

      // add 1 on device, copy back, add 1 on host
      inc<<< N / MAX_THREADS, MAX_THREADS >>>(data, N);
      for (int i=0; i<N; i++)


// Test the traditional version
void testCuda(int N){
   // Size of data buffer, in bytes
   const int nBytes = sizeof(float) * N;

   // Allocate host and device buffers
   float * h_Data(0), * d_Data(0);
   h_Data = (float *)malloc(nBytes);
   cudaMalloc((void **)&d_Data, nBytes);

   // Initialize host buffer
   for (int i=0; i<N; i++)
      h_Data[i] = float(i);

   { // Time execution
      CudaStopWatch CSW("CUDA");

      // Copy to device
      cudaMemcpy(d_Data, h_Data, nBytes, cudaMemcpyHostToDevice);
      // add 1 on device, add 1 on host
      inc<<< N / MAX_THREADS, MAX_THREADS >>>(d_Data, N);
      // copy back
      cudaMemcpy(h_Data, d_Data, nBytes, cudaMemcpyDeviceToHost);
      // add 1
      for (int i=0; i<N; i++)

   // Free

int main(){
   const int N(1000 * MAX_THREADS);


   return 0;

In the UMA version, I start timing after I’ve initialized the data on the host, whereas in the traditional version I start timing before I copy the data from the host to the device.

My assumption was that until I actually call a kernel on my unified buffer it hasn’t yet been uploaded, and that the upload only occurs after I start timing. However, I can’t tell if this is true, since I don’t know if nvcc is doing anything tricky under the hood to optimize the data transfer.

Does this timing code seem fair? Sorry if it’s too long of a block, but it’s nothing complex I promise!

“since I don’t know if nvcc is doing anything tricky under the hood to optimize the data transfer”

sound thought; however, i struggle to sustain it

to at all be able to optimize the transfer, one access pattern must be a) faster, or b) able to receive ‘updates’ faster than the other
i may be wrong, but i do not see how uma would be faster than ordinary memory copies in terms of actual transfer, or how it would be able to receive the updated data quicker - both cases generally wait for the kernel to finish, and thus receives the data at the same time, i would think

your end event recording is only when the structure is deconstructed
note the difference between your 2 measures: the one has an additional memory free before the structure’s deconstruction
hence, the measure would be fair to the extent that the time taken by the (additional) memory overhead (free) is a fraction of total time measured
you could have the host array allocated beforehand, and destroyed afterwards, to remove this potential nuisance variable

and it might be helpful to consider measures with different packet sizes - that being transferred

even best would be, in case of a consistently better measurement of uma versus ordinary transfer, being able to offer reasons for it
i cant think why uma should be faster - there is little grounds for it in my mind

UM can be faster on Jetson because Jetson (i.e. TK1) has a fundamentally different architecture than most desktop/notebook/Tesla GPUs. The host memory and device memory are physically unified. Therefore no copying is required (under UM) and UM (on TK1) takes advantage of this. If, on the other hand, you do actually create separate “host” and “device” copies of the data, as you “normally” would, and then copy between them, that will, of course, be slower.

I’m not sure I follow. Do you mean the destruction of the CudaStopWatch object, or the free of the buffer? I believe the object is destructed once the code leaves the scope in which the kernel is called (note the “unnecessary” brackets), so I don’t think any of the malloc/free calls on the host/device get counted.

Neat. A quick google search of “tk1 unified” has turned up some interesting results, so it looks like I have some reading to do.

Thanks, you’ve been quite the help

edit: Actually, are you (or anyone else reading this) aware of anything in the datasheet / technical reference manual that would indicate that the GPU and CPU memory are physically unified? I’ve got an official source here

As well as a stackoverflow post that Robert Crovella replied to, but I was wondering if there was an official diagram or something…

I don’t know about a diagram, but take a careful look at the deviceQuery output from your TK1:

Integrated GPU sharing Host Memory: Yes

contrast that with the corresponding line from the C2050 deviceQuery output in the same thread.

Here’s a block diagram of Tegra TK1:

Ha, I was just going to ask about that. Although I thought it had to do with context sharing… I don’t know why.

How many times will I need you to tell me to read the manual?

Thanks again.