Hey all,
I’ve been tasked with benchmarking the performance of code that uses unified data buffers as opposed to the traditional host/device model. To be honest I didn’t expect any difference in performance, which is why the results I’m getting are confusing me.
I’ve written a contrived example that adds 1 to a buffer of floats once on the device and once on the host. The performance of the UMA code is consistently better than that of the traditional cuda code (for lack of a better term) on my Jetson TK1
I’ve written a scoped timing struct that uses the cudaEvent_t type to keep track of when it was constructed/destructed. On destruction it spits out its lifespan.
// Scoped timing struct
struct CudaStopWatch{
std::string name; // Name of timed region
cudaEvent_t start, stop; // Start/Stop event
CudaStopWatch(std::string n) // Constructor
: name(n)
{
cudaEventCreate(&start); // Create start event
cudaEventCreate(&stop); // Create stop event
cudaEventRecord(start); // Start recording
}
~CudaStopWatch(){ // On destruction
cudaEventRecord(stop); // Stop recording
cudaEventSynchronize(stop);// And synchronize the event
// Print out the elapsed time
float mS(0.f);
cudaEventElapsedTime(&mS, start, stop);
printf("%s took %f mS to execute\n", name.c_str(), mS);
}
};
If my thinking is sound, this struct should time the Cuda code that occurs from its instantiation to the end of its scope.
Here’s the actual code of interest:
#include <stdio.h>
#include <string>
#define MAX_THREADS 1024
// Add 1 to buf
__global__
void inc(float * data, int N){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < N) data[idx]++;
}
// Test the UMA version
void testUma(int N){
// Size of data buffer, in bytes
const int nBytes = sizeof(float) * N;
// Allocate unified buffer
float * data(0);
cudaMallocManaged((void **)&data, nBytes);
// Initialize buffer data
for (int i=0; i<N; i++)
data[i] = float(i);
{ // Time execution
CudaStopWatch CSW("UMA");
// add 1 on device, copy back, add 1 on host
inc<<< N / MAX_THREADS, MAX_THREADS >>>(data, N);
cudaDeviceSynchronize();
for (int i=0; i<N; i++)
data[i]++;
}
//Free
cudaFree(data);
}
// Test the traditional version
void testCuda(int N){
// Size of data buffer, in bytes
const int nBytes = sizeof(float) * N;
// Allocate host and device buffers
float * h_Data(0), * d_Data(0);
h_Data = (float *)malloc(nBytes);
cudaMalloc((void **)&d_Data, nBytes);
// Initialize host buffer
for (int i=0; i<N; i++)
h_Data[i] = float(i);
{ // Time execution
CudaStopWatch CSW("CUDA");
// Copy to device
cudaMemcpy(d_Data, h_Data, nBytes, cudaMemcpyHostToDevice);
// add 1 on device, add 1 on host
inc<<< N / MAX_THREADS, MAX_THREADS >>>(d_Data, N);
// copy back
cudaMemcpy(h_Data, d_Data, nBytes, cudaMemcpyDeviceToHost);
// add 1
for (int i=0; i<N; i++)
h_Data[i]++;
}
// Free
free(h_Data);
cudaFree(d_Data);
}
int main(){
const int N(1000 * MAX_THREADS);
testUma(N);
testCuda(N);
return 0;
}
In the UMA version, I start timing after I’ve initialized the data on the host, whereas in the traditional version I start timing before I copy the data from the host to the device.
My assumption was that until I actually call a kernel on my unified buffer it hasn’t yet been uploaded, and that the upload only occurs after I start timing. However, I can’t tell if this is true, since I don’t know if nvcc is doing anything tricky under the hood to optimize the data transfer.
Does this timing code seem fair? Sorry if it’s too long of a block, but it’s nothing complex I promise!