Background DMA making applications faster!?

Hi,

I am running the sample matrix multiplication code that comes with cuda sample (NVIDIA_CUDA-9.1_Samples/0_Simple/matrixMul). When I run it on it’s own, it takes ~0.24 ms on my GTX 1070.

But if I run a background application that is calling cudaMemcpy() in a loop(memcpy from host to device, 128MB), the runtime of matrixMul reduces to ~0.16ms. That is 33% reduction!

Why is the runtime decreasing? Shouldn’t it be increasing because the cudaMemcpy() should cause inteference (in terms of reduced memory bandwidth available to matrixMul)?

If it helps, below is my code for the background process:

#include
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <assert.h>

#define MEM_SIZE (1024 * 1024 * 1024)

#define gpuErrchk(ans) { gpuAssert((ans), FILE, LINE); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,“GPUassert: %s %s %d\n”, cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}

void initializeEvents(cudaEvent_t *start, cudaEvent_t *stop){
gpuErrchk( cudaEventCreate(start) );
gpuErrchk( cudaEventCreate(stop) );
gpuErrchk( cudaEventRecord(*start, 0) );
}

float finalizeEvents(cudaEvent_t start, cudaEvent_t stop){
gpuErrchk( cudaGetLastError() );
gpuErrchk( cudaEventRecord(stop, 0) );
gpuErrchk( cudaEventSynchronize(stop) );
float kernel_time;
gpuErrchk( cudaEventElapsedTime(&kernel_time, start, stop) );
gpuErrchk( cudaEventDestroy(start) );
gpuErrchk( cudaEventDestroy(stop) );
return kernel_time;
}

void *allocate_gpu_contigous(size_t mem)
{
size_t size = mem;
int device = -1;
void *gpu_mem;
cudaEvent_t start, stop;
float time;

gpuErrchk(cudaGetDevice(&device));

initializeEvents(&start, &stop);
gpuErrchk(cudaMallocManaged(&gpu_mem, size));
time = finalizeEvents(start, stop);
printf("CudaMallocManaged: Size:0x%lx, Time: %f ms\n", size, time);

memset(gpu_mem, 0x1, size);

initializeEvents(&start, &stop);
gpuErrchk(cudaMemPrefetchAsync(gpu_mem, size, device, NULL));
time = finalizeEvents(start, stop);
printf("cudaMemPrefetchAsync To GPU: Size:0x%lx, Time: %f ms\n", size, time);

return gpu_mem;

}

void memcpy_to_device(void *dest, void *src, size_t size, int eval)
{
cudaEvent_t start, stop;
float time;

if (eval) {
    initializeEvents(&start, &stop);
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice));
    time = finalizeEvents(start, stop);
    printf("CudaMemCpy (HostToDevice): Size:0x%lx, Time: %f ms\n", size, time);
} else {
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice));
}

}

void memcpy_to_host(void *dest, void *src, size_t size, int eval)
{
cudaEvent_t start, stop;
float time;

if (eval) {
    initializeEvents(&start, &stop);
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost));
    time = finalizeEvents(start, stop);
    printf("CudaMemCpy (DeviceToHost): Size:0x%lx, Time: %f ms\n", size, time);
} else {
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost));
}

}

int main()
{
void *gpu_mem = (void *)allocate_gpu_contigous(MEM_SIZE);
void *cpu_mem = (void *)malloc(MEM_SIZE);
assert(gpu_mem != NULL);
assert(cpu_mem != NULL);

memset(cpu_mem, 0x1, MEM_SIZE);

// Warmup
memcpy_to_device(gpu_mem, cpu_mem, MEM_SIZE, true);

while (1) {
    memcpy_to_device(gpu_mem, cpu_mem, MEM_SIZE, false);
}

}

no idea how you are measuring time or what OS this is.

If on linux, the background process may be keeping the GPU out of an idle state, which will reduce the overall CUDA start-up time, possibly affecting your foreground application wallclock time.

I am running on linux.
For time measurement, the sample code itself is measuring time (using cudaEventElapsedTime).
So two questions:

  1. How to stop GPU from going into idle state?
  2. What does cudaEventElapsedTime uses for time calcualtion (host process runtime or wall clock)

The best you can do is set persistence mode in nvidia-smi

Not 100% sure persistence mode is supported for GeForce GPUs, but I think it is.

the event system tracks activity on the device, but the situation is not as simple as that statement may sound.

Nevertheless, if you put your cuda event markers (again - you haven’t shown that app, right?) around the first CUDA statements in your code, you are likely to pick up startup/initialization time. If you put enough CUDA statements prior to your first event record, then most of the start up overhead should be gone (although wallclock time may still vary).

I wonder whether in addition to persistence mode and basic power-management handling there may also be an influence from automatic clock boosting, where GPU usage by the background task leads to higher boost clocks.

With any short-running CUDA app, it is interesting to see what happens when one runs it multiple times in a row, recording the performance for each invocation. Often, the first run is pretty slow, it gets faster on subsequent runs (say the first half dozen), then performance plateaus, and as one keeps repeating the runs, ultimately performance drops again as the GPU heats up.

Best I could tell when I looked into this phenomenon some years ago, different boost clocks were selected for different runs, corresponding to the observed performance differences. Since the details of the boost clock mechanism are not public, this behavior may differ between GPUs, driver versions, and environmental factors.

I agree. It’s possible clock-boosting could be a factor. If you combine persistence mode with setting of application clocks, it may help to smooth this out. I don’t remember if setting application clocks in nvidia-smi is supported for GeForce, but I think it is.

I also think it would be useful to profile this to find out what exactly is changing. using cudaEvent is probably still pretty coarse to figure this out.