Background DMA making applications faster!?

sakjain92 · June 7, 2018, 5:54pm

Hi,

I am running the sample matrix multiplication code that comes with cuda sample (NVIDIA_CUDA-9.1_Samples/0_Simple/matrixMul). When I run it on it’s own, it takes ~0.24 ms on my GTX 1070.

But if I run a background application that is calling cudaMemcpy() in a loop(memcpy from host to device, 128MB), the runtime of matrixMul reduces to ~0.16ms. That is 33% reduction!

Why is the runtime decreasing? Shouldn’t it be increasing because the cudaMemcpy() should cause inteference (in terms of reduced memory bandwidth available to matrixMul)?

sakjain92 · June 7, 2018, 6:00pm

If it helps, below is my code for the background process:

#include
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <assert.h>

#define MEM_SIZE (1024 * 1024 * 1024)

#define gpuErrchk(ans) { gpuAssert((ans), FILE, LINE); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,“GPUassert: %s %s %d\n”, cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}

void initializeEvents(cudaEvent_t *start, cudaEvent_t *stop){
gpuErrchk( cudaEventCreate(start) );
gpuErrchk( cudaEventCreate(stop) );
gpuErrchk( cudaEventRecord(*start, 0) );
}

float finalizeEvents(cudaEvent_t start, cudaEvent_t stop){
gpuErrchk( cudaGetLastError() );
gpuErrchk( cudaEventRecord(stop, 0) );
gpuErrchk( cudaEventSynchronize(stop) );
float kernel_time;
gpuErrchk( cudaEventElapsedTime(&kernel_time, start, stop) );
gpuErrchk( cudaEventDestroy(start) );
gpuErrchk( cudaEventDestroy(stop) );
return kernel_time;
}

void *allocate_gpu_contigous(size_t mem)
{
size_t size = mem;
int device = -1;
void *gpu_mem;
cudaEvent_t start, stop;
float time;

gpuErrchk(cudaGetDevice(&device));

initializeEvents(&start, &stop);
gpuErrchk(cudaMallocManaged(&gpu_mem, size));
time = finalizeEvents(start, stop);
printf("CudaMallocManaged: Size:0x%lx, Time: %f ms\n", size, time);

memset(gpu_mem, 0x1, size);

initializeEvents(&start, &stop);
gpuErrchk(cudaMemPrefetchAsync(gpu_mem, size, device, NULL));
time = finalizeEvents(start, stop);
printf("cudaMemPrefetchAsync To GPU: Size:0x%lx, Time: %f ms\n", size, time);

return gpu_mem;

}

void memcpy_to_device(void *dest, void *src, size_t size, int eval)
{
cudaEvent_t start, stop;
float time;

if (eval) {
    initializeEvents(&start, &stop);
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice));
    time = finalizeEvents(start, stop);
    printf("CudaMemCpy (HostToDevice): Size:0x%lx, Time: %f ms\n", size, time);
} else {
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice));
}

}

void memcpy_to_host(void *dest, void *src, size_t size, int eval)
{
cudaEvent_t start, stop;
float time;

if (eval) {
    initializeEvents(&start, &stop);
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost));
    time = finalizeEvents(start, stop);
    printf("CudaMemCpy (DeviceToHost): Size:0x%lx, Time: %f ms\n", size, time);
} else {
    gpuErrchk(cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost));
}

}

int main()
{
void *gpu_mem = (void *)allocate_gpu_contigous(MEM_SIZE);
void *cpu_mem = (void *)malloc(MEM_SIZE);
assert(gpu_mem != NULL);
assert(cpu_mem != NULL);

memset(cpu_mem, 0x1, MEM_SIZE);

// Warmup
memcpy_to_device(gpu_mem, cpu_mem, MEM_SIZE, true);

while (1) {
    memcpy_to_device(gpu_mem, cpu_mem, MEM_SIZE, false);
}

}

Robert_Crovella · June 7, 2018, 6:08pm

no idea how you are measuring time or what OS this is.

If on linux, the background process may be keeping the GPU out of an idle state, which will reduce the overall CUDA start-up time, possibly affecting your foreground application wallclock time.

sakjain92 · June 7, 2018, 6:16pm

I am running on linux.
For time measurement, the sample code itself is measuring time (using cudaEventElapsedTime).
So two questions:

How to stop GPU from going into idle state?
What does cudaEventElapsedTime uses for time calcualtion (host process runtime or wall clock)

Robert_Crovella · June 7, 2018, 6:22pm

The best you can do is set persistence mode in nvidia-smi

Not 100% sure persistence mode is supported for GeForce GPUs, but I think it is.

the event system tracks activity on the device, but the situation is not as simple as that statement may sound.

Nevertheless, if you put your cuda event markers (again - you haven’t shown that app, right?) around the first CUDA statements in your code, you are likely to pick up startup/initialization time. If you put enough CUDA statements prior to your first event record, then most of the start up overhead should be gone (although wallclock time may still vary).

njuffa · June 7, 2018, 7:14pm

I wonder whether in addition to persistence mode and basic power-management handling there may also be an influence from automatic clock boosting, where GPU usage by the background task leads to higher boost clocks.

With any short-running CUDA app, it is interesting to see what happens when one runs it multiple times in a row, recording the performance for each invocation. Often, the first run is pretty slow, it gets faster on subsequent runs (say the first half dozen), then performance plateaus, and as one keeps repeating the runs, ultimately performance drops again as the GPU heats up.

Best I could tell when I looked into this phenomenon some years ago, different boost clocks were selected for different runs, corresponding to the observed performance differences. Since the details of the boost clock mechanism are not public, this behavior may differ between GPUs, driver versions, and environmental factors.

Robert_Crovella · June 7, 2018, 7:56pm

I agree. It’s possible clock-boosting could be a factor. If you combine persistence mode with setting of application clocks, it may help to smooth this out. I don’t remember if setting application clocks in nvidia-smi is supported for GeForce, but I think it is.

I also think it would be useful to profile this to find out what exactly is changing. using cudaEvent is probably still pretty coarse to figure this out.

Topic		Replies	Views
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1728	July 19, 2022
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13254	July 9, 2008
Very fast ramp-down from high to low clock speeds leading to increased time repeatedly ramping up CUDA Programming and Performance	13	1139	January 15, 2021
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25062	March 8, 2010
Performance drop when turning off desktop GUI during CUDA kernel execution CUDA Programming and Performance	6	19	December 20, 2024
processing time check CUDA Programming and Performance	5	551	November 16, 2010
GPU running time is not stable CUDA Programming and Performance	5	3055	April 24, 2010
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22044	January 13, 2008
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2364	September 3, 2010
cudaMemcpy execution time CUDA Programming and Performance	5	6777	June 17, 2010

Background DMA making applications faster!?

Related topics