Hello
I am a really beginner in CUDA programming and I figured out that even if my very simple codes work, I didn’t really understood how CUDA works.
Thus I have here some basics question on synchronisation times between GPU & CPU :
First question :
I have seen the notion of streams. I don’t really use them so if I understood well by default I am on the stream 0 of the GPU for all my GPU instructions.
A stream is a succession of CUDA code that is executed in order (not in parallel). So by default, if I don’t use this notion all the CUDA lines I will write in my main program of C++ will be executed in order.
But the GPU and the CPU can run in an asynchronous way. It mainly happens when a kernel is launched (it means a computation by a cuda function)
Am I right for this first question ?
Next question : take the code from this page :
https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
In this example I don’t understand what we do with the function cudaEventCreate ? Is it just an initialisation of the variables start and stop ? Indeed in general when we measure execution time we usually create a timer and we measure the value of the timer before & after the instruction we want to time. So here I just don’t understand the “philosophy” of this time measurement in CUDA (what is the need of this cudaEventCreate as we just created the variables cudaEvent_t on the line above).
And I also would like to be really sure that in this code we really mesure the time execution of the computation saxpy and there is no other thing “hidden” behind. Indeed I just figured out the fact the Kernel is launched in parallel of the GPU (yes I am a beginner), and I want to be a 100% certain I don’t miss another point here.
Also, imagine that I had a function on the CPU such that I would have :
cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
CPUfunction()
cudaEventRecord(stop);
would the measure time include this CPU execution (that is launched in an asynchronous way with the GPU), or these timers only measure what is happening on the stream0 in this example (so the cpu elapsed time are not taken in account, what is measured is only what is happening on my GPU).
Final question :
I have read about the cudaDeviceSynchronize() function. So when I use it in my code, what it does is that just after this line, the GPU and the CPU are synchronised. It is not just “the CPU will wait the GPU if the CPU was faster”, it really works in both directions : right after this line both the CPU and GPU have a fresh start (if I measure the timestamp on the GPU & CPU just after this line, I would have the same value for example).
Thank you a lot.