Very newbie questions on synchronisation between GPU & CPU, and time measurement

Hello

I am a really beginner in CUDA programming and I figured out that even if my very simple codes work, I didn’t really understood how CUDA works.

Thus I have here some basics question on synchronisation times between GPU & CPU :

First question :

I have seen the notion of streams. I don’t really use them so if I understood well by default I am on the stream 0 of the GPU for all my GPU instructions.

A stream is a succession of CUDA code that is executed in order (not in parallel). So by default, if I don’t use this notion all the CUDA lines I will write in my main program of C++ will be executed in order.

But the GPU and the CPU can run in an asynchronous way. It mainly happens when a kernel is launched (it means a computation by a cuda function)

Am I right for this first question ?

Next question : take the code from this page :

https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

In this example I don’t understand what we do with the function cudaEventCreate ? Is it just an initialisation of the variables start and stop ? Indeed in general when we measure execution time we usually create a timer and we measure the value of the timer before & after the instruction we want to time. So here I just don’t understand the “philosophy” of this time measurement in CUDA (what is the need of this cudaEventCreate as we just created the variables cudaEvent_t on the line above).

And I also would like to be really sure that in this code we really mesure the time execution of the computation saxpy and there is no other thing “hidden” behind. Indeed I just figured out the fact the Kernel is launched in parallel of the GPU (yes I am a beginner), and I want to be a 100% certain I don’t miss another point here.

Also, imagine that I had a function on the CPU such that I would have :

cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
CPUfunction()
cudaEventRecord(stop);

would the measure time include this CPU execution (that is launched in an asynchronous way with the GPU), or these timers only measure what is happening on the stream0 in this example (so the cpu elapsed time are not taken in account, what is measured is only what is happening on my GPU).

Final question :

I have read about the cudaDeviceSynchronize() function. So when I use it in my code, what it does is that just after this line, the GPU and the CPU are synchronised. It is not just “the CPU will wait the GPU if the CPU was faster”, it really works in both directions : right after this line both the CPU and GPU have a fresh start (if I measure the timestamp on the GPU & CPU just after this line, I would have the same value for example).

Thank you a lot.

Yes, correct. CUDA streams represent a sequence of CUDA activity that will run in issue-order on the GPU. However host and device activity can still be asynchronous to each other (overlapping).

It is an initialization of the variable, and it also “registers” it (not the same thing as recording, which occurs later) with the CUDA runtime system. There is a need for something other than a “simple” variable because of the asynchrony between host and device. The cudaEvent gets “recorded” into a stream at a particular point (i.e. after some other “recorded” or “issued” activity, and before some other activity), but the event is not actually deemed to have “occurred” or been “completed” until the CUDA stream execution reaches that point. Again, because of asynchrony. As a result, a simple, purely host-based variable won’t work

The code is not guaranteed to measure only the saxpy function, for at least 2 reasons.

  1. If the saxpy function is so short that the subsequent CPUfunction() does not fully overlap with it, then the duration between the two events will be greater than that predicted by the duration of the saxpy function. The end event does not get recorded until CPUfunction is complete, so this particular sequence will measure the time from the start of saxpy until the end of CPUfunction, or the end of saxpy, whichever is later in time. If you want to measure just saxpy, in a single-stream scenario, then record your events only around saxpy. The general description I am giving here may also be affected if there is or is not prior CUDA activity not reflected in the snippet here.

  2. Use of CUDA events for time measurement in a multi-stream scenario may not give expected results. Your sample does not obviously indicate multi-stream activity, and might be construed to be using the default stream semantics which would preclude this concern, however CUDA default stream semantics are now modifiable. This is a rather complicated topic, but the hazard is referred to here:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g40159125411db92c835edb46a0989cd6

For multi-stream use of any kind, cudaEvent based timing must be used carefully, if at all. It’s commonly recommended to use some other timing scheme, instead. Again, your sample does not obviously indicate multi-stream usage, but since it’s just a snippet and you are a newbie, then going for full disclosure here.

Thank you for your fast answer.

I have some questions with your last point :

I would like to understand why the cudaEventRecord(stop); is waiting for the CPU operation to complete.
Is it the case because this command synchronise the GPU and the CPU, THEN measure the time (thus the return time will measure the two times).

Or is it because of any other reason I don’t know (and thus I will indeed measure the two times CPU+GPU but both won’t be synchronised at the end of it).

And to be sure : if I only use one stream and I put the cudaEventRecord right before and after the saxpy function, in this case I’m 100% sure to only really measure the time execution of this function ? No more hidden stuff behind ?

cudaEventRecord(stop) is not “waiting for the CPU operation to complete”. The CPU thread does not reach the point of issuing the cudaEventRecord(stop) UNTIL the CPU thread has finished executing the CPUfunction(). This is the nature of ordinary C program execution. A statement B issued after a statement A will not begin executing until A is complete. The cudaEventRecord() operation does not begin executing (so the event is not recorded) until CPUfunction() is compleete.

Asynchrony means we have multiple timelines to consider. Consider the following timelines:

HOST:     START KERNEL_LAUNCH CPUfunction EXECUTION****** STOP

DEVICE:   START KERNEL_EXECUTION****************          STOP
          |<---------------duration-----------------------|

Assuming no prior activity, the cudeEventRecord(START) operation is performed, and this records the event in the device timeline. Since there is no prior activity, the event immediately completes. Next, the host issues the kernel launch for saxpy. On the device timeline the kernel begins executing. The host kernel launch returns control to the host thread before the kernel is finished executing, so while the kernel continues to execute, the host will begin processing CPUfunction(). When CPU function is complete, the host thread will proceed, and the next thing occurring is the cudaEventRecord(STOP) call. This records the event on the device timeline, and if the saxpy function has already finished, then the STOP event will “complete” immediately, but the completion time is not earlier than the time it was recorded. And it was not recorded until CPUfunction was complete. Therefore the overall duration is longer than just the execution of saxpy (for this particular made-up example).

Yes, to my knowledge, that should yield expected results on Windows TCC model and on Linux. On Windows WDDM driver model, there may still be issues, because Windows WDDM driver model allows for the “batching” of commands to the GPU. This means that even the cudeEventRecord(STOP) function is not guaranteed to be issued when the CPU thread issues it. See this recent question:

https://devtalk.nvidia.com/default/topic/1027690/cuda-programming-and-performance/kernel-won-t-start-until-cudadevicesynchronize-is-called/

I am ok with all this.

Ok I think I get what I didn’t understand. In fact I really have to consider that everything I put in my main is CPU code that just make calls to the Kernel of the GPU. The Kernel in itself never read the main.

Thus if I take this very simple code :

saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
CPUfunction();
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

And I admit for example saxpy last 1s in both case and CPUfunction last 10s.

So in fact, if I understood you well, it does the following :

The CPU calls the first saxpy, then it launches CPUfunction. The call is really fast but the real work is now being done on the GPU in parallel.

So the first saxpy and CPUfunction() are launched in parallel : thus those two first lines last 10s.

Then the second saxpy called is called when CPUfunction has finished. The kernel never read the main directly.

Overall time : 11s.

What I thought it worked was :

On the first saxpy, the GPU launch the process and as soon as it is finished it launches the next GPU process it finds in the program.

So from my perspective it was be 1s+1s in parallel of 10s: so 10s globally.

But it is wrong as the GPU never read the main.cpp, it is only waiting for calls.

Sorry, it was probably a very basic question but I was confused.