Transfer results of executing kernel calculation to host

geohei · May 24, 2022, 2:47pm

Hi.

I’d like my host code to access kernel calculation results during actual kernel execution.

My kernel runs for a couple of days. I could split it into chunks, but I would like to avoid the overhead going along with kernel starts (data copied to shared memory, initialization of variables, …). Presently, the results from kernel calculation is printed out, but a power failure would make me start again from the scratch. I’d like to have some restore point of the running calculation every minute on file.

Since CUDA doesn’t provide an API to save data to file, I was wondering if there would be some kind of e.g. unified memory monitoring API, which I could use on host code to save intermediate data calculated on device?

Generally speaking, is there a possibility for host code to probe device data during execution?

Robert_Crovella · May 24, 2022, 2:52pm

One approach would be to use either pinned memory or unified memory (linux and pascal or beyond) to transfer data from device to host during kernel execution. To get reliable transfer its necessary to understand concepts like volatile and memory fencing.

This may be of interest.

Another approach would be to have your kernel halt on some periodic basis, write results to memory, then restart the kernel calcs from there (relaunch the kernel for the next time step.)

It should also be possible to simply have a cudaMemcpyAsync run periodically on a stream that is not the same as the stream the kernel is running on. I don’t have a demonstrator for this, and it will still be necessary to make appropriate use of volatile and/or fencing.

njuffa · May 24, 2022, 4:00pm

Side remark regarding design:

Creating a restore point every minute seems excessive, and depending on the amount of state that needs to be written out to per restore point it could create a significant load on mass storage.

I do not know the environment this machine is operating in, but in typical environments, the likelihood of a power failure within the next minute is exceedingly small. A fairly common approach for computations that are projected to run on the order of hours is to checkpoint every 5% to completion, or every 1% for something that is designed to run a few days. The largest practical computations I have undertaken ran on the order of 700 hours of wallclock time.

Topic		Replies	Views
passing information between kernels ? CUDA Programming and Performance	1	1858	July 17, 2009
Threads, branching and writing to global memory CUDA Programming and Performance	3	617	October 24, 2018
Idea: a new memcpy from device to host for gain performance CUDA Programming and Performance	3	525	October 18, 2018
Iterative computations Creating of efficient iterative computations using CUDA CUDA Programming and Performance	16	2276	January 25, 2011
How to send data to host during long task on device? Teaching and Curriculum Support	0	1131	October 12, 2013
How to implement calculation pipeline via CUDA streams ? CUDA Programming and Performance	3	6458	January 17, 2013
Best way to get result back to the host? CUDA Programming and Performance	3	1333	April 30, 2009
bets way to return a float value sync or assync CUDA Programming and Performance	26	10310	May 7, 2009
Usage memory between global kernel CUDA Programming and Performance	2	415	September 9, 2019
Copying memory from device to Host takes too much time CUDA Programming and Performance	7	3391	October 5, 2010

Transfer results of executing kernel calculation to host

Related topics