Transfer results of executing kernel calculation to host

Hi.

I’d like my host code to access kernel calculation results during actual kernel execution.

My kernel runs for a couple of days. I could split it into chunks, but I would like to avoid the overhead going along with kernel starts (data copied to shared memory, initialization of variables, …). Presently, the results from kernel calculation is printed out, but a power failure would make me start again from the scratch. I’d like to have some restore point of the running calculation every minute on file.

Since CUDA doesn’t provide an API to save data to file, I was wondering if there would be some kind of e.g. unified memory monitoring API, which I could use on host code to save intermediate data calculated on device?

Generally speaking, is there a possibility for host code to probe device data during execution?

One approach would be to use either pinned memory or unified memory (linux and pascal or beyond) to transfer data from device to host during kernel execution. To get reliable transfer its necessary to understand concepts like volatile and memory fencing.

This may be of interest.

Another approach would be to have your kernel halt on some periodic basis, write results to memory, then restart the kernel calcs from there (relaunch the kernel for the next time step.)

It should also be possible to simply have a cudaMemcpyAsync run periodically on a stream that is not the same as the stream the kernel is running on. I don’t have a demonstrator for this, and it will still be necessary to make appropriate use of volatile and/or fencing.

1 Like

Side remark regarding design:

Creating a restore point every minute seems excessive, and depending on the amount of state that needs to be written out to per restore point it could create a significant load on mass storage.

I do not know the environment this machine is operating in, but in typical environments, the likelihood of a power failure within the next minute is exceedingly small. A fairly common approach for computations that are projected to run on the order of hours is to checkpoint every 5% to completion, or every 1% for something that is designed to run a few days. The largest practical computations I have undertaken ran on the order of 700 hours of wallclock time.

1 Like