Polling device variable while kernel is running

venetis · August 31, 2016, 8:30am

Hello everyone,

I have an application that runs on a GPU for hours or even days, depending on the input. Hence, I would like to have some indication of the percentage of calculations that have been completed at constant time intervals (e.g. every minute).

The application consists of a set of nested loops with known numbers of iterations. As a result, it is relatively easy to count on the device in a global variable how many iterations have already been calculated and divide that number by the total number of iterations.

But to print the percentage of completion, I need some way on the host to poll that global variable at specific time intervals while the kernel is running. How can I do this continuously on the host? What I need to do on the host is described by the following pseudocode:

call kernel

do
  wait for time interval to pass
  poll device variable
  print "Percentage completed is ..."
while (kernel is running)

copy output data from device to host

My issue is what function of the CUDA API I have to use to read the value of the device variable without blocking execution on the host until the kernel finishes.

Any suggestions?

Robert_Crovella · August 31, 2016, 1:57pm

There are a number of concepts involved, including streams and zero-copy memory.

Here is a fully -worked example:

[url]cuda - How can I check the progress of matrix multiplication? - Stack Overflow

venetis · September 1, 2016, 8:49am

Somehow I missed that example. Probably did not use the appropriate search terms in Google :-)

I will have a look at it and come back if I have further questions.

Thanks!

venetis · September 19, 2016, 1:17pm

I managed to create a working version, based on the above example. I used the same logic of increasing a counter after each block of threads finishes execution, as this proved to be sufficiently accurate for my purposes.

However, I am not completely happy with this solution and I have a few more questions about how to improve it. Before I ask them, let me present my solution so that we have something more concrete to discuss about.

My kernel looks like this:

__global__ void findSolutions(<Many problem-specific parameters>, volatile unsigned long long *progress)
{
  // Do work here

  // I use a 1D grid with 1D blocks of threads.
  // Only the thread with ID 0 in each block
  //  increases atomically the progress counter.
  if (threadIdx.x == 0) {
    atomicAdd((unsigned long long *)progress, 1);
    __threadfence_system();
  }
}

The code in the main() function is as follows:

volatile unsigned long long     *progress_h, *progress_d, currValue;
double                          currProgress, prevProgress = 0.0;
cudaError_t                     err;

err = cudaSetDeviceFlags(cudaDeviceMapHost);
if (err != cudaSuccess) {
  printf("cudaSetDeviceFlags failed.\n");
  exit(1);
}

err = cudaHostAlloc((void **)&progress_h, sizeof(volatile unsigned long long), cudaHostAllocMapped);
if (err != cudaSuccess) {
  printf("cudaHostAlloc failed.\n");
  exit(1);
}
err = cudaHostGetDevicePointer((void **)&progress_d, (void *)progress_h, 0);
if (err != cudaSuccess) {
  printf("cudaHostGetDevicePointer failed.\n");
  exit(1);
}

*progress_h = 0;

findSolutions<<<blocksPerGrid, threadsPerBlock>>>(<Other parameters>, progress_d);

do {
  currValue = *progress_h;
  currProgress = (double)currValue / (double)blocksPerGrid;
  // Print percentage only if there has been more than 1% progress.
  if (currProgress - prevProgress > 0.01) {
    printf("%5.1f%%\n", currProgress * 100.0);
    prevProgress = currProgress;
  }
} while (currProgress < 1.0);

Now my questions:

How do I check under the above scheme that the kernel actually launched? I understand that normally cudaGetLastError() is called to check for this immediately after calling the kernel. But if cudaGetLastError() is blocking, execution will not reach the loop to show progress.

Furthermore, I am not certain what the following note from the cudaGetLastError() documentation might mean in the context of my code and whether some error caused by another CUDA API call could be misinterpreted as an error of the kernel not being launched:

“Note that this function may also return error codes from previous, asynchronous launches.”

I am not very happy with the fact that I check variable currProgress to see whether execution of the kernel reached 100%. I would be much more comfortable to use some CUDA specific API to check whether the kernel actually finished execution or not. Reading through several posts here and on other forums I think that a combination of cudaEventRecord() and cudaEventQuery() can provide the necessary functionality I need. However, I have not been able to find a fully functional example about how to use them, i.e., I am not certain at which points in my code I need to call them. Does anyone know a link to a good example?

Furthermore, does somehow use of these two functions interact with the use of cudaGetLastError() to check whether the kernel launched or not?

The loop spins continuously using 100% of the CPU all the time. Would using some sleep function (sleep, usleep, nanosleep) cause any problem in the execution of the code (apart from some small latency in reacting to kernel completion)? Is usage of any of the sleep functions appropriate? Or is there some other CUDA specific way to do this?

Thank you in advance for your comments on the above.

Robert_Crovella · September 19, 2016, 1:56pm

cudaGetLastError is not blocking, and can be used directly after a kernel launch to verify that the launch itself was successful. It does not verify that the kernel successfully completed unless you use a blocking call like cudaDeviceSynchronize before the call to cudaGetLastError
You can use a method like this:

[url]Launch CUDA-Kernel with a timeout - Stack Overflow

to verify that the kernel has completed (and optionally kill it).

I’m sure you can use a CPU thread based method to sleep instead of spinning on the progress variable. Such as the usleep function used in the example above.

venetis · September 20, 2016, 3:01pm

Dear txbob,

Thank you for your to-the-point answers and links. I managed to make it work as I wanted. Just as a reference, in case someone searches for something similar, I quote the new code I have created.

The kernel didn’t change:

__global__ void findSolutions(<Many problem-specific parameters>, volatile unsigned long long *progress)
{
  // Do work here

  // I use a 1D grid with 1D blocks of threads.
  // Only the thread with ID 0 in each block
  // increases atomically the progress counter.
  if (threadIdx.x == 0) {
    atomicAdd((unsigned long long *)progress, 1);
    __threadfence_system();
  }
}

The main() function now includes the following code:

volatile unsigned long long     *progress_h, *progress_d, currValue;
double                          currProgress, prevProgress = 0.0;
cudaError_t                     err;
cudaEvent_t                     kernelFinished;

err = cudaSetDeviceFlags(cudaDeviceMapHost);
if (err != cudaSuccess) {
  printf("cudaSetDeviceFlags failed.\n");
  exit(1);
}

err = cudaEventCreate(&kernelFinished);
if (err != cudaSuccess) {
  printf("cudaEventCreate failed.\n");
  exit(0);
}

err = cudaHostAlloc((void **)&progress_h, sizeof(volatile unsigned long long), cudaHostAllocMapped);
if (err != cudaSuccess) {
  printf("cudaHostAlloc failed.\n");
  exit(1);
}

err = cudaHostGetDevicePointer((void **)&progress_d, (void *)progress_h, 0);
if (err != cudaSuccess) {
  printf("cudaHostGetDevicePointer failed.\n");
  exit(1);
}

*progress_h = 0;

/*
 * Call the kernel.
 */
findSolutions<<<blocksPerGrid, threadsPerBlock>>>(<Other parameters>, progress_d);

/*
 * Check that the kernel actually launched.
 */
err = cudaGetLastError();
if (err != cudaSuccess) {
  printf("CUDA error: %s\n", cudaGetErrorString(err));
  exit(0);
}

/*
 * Record when the kernel finishes execution.
 */
err = cudaEventRecord(kernelFinished);
if (err != cudaSuccess) {
  printf("cudaEventRecord failed.\n");
  exit(0);
}

/*
 * Print progress of calculations every second and only if more than 1% has been completed.
 */
printf("Progress of calculations:\n");
printf("  0.0%%\n");
fflush(NULL);
while (cudaEventQuery(kernelFinished) != cudaSuccess) {
  usleep(1000000);
  currValue = *progress_h;
  currProgress = (double)currValue / (double)blocksPerGrid;
  if (currProgress - prevProgress > 0.01) {
    printf("%5.1f%%\n", currProgress * 100.0);
    fflush(NULL);
    prevProgress = currProgress;
  }
}

cudaDeviceSynchronize();
err = cudaGetLastError();
if (err != cudaSuccess) {
  printf("CUDA error: %s\n", cudaGetErrorString(err));
  exit(0);
}

printf("Calculations finished\n");
fflush(NULL);

I typically redirect the output of the program to a file, hence I use fflush() to immediately update the file and see the progress.

I also artificially made the kernel fail to launch (allowed it to use too many registers for the number of threads per block) and indeed this is correctly caught.

Thanks again for all your help.

Topic		Replies	Views
Increment Host variable in kernal code and report continuously back to CPU for progress of work Nsight Visual Studio Edition kernel	0	312	November 18, 2020
GPU polling for completion CUDA Programming and Performance	1	2389	March 27, 2009
Kernel progress status CUDA Programming and Performance	3	1961	April 18, 2009
No printf(".") output from the kernel CUDA Programming and Performance	6	2790	March 15, 2023
[Solved] How to update host memory variable from device during OpenCL kernel execution CUDA Programming and Performance	13	3220	August 2, 2019
Non-blocking kernel execution CUDA Programming and Performance	5	9544	August 8, 2007
Signalling between cuda thread and host thread cuda-thread host-thread communication CUDA Programming and Performance	7	19634	June 21, 2011
Silent kernel failure CUDA Programming and Performance	25	8425	May 18, 2020
How to detect async event without polling CUDA Programming and Performance	28	5853	August 20, 2010
A simple question about printf() inside a kernel with no convincing answer on google or nvidia docs CUDA Programming and Performance	8	5738	August 4, 2019

Polling device variable while kernel is running

Related topics