Realtime kernel

Hello.

I’m coding a program that have to be ‘real time’, by ‘real time’ I mean that the kernel can’t execute beyond some predefined threshold.

My question is: How could I force the kernel execution finish when it reach this limit?

I thought may be the watchdog my serve as a helper in this situation…

Can anybody comment on this?

Thanks.

Use the “clock” function on GPU to time your code and exit when its time to end…

Use the “clock” function on GPU to time your code and exit when its time to end…

Sarnath could you be more specific? I didn’t find anything in the manual about any clock() function…

Sarnath could you be more specific? I didn’t find anything in the manual about any clock() function…

Section B.10, p113 of the programming guide I am looking at right now…

Section B.10, p113 of the programming guide I am looking at right now…

“when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater that the latter since threads are time sliced.”

Please correct me if I’m wrong. To achieve the desired behaviour I’ll need sum up all the clock cycles and then multiply that by ‘time per clock cycle’? How do one thread access the ‘clock’ of one S.M that isn’t that itself is running…?

“when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater that the latter since threads are time sliced.”

Please correct me if I’m wrong. To achieve the desired behaviour I’ll need sum up all the clock cycles and then multiply that by ‘time per clock cycle’? How do one thread access the ‘clock’ of one S.M that isn’t that itself is running…?

Hmmm i haven’t tried what you’re doing but would the problem be solved if you used a __syncthreads() so that you can guarantee that all the threads have reached a certain point ?

Ex

int per_thread_timestamp = clock();

// do something
for( … )
{

}

__syncthreads(); // checkpoint

int delta = clock() - per_thread_timestap;

if( delta > something)
{

}

But i guess with this setup there could still be context switching going to other blocks which will make this not work. Depends on your setup… Anyways, just an idea. Otherwise you have to do a reduction and look at the maximum and mimimum timestap an operation which will in itself take som time to do…

Hmmm i haven’t tried what you’re doing but would the problem be solved if you used a __syncthreads() so that you can guarantee that all the threads have reached a certain point ?

Ex

int per_thread_timestamp = clock();

// do something
for( … )
{

}

__syncthreads(); // checkpoint

int delta = clock() - per_thread_timestap;

if( delta > something)
{

}

But i guess with this setup there could still be context switching going to other blocks which will make this not work. Depends on your setup… Anyways, just an idea. Otherwise you have to do a reduction and look at the maximum and mimimum timestap an operation which will in itself take som time to do…

I doubt is needs to be that elaborate. Have one thread per block atomically set a shared memory start clock count - when the elapsed count reaches some limiting value, have the block abort. If the bound it chosen correctly, even block in the grid is guaranteed to have completed or exited inside your wallclock time window. CUDA is not designed to do hard realtime. If you really need those facilities, you probably need different hardware and a different OS (like an RTOS and FPGAs).

I doubt is needs to be that elaborate. Have one thread per block atomically set a shared memory start clock count - when the elapsed count reaches some limiting value, have the block abort. If the bound it chosen correctly, even block in the grid is guaranteed to have completed or exited inside your wallclock time window. CUDA is not designed to do hard realtime. If you really need those facilities, you probably need different hardware and a different OS (like an RTOS and FPGAs).