It sounds like you would want to rework your app to avoid kernels that get close to the timeout limit.
Your description of the driver “crashing” when a watchdog event is trigger does not sound right to me. It used to be the case, on both Linux and Windows, that in such a situation the current CUDA context is destroyed, but the CUDA driver itself recovered. This recovery could take up to several seconds. After recovery, other CUDA apps could be run.
I ran on an RHEL-based workstation and driver recovery seemed to work quite well, although it did happen on a few occasions that after multiple consecutive timeout events, unloading and reloading of the driver as described by txbob became necessary. This required stopping X and dropping into console mode, but it did not require rebooting the machine as a whole.
So what exactly are the symptoms observed when the “driver crashes”? Is there a possibility that this is some sort of Ubuntu-specific issue? I may be biased, but I have seen too many reports of “crazy stuff” happening on Ubuntu over the years that I never observed on RHEL that I have decided to stay as far away from Ubuntu as possible.