After a driver update from 418 (CUDA 10.1) to 450 (CUDA 11.0) on debian and ubuntu linux systems using K80 and V100 GPUs, an application we use started to hang on clFinish calls at random times. The application is multithreaded, compiled for OpenCL 1.2 and works without apparent problems using the 418 driver.
All calls to OpenCL are wrapped with error assertion and no errors are logged on either driver version, there is no difference in behavior apart from clFinish not returning sometimes - which of course hangs the whole application. In the rare cases clFinish does not hang, results are identical.
The driver versions mentioned are the nvidia-tesla-*** versions distributed on the debian package system, although a recent 455 fresh from nvidia’s repositories has also been tested with negative results.
I tried to find actual changelogs for the driver with information on changes which might introduce a problem like this, but could only find rather superficial information.
I am suspecting the application might rely on undefined behaviour which might have changed between the drivers - although inspection of the application’s sourcecode seems to not point out any obvious discrepancies with regard to usage as specified in the OpenCL specs.
Are there any known issues or changes regarding OpenCL introduced after 418?
comparing the clinfo output from both drivers i can see 450 added extensions:
so there was some work done regarding OpenCL.