Problem with Tesla K40 TCC driver in Windows 8.1 ?

I have a problem with our workstation (with a Tesla K40c and Tesla K20 + Quadro NVS 300) in Wndows 8.1 64-Bit.
The Teslas are switched into TCC mode. I installed the latest Tesla driver for windows 8.1 (ForceWare 377.35).

I tried to invoke the CUDA SDK samples for Toolkit 7.0 (compiled with VS 2013 with the Solution file provided by NVIDIA in the SDK directory). I exclude of course the CUDA SDK samples which are using OpenGL or DirectX interaction. Note that the NVCC settings in the SDK sample are compiling machine code also for CC 3.5.

The ‘deviceQuery’ program works, and reports information for both Tesla GPUs.
But all other SDK samples seem to hang - forever - on the first memory allocation.
E.g. The ‘alignedTypes’ programs hangs after printing the string ‘Allocating memory …’ to the console. It uses the Tesla K40c.

Also my own ‘AllocationTester’ program hangs (program from https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-significantly-slower-on-windows-with-geforce-drivers-gt-350-12/) . It just allocates a few buffers with ‘cudaMallocPitch’ and frees them.

I am wondering if other users are having same problems with Tesla GPUs on Windows. I think I have read something on this forum about issues with Teslas ins recent Windows versions (or with recent drivers), when TCC mode is enabled, but I can’t find the link.

I tried also to switch the Tesla K40 to WDDM mode, but the ‘nvidia-smi’ tool just reports that this is not possible. How can I do that ?

I believe this is a known issue.

It should be fixed in a future driver:

https://devtalk.nvidia.com/default/topic/1003087/cuda-programming-and-performance/cudamalloc-hang-when-building-x64-version-binary/post/5172447/#5172447

I don’t believe either K40c or K40m can be switched to WDDM mode. I’m certain K40m cannot, I’m pretty sure K40c cannot.

Best I recall the K40c does not have video out, therefore cannot drive a display, and thus cannot be used with a WDDM driver.

Ok, thanks for the information and the link to the issue. I have some additional questions to txbob:

  • When will 377.48 driver be released approximately - days / weeks / months ? Will it fix the bug ?
  • Is there some older working driver which I can revert for now ?
  • Can the ‘Tesla M40’ (Maxwell Gen.) be switched to WDDM mode ?

P.s. I find it a bit hard to understand how a Tesla TCC driver which simply does not work at all, can pass internal QC and can be released subsequently …

weeks, as best as I can tell. I actually thought it would be released by now, so there is some sort of delay. It will be part of some r375 driver release process which has other gates (e.g. WHQL) besides just this issue, so I think that is why it hasn’t been released yet. I don’t have specific details, this is just a guess.

I mentioned it because I think it would fix the issue. I cannot be 100% certain. It fixed a similar issue according to our internal testing. Once you get the driver, you can decide for yourself if it fixes the issue. Alternatively, feel free to file a bug at developer.nvidia.com

Probably, subject to some caveats. K40 has been around since around 2013, and certainly this issue has not persisted since then. If you install CUDA 7.0 or CUDA 7.5 on that machine, and the driver associated with CUDA 7 or CUDA 7.5 (i.e. the driver from the CUDA toolkit installer), I suspect it might avoid the issue. I haven’t personally tested that, it’s just conjecture. With respect to CUDA 8, the current CUDA 8.0.61 requires a driver from r375 or newer, and my guess right now is that this issue is prevalent in (i.e. manifesting throughout) the r375 branch.

I don’t believe so. GPUs without a display output generally publish a 3D Controller PCI classcode, rather than a VGA classcode, in PCI config space. Such devices cannot be run under WDDM. You can check this yourself with lspci, and I’m sure there is a tool on windows, or just look in device manager, and see if the M40 appears under “Display adapters” (==publish VGA classcode) or somewhere else (==publish 3D controller classcode).

Some handwaving background suggesting why this may have slipped through testing was given in one of the 4 or 5 posts here on this board which refer to this issue in one form or another. I don’t wish to search for them all now, nor can I shed any further light on it.

I have never even seen handwaving as to possible explanations, nor would I expect to see any: the details of any company’s QA process are usually confidential information. The only statement I recall seeing in other threads was that the issue of how this bug fell through the cracks of the CUDA QA process is under investigation.

As someone with extensive historic knowledge of the CUDA QA process I could have sworn that it is impossible for a bug like this to escape into the wild, but here we are. One might hypothesize a most unusual concatenation of events. Maybe we’ll find the backstory published in some article twenty years from now :-)

what I meant by handwaving:

https://devtalk.nvidia.com/default/topic/1003087/cuda-programming-and-performance/cudamalloc-hang-when-building-x64-version-binary/post/5123877/#5123877

I certainly did not mean to convey that a concise description or explanation had been provided, and it won’t be, I don’t think. But as stated there, there is hole, it’s understood there is a hole, and there is some corrective action being looked into.

The statements in that thread provided factual information indicating that the bug is restricted to very specific, though not particular uncommon, configurations.

It does not provide speculation as to how that bug was not noticed during the QA process, except possibly hinting that this particular space in the configuration matrix was unpopulated during QA. If we want to call such an implicit hint “handwaving”, I am fine with that.

As I said, I do not expect NVIDIA to provide any information on their QA process (handwavy or otherwise). I trust that steps have been taken to make the CUDA QA process more water tight going forward.

OK, that sounds good, thx for the detailed response.

Meanwhile, I did try out some older Tesla drivers from the archive. The driver version 354.99 from October 2016 seems to be the newest ‘working’ driver. So i reverted to this driver version. As we are using Cuda Toolkit 7.0, that is fine for us for now (until the fixed driver will be released).