I developed some code to generate the histogram of some random data of the same size and I need to run it multiple times. Sometimes it gave me the error of cudaErrorInvalidConfiguration but sometimes it just ran for 10,000 times without crash. Note that the number of threads, number of blocks and shared memory size are all fixed. Does anybody have any idea about this? It is really annoying.
Thanks for the help,
That is very odd, that error should be a very deterministic consequence of the configuration data passed. The only thing I can think of is that one of the configuration parameters is not actually fixed, but dependent on an uninitialized variable (possibly through multiple levels of indirection).
Is there more than one GPU installed in this system? If so, are they all of the same type, or of different types? I wonder whether it is possible that in some runs, your app picks up a different GPU (possibly through use of the CUDA_VISIBLE_DEVICES environment variable).
Thanks very much for the reply. We have been working on this issue for a while and haven’t got a solution. Here is more about the context. Since the data is huge, we split the data and run it on 4 GPUs (GeForce GTX 1080, same configuration, each taking care of 1/4 of the data). To test the program, we ran it 10,000 times (each time modifying the data a little bit, fixing distribution) and draw the histogram. The error is quite random: sometimes we got invalid device id (note that the device id is const when initialized) and it happened at different runs (sometimes failed at 7500 and sometimes at 5000). I am now trying to disable part of the code (one after another) to see which part is root cause. Do you think this related to the hardware?
Thanks again for your help.
cudaErrorInvalidConfiguration should be a synchronous error reported by the CUDA driver through the CUDA runtime, after checking the passed-in configuration parameters with the GPU capabilities. So this should be 100% deterministic, and other than GPUs temporarily “disappearing” I do not have a mental model of how your observations could be related to hardware flakiness. The “invalid device ID” in particular would hint at GPUs “falling off the bus” (~= disappearing), a hardware flakiness issue.
You might want to design a targeted test to check which GPU(s) become temporarily unavailable. If you cycle the GPUs through the PCIe slots, are failures correlated with a specific GPU or a specific PCIe slot? PCIe slots can become damaged (e.g. dirty or bent connector fingers) and so can GPUs (e.g. electro statics when handling without proper grounding, shipment not in conductive bags).
Four high-end GPUs in the same system would invite double-checking on potential sources of hardware flakiness, such as insufficient power supply (your PSU should ideally be rated for 1300W - 1400W depending on system components), elevated operating temperatures (use nvidia-smi to check, for Pascal-family cards you want to be as close as possible to 65 degrees Celsius), or aggressive overclocking (note that many models of Geforce GPUs are overclocked by the vendor, and there is no evidence I am aware of that vendors use compute applications when qualifying their overclocked models). You would also want to double-check PCIe connectors (GPUs plugged in firmly and secured at the bracket) and the PCIe power connectors on the GPUs (plugged in all the way, no converters or Y-splitters used in the cabling).
What is the system platform? Is it a standard system from HP, Dell, Lenovo, etc? A custom build based on SuperMicro or similar high-quality motherboards?
Personally I consider it iffy to build high-end system that is under continuous load with consumer-grade electronics that are not designed for a 24/7 duty cycle. That applies to all components in such a system.
Thanks a lot njuffa. Lots of helpful information. I would design some target tests for the items listed above.
So basically, if I understand correctly, it is very likely that one of the GPUs was temporarily off/disappearing while the program was running? And it was just back on when I restarted and ran the program the second time?
GPU “falling off the bus” (meaning the card is still turned on, but host PCIe controller lost connection to it) is my best guess at this point. Check the system logs for any GPU related error messages.
Note that if it is indeed a GPU falling off the bus this is one of the hardest issues to root cause, in particular when you try to do it over the internet. All we know in such a case is that connection between host and GPU was lost, and it could be due to myriad issues from defects in host or device hardware, to bugs or configuration issues in system BIOS and VBIOS, to GPU driver bugs. Environmental factors (“dirty” power supply, electromagnetic interference, vibrations, excessive humidity, excessive altitude) could come into play as well. Have you checked on the operating temperatures of the GTX 1080s under full load? Are they about equal?
80 PLUS Platinum PSU makes a quality product likely, but I am not clear how to interpret the stated power rating for this redundant setup. The PSU should be sized such that total nominal power of all system components does not exceed 60% of PSU rating (that’s because PSU are most efficient in the 25% to 60% load range, and you want sufficient reserves to absorb any short-duration power spikes that can occur when GPU loads change rapidly to prevent brown-outs, i.e. voltage drops).
In your case that is 720W for the four GTX 1080s, plus an estimated 120W for the rest of the system, but this will depend on CPUs (number and type), amount of DRAM (about 0.4W per GB for DDR4), mass storage (~5W per HDD/SSD). With 120W we would be at 840W total, which requires a 1400W power supply to not exceed 60%. If your system’s two PSUs are set up to deliver 2000W in combination, you should be good even for an ambitiously configured system (dual high-end CPUs plus tons of memory).
I am not familiar with Silicon Mechanics, but I see that they are listed on NVIDIA’s official list of VARs, so there should be no cause for concern in that respect.
Thanks a lot niuffa for all the help. Will look into this and get back to the thread.
If you haven’t done so already, now may be a good time to update your system BIOS and your NVIDIA drivers to the latest available for your platform. What CUDA version are you running?
Is this a Windows or a Linux system, by the way? If Linux, check driver output with dmesg to see if you can find anything interesting about GPU operation after a failure occurred.
It seems advisable to run valgrind or similar utility on your application to check whether there are any out-of-bounds reads or writes that could negatively impact the operation of the application by overwriting valid data, or picking up uninitialized data.