cudaMemcpy Hung

System Configuration

  • Dual Hex Core Intel Xeon CPUs
  • 512GB System Memory
  • Two NVIDIA Tesla K80 GPUs (4 logical devices)
  • CentOS 7.6.1810
  • Kernel: 3.10.0-957.1.3.el7.x86_64
  • CUDA 10.0.130
  • CUDA Driver: 410.79
  • GCC 4.8.5
  • I have a C++ CUDA application that involves performing lots of FFTs and various arithmetic operations. The application is kicked off on a cluster via SLURM/MPI (not using CUDA-aware MPI). Each node is executing 24 ranks / processes, each GPU is being utilized concurrently by 6 processes - this yields ~60% overall GPU utilization in most cases. Each process is only using ~90MB of GPU memory.

    I’m encountering a problem where some processes (usually 1-2 per node) are hanging after 1-2 days of execution. After lots of experiments, it does not seem to be a function of the work they are performing. The jobs can be terminated / restarted, they finish the work they were originally hung on, and continue for another 1-2 days (eventually hanging at the same point, see below).

    Below are some data points that lead me to believe there is something suspect in the software / driver stack:

  • The same cluster (before being upgraded) with CentOS 6.6 and CUDA 7.5 did not exhibit this problem.
  • A different cluster with V100s and similar version of CentOS 7 does not exhibit this problem.
  • As stated previously, the problem does not seem to be a function of the input, it is correlated with execution time.
  • Note: I ran cuda-memcheck, host memory checkers, and nvidia-smi to ensure there aren’t ECC / memory errors.

    Below is the stack trace obtained with gdb. The program is clearly hung, as it is not producing any output. I can continue the program and get the same stack trace every time in gdb. Note: I had to manually type in the stack traces, so I omitted the memory addresses and truncated it at the point where the application calls into the CUDA library.

    #0 in clock_gettime ()
    #1 in clock_gettime() from /usr/lib64/
    #2 in ?? () from /usr/lib64/
    #3 in ?? () from /usr/lib64/
    #4 in ?? () from /usr/lib64/
    #5 in ?? () from /usr/lib64/
    #6 in ?? () from /usr/lib64/
    #7 in ?? () from /usr/lib64/
    #8 in ?? () from /usr/lib64/
    #9 in cuMemcpyDtoH_v2 () from /usr/lib64/
    #10 in ?? () from /usr/local/cuda-10.0/targets/x86_x64-linux/lib/
    #11 in ?? () from /usr/local/cuda-10.0/targets/x86_x64-linux/lib/
    #12 in cudaMemcpy () from /usr/local/cuda-10.0/targets/x86_x64-linux/lib/
    #13 in cuMemoryWrap::moveToHost () from

    Below is the stack trace from cuda-gdb:

    #0 in cuVDPAUCtxCreate () from /usr/lib64/
    #1 in cuVDPAUCtxCreate () from /usr/lib64/
    #2 in cuVDPAUCtxCreate () from /usr/lib64/
    #3 in cuMemGetAttribute_v2 () from /usr/lib64/
    #4 in cudbgApiDetach () from /usr/lib64/
    #5 in cudbgApiDetach () from /usr/lib64/
    #6 in cuVDPAUCtxCreate () from /usr/lib64/
    #7 in cuVDPAUCtxCreate () from /usr/lib64/
    #8 in cuEGLApiInit () from /usr/lib64/
    #9 in cuEGLApiInit () from /usr/lib64/
    #10 in cuEGLApiInit () from /usr/lib64/
    #11 in cuMemGetAttribute_v2 () from /usr/lib64/
    #12 in cuMemGetAttribute_v2 () from /usr/lib64/
    #13 in cuMemcpyDtoH_v2 () from /usr/lib64/
    #14 in __cudaInitModule () from /usr/local/cuda-10.0/targets/x86_x64-linux/lib/
    #15 in cudaGetExportTable () from /usr/local/cuda-10.0/targets/x86_x64-linux/lib/
    #16 in cudaMemcpy () from /usr/local/cuda-10.0/targets/x86_x64-linux/lib/
    #17 in cuMemoryWrap::moveToHost () from

    Does anyone know why the gdb shows clock_gettime at the top of the stack, while cuda-gdb shows functions within libcuda? Is it that gdb has limited insight at the point where the application is waiting for the driver - presumably in a poll / sleep loop?

    We have started down the long trial-and-error road of rolling back the driver, but I’m posting this issue to the forum to see if anyone has encountered similar issues or if they have suggestions on how to overcome this error. Thanks in advance for any help / suggestions!


    Here is an update after further experimentation.

    We rolled back the CUDA toolkit to v9.2 and the driver to 396.37 and encountered the same hanging issue (on the systems defined in the original post).

    We also performed the same test on another machine with the configuration below and encountered the same hanging issue.

    System #2 Configuration

    • Dual 10-core Intel Xeon CPUs (E5-2670v2)
    • 377GB System Memory
    • One NVIDIA Tesla K80 GPU (2 logical devices)
    • CentOS 7.6.1810
    • Kernel: 3.10.0-957.1.3.el7.x86_64
    • CUDA 10.0.130
    • CUDA Driver: 410.79
    • GCC 4.8.5

    At this point, we are considering the following paths:

    • Disable ECC (motivation, possible interference by ECC scrubbing:
    • Roll back to CUDA 8.0 toolkit/driver (this is not easy, as it appears we need to install an earlier version of CentOS)
    • Change Linux kernels (keep CUDA 10 toolkit/driver, possibly try other versions)
    • Reproduce problem with different application, although ultimately we need to understand why this application is hanging, so this is a bit of a time-sink / dead-end path (although interesting data point)

    Thanks in advance for any help / suggestions!


    We have performed lots of experiments and explored many dead-end paths. I’ll attempt to summarize the important discoveries to provide everyone an update.

    I realize the application is a huge unknown variable in this equation, so I attempted to implement a “reproducer” / test application (publishable source code) that exhibits the same “hanging” issue, but I have been unsuccessful. There must be something subtle in the real application versus the test application that I’m not capturing. Anyway, I’ll continue development of the test application and will publish the source code once I’m able to successfully recreate the hanging issue.

    Reiterating from the original post: I have a C++ CUDA application that involves performing lots of FFTs and various arithmetic operations. The application is kicked off on a cluster via SLURM/MPI (not using CUDA-aware MPI). Each node is executing 24 ranks / processes, each GPU is being utilized concurrently by 6 processes - this yields ~60% overall GPU utilization in most cases. Each process is only using ~90MB of GPU memory.

    Below is a summary of the environments under which the application hangs / does not hang:


  • CUDA Toolkit: 10.0 | CUDA Driver: 410.79 | Linux Kernel: 3.10.0-957
  • CUDA Toolkit: 9.2 | CUDA Driver: 396.37 | Linux Kernel: 3.10.0-957
  • Does Not Hang

  • CUDA Toolkit: 8.0 | CUDA Driver: 375.26 | Linux Kernel: 3.10.0-229
  • Our next experiments will include testing with the newer Linux kernels and CUDA drivers that were recently released.

    Thanks in advance for any help / suggestions!
    (note: the plot is also attached to this post in case it doesn’t render via the link or gets taken down over time)

    A picture is worth a thousand words - in the absence of code, hopefully this plot provides enough information for someone that has access to the driver / closed source code to infer what might be happening at the low level during the time the GPU is hanging.

    I’ll start by describing the application again. We have a C++ CUDA application that involves performing lots of FFTs and various arithmetic operations. The application is kicked off on a cluster via SLURM/MPI (not using CUDA-aware MPI). Each node is executing 24 ranks / processes, each GPU is being utilized concurrently by 6 processes - this yields ~60% overall GPU utilization in most cases. Each process is only using ~90MB of GPU memory.

    The application is running on a cluster of the following systems:

    • Dual Hex Core Intel Xeon CPUs
    • 512GB System Memory
    • Two NVIDIA Tesla K80 GPUs (4 logical devices)
    • CentOS 7.6.1810
    • Kernel: 3.10.0-957.1.3.el7.x86_64
    • CUDA 10.0.130
    • CUDA Driver: 410.79
    • GCC 4.8.5

    The data from the plot was obtained by running nvidia-smi, capturing measurements at a 1 second interval throughout the run-time of the application (~1 day). There are 4 logical GPU devices on the system, but the plot only includes data from a single device to reduce clutter and clearly illustrate the issue. Note, the behavior of the other devices is very similar.

    During the one hour zoom shown in the plot, some of the 6 processes finish their work, this causes the SM% (overall for the device) to begin dropping at ~07:31, as expected. At ~07:41, the clock is reduced and the temperature declines. At ~07:42 the temperature begins to sharply increase - it is not apparent why this is the case. At 07:44 the GPU hangs for about 4 minutes. We added timing instrumentation to the application and plotted the start and end hang time as vertical lines on the plot - this correlates exactly with the time the SM is at 0% utilization, as expected.

    As mentioned, the other devices in this machine and all machines across the cluster exhibit similar behavior (i.e., the plots are very similar), so this is not an isolated incident and it is very repeatable with all test cases we execute - although very frustrating because it only happens toward the end of the run when the processes finish their work.

    A summary of the behavior is as follows:

    1. Processes execute CUDA kernels successfully / consistently for many hours / days
    2. Processes finish their work and no longer use the GPU (indicated by reduced / step behavior on the SM% plot)
    3. Temperature of the device declines
    4. Clock is reduced by ~300MHz
    5. Temperature increases relatively quickly, then tapers off
    6. GPU hangs

    Note on #1: we ran a much longer test case that executes fine for ~3 days and it is only until the final hour (when processes begin completing their work) that the GPU hangs. This tells us the hang issue is not input dependent and not run-time or temperature dependent. There must be an issue when one or more processes’ GPU utilization goes to zero (maybe caused by a staggered drop-off?).

    Note on the hanging: once we enter the hang state, we hang for various amounts of time (sometimes for hours, which is obviously detrimental to our throughput) and eventually must kill the application and restart. When the results are cached, the processes reach the completion state faster (upon restart), and enter the hang state sooner - this is further evidence to support our previous statement regarding input / run-time / temperature independence.

    Here are some questions I’m dying to ask a GPU driver / firmware / hardware engineer at NVIDIA:

    1. What conditions cause the clock to be decreased / increased?
    2. What effect does clock throttling have on CUDA kernel execution (my hope would be nothing, but it seems there might be a timing / synchronization issue as a result of the throttling)?
    3. Were any of these control systems different between CUDA 8.0 and 9.2? Note: we never experienced the hang in CUDA 8 (driver 375.26). It would be great to see this type of plot with CUDA 8, but it is very difficult to change our configuration. However, we can do this if we receive feedback from this forum that it is valuable.

    We desperately need to find a solution to this hanging problem because we cannot run CUDA 8 in our environment any longer.

    Thanks in advance for any help / suggestions!

    While you provide a lot of detailed information, this doesn’t look like an issue that can be diagnosed well remotely. Consider filing a bug with NVIDIA, given that observations appear to correlate with changes to NVIDIA software. These forums are not designed as a bug-reporting venue. You may also want to consider bringing the issue to the attention of the system integrator who delivered these GPU-accelerated systems; to my understanding this is how support for Tesla-based systems work.

    The heuristics driving clock, power, and thermal management of GPUs are internal implementation details of NVIDIA hardware, firmware, and software that are (to the best of my knowledge) not documented publicly and are unlikely to be revealed to end customers. By observation: these details seem to change frequently. You may want to try the latest CUDA (10.1?) and the latest drivers (418?).

    I am not convinced that thermal / clock management is at the core to your observations. It might be, but it could just as well be a red herring. Have you tried fixing clocks by making use of the application clock feature? nvidia-smi can tell you what the supported application clock settings are, and it lets you set application clocks.

    Based on the information presented, the most likely root cause of your issue is software, thus my suggestions to try different drivers and consider filing a bug with NVIDIA.

    I don’t think one can exclude system issues (rather than GPU issues) with certainty as a root cause or contributing factor of the observations. For example, with passively-cooled devices like the K80, the fans in the system enclosure are responsible for directing adequate airflow across the heatsink fins of the GPU, which in turn impacts it operating temperatures.

    I also wonder whether there could be contributing hardware-related issues, as the hardware is presumably quite old, being K80s. Semiconductor components age through a variety of failure mechanisms. The more adverse the operating environment, the faster they die, with heat usually being the biggest ageing factor. The sensors for heat and power on a GPU may similarly be affected by ageing, but I don’t have any insights into those components.

    Sometimes weird things happen with older hardware: I recently noticed a GPU on which the PCIe interface operated only in x4 mode only for no apparent reason. Re-seating the GPU in its slot magically fixed the issue. I have similarly observed strange issues with GPUs (and CPUs) whose cooling fins, fans, or ducts had been partially clogged by years of accumulated filth. This caused the processors to run hotter. The issues went away after blowing out all that gunk.

    Thanks for your reply, njuffa!

    We will file a bug report with NVIDIA. That said, your comments / suggestions regarding contacting our hardware vendor, upgrading to the latest driver, fixing the clocks, and NVIDIA frequently changing the thermal management / clock control systems are great! This is exactly the type of feedback we were hoping to receive on this forum. We will immediately employ your feedback. Please let us know if you think the results of your suggestions are worth following up on this post for the benefit of other / future readers - otherwise, we will stop posting updates here.

    We’re not expecting NVIDIA to reveal their internal clock / power / thermal management algorithms. We stated the questions as an attempt to trigger someone that works in this area to think about potential pitfalls in their algorithms given the application / scenario described and portrayed in the plot.

    We completely agree that the thermal / clock management might not be at the core of the problem. We have a hanging application (as shown in the stack trace) under CUDA 9.2 and 10.0 (not under 8.0), thermal/clock/SM% data from nvidia-smi around the time of the hang, and a closed source CUDA library and driver. We’re not sure where else to start in order to drive toward the root of the problem. Granted, our environment is not the most conducive to debug this type of problem (remotely).

    We’re struggling to believe that this is a thermally induced issue. Maybe NVIDIA changed something between CUDA 8.0 and 9.2 that triggered this behavior under the same thermal conditions? This prompted us to research the acceptable operating temperature for the K80, but surprisingly, we could not find it. The K80 datasheet states the “board” operating temperature is 0-45degC, this seems way of base for a GPU operating temperature. We’ve seen other posts that indicate thermal problems arise when the temperature reaches upwards of 90degC. As shown in the plot, our GPU temperature is usually around 70degC - this seems reasonable / within an acceptable range. This post states that 105degC is the upper limit for NVIDIA GPUs:

    Other references:

    We found it particularly intriguing that the clocks were throttled right before the hang. It seemed like a high probability of an issue (synchronization?) existing in this scenario under a fairly high load - this is why we made lots of assertions in this area. Again, we have limited visibility, so all we can do is speculate and hope someone at NVIDIA can offer insight. We will definitely try your suggestion to fix the application clocks - this will be a good datapoint.

    While the hardware is about 3 years old, we’re struggling to believe this problem is caused by a component failure because the exact same issue exists across many machines. Also, the hanging issue does not occur when we revert to CUDA 8.0. Note: this cluster was built by an NVIDIA Tesla certified vendor, listed on this page: Buy Tesla Personal High Performance Supercomputers | NVIDIA

    Thanks again for your feedback, we really appreciate it!

    Your points are well taken; in my previous post I was simply providing the output of my internal brain-storming session given that remotely diagnosing computer problems of this nature is akin to diagnosis of car trouble over the phone, and I am not the computer equivalent of Click and Clack of Car Talk fame :-).

    I don’t have personal experience with operating clusters, but am aware of published information that issues like significantly increased GPU memory error rates have been observed due to higher operating temperatures where identical nodes in the same cluster operate with different environment.

    I consider it helpful to “close the loop” by posting an outline of how an issue was ultimately resolved. It may help the next forum reader running into similar hard to diagnose issues. Some issues have a very non-intuitive root cause and/or resolution.

    Relevant power and thermal limits for a given GPU should be reported in the output of nvidia-smi:

    GPU Shutdown Temp
    GPU Max Operating Temp
    Memory Max Operating Temp
    Max Power Limit

    Some items may show up as N/A, not sure what is driving that. Note that (based on observation) the power and thermal management software seems to put on the brakes quite some ways before these absolute limits are reached. For example, for my GPU the shutdown temperature is reported as 104 deg C, but thermal throttling seems to kick in above 83 deg C. Likewise, any continuous load above 90% of the max power limit seems to trigger power capping.

    A theoretically possible connection between thermal management and your observation could be that a bug has crept into the “step down” sequence of the management software which could cause some GPUs to hang during a clock frequency transition. But that is pure speculation.

    The “operating temperature” you quote looks more like a supported range for the ambient temperature, i.e. environment. Often specs quote such ranges with the addition “non-condensing”, and may also specify a “non-operating” temperature range, i.e. when storing the parts.

    Hi Adam,

    Did you ever manage to figure out what the issue is?

    I have currently been dealing with a very similar issue of jobs hanging on our cluster that are using MPI and GPUs. The jobs run with 8 tasks (one core each) and are assigned to one GPU (mostly K80s). The jobs that hang continue to have 100% CPU activity, but the assigned GPU has 0% activity and a temperature that indicates no activity as well (however, it still has some memory filled from the tasks).

    Here are a few details of our systems that I have found the jobs to fail on:

    System Configuration
    2x E5-2660 v3
    (128 or 256)GB System Memory (on different nodes)
    One or Two NVIDIA Tesla K80 GPUs (2 or 4 logical devices, depending on node)
    RHEL 7.5 Maipo
    Kernel: 3.10.0-862.27.1.el7.x86_64
    CUDA 10.0
    CUDA Driver: 410.79
    GCC 4.8.5

    Regarding the code that I’m running, I’ve tried compiling with various versions of OpenMPI and IntelMPI, as well as tried with CUDA 5, 8, and 10, all of which result in similar hanging of jobs. Unfortunately, I don’t have any data points for running this code on an older version of the linux kernel, but interested to know if that may be related to the issue, since your application worked prior to the updates.

    I have not done as much detailed analysis of the job issues as you have, but I’m really interested to hear what you have done if you managed to get it fixed because our issues sound extremely similar. Our cluster also has some V100s, GTX 1080is, and a few P100s. I’m going to see if I can isolate the issue to just the K80s next to further isolate the issue.


    Hi Sheridan,

    Sorry for the delay, I was hoping to include some testing results in my response to you (based on njuffa’s feedback), but we haven’t completed yet.

    Anyway, we haven’t figured out what is causing the GPUs to hang. We suspect it is an issue introduced somewhere between CUDA 8.0 and 9.2 (possibly in the respective drivers: 375.26, 396.37). We actually just started running this application on AWS with V100s and believe the issue exists with those GPUs as well. We don’t have the same level of instrumentation / debugging capability enabled in the application yet, so we have limited insight. That said, we experience the same behavior - the application runs without issue for a long time (hours/days), but during the dynamic period when processes complete their work, clocks are throttling (due to auto/gpu boost), and thermal conditions are changing, we encounter the hanging condition. Based on the stack trace obtained from GDB, we observe the application is hung on a CUDA call (please see my first post).

    In our experience, with our application, running on our hardware / software (note all the caveats), the hanging issue is not a function of system, compiler, and the fact that it is using MPI / SLURM. It appears to be correlated with the CUDA toolkit / driver and possibly the Linux kernel.

    Note: the 100% CPU utilization is due to MPI being in a very tight message polling loop. The process is likely not doing any “actual” work. Here is a good SoF reference: [url][/url]

    I will follow up with our testing results soon.


    Attached are 3 plots (at various zoom locations / levels) of one K80 device, obtained from nvidia-smi after fixing the clocks. A narrative for each plot is provided below.

    Note: I only attached the plots to the post this time, rather than in-lining images / providing external links, as those methods were problematic.

    System Configuration

    • Dual 10-core Intel Xeon CPUs (E5-2670v2)
    • 377GB System Memory
    • One NVIDIA Tesla K80 GPU (2 logical devices)
    • CentOS 7.6.1810
    • Kernel: 3.10.0-957.1.3.el7.x86_64
    • CUDA 10.0.130
    • CUDA Driver: 410.79
    • GCC 4.8.5

    Plot #1 [K80GpuHang2-1.png]:
    This is a view of the entire run. Notice the clocks are fixed. During the beginning 24 hour period, the GPU is near 100% and running without any problems / hangs. However, there are three interesting distinct rises in temperature followed by a sharp decrease in GPU utilization (~05:00, ~10:00, ~14:30) - maybe this is a key as to what is going on with the thermal management algorithm, or maybe it is a red herring? Anyway, the processes start finishing work @16:00, leading to the hangs starting @18:35.

    Plot #2 [K80GpuHang2-2.png]:
    This is a zoom of the time period when some processes are finishing their work and the GPU hangs for ~2 minutes (@18:35), and then indefinitely @18:40.

    Plot #3 [K80GpuHang2-3.png]:
    This is a zoom of the 2 minute hang.

    At this point, we will test the following CUDA toolkit / driver and Linux kernel combinations:

    • CUDA: 10.1, Driver: 418.39, Linux Kernel: 3.10.0-957.5.1
    • CUDA: 10.1, Driver: 418.39, Linux Kernel: 3.10.0-957.10.1
    • CUDA: 8.0, Driver: 375.26, Linux Kernel: 3.10.0-957.10.1


    I cannot tell what the units are on the x-axis and y-axis of the plots. I see small spikes in temperature, but how big are these spikes (from what temperature to what temperature)? The scale on the left seems to imply that overall temperature is very low, even during the spikes. Far away from any thermal thresholds of the GPU.

    After the small spikes in temperature, there are down-clocking events. What is the time difference between temperature spike and down-clocking? The time scale at the bottom seems to suggest the units are hours, which would imply down-clocking happens many minutes after temperature spike? Any down-clocking due to reaching GPU thermal limits would happen within seconds.

    I don’t know the workload you are monitoring here, which makes it difficult to interpret the diagrams. I sometimes do long-time monitoring of Folding@Home tasks with GPU-Z. From that it is quite clear that programs tend to have phases with particular power draw, GPU temperature, and clock characteristics. This is presumably a function of how hard particular program phases exercise various function units of the GU, such as computational cores, memory interfaces, and interconnect (PCIe). These may be cyclical, which then results in visual patterns.

    Your description of the third image says it is from a hang with CUDA 8. I thought your previous observations were that weird hangs occur only with CUDA 10, not CUDA 8, giving rise to a working hypothesis that the issue is triggered by a change in CUDA software stack version?

    As for the third image, all it proves is there were two minutes of GPU inactivity. That could happen for any number of reasons. Maybe the app was in a non-GPU accelerated portion at the time (such as writing out checkpoint data), a blockage at the PCIe level caused by other PCIe devices may have prevented new commands from reaching the GPU, some sort of issue with the host system memory controller (scrubbing?). Generally, we might want to look for some sort of resource contention. It might be interesting to use OS performance counters to check whether any other system events correlate with these mysterious regions of GPU inactivity.

    Hi njuffa,

    I’m very confused by your response. Your comments/questions are inconsistent with my post and attached plots.

    All the axes are labelled: the left vertical axis is temperature (degC) and SM% (I was able to overload this axis since the SM% offered a reasonable scale for the temperature). The right vertical axis is the clock frequency (MHz). The horizontal axis is time (the first two plots are ticks of hours, the last has ticks of 5 minutes). The temperature spikes are on the order of 5 degC. Agreed, these are far away from thermal GPU thresholds.

    I don’t see any down-clocking events throughout the plots. The clock is stable at 562MHz throughout the entire run, as expected since we fixed the clocks.

    The GPU-Z characteristics you listed are all available from nvidia-smi. We are running nvidia-smi “process” and “device” monitoring at a 1 second sampling rate to generate the plots - this is how we can zoom to such fine detail.

    I did not say the third image is a hang from CUDA 8. The last section of my post outlines the configurations we are planning to test next. The system configuration is stated at the beginning of the post and indicates our test used CUDA 10.0 with the 410.79 driver.


    I just looked at the linked pictures again, and I still don’t see any information that the units on the left y-axis “temp and sm%” are degC and units on the right y-axis “clock” are MHz. The labeling of the x-axis is simply “datetime”.

    A clock of 562 MHz seemed too low to make sense for a K80 under load, but I see now after review of the K80 specs that the base core clock for K80 is specified at 560 MHz (and you fixed the clock at that, as you say).

    I was confused about down-clocking, because I got confused about the colors (maybe I should have my color vision checked). The teal (?) curve is SM activity, the drop is there.

    So you are running a very cool (in the literal sense) K80. Occasionally it runs a bit hotter for a short while. And occasionally it is inactive for a few minutes. I don’t think we can tell anything from that. Have you had a chance to file a bug report with NVIDIA?

    You did not. I skimmed the text too quickly and conflated two separate pieces of data. My mind is pre-occupied with an important issue today, I probably should have refrained from posting.

    Here is another idea what one might want to look at: I wonder whether the inactivity could be related to lock handling either in the CUDA driver or elsewhere in the application. I forget what the appropriate Linux tool is for looking at that. Maybe “perf lock”?

    My apologizes, I assumed (probably wrongfully so) that everyone would know the units on the axes from the labels given it is coming from nvidia-smi. I will be more explicit by including this information in future plots.

    Yes, I fixed the clock at 562 MHz to be conservative, helping stay far below thermal thresholds.

    I contacted NVIDIA about filing a bug after your previous post. I thought this would be fairly straight-forward, but I’m working through the process with NVIDIA support. It feels like I’m going down the wrong path and it shouldn’t be this difficult, so if you have any links / suggestions on filing a bug, they would be very much appreciated!

    No problem - we are all very busy. I appreciate your reply!

    I will explore using “perf lock” as part of our testing.

    Thanks again for all your help / suggestions!


    I have only filed a few bugs from outside of NVIDIA, one fairly recently. The process seems to have been streamlined in recent times, as the web form one fills in to file a bug appeared much simplified in the most recent iteration compared to what I recalled from a few years back.

    NVIDIA handles bug reports pretty much like any other company that I am familiar with (which is about ten or so). Before a bug is assigned to appropriate engineering resources for resolution, it must be reproduced in-house, as reliably as possible. It may be at this stage that you find yourself iterating at length with the “frontline” repro team, in particular if the issue requires very specific hardware and software for successful reproduction. NVIDIA may have to assemble a system first that matches yours, and that may be more difficult when some components are older and no longer readily available.

    The repro team may also work on increasing the likelihood of failure, or shortening time to failure, in order to make the best use of the most expensive engineering resources (actual cost + opportunity cost).

    In-house reproduction is an indispensable part of an orderly bug resolution process, and there are really no ways to side step it.

    If a bug is filed based on what I see here, the repro team will almost certainly ask for a self-contained reproducer code. It’s usually the first request back to the customer (when one has not been supplied).

    If that isn’t supplied, forward progress is likely to be approximately zero, based on my experience.

    You’re welcome to do what you wish, of course. I’ve not jumped into this thread so far, because this concept was identified early on in the thread (search for the first instance of the use of the word “reproduce”).

    I ran into a 15 second cudaMemcpy hang just yesterday and I can reproduce it with one of the CUDA samples.
    CUDA Samples\v10.0\0_Simple\simpleP2P
    It can be reproduced by making it select two devices that support peer-access, but are not actually housed in the same K80 (without peer-access it is OK, and with peer-access it is OK for devices that are housed together in one K80).
    So my devices 3 to 6 could have peer-access among each other, but only the combinations 3,5 and 3,6 had problems, not 3,4 (on the same K80). You will have to hack a little into the sample to achieve this scenario, because it tries to use the first two peer-access supported devices (for K80 that will then be two that are housed together).
    Tested with CUDA 9.1 and 10, drivers 411 and 419, Windows Server 2016 with 3 K80.

    The only way to resolve this issue is through filing bugs with NVIDIA. The issue could well be limited to specific system platforms, so in addition to a self-contained, ready-to-run, reliable reproducer app to be added to the bug report, detailed description of the specifics of the host platform may be of essence.

    Thanks for your help, guys!

    The “reproducer” application I mentioned early in this thread exhibits slightly different behavior than the “real” application. The duration of hangs are shorter, although significant, and we have never been able to enter an infinite hang state. I was hoping to exactly match the behavior of the “real” application, but this is probably sufficient to support a bug report, so I’ll move forward on that front. I will work with the repro team if the application needs refinement.

    Thanks again for all your help! I’ll try to keep this thread up to date with bug / repro developments and results from our CUDA / Linux testing.

    I think the easiest for the team at Nvidia is to have it reproducible with their own samples. That is why I have added my case to this thread and I have just filed a bug report:
    “Hang in cudaMemcpy in CUDA sample simpleP2P”
    Will post news here.
    If anybody wants to try my case for reproduction please revisit my previous post (3 posts up). Would appreciate any kind of feedback, it might help the communication with the Nvidia support.