Severe CUDA performances regression on Kepler hardware (K20, K40, K80) using latest drivers (410.xx)

Hello,

We witness a 85% CUDA performances drop - 1/6th of the original performances (!!!) - when using the (latest) available drivers on our Linux (Debian/Stretch) K20, K40 and K80 setups, using:

  • driver 410.78 (latest nVidia, general GPUs) along CUDA 9.2.148
  • driver 410.72 (latest nVidia, Tesla-specific) along CUDA 9.2.148
  • driver 384.130 (stock Debian/Stretch) along CUDA 8.0.44

We’ve been able to restore expected performances using:

  • driver 375.26 (old “forported” custom Debian/Jessie) along CUDA 8.0.44

This affects all our Kepler hardware (corresponding to a hundred-odd thousands $$$ of investment) and prevents us to move our infrastructure to CUDA 9.x (since incompatible with 375.xx drivers).

Is anyone aware of the issue ?
Is there any known work-around, e.g. parameters passed to the drivers (modprobe) or via nvidia-smi ?

Thanks in advance for your support,

Cédric

PS: I can post our crude pyCuda benchmarking script if needs be (although is does nothing else than time the CPU-to-GPU memory transfer of two 1024MB vectors and their subsequent addition, subtraction, multiplication and division; as simple as it gets)

Please check if the nvidia persistence daemon is running, if not, start it and check if that resolves the issue.
Otherwise, please put the gpus under load and run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

I doubt driver persistence has anything to do with the problem at hand since our pyCuda script keeps the hand on the GPU throughout then entire “benchmark”.

However, for the sake of thoroughness and after going through Driver Persistence :: GPU Deployment and Management Documentation , I tried:

  • nvidia-smi -i … -pm 1 → no change
  • nvidia-smi -i … -pm 0 + nvidia-persistenced → no change

Please find attached the two nvidia-bug-report.sh output, on the exact same node, once with driver 375.26 and then after upgrade to driver 410.78. Below the output of nvidia-smi and our “benchmark” during both runs:

* root@vgnb001:~
# nvidia-smi 
Fri Nov 23 12:32:27 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:00:05.0     Off |                    0 |
| N/A   31C    P0    49W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:00:06.0     Off |                    0 |
| N/A   28C    P0    51W / 225W |      0MiB /  4742MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
* 2018-11-23 12:32:27 +0100

* root@vgnb001:~
# CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|Bandwidth GPU [MB/s]|MFlops GPU [Mops/s]|MFlops CPU [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
64       |3868.6              |134.9              |nan                |nan               
256      |4201.9              |7565.0             |nan                |nan               
1024     |4526.9              |9141.9             |nan                |nan               
* 2018-11-23 12:33:40 +0100

* root@vgnb001:~
# nvidia-smi 
Fri Nov 23 13:34:19 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 00000000:00:05.0 Off |                    0 |
| N/A   32C    P0    49W / 225W |      0MiB /  4743MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 00000000:00:06.0 Off |                    0 |
| N/A   30C    P0    51W / 225W |      0MiB /  4743MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
* 2018-11-23 13:34:19 +0100

* root@vgnb001:~
# CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|Bandwidth GPU [MB/s]|MFlops GPU [Mops/s]|MFlops CPU [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
64       |3068.2              |162.2              |nan                |nan               
256      |2115.5              |1905.1             |nan                |nan               
1024     |3306.4              |1971.8             |nan                |nan               
* 2018-11-23 13:35:21 +0100

PS: on given node (an oldie I can readily run tests on), performances drop is only 78% (~1/5th of original performances); the 85% performances drop is witnessed on our “beefed up” production nodes

nvidia-bug-report-375.26.log.gz (350 KB)
nvidia-bug-report-410.78.log.gz (1.48 MB)

The persistenced is depending on system/gpu responsible for more than just keeping the driver alive, thus essential, e.g. https://devtalk.nvidia.com/default/topic/1037778/?comment=5272411
TBH, both logfiles exibit a very strange behaviour of the gpus, especially this:

*** /proc/driver/nvidia/./gpus/0000:00:06.0/information
*** ls: -r--r--r-- 1 root root 0 2018-11-23 12:33:15.060934313 +0100 /proc/driver/nvidia/./gpus/0000:00:06.0/information
Model: 		 Tesla K20m
IRQ:   		 33
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 40 bits
DMA Mask: 	 0xffffffffff
Bus Location: 	 0000:00:06.0
Device Minor: 	 1

The ??? normally result from an uninitialized gpu due to a missing persistenced but at the same time nvidia-smi reports a gpu usage of 21%(375)/100%(410) without any running process and the other gpu at 0000:00:05.0 is reporting a running proces but only 1% load.
Could you please create and attach a new nvidia-bug-report.log while the persistence daemon is running?

Yes. We’ve always been baffled by those weird nvidia-smi outputs (but never worried about it, since workloads did not seem - so far - to suffer from it)

Please see attached output, along corresponding CLI:

* root@vgnb001:~
# nvidia-persistenced  # Note: I haven't taken the time - yet - to provision a non-root account for the purpose
* 2018-11-23 14:54:40 +0100

* root@vgnb001:~
# ps waxuf | grep nvidia-persistenced
root       302  3.4  0.0   8504  1336 ?        Ss   14:54   0:00 nvidia-persistenced
* 2018-11-23 14:54:48 +0100

* root@vgnb001:~
# CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|Bandwidth GPU [MB/s]|MFlops GPU [Mops/s]|MFlops CPU [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
64       |3056.3              |323.7              |nan                |nan               
256      |3242.1              |1824.5             |nan                |nan               
1024     |3282.2              |1968.8             |nan                |nan               
* 2018-11-23 14:56:10 +0100

* root@vgnb001:~
# nvidia-smi 
Fri Nov 23 15:00:53 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:00:05.0 Off |                    0 |
| N/A   31C    P8    16W / 225W |      0MiB /  4743MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          On   | 00000000:00:06.0 Off |                    0 |
| N/A   30C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
* 2018-11-23 15:00:53 +0100

Ha! nvidia-smi now makes sense! But performances stil aren’t there.

nvidia-bug-report-410.78+persistenced.log.gz (1.43 MB)

Ok, now that the values make sense, you’ll unfortunately have to start anew getting data to compare.
One variation: please start the persistence daemon with options

--verbose --no-persistence-mode

Then please create new logs for the 410 and 375 driver.
One question, did you ever check driver 410 alongside cuda 10?

* root@vgnb001:~
# pkill -f nvidia-persistenced 
* 2018-11-23 16:01:09 +0100

* root@vgnb001:~
# nvidia-persistenced --verbose --no-persistence-mode
* 2018-11-23 16:01:38 +0100

* root@vgnb001:~
# ps waxuf | grep nvidia-persistenced
root     13421  0.0  0.0   8504  1316 ?        Ss   16:01   0:00 nvidia-persistenced --verbose --no-persistence-mode
* 2018-11-23 16:01:49 +0100

* root@vgnb001:~
# CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|GPU Bandwidth [MB/s]|GPU MFlops [Mops/s]|CPU MFlops [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
64       |3041.8              |1125.0             |nan                |nan               
256      |3153.5              |1906.0             |nan                |nan               
1024     |3217.7              |1623.3             |nan                |nan               
* 2018-11-23 16:04:34 +0100

Note: very interestingly, the 1024MB results are getting worse than before, when I launch the nvidia-bug-report.sh concurrently during the corresponding timing phase; this does not happen when the persistence daemon is launched without the ‘–verbose --no-persistence-mode’ options

Please find attached the logs for the 410 driver.

I’ll come back to you with the 375 logs as you as I have “imported” the nvidia-persistenced (missing from the stock Debian packaging) from nVidia “source” (375.26) download into my existing base.

Well, we’re not there yet. “Hacking” our own drivers and toolkit (and Debian packaging) onto our existing Debian/Stretch base and ready-ing it for production is already enough work ;-)
nvidia-bug-report-410.78+persistenced-no-pm.log.gz (1.32 MB)

This doesn’t look right. Without persistence mode, the gpus are again not properly initialized despite being registered by the persistenced. So please run the 375 driver also without --no-persistence-mode.
For the 410 driver the collection of logs is complete now.

Here we go.

* root@vgnb001:~
# ./nvidia-persistenced.375.26 --verbose --no-persistence-mode
* 2018-11-23 16:35:46 +0100

* root@vgnb001:~
# ps waxuf | grep nvidia-persistenced
root      5066  0.0  0.0   8468  1284 ?        Ss   16:35   0:00 ./nvidia-persistenced.375.26 --verbose --no-persistence-mode
* 2018-11-23 16:35:55 +0100

* root@vgnb001:~
# CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|GPU Bandwidth [MB/s]|GPU MFlops [Mops/s]|CPU MFlops [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
64       |3911.5              |2585.2             |nan                |nan               
256      |4382.6              |7561.5             |nan                |nan               
1024     |4499.4              |5896.4             |nan                |nan               
* 2018-11-23 16:39:36 +0100

* root@vgnb001:~
# nvidia-smi 
Fri Nov 23 16:40:33 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m          Off  | 0000:00:05.0     Off |                    0 |
| N/A   32C    P0    49W / 225W |      0MiB /  4742MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m          Off  | 0000:00:06.0     Off |                    0 |
| N/A   27C    P0    51W / 225W |      0MiB /  4742MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
* 2018-11-23 16:40:33 +0100

* root@vgnb001:~
# ps waxuf | grep nvidia-persistenced
root      5066  0.0  0.0   8468  1284 ?        Ss   16:35   0:00 ./nvidia-persistenced.375.26 --verbose --no-persistence-mode
* 2018-11-23 16:42:16 +0100

Same performances drop when running nvidia-bug-report concurrently.

And I can confirm that the nvidia-smi weird ouputs are back, despite running nvidia-persistenced (albeit with ‘–no-persistence-mode’ option)

And I can also confirm that running nvidia-persistenced without the options fixes the two issues above.

nvidia-bug-report-375.26+persistenced-no-pm.log.gz (352 KB)

Please also attach a log from the last test-case, 375+persistence mode to have the whole picture.

There they are
nvidia-bug-report-375.26+persistenced.log.gz (353 KB)

Ok, looking at the data, this boils down to bad memory/-bandwidth usage of the 410 driver.
What kind of ops are you using in your benchmark to measure bandwidth and ‘Mops/s’ ?

375:

FB Memory Usage
        Total                       : 4742 MiB
        Used                        : 3455 MiB
    Utilization
        Gpu                         : 85 %
        Memory                      : 84 %
    Power Readings
        Power Management            : Supported
        Power Draw                  : 102.19 W
    Clocks
        Graphics                    : 705 MHz
        SM                          : 705 MHz
        Memory                      : 2600 MHz

410:

FB Memory Usage
        Total                       : 4743 MiB
        Used                        : 1158 MiB
    Utilization
        Gpu                         : 80 %
        Memory                      : 5 %
    Power Readings
        Power Management            : Supported
        Power Draw                  : 59.18 W
    Clocks
        Graphics                    : 705 MHz
        SM                          : 705 MHz
        Memory                      : 2600 MHz

Attached our pycuda_benchmark.py script, exactly as we’ve been using it for years to compare performances as we evolve our setups. Below a typical output:

* root@vgnb001:~
# CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
INFO: Data (vector) size = 1MB
INFO: Creating (CPU) data
INFO: Timing CPU MFlops (add, sub, mul, div; 20 times)
INFO: Timing GPU Bandwith (CPU-to-GPU data transfer; 20 times)
INFO: Timing GPU MFlops (add, sub, mul, div; 20 times)
INFO: Data (vector) size = 4MB
INFO: Creating (CPU) data
INFO: Timing CPU MFlops (add, sub, mul, div; 20 times)
INFO: Timing GPU Bandwith (CPU-to-GPU data transfer; 20 times)
INFO: Timing GPU MFlops (add, sub, mul, div; 20 times)
INFO: Data (vector) size = 16MB
INFO: Creating (CPU) data
INFO: Timing CPU MFlops (add, sub, mul, div; 20 times)
INFO: Timing GPU Bandwith (CPU-to-GPU data transfer; 20 times)
INFO: Timing GPU MFlops (add, sub, mul, div; 20 times)
INFO: Data (vector) size = 64MB
INFO: Creating (CPU) data
INFO: Timing CPU MFlops (add, sub, mul, div; 20 times)
INFO: Timing GPU Bandwith (CPU-to-GPU data transfer; 20 times)
INFO: Timing GPU MFlops (add, sub, mul, div; 20 times)
INFO: Data (vector) size = 256MB
INFO: Creating (CPU) data
INFO: Timing CPU MFlops (add, sub, mul, div; 20 times)
INFO: Timing GPU Bandwith (CPU-to-GPU data transfer; 20 times)
INFO: Timing GPU MFlops (add, sub, mul, div; 20 times)
INFO: Data (vector) size = 1024MB
INFO: Creating (CPU) data
INFO: Timing CPU MFlops (add, sub, mul, div; 20 times)
INFO: Timing GPU Bandwith (CPU-to-GPU data transfer; 20 times)
INFO: Timing GPU MFlops (add, sub, mul, div; 20 times)
Size [MB]|GPU Bandwidth [MB/s]|GPU MFlops [Mops/s]|CPU MFlops [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
1        |466.9               |35.4               |1386.4             |0.0               
4        |1492.5              |516.3              |1225.4             |0.4               
16       |2628.5              |1424.5             |821.7              |1.7               
64       |3040.4              |1749.6             |399.3              |4.4               
256      |3158.9              |1894.4             |403.9              |4.7               
1024     |3219.4              |1620.1             |406.3              |4.0               
WARNING: Please 'rm -rf ~/.cache/pycuda' (pyCUDA cache directory) NOW!
* 2018-11-23 20:23:51 +0100

PS: in post above, I had stripped out the least relevant tests/information.

(I’ll follow you up on next Monday; now is time to hit it off in my timezone)
pycuda_benchmark.py.gz (859 Bytes)

Ok, that benchmark looks simple and straightforward. which are the installed versions of pyCUDA and numpy?
Did you also test the performance on the 410 driver using your regular workload and if so, what kind of libraries/frameworks are involved there?

pyCUDA:

  • 2016.1.2+git20161024-1+b1; stock Debian/Stretch
  • 2018.1.1; backported from Debian/Unstable and rebuilt/repackaged against CUDA 9.2 (used along 410 driver; to clear out the doubt raised by your mentioning GPU initialization issues)

numpy:

  • 1.12.1-3: stock Debian/Stretch

I’ll do some ResNet and AlexaNet comparative runs - taken from https://github.com/pytorch/examples/tree/master/imagenet - using pyTorch 0.4.1 on Monday.
I guess/hope the results shall be conclusive.

PS: Maybe I should have mentioned in the opening post that we do not witness such regression when using the 384 driver on our Pascal - P40 - nodes (we haven’t tried the 410 driver on those nodes yet; but remember, the 384 driver leads to the same regression as the 410 one on Kepler nodes). This pushes doubts towards the driver rather than the computation stack :-/

Nope. That wouldn’t be conclusive. ImageNet is a very large dataset and any benchmark will most likely be storage/network-bound.
PS: I had to switch to K40 (11GiB DRAM) setups for the purpose.

I finally opted for MNIST comparative runs - taken from https://github.com/pytorch/examples/tree/master/mnist - which underlying dataset is small enough to be held in the host’s local filesystem cache (as verified by our hosts performances monitoring tools).
PS: nvidia-persistenced running with no options

Driver 375.26 + CUDA 8.0.44 + pyCUDA 2016.1.2 + pyTorch 0.4.1 results:

* user@vgnc001:/tmp/user/pyTorch
$ nvidia-smi 
Mon Nov 26 11:54:01 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 0000:00:04.0     Off |                    0 |
| N/A   30C    P8    21W / 235W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
* 2018-11-26 11:54:01 +0100

* user@vgnc001:/tmp/user/pyTorch
$ apt-cache policy python3-pycuda
python3-pycuda:
  Installed: 2016.1.2+git20161024-1+b1
* 2018-11-26 11:54:16 +0100

* user@vgnc001:/tmp/user/pyTorch
$ CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|GPU Bandwidth [MB/s]|GPU MFlops [Mops/s]|CPU MFlops [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
1        |948.9               |36.3               |1695.5             |0.0               
4        |2948.1              |1296.0             |1655.2             |0.8               
16       |6520.2              |4192.6             |1058.5             |4.0               
64       |8025.9              |8416.8             |426.8              |19.7              
256      |7539.0              |11321.6            |434.0              |26.1              
1024     |7384.9              |12072.4            |441.2              |27.4              
* 2018-11-26 11:55:53 +0100

* user@vgnc001:/tmp/user/pyTorch
$ source ./pytorch-0.4.1+cuda-8.0.py35.env.d/bin/activate
(pytorch-0.4.1+cuda-8.0.py35.env.d) * 2018-11-26 11:56:02 +0100

* user@vgnc001:/tmp/user/pyTorch
$ time CUDA_VISIBLE_DEVICES=0 python3.5 ./pytorch/examples/mnist/main.py --epochs 10 ./ >/dev/null
real	1m43.329s
user	2m3.280s
sys	0m21.552s
(pytorch-0.4.1+cuda-8.0.py35.env.d) * 2018-11-26 11:57:51 +0100

Driver 410.78 + CUDA 9.2.148 + pyCUDA 2018.1.1 + pyTorch 0.4.1 results:

* user@vgnc002:/tmp/user/pyTorch
$ nvidia-smi 
Mon Nov 26 11:57:57 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 00000000:00:04.0 Off |                    0 |
| N/A   26C    P8    21W / 235W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
* 2018-11-26 11:57:57 +0100

* user@vgnc002:/tmp/user/pyTorch
$ apt-cache policy python3-pycuda
python3-pycuda:
  Installed: 2018.1.1-1+1custom1
* 2018-11-26 11:58:12 +0100

* user@vgnc002:/tmp/user/pyTorch
$ CUDA_VISIBLE_DEVICES=0 ./pycuda_benchmark.py
Size [MB]|GPU Bandwidth [MB/s]|GPU MFlops [Mops/s]|CPU MFlops [Mops/s]|GPU vs CPU speedup
---------+--------------------+-------------------+-------------------+------------------
1        |1705.4              |47.3               |1706.1             |0.0               
4        |3012.2              |1568.6             |1659.8             |0.9               
16       |3260.2              |2000.3             |753.7              |2.7               
64       |3064.1              |2150.3             |351.6              |6.1               
256      |3064.1              |2163.4             |362.9              |6.0               
1024     |3082.9              |2167.4             |376.0              |5.8               
* 2018-11-26 12:00:29 +0100

* user@vgnc002:/tmp/user/pyTorch
$ source ./pytorch-0.4.1+cuda-9.2.py35.env.d/bin/activate
(pytorch-0.4.1+cuda-9.2.py35.env.d) * 2018-11-26 12:00:37 +0100

* user@vgnc002:/tmp/user/pyTorch
$ time CUDA_VISIBLE_DEVICES=0 python3.5 ./pytorch/examples/mnist/main.py --epochs 10 ./ >/dev/null
real	1m42.776s
user	1m59.872s
sys	0m19.704s
(pytorch-0.4.1+cuda-9.2.py35.env.d) * 2018-11-26 12:02:23 +0100

Just making sure the GPU is actually being used (by comparing by CPU-only run):

* user@vgnc002:/tmp/user/pyTorch
$ time CUDA_VISIBLE_DEVICES=999 python3.5 ./pytorch/examples/mnist/main.py --epochs 10 ./ >/dev/null
real	5m6.849s
user	5m52.268s
sys	0m56.552s
(pytorch-0.4.1+cuda-9.2.py35.env.d) * 2018-11-26 12:29:15 +0100

The problem cannot be reproduced with pyTorch! although our pyCUDA-based benchmark is still ill-fated.

I guess pyCUDA is now the most likely culprit for this regression and I can mark this post INVALID. Sorry for the false alert.

Any idea why this pyCUDA ↔ Kepler incompatibility might show up, though (but not with Pascal) ?

AFAIK, pyCUDA is built upon the CUDA driver API while pyTorch is built upon the CUDA runtime API. So this still might be a driver problem but you’d better check with the pyCUDA devs since they should have better insight into this.

And another thing that might clear pyCUDA is the fact that MNIST does not use large data structures (GPU memory usage peaks at 400MiB) while our pyCUDA benchmarks - which uses as much as ~3.7GiB GPU memory (incl. two data structures that “weights” up to 1GiB individually) - shows a clear capping occurring between the 4MB and 16MB test cases.

I guess there is no hope in having nVidia just verifying this issue in its own lab environment ?
(given all necessary resources are now readily available, included our pycuda_benchmark script)

You could send the essence of it to linux-bugs[at]nvidia.com and maybe also ask here: https://devtalk.nvidia.com/default/board/57/cuda-programming-and-performance/
if anybody else has noticed a performance regression with pycuda.