run tensorflow on top of cuDNN7.05 for CUDA 9.1 on macOS 10.13.4

Hi All:

I followed https://byai.io/howto-tensorflow-1-6-on-mac-with-gpu-acceleration/ to build tensorflow 1.6rc with cuDNN 7.05 on macOS with CUDA 9.1.

However, get the following issue, when running tensorflow:

tf.Session()
2018-04-24 23:12:48.265064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] OS X does not support NUMA - returning NUMA node zero
2018-04-24 23:12:48.265304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1331] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:c4:00.0
totalMemory: 8.00GiB freeMemory: 7.86GiB
2018-04-24 23:12:48.265343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1410] Adding visible gpu devices: 0
2018-04-24 23:12:48.877817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-24 23:12:48.878091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-24 23:12:48.878273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-24 23:12:48.878514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7591 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:c4:00.0, compute capability: 6.1)
2018-04-24 23:12:48.879315: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 7.41G (7960118784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-24 23:12:48.879703: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 6.67G (7164106752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
<tensorflow.python.client.session.Session object at 0x10c2abfd0>

1, I’m just wondering if anyone meet with such issue?
2, Is there any official release of cuDNN for CUDA9.1 on macOS 10.13.4? As I haven’t found in official downloads area of cuDNN

Best Regards
Orlando

Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.86GiB
2018-07-02 09:21:41.499407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1483] Adding visible gpu devices: 0
2018-07-02 09:21:42.048827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:964] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-02 09:21:42.048842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:970]      0
2018-07-02 09:21:42.048846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] 0:   N
2018-07-02 09:21:42.049213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9529 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-02 09:21:42.049856: E tensorflow/core/common_runtime/gpu/gpu_device.cc:228] Illegal GPUOptions.experimental.num_dev_to_dev_copy_streams=0 set to 1 instead.
2018-07-02 09:21:42.050080: E tensorflow/stream_executor/cuda/cuda_driver.cc:903] failed to allocate 9.31G (9992399104 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

I met the exact same issue, @xiandao.airs do you have any solutions?

cuDNN 7.1.4 is the current recommended release of cuDNN

It is available for OSX on CUDA 9.2

If you’re building TF from sources for OSX, I would recommend using CUDA 9.2 and cuDNN 7.1.4, currently.

@txbob, Thanks for your reply. However, when I’m using CUDA 9.2 and cuDNN 7.1.4 for building tensorflow on my mac, I meet with issue as follow:

external/eigen_archive/unsupported/Eigen/CXX11/…/…/…/Eigen/src/Core/NumTraits.h(180): warning: calling a host function from a host device function is not allowed
detected during instantiation of “T Eigen::GenericNumTraits::quiet_NaN() [with T=std::__1::complex]”
./tensorflow/core/kernels/reduction_ops.h(60): here
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h(271): error: initializer not allowed for shared variable
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h(319): error: initializer not allowed for shared variable
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h(271): error: initializer not allowed for shared variable
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h(319): error: initializer not allowed for shared variable
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h(271): error: initializer not allowed for shared variable
./tensorflow/core/kernels/reduction_gpu_kernels.cu.h(319): error: initializer not allowed for shared variable
external/eigen_archive/unsupported/Eigen/CXX11/…/…/…/Eigen/src/Core/NumTraits.h(180): warning: calling a host function(“Eigen::internal::device::numeric_limits< ::std::__1::complex > ::quiet_NaN”) from a host device function(“Eigen::GenericNumTraits< ::std::__1::complex > ::quiet_NaN”) is not allowed
6 errors detected in the compilation of “/var/folders/p8/91_v9_9d12q9wmlydb406rbr0000gn/T//tmpxft_00006e68_00000000-6_reduction_ops_gpu_complex64.cu.cpp1.ii”.
ERROR: /Users/llv23/Documents/05_machine_learning/dl_gpu_mac/tensorflow/tensorflow/core/kernels/BUILD:2807:1: output ‘tensorflow/core/kernels/_objs/reduction_ops_gpu/tensorflow/core/kernels/reduction_ops_gpu_complex64.cu.pic.o’ was not created
ERROR: /Users/llv23/Documents/05_machine_learning/dl_gpu_mac/tensorflow/tensorflow/core/kernels/BUILD:2807:1: not all outputs were created or valid
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 10131.637s, Critical Path: 181.47s
FAILED: Build did NOT complete successfully

It’s a kind of depended Eigen library building error, which blocked the tensorflow building.
As my daily research work is highly relevant with tensorflow, I rollback to
cuda-9.1 + cuDNN v7.0.5 Library for OSX
. @rovingbreeze

Then building tensorflow 1.8 has been successfully passed.

The WAR is:
Modify “./tensorflow/core/kernels/reduction_gpu_kernels.cu.h”
replace:
shared value_type partial_sums[32 * 33];
by:
shared align(
alignof(value_type)) char partial_sums_raw[32 * 33 * sizeof(value_type)];
value_type* partial_sums = reinterpret_cast<value_type*>(partial_sums_raw);

There are two places defining “partial_sums”