CUDNN_STATUS_EXECUTION_FAILED with cuda8.0.61 + cudnn6.0.21 + ubuntu16.04 + Caffe-BVLC

Hi,I met this problem where results from google are few and not work for me.

System: Ubuntu16.04
CUDA: 8.0.61
CuDNN: 6.0.21
Caffe: the latest https://github.com/BVLC/caffe, master branch. Compiled with CUDA and CuDNN.
GPU: There are 8 1080Ti GPUs on the ubuntu server.

Problem description:
I can do inference of my network with GPU 0. However, when switching to other GPUs, it fails with the following error message:

F0709 10:23:23.628834 11793 cudnn_conv_layer.cu:28] Check failed: status == CUDNN_STATUS_SUCCESS (8 vs. 0) CUDNN_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
[1] 11793 abort (core dumped) python tools/infer_better.py