error occurs about CreateMutex on CUDA

I have run the program on my desktop GPU (1050Ti),but error occurs when runs it on TX2.
Here is the error output:
“Cannot create operator of type ‘CreateMutex’ on the device ‘CUDA’”
It occurs on the command:
workspace.RunNetOnce(train_model.param_init_net)
workspace.CreateNet(train_model.net)

Why does it happen? Any suggestion is great help.

Hi,

Do you use caffe2? Or could you share more information about your program?

Thanks.

Sorry late for reply.Here is my main scripts for caffe2 :
def create_model_FP16(m, device_opts,dtype):
with core.DeviceScope(device_opts):
initializer = pFP16Initializer
with brew.arg_scope([brew.conv, brew.fc],
WeightInitializer=initializer,
BiasInitializer=initializer,
enable_tensor_core=True):
conv1 = brew.conv(m, ‘data’, ‘conv1’, dim_in=1, dim_out=96, kernel=11,stride=4)
relu1 = brew.relu(m,conv1, ‘conv1’)
norm1 = brew.lrn(m,relu1, ‘norm1’, size=5, alpha=0.0001, beta=0.75)
pool1 = brew.max_pool(m, norm1, ‘pool1’, kernel=3, stride=2)
conv2 = brew.conv(m, pool1, ‘conv2’, dim_in=96, dim_out=256, kernel=5)
relu2 = brew.relu(m,conv2, ‘conv2’)
norm2 = brew.lrn(m,relu2, ‘norm2’, size=5, alpha=0.0001, beta=0.75)
pool2 = brew.max_pool(m, norm2, ‘pool2’, kernel=3, stride=2)
conv3 = brew.conv(m, pool2, ‘conv3’, dim_in=256, dim_out=384, kernel=3)
relu3 = brew.relu(m,conv3, ‘conv3’)
conv4 = brew.conv(m, relu3, ‘conv4’, dim_in=384, dim_out=384, kernel=3)
relu4 = brew.relu(m,conv4, ‘conv4’)
conv5 = brew.conv(m, conv4, ‘conv5’, dim_in=384, dim_out=256, kernel=3)
relu5 = brew.relu(m,conv5, ‘conv5’)
pool5 = brew.max_pool(m, relu5, ‘pool5’, kernel=3, stride=2)
fc6 = brew.fc(m, pool5, ‘fc6’, dim_in=25622, dim_out=4096)
relu6 = brew.relu(m,fc6, ‘fc6’)
dropout1 = brew.dropout(m,relu6, ‘dropout1’, ratio=0.5, is_test=0)
fc7 = brew.fc(m, dropout1, ‘fc7’, dim_in=4096, dim_out=4096)
relu7 = brew.relu(m,fc7, ‘fc7’)
dropout2 = brew.dropout(m,relu7, ‘dropout2’, ratio=0.5, is_test=0)
fc8 = brew.fc(m, dropout2, ‘fc8’, dim_in=4096, dim_out=1000)
softmax = brew.softmax(m, fc8, ‘softmax’)
m.net.AddExternalOutput(softmax)
# New Addition
softmax=m.net.HalfToFloat(softmax, softmax + ‘_fp32’)
xent = m.LabelCrossEntropy([softmax, “label”], ‘xent’)
loss = m.AveragedLoss(xent, “loss”)
brew.accuracy(m, [softmax, “label”], “accuracy”)

        m.AddGradientOperators([loss])
        opt = optimizer.build_sgd(m, base_learning_rate=0.01, policy="step", stepsize=1, gamma=0.999)
    return softmax

#########
Here is my run the function “create_model_FP16”:
workspace.FeedBlob(“data”, data, device_option=device_opts)
workspace.FeedBlob(“label”, label, device_option=device_opts)

train_model= model_helper.ModelHelper(name="train_net")
softmax = create_model_FP16(train_model, device_opts=device_opts,dtype=dtype,is_test='false')
with core.DeviceScope(device_opts):
    brew.add_weight_decay(train_model, 0.001)  # any effect???

workspace.RunNetOnce(train_model.param_init_net)
workspace.CreateNet(train_model.net)

I run this code on desktop GPU(1050Ti) is ok,but on Nvidia TX2 it occurs error as belows:
“Cannot create operator of type ‘CreateMutex’ on the device ‘CUDA’”
Thanks for reply.

Thanks.

We will reproduce this issue and update information to you later.

Hi,

Could you check if there is a built-it example can reproduce this issue?
We have tested several built-in samples, but all work correctly.

Thanks.

Thanks for reply. Could I send a email of my codes to you?

Hi AastaLLL, could you give me your email address? You can run my scripts to reproduce the issue.Thanks for help.

Hi,

Sorry for the late reply. We can’t reveal our mail here.
Could you upload the script and pass the link via private message?

Thanks.

Hi AastaLLL, I have send a private message to you.

Hi,

Thanks for the feedback.
We have downloaded the script. Will update information to you later.

Hi,

The cause is a missing TX2 GPU architecture in the caffe2.
We can run your script after applying this change in caffe2:

diff --git a/cmake/Cuda.cmake b/cmake/Cuda.cmake
index 2425375..54605ef 100644
--- a/cmake/Cuda.cmake
+++ b/cmake/Cuda.cmake
@@ -6,7 +6,7 @@
 # Default is set to cuda 9. If we detect the cuda architectores to be less than
 # 9, we will lower it to the corresponding known archs.
 set(Caffe2_known_gpu_archs "30 35 50 52 60 61 70") # for CUDA 9.x
-set(Caffe2_known_gpu_archs8 "20 21(20) 30 35 50 52 60 61") # for CUDA 8.x
+set(Caffe2_known_gpu_archs8 "20 21(20) 30 35 50 52 60 61 62") # for CUDA 8.x
 set(Caffe2_known_gpu_archs7 "20 21(20) 30 35 50 52") # for CUDA 7.x

Thanks.

Thanks , i will try it soon.

Thanks for help.It doesn’t work fine.I change the cuda.cmake:

Known NVIDIA GPU achitectures Caffe2 can be compiled for.

This list will be used for CUDA_ARCH_NAME = All option

set(Caffe2_known_gpu_archs “20 21(20) 30 35 50 52 60 61 70”)#for cuda 9.x
set(Caffe2_known_gpu_archs8 “20 21(20) 30 35 50 52 60 61 62”)#for cuda 8.x
set(Caffe2_known_gpu_archs7 “20 21(20) 30 35 50 52”)#for cuda 7.x

And then do this :
rm -rf build
./scripts/build_tegra_x1.sh
cd XX/XX
python XXX.py

Still occurs the error:

Traceback (most recent call last):
File “AlexTest2.py”, line 242, in
train(INIT_NET, PREDICT_NET, epochs=3, batch_size=100, device_opts=device_opts,dtype=dtype,is_test=is_test)
File “AlexTest2.py”, line 162, in train
workspace.RunNetOnce(train_model.param_init_net)
File “/usr/local/caffe2/python/workspace.py”, line 161, in RunNetOnce
return C.run_net_once(StringifyProto(net))
RuntimeError: [enforce fail at operator.cc:110] op. Cannot create operator of type ‘CreateMutex’ on the device ‘CUDA’. Verify that implementation for the corresponding device exist. It might also happen if the binary is not linked with the operator implementation code. If Python frontend is used it might happen if dyndep.InitOpsLibrary call is missing. Operator def: output: “iteration_mutex” name: “” type: “CreateMutex” device_option { device_type: 1 cuda_gpu_id: 0 }

Any other suggestions?

Hi,

Which JetPack do you use?
We test your source on JetPack3.1. Could you also try caffe2 on JetPack3.1?

Thanks.

I also use JetPack 3.1 on my TX2 board.I am confused.

Hi,

Okay. We will check caffe2 with another TX2 board and update information to you later.
Thanks.

Hi,

Could you help us check CUDA functionality first?

/usr/local/cuda-8.0/bin/cuda-install-samples-8.0.sh .
cd NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd
make
./vectorAdd

Thanks.

Hi, AastaLLL.
I have done the script list. Result is as follows:

nvidia@tegra-ubuntu:/usr/local/cuda-8.0/bin/NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd$ ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Hi,

Thanks for your checking. Your GPU looks good.

We have double checked caffe2 today, and we can run your script without error.
Here are our steps:

1. Flash the device with JetPack3.1
2. Clone

$ git clone --recursive https://github.com/caffe2/caffe2.git

3. Apply change in comment #11.
4. Build and run

$ ./scripts/build_tegra_x1.sh
$ export PYTHONPATH=$PYTHONPATH:[caffe2_root]/build
$ sudo pip install future
$ sudo apt-get install python-six
$ python [Your script]

By the way, could you try to install caffe2 and run the script with ‘nvidia’ account?
Please also let us know your results.

Thanks.