Nvidia Modulus: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

I was running Nvidia Modulus with bare-metal installation in a conda environment.

I am getting this error failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED. No idea how to fix this.

Here is the whole error.

[s.1915438@scs2042 ldc]$ python ldc_2d.py
/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/controller.py:8: UserWarning: horovod was not imported. This will make multi-gpu runs impossible
  warnings.warn("horovod was not imported. This will make multi-gpu runs impossible")
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/optimizer.py:353: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/optimizer.py:361: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

CONFIGS: FullyConnectedArch, /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/architecture/fully_connected.py
  activation_fn: swish
  layer_size: 512
  nr_layers: 6
  skip_connections: False
  weight_norm: True
  adaptive_activations: False
CONFIGS: ExponentialDecayLR, /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/learning_rate.py
  start_lr: 0.001
  end_lr: 0.0
  decay_steps: 4000
  decay_rate: 0.95
CONFIGS: AdamOptimizer, /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/optimizer.py
  beta1: 0.9
  beta2: 0.999
  epsilon: 1e-08
  amp: False
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/arch.py:36: The name tf.make_template is deprecated. Please use tf.compat.v1.make_template instead.

CONFIGS: LDCSolver, ldc_2d.py
  network_dir: ./network_checkpoint_ldc_2d
  initialize_network_dir: 
  added_config_dir: 
  rec_results: True
  rec_results_cpu: False
  rec_results_freq: 1000
  max_steps: 400000
  save_filetypes: vtk,np
  xla: False
  inner_norm: 2
  outer_norm: 2
  save_network_freq: 1000
  print_stats_freq: 100
  tf_summary_freq: 500
  optimizer_params_index: None
  initialize_network_params: None
  seq_train_domain: [<class '__main__.LDCTrain'>]
  config: {'config': ModulusConfig(activation_fn='swish', adaptive_activations=False, added_config_dir='', amp=False, beta1=0.9, beta2=0.999, decay_rate=0.95, decay_steps=4000, end_lr=0.0, epsilon=1e-08, initialize_network_dir='', inner_norm=2, layer_size=512, max_steps=400000, network_dir='./network_checkpoint_ldc_2d', nr_layers=6, outer_norm=2, rec_results=True, rec_results_cpu=False, rec_results_freq=1000, run_mode='solve', save_filetypes='vtk,np', skip_connections=False, start_lr=0.001, weight_norm=True, xla=False)}
  arch: <modulus.architecture.fully_connected.FullyConnectedArch object at 0x7f93e0020ef0>
  lr: <modulus.learning_rate.ExponentialDecayLR object at 0x7f93e1005e10>
  optimizer: <modulus.optimizer.AdamOptimizer object at 0x7f93e1005cf8>
  equations: [<modulus.node.Node object at 0x7f93e1015160>, <modulus.node.Node object at 0x7f93defb9390>, <modulus.node.Node object at 0x7f93e0020358>]
  nets: [<modulus.node.Node object at 0x7f93df01b128>]
  diff_nodes: []
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:224: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:236: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2022-04-08 09:26:08.773805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-04-08 09:26:08.848041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:27:00.0
2022-04-08 09:26:08.854994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-04-08 09:26:08.888436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-04-08 09:26:08.912782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-04-08 09:26:08.936875: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-04-08 09:26:08.966060: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-04-08 09:26:08.989298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-04-08 09:26:09.070889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-04-08 09:26:09.075647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-04-08 09:26:09.085358: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-04-08 09:26:09.464066: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2350065000 Hz
2022-04-08 09:26:09.472379: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5579824ebc40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-04-08 09:26:09.472406: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-04-08 09:26:09.708881: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5579825039c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-04-08 09:26:09.708955: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2022-04-08 09:26:09.713053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA A100-PCIE-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:27:00.0
2022-04-08 09:26:09.713109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-04-08 09:26:09.713140: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-04-08 09:26:09.713160: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-04-08 09:26:09.713181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-04-08 09:26:09.713200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-04-08 09:26:09.713219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-04-08 09:26:09.713238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-04-08 09:26:09.717462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-04-08 09:26:09.717508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-04-08 09:26:09.720791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-04-08 09:26:09.720829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2022-04-08 09:26:09.720853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2022-04-08 09:26:09.726211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37943 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:27:00.0, compute capability: 8.0)
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:175: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/variables.py:241: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

UNROLLING GRAPH: 
    TopWall
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/tf_utils/layers.py:34: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/tf_utils/layers.py:34: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/tf_utils/layers.py:307: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

    NoSlip
    Interior
grad calls: 2
calculated: [v__x, u__x, p__x, v__y, u__y, p__y]
grad calls: 2
calculated: [v__y, u__y, u__y__y, v__y__y, v__x, u__x, v__x__x, u__x__x]
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/variables.py:218: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/learning_rate.py:65: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
UNROLLING GRAPH: 
    Val
WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:480: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:241: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:262: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:262: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py:520: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Solving for Domain  iteration 0
2022-04-08 09:33:55.831349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-04-08 09:36:01.040438: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
         [[{{node flow_net/fc0/MatMul}}]]
         [[Sum_7/_41]]
  (1) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
         [[{{node flow_net/fc0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ldc_2d.py", line 91, in <module>
    ctr.run()
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/controller.py", line 91, in run
    self.solver.solve()
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py", line 527, in solve
    train_stats = seq_train_step[domain_index](train_np_var)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/variables.py", line 510, in np_function
    np_outvar_list = sess.run(outvar_placeholders, feed_dict)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
         [[node flow_net/fc0/MatMul (defined at /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[Sum_7/_41]]
  (1) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
         [[node flow_net/fc0/MatMul (defined at /home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'flow_net/fc0/MatMul':
  File "ldc_2d.py", line 91, in <module>
    ctr.run()
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/controller.py", line 91, in run
    self.solver.solve()
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/solver.py", line 389, in solve
    train_pred_domain_outvar = unroll_graph_on_dict(self.nets+self.equations, train_domain_invar, train_true_domain_outvar, diff_nodes=self.diff_nodes)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/graph.py", line 128, in unroll_graph_on_dict
    outvar_dict[key] = unroll_graph(nodes, invar_with_global, req_outvar_names, diff_nodes)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/graph.py", line 67, in unroll_graph
    outvar.update(node.evaluate(input_variables))
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/ops/template.py", line 393, in __call__
    return self._call_func(args, kwargs)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/ops/template.py", line 355, in _call_func
    result = self._func(*args, **kwargs)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/arch.py", line 36, in <lambda>
    network_template = tf.make_template(name, lambda x: self._network_template(x, output_keys=Key.convert_list(outputs)))
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/architecture/fully_connected.py", line 73, in _network_template
    activation_par = activation_par)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/modulus-21.6-py3.6.egg/modulus/tf_utils/layers.py", line 48, in fc_layer
    outputs = tf.add(tf.matmul(inputs, weights), biases, name=name)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
    name=name)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/s.1915438/.conda/envs/modulus/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

[s.1915438@scs2042 ldc]$ 

Hello, we have release a new version of Modulus that uses PyTorch so this problem should be resolved now.