Deep Convolutional Generative Adversarial Network example from manual

Hi- I’m trying to run the Deep Convolutional Generative Adversarial Network example from section 2.2 of the 9.2 machine-learning-manual.pdf. The Hello World from section 2.1 works fine, but I get some warnings and failure with DCGAN. The manual doesn’t mention needing a Google authentication bearer token, but I’m not sure if that’s it.

[cht@node001 ~]$ module load tensorflow2-extra-py39-cuda11.2-gcc9
Loading tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0
  Loading requirement: openblas/dynamic/0.3.18 hdf5_18/1.8.21 gcc9/9.5.0 python39 cuda11.2/toolkit/11.2.2 cudnn8.1-cuda11.2/8.1.1.33 ml-pythondeps-py39-cuda11.2-gcc9/4.8.1 protobuf3-gcc9/3.9.2
    nccl2-cuda11.2-gcc9/2.14.3 tensorflow2-py39-cuda11.2-gcc9/2.7.0 opencv4-py39-cuda11.2-gcc9/4.5.4
[cht@node001 ~]$ module load openmpi4-cuda11.2-ofed51-gcc9
Loading openmpi4-cuda11.2-ofed51-gcc9/4.1.4
  Loading requirement: hpcx/mlnx-ofed51/2.7.4 ucx/1.10.1 cm-pmix3/3.1.4 hwloc/1.11.11
[cht@node001 ~]$ cd ${CM_TENSORFLOW2_EXTRA}/tensorflow_examples/models/dcgan/
[cht@node001 dcgan]$ python dcgan.py --epochs 5
/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (5.1.0)/charset_normalizer (2.0.10) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
2023-03-22 18:50:53.627768: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "INTERNAL: Couldn't parse JSON response from OAuth server.".
I0322 18:50:53.700929 23456247932736 dataset_builder.py:400] Generating dataset mnist (/home/cht/tensorflow_datasets/mnist/3.0.1)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/cht/tensorflow_datasets/mnist/3.0.1...
Dl Completed...: 0 url [00:00, ? url/s]          I0322 18:50:53.994837 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-images-idx3-ubytedDnaEPiC58ZczHNOp6ks9L4_JLids_rpvUj38kJNGMc.gz.tmp.55e4a786bd8d40478f88319e535a65fb...
Dl Completed...:   0%|                           I0322 18:50:53.997988 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_t10k-labels-idx1-ubyte4Mqf5UL1fRrpd5pIeeAh8c8ZzsY2gbIPBuKwiyfSD_I.gz.tmp.d93d3e181b22493dba85b77ce1fdd027...
Dl Completed...:   0%|                           I0322 18:50:54.000667 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-images-idx3-ubyteJAsxAi0QnOBEygBw_XW2X7zp-LBZAIqqYSHN8ru4ZO4.gz.tmp.d98e7529af77497dabbe5c21e38fe395...
Dl Completed...:   0%|                           I0322 18:50:54.004412 23456247932736 download_manager.py:354] Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz into /home/cht/tensorflow_datasets/downloads/cvdf-datasets_mnist_train-labels-idx1-ubytedcDWkl3FO9T-WMEH1f1Xt51eIRmePRIMAk6X147Qw8w.gz.tmp.2ec0b80ac0854337bebc0b03e47659fb...
Extraction completed...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:27<00:00, 13.56s/ file]
Dl Size...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:27<00:00,  2.71s/ MiB]
Dl Completed...:  50%|█████████████████████████████████████████████████████████████████████████                                                                         | 2/4 [00:27<00:27, 13.56s/ url]
Traceback (most recent call last):
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/cm/local/apps/python39/lib/python3.9/http/client.py", line 463, in read
    n = self.readinto(b)
  File "/cm/local/apps/python39/lib/python3.9/http/client.py", line 507, in readinto
    n = self.fp.readinto(b)
  File "/cm/local/apps/python39/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/cm/local/apps/python39/lib/python3.9/ssl.py", line 1242, in recv_into
    return self.read(nbytes, buffer)
  File "/cm/local/apps/python39/lib/python3.9/ssl.py", line 1100, in read
    return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2633)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 225, in <module>
    app.run(run_main)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 213, in run_main
    main(**kwargs)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 217, in main
    train_dataset = create_dataset(buffer_size, batch_size)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 47, in create_dataset
    train_dataset = tfds.load(
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/load.py", line 318, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 439, in download_and_prepare
    self._download_and_prepare(
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1113, in _download_and_prepare
    split_generators = self._split_generators(  # pylint: disable=unexpected-keyword-arg
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/image_classification/mnist.py", line 118, in _split_generators
    mnist_files = dl_manager.download_and_extract(
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 634, in download_and_extract
    return _map_promise(self._download_extract, url_or_urls)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 767, in _map_promise
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 767, in <lambda>
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/promise/promise.py", line 512, in get
    return self._target_settled_value(_raise=True)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/promise/promise.py", line 516, in _target_settled_value
    return self._target()._settled_value(_raise)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/promise/promise.py", line 226, in _settled_value
    reraise(type(raise_val), raise_val, self._traceback)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/promise/promise.py", line 844, in handle_future_result
    resolve(future.result())
  File "/cm/local/apps/python39/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/cm/local/apps/python39/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/cm/local/apps/python39/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 228, in _sync_download
    for block in iter_content:
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/requests/models.py", line 760, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/urllib3/response.py", line 576, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/cm/local/apps/python39/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/urllib3/response.py", line 449, in _error_catcher
    raise SSLError(e)
urllib3.exceptions.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2633)

I was able to resolve the Google warning with

$ export NO_GCE_CHECK=‘true’

But I’m still getting these errors:

Training ...
2023-03-24 19:56:31.056080: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8101
2023-03-24 19:56:31.547750: E tensorflow/stream_executor/dnn.cc:764] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(4706): 'cudnnBatchNormalizationForwardTrainingEx( cudnn.handle(), mode, bn_ops, &one, &zero, x_descriptor.handle(), x.opaque(), x_descriptor.handle(), side_input.opaque(), x_descriptor.handle(), y->opaque(), scale_offset_descriptor.handle(), scale.opaque(), offset.opaque(), exponential_average_factor, batch_mean_opaque, batch_var_opaque, epsilon, saved_mean->opaque(), saved_inv_var->opaque(), activation_desc.handle(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2023-03-24 19:56:31.551308: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:31.551325: W tensorflow/core/kernels/gpu_utils.cc:69] Failed to check cudnn convolutions for out-of-bounds reads and writes with an error message: 'stream did not block host until done; was already in an error state'; skipping this check. This only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2023-03-24 19:56:31.551345: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:31.556861: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:31.556886: I tensorflow/stream_executor/stream.cc:4442] [stream=0x16dfedf0,impl=0x5d00110] INTERNAL: stream did not block host until done; was already in an error state
2023-03-24 19:56:32.320397: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Traceback (most recent call last):
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 225, in <module>
    app.run(run_main)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/cm/shared/apps/ml-pythondeps-py39-cuda11.2-gcc9/4.8.1/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 213, in run_main
    main(**kwargs)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 222, in main
    return dcgan_obj.train(train_dataset, checkpoint_pr)
  File "/cm/shared/apps/tensorflow2-extra-py39-cuda11.2-gcc9/2.7.0/examples/tensorflow_examples/models/dcgan/dcgan.py", line 194, in train
    gen_loss, disc_loss = self.train_step(image)
  File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  cuDNN launch failure : input shape ([64,128,7,7])
         [[node sequential/batch_normalization_1/FusedBatchNormV3
 (defined at /cm/shared/apps/tensorflow2-py39-cuda11.2-gcc9/2.7.0/lib/python3.9/site-packages/keras/layers/normalization/batch_normalization.py:589)
]] [Op:__inference_train_step_142280]

Errors may have originated from an input operation.
1 Like

Thanks to Andrew in BCM support: Setting the TF_FORCE_GPU_ALLOW_GROWTH environment variable to “true” allowed the example to run perfectly.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.