Custom training yolov3 out of mermory tlt

Hello everyone,
Using tensorrt for custom yolo training, with Quadro RTX GPU, 1 class dataset and ~40k images, I run into a lot of memory trouble.
First it was cpu memory fail, then cuda host allocation fails, depending on my batch_size.
Here is my config :
random_seed: 42
yolo_config {
big_anchor_shape: “[(65.41, 141.27), (170.16, 92.18), (283.82, 176.38)]”
mid_anchor_shape: “[(76.64, 29.21), (53.04, 43.57), (101.22, 56.79)]”
small_anchor_shape: “[(20.99, 17.11), (37.16, 25.72), (23.00, 55.92)]”
matching_neutral_box_iou: 0.5

  arch: "resnet"
  nlayers: 18
  arch_conv_blocks: 2

  loss_loc_weight: 0.75
  loss_neg_obj_weights: 200.0
  loss_class_weights: 1.0

  freeze_blocks: 0
  freeze_bn: false
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 20
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-4
    soft_start: 0.1
    annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 5e-5
  }
}
eval_config {
  validation_period_during_training: 1
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 100
}
augmentation_config {
  preprocessing {
    output_image_width: 416
    output_image_height: 416
    output_image_channel: 3
    crop_right: 0
    crop_bottom: 0
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "person"
      value: "person"
  }
validation_fold: 0
}

And here are my latest errors :

2020-09-29 12:59:25.775103: W tensorflow/core/common_runtime/bfc_allocator.cc:424] __________________________________________________________________________*________________________*
2020-09-29 12:59:25.795500: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:25.795586: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:25.795687: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:25.795716: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:25.795818: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:25.795847: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:25.795924: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:25.795950: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:25.796028: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:25.796054: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184

2020-09-29 12:59:35.795239: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:35.795297: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:35.795398: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:35.795417: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:35.795510: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:35.795527: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:35.795619: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:35.795636: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:35.795728: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:35.795745: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:35.795778: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-09-29 12:59:35.795792: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2020-09-29 12:59:35.795809: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (gpu_host_bfc) ran out of memory trying to allocate 13.90GiB (rounded to 14929920000).  Current allocation summary follows.
2020-09-29 12:59:35.795834: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): 	Total Chunks: 41, Chunks in use: 35. 10.2KiB allocated for chunks. 8.8KiB in use in bin. 2.9KiB client-requested in use in bin.
2020-09-29 12:59:35.795848: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): 	Total Chunks: 1, Chunks in use: 0. 512B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795860: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795872: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795883: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795895: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795909: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): 	Total Chunks: 8, Chunks in use: 8. 226.0KiB allocated for chunks. 226.0KiB in use in bin. 225.0KiB client-requested in use in bin.
2020-09-29 12:59:35.795923: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): 	Total Chunks: 7, Chunks in use: 7. 312.8KiB allocated for chunks. 312.8KiB in use in bin. 295.3KiB client-requested in use in bin.
2020-09-29 12:59:35.795936: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): 	Total Chunks: 11, Chunks in use: 1. 775.8KiB allocated for chunks. 64.5KiB in use in bin. 42.2KiB client-requested in use in bin.
2020-09-29 12:59:35.795949: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): 	Total Chunks: 1, Chunks in use: 0. 141.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795960: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795972: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795983: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.795995: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796006: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796018: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): 	Total Chunks: 1, Chunks in use: 0. 15.57MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796031: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): 	Total Chunks: 1, Chunks in use: 0. 16.00MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796043: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): 	Total Chunks: 1, Chunks in use: 0. 32.00MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796055: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): 	Total Chunks: 1, Chunks in use: 0. 64.00MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796067: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796078: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-09-29 12:59:35.796090: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 13.90GiB was 256.00MiB, Chunk State: 
2020-09-29 12:59:35.796100: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 67108864
2020-09-29 12:59:35.796111: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8f9a000000 next 18446744073709551615 of size 67108864
2020-09-29 12:59:35.796121: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 33554432
2020-09-29 12:59:35.796130: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8f9e000000 next 18446744073709551615 of size 33554432
2020-09-29 12:59:35.796140: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 16777216
2020-09-29 12:59:35.796151: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0000000 next 11 of size 28928
2020-09-29 12:59:35.796161: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0007100 next 1266 of size 60672
2020-09-29 12:59:35.796170: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0015e00 next 1265 of size 28928
2020-09-29 12:59:35.796181: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa001cf00 next 1264 of size 43264
2020-09-29 12:59:35.796190: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0027800 next 16 of size 256
2020-09-29 12:59:35.796200: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0027900 next 1262 of size 256
2020-09-29 12:59:35.796210: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0027a00 next 1261 of size 256
2020-09-29 12:59:35.796219: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0027b00 next 1260 of size 256
2020-09-29 12:59:35.796229: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0027c00 next 1259 of size 28928
2020-09-29 12:59:35.796238: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa002ed00 next 1258 of size 43264
2020-09-29 12:59:35.796248: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0039600 next 1274 of size 256
2020-09-29 12:59:35.796257: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0039700 next 1318 of size 256
2020-09-29 12:59:35.796267: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fa0039800 next 1254 of size 512
2020-09-29 12:59:35.796276: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0039a00 next 1253 of size 256
2020-09-29 12:59:35.796286: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0039b00 next 1256 of size 28928
2020-09-29 12:59:35.796295: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0040c00 next 1257 of size 43264
2020-09-29 12:59:35.796304: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa004b500 next 2 of size 28928
2020-09-29 12:59:35.796314: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0052600 next 1340 of size 43264
2020-09-29 12:59:35.796324: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa005cf00 next 1300 of size 28928
2020-09-29 12:59:35.796333: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fa0064000 next 9 of size 43264
2020-09-29 12:59:35.796342: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fa006e900 next 18446744073709551615 of size 16324352
2020-09-29 12:59:35.796352: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 16777216
2020-09-29 12:59:35.796361: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fb1000000 next 18446744073709551615 of size 16777216
2020-09-29 12:59:35.796371: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 1048576
2020-09-29 12:59:35.796380: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e00000 next 1312 of size 144384
2020-09-29 12:59:35.796390: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e23400 next 8 of size 256
2020-09-29 12:59:35.796399: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e23500 next 1319 of size 256
2020-09-29 12:59:35.796409: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e23600 next 1321 of size 256
2020-09-29 12:59:35.796418: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e23700 next 1320 of size 256
2020-09-29 12:59:35.796427: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e23800 next 18 of size 256
2020-09-29 12:59:35.796437: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e23900 next 1282 of size 256
2020-09-29 12:59:35.796446: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e23a00 next 1287 of size 72192
2020-09-29 12:59:35.796456: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e35400 next 17 of size 256
2020-09-29 12:59:35.796465: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e35500 next 1291 of size 256
2020-09-29 12:59:35.796474: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e35600 next 1296 of size 256
2020-09-29 12:59:35.796484: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e35700 next 1343 of size 256
2020-09-29 12:59:35.796493: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e35800 next 1289 of size 73472
2020-09-29 12:59:35.796503: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e47700 next 21 of size 256
2020-09-29 12:59:35.796513: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e47800 next 1377 of size 256
2020-09-29 12:59:35.796522: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e47900 next 1388 of size 72192
2020-09-29 12:59:35.796532: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e59300 next 22 of size 256
2020-09-29 12:59:35.796541: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e59400 next 1293 of size 256
2020-09-29 12:59:35.796551: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e59500 next 1301 of size 256
2020-09-29 12:59:35.796561: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e59600 next 1298 of size 256
2020-09-29 12:59:35.796571: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e59700 next 1290 of size 73472
2020-09-29 12:59:35.796580: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e6b600 next 1255 of size 256
2020-09-29 12:59:35.796590: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e6b700 next 1352 of size 256
2020-09-29 12:59:35.796600: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e6b800 next 1373 of size 72192
2020-09-29 12:59:35.796609: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e7d200 next 1263 of size 256
2020-09-29 12:59:35.796619: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e7d300 next 1371 of size 256
2020-09-29 12:59:35.796628: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e7d400 next 1286 of size 256
2020-09-29 12:59:35.796641: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e7d500 next 1275 of size 256
2020-09-29 12:59:35.796653: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e7d600 next 1392 of size 73472
2020-09-29 12:59:35.796667: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e8f500 next 1267 of size 256
2020-09-29 12:59:35.796679: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2e8f600 next 1279 of size 256
2020-09-29 12:59:35.796692: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2e8f700 next 1369 of size 72192
2020-09-29 12:59:35.796705: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ea1100 next 1 of size 256
2020-09-29 12:59:35.796717: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ea1200 next 1362 of size 256
2020-09-29 12:59:35.796729: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2ea1300 next 1299 of size 256
2020-09-29 12:59:35.796741: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ea1400 next 1332 of size 256
2020-09-29 12:59:35.796753: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2ea1500 next 1326 of size 73472
2020-09-29 12:59:35.796765: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2eb3400 next 6 of size 256
2020-09-29 12:59:35.796777: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2eb3500 next 1324 of size 256
2020-09-29 12:59:35.796789: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2eb3600 next 1338 of size 72192
2020-09-29 12:59:35.796801: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ec5000 next 7 of size 256
2020-09-29 12:59:35.796813: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ec5100 next 1273 of size 256
2020-09-29 12:59:35.796825: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2ec5200 next 1310 of size 256
2020-09-29 12:59:35.796837: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ec5300 next 1355 of size 256
2020-09-29 12:59:35.796849: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f8fe2ec5400 next 23 of size 73472
2020-09-29 12:59:35.796861: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ed7300 next 10 of size 28928
2020-09-29 12:59:35.796875: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ede400 next 15 of size 43264
2020-09-29 12:59:35.796888: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2ee8d00 next 20 of size 28928
2020-09-29 12:59:35.796900: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8fe2eefe00 next 18446744073709551615 of size 66048
2020-09-29 12:59:35.796910: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size: 
2020-09-29 12:59:35.796925: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 35 Chunks of size 256 totalling 8.8KiB
2020-09-29 12:59:35.796938: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 28928 totalling 226.0KiB
2020-09-29 12:59:35.796950: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 43264 totalling 253.5KiB
2020-09-29 12:59:35.796962: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 60672 totalling 59.2KiB
2020-09-29 12:59:35.796974: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 66048 totalling 64.5KiB
2020-09-29 12:59:35.796986: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 612.0KiB
2020-09-29 12:59:35.796997: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 135266304 memory_limit_: 68719476736 available bytes: 68584210432 curr_region_allocation_bytes_: 17179869184
2020-09-29 12:59:35.797012: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 68719476736
InUse:                      626688
MaxInUse:                104521216
NumAllocs:                    3425
MaxAllocSize:             16777216

2020-09-29 12:59:35.797037: W tensorflow/core/common_runtime/bfc_allocator.cc:424] __________________________________________________________________________*________________________*

Can someone tell me why this allocation are not working? Is it a dataset problem? Config one?

Please set smaller batch-size and retry.
For example,
batch_size_per_gpu: 4

If too low, it get killed :
Epoch 1/80
2020-09-30 08:38:11.177945: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-30 08:38:11.901130: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x7f154f0
2020-09-30 08:38:11.901300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
/usr/local/bin/tlt-train: line 32: 989 Killed tlt-train-g1 ${PYTHON_ARGS[*]}

I am afraid it is still OOM issue. Could you try bs2 or bs1?
Or if possible, do you have chance to train with 2 gpus?

I’ve a RTX 6000, why would I need 2 gpus with that kind of card?

Please check if below link can help.

Could you please share the output of nvidia-smi ? Is there any other process using GPU?