CUDA Is allocating much more GPU memory than expected

I’m trying to allocate memory for the algorithm but according to my calculations, it should only use about 315 MBs but it’s using 3.5 GBs of memory and still for the second iteration it’s giving illegal memory access (probably memory is not allocated and because of that nth element can not be accessed at the array)

Here’s my calculation (data size is 125.381 so I will assume it’s 125.000)

Variables:

  • sizeff = 125,000
  • input_num = 390
  • population_size = 300
  • GTX 1650 with 14 SMs (maximum 14,336 concurrent threads)

Static Memory Allocations:

  1. Rolling Data: 125,000 * 6 * 65 * 4 bytes = 195,000,000 bytes ≈ 185.96 MB
  2. Neurons for Networks: (390 + 3) * 28 bytes * 300 = 3,301,200 bytes ≈ 3.15 MB
  3. Connections for Networks: (390 * 3) * 20 bytes * 300 = 7,020,000 bytes ≈ 6.69 MB
  4. Trades Array: 125,000 * 1 byte * 300 = 37,500,000 bytes ≈ 35.76 MB
  5. Incoming connections for 3 output neurons 3 * 390 * 300 * 4 bytes ≈ 1.34 MB
  6. Other small allocations: ~1.5 MB (estimated)

Total Static Memory: ~237 MB

Dynamic Memory (per thread):

  1. In GetNetworkOutput function:
  • outputs: (390 + 3) * 4 bytes = 1,572 bytes
  • values: (390 + 3) * 4 bytes = 1,572 bytes
  • completed: (390 * 3) * 1 byte = 1,170 bytes Total: 4,314 bytes
  1. In EvaluateNetworks function:
  • data_chnk: 390 * 4 bytes = 1,560 bytes

Total per thread: 5,874 bytes

Total with GTX 1650’s concurrency:

With maximum 14,336 concurrent threads: 14,336 * 5,874 bytes = 84,209,664 bytes ≈ 80.31 MB

Total GPU RAM usage: 237 MB + 80.31 MB ≈ 317.31 MB

I’m sending all of the code to make sure I’m not missing anything major

It’s not only using much more memory than expected, it also gives an illegal memory access was encountered error after Population 1 (when I delete the trade->traades[i] = result_id; line, it’s not giving the error, so the problem is at the trades array)

you’re allocating in device code without freeing it, that I can see. example:

device code/in-kernel allocations are not automatically freed at the end of a kernel execution. Studying your code this may make sense, of course, in this case.

GPU memory usage on a windows WDDM GPU will also include various overheads that are not directly calculatable from your code allocations:

  • WDDM (i.e. display GPU) overhead due to windows
  • CUDA context overhead

Each of those could be easily several hundred megabytes or perhaps more.

Additionally, “small” device side allocations will have noticeable overhead due to allocation granularity. I don’t happen to know what it is offhand, and it may vary based on GPU, CUDA version, OS etc, but I would expect as a rule of thumb a minimum device-side allocation granularity of 4KB.

Why not track where the big hits to memory usage are occurring? You should be able to narrow things down quite a bit by scattering cudaMemGetInfo after every step in your host code sequence.

When you call cudaDeviceSetLimit the limit will be changed on the next kernel that uses the feature. In this case the call is allocating a 3 GiB device malloc heap. The heap allocation is performed before the grid. Device calls to cudaMalloc sub-allocate from the heap.

2 Likes

Ohh yes, it’s allocationg 3 GB of memory even tho it’s not being used. But still didn’t understand why it gives illegal memory access after population 0. That’s the output

[+] Heap size successfully set to 2 GBs
[+] Data preparation finished successfully!
[+] Population created successfully!
[!] Population 0 finished
GPUassert: an illegal memory access was encountered a.cu 358

As a diagnostic when using in-kernel allocations of any sort (e.g. new, malloc, cudaMalloc in device code), test the returned pointer for NULL. If it is a null pointer, that is the API’s method of communicating the allocation failed, and most of the time a failed allocation will be due to insufficient heap space.

To sort out why this is happening, I would add those tests to your code. If one of your allocations is returning NULL, then that is a possible reason for the illegal memory access (you can also get some confirmation here by running compute-sanitizer and seeing if the reported out-of-bounds access is at or near a zero pointer).

If none of your allocations are returning NULL, then use the method described here to continue tool-assisted debug efforts, to localize the fault to a particular line of device code. Once you’ve done that, debug is considerably simplified, in my opinion. You can usually identify the offending pointer, test if it is out-of-range, etc. and then attempt to discover why, using ordinary, or even printf-based debugging.

1 Like

Almost certainly, you are running out of (exceeding) available heap space.

But normally even 400 MB of memory should be enough but I’m setting the limit for 3 GB. How can I prevent the issue?

That kind of allocations for neurons and connections are made for only once and it will be used until script finishes so it’s not requried to free that.