I’m trying to allocate memory for the algorithm but according to my calculations, it should only use about 315 MBs but it’s using 3.5 GBs of memory and still for the second iteration it’s giving illegal memory access (probably memory is not allocated and because of that nth element can not be accessed at the array)
Here’s my calculation (data size is 125.381 so I will assume it’s 125.000)
Variables:
sizeff = 125,000
input_num = 390
population_size = 300
GTX 1650 with 14 SMs (maximum 14,336 concurrent threads)
I’m sending all of the code to make sure I’m not missing anything major
It’s not only using much more memory than expected, it also gives an illegal memory access was encountered error after Population 1 (when I delete the trade->traades[i] = result_id; line, it’s not giving the error, so the problem is at the trades array)
you’re allocating in device code without freeing it, that I can see. example:
device code/in-kernel allocations are not automatically freed at the end of a kernel execution. Studying your code this may make sense, of course, in this case.
GPU memory usage on a windows WDDM GPU will also include various overheads that are not directly calculatable from your code allocations:
WDDM (i.e. display GPU) overhead due to windows
CUDA context overhead
Each of those could be easily several hundred megabytes or perhaps more.
Additionally, “small” device side allocations will have noticeable overhead due to allocation granularity. I don’t happen to know what it is offhand, and it may vary based on GPU, CUDA version, OS etc, but I would expect as a rule of thumb a minimum device-side allocation granularity of 4KB.
Why not track where the big hits to memory usage are occurring? You should be able to narrow things down quite a bit by scattering cudaMemGetInfo after every step in your host code sequence.
When you call cudaDeviceSetLimit the limit will be changed on the next kernel that uses the feature. In this case the call is allocating a 3 GiB device malloc heap. The heap allocation is performed before the grid. Device calls to cudaMalloc sub-allocate from the heap.
Ohh yes, it’s allocationg 3 GB of memory even tho it’s not being used. But still didn’t understand why it gives illegal memory access after population 0. That’s the output
[+] Heap size successfully set to 2 GBs
[+] Data preparation finished successfully!
[+] Population created successfully!
[!] Population 0 finished
GPUassert: an illegal memory access was encountered a.cu 358
As a diagnostic when using in-kernel allocations of any sort (e.g. new, malloc, cudaMalloc in device code), test the returned pointer for NULL. If it is a null pointer, that is the API’s method of communicating the allocation failed, and most of the time a failed allocation will be due to insufficient heap space.
To sort out why this is happening, I would add those tests to your code. If one of your allocations is returning NULL, then that is a possible reason for the illegal memory access (you can also get some confirmation here by running compute-sanitizer and seeing if the reported out-of-bounds access is at or near a zero pointer).
If none of your allocations are returning NULL, then use the method described here to continue tool-assisted debug efforts, to localize the fault to a particular line of device code. Once you’ve done that, debug is considerably simplified, in my opinion. You can usually identify the offending pointer, test if it is out-of-range, etc. and then attempt to discover why, using ordinary, or even printf-based debugging.
That kind of allocations for neurons and connections are made for only once and it will be used until script finishes so it’s not requried to free that.