Large (300MB+) memory allocation issues

My system: GTX 980, Ubuntu 14.04, Eclipse IDE

I’ve been attempting to allocate a chunk of pitched memory on the device with dimensions of 65536 bytes x 5000 rows, so roughly 300 MB. I have an error trap on the call to cudaMallocPitch and it does return success. However, when I try to access the memory in device code I get errors which cuda-memcheck has thusfar been useless to diagnose. I’m assuming that large of an allocation is going to give unstable behavior.

My questions are:

  1. Are there size limits on cudaMallocPitch that I don’t see in the documentation? I can sacrifice some width in the rows but I cannot drop any rows; the only alternative would be to cut it into multiple allocations with fewer rows. That’s doable, but eventually the code will deployed to a multi-GPU environment, I need to be certain that whatever size I end up with will always work.
  2. Could the fact that I am using the GTX 980 as the display card in my system (no on-board VGA) be causing the issue? i.e. if the card were solely for CUDA, could I assume the memory is basically “empty” until I call cudaMalloc?
  3. Do you have any suggestions for best practices on addressing this chunk of memory? i.e., the way I’m doing it now is to allocate a device memory pointer on the host, and copy the pointer address to a symbol in constant memory, which is then read by the kernel. Yes, I know I could just pass the address as a parameter, but the kernel is fairly register intensive and I figured saving a local variable/parameter buffer was worth the extra read time (I’m not utilizing much more than 4K of constant memory so the cache should basically have all of it). Also, if we were to break up the large chunk into smaller pieces, I would need to pass multiple parameters and that could get cumbersome quickly.

It would be best if you could post a minimal, self-contained, buildable program that reproduces the issue and that other forum participants can build and run to check what might be wrong.

It’s impossible to tell from the description what exactly it is you are doing and how exactly the code is failing. For now a reasonable working hypothesis is that there is a bug in your code. It will also be helpful to state the CUDA version and driver version, and the nvcc command line used to build the code.

Why? Because your code doesn’t work?

cudaMallocPitch doesn’t do anything magical. It is no different than a cudaMalloc operation, except that it will increase the size of the allocation beyond what you explicitly requested (and it returns a “pitch” number).

  1. Are there size limits on cudaMallocPitch that I don’t see in the documentation?

None that I know of. There should be no trouble allocating a gigabyte or more via cudaMallocPitch.

  1. Could the fact that I am using the GTX 980 as the display card in my system (no on-board VGA) be causing the issue?

Certainly. Since we have no idea what “the issue” is, it’s entirely possible. For example, suppose that the amount of “work” your kernel does, and therefore its execution time, increases with the size of the allocation. At some point, you may run into a linux display watchdog, since your GTX980 is servicing a display. This is normally discoverable with proper cuda error checking. But I don’t think memory size per se is the issue. A GTX980 has 4GB of memory. At most about 0.5GB will be used by the display, I would guess (and rather than guessing, there are various tools, such as nvidia-smi or cudaMemGetInfo(), to assess the situation). So a 300MB allocation by itself is not going to be an issue.

  1. Regarding this question, I’ve seen other code examples where someone thinks it’s sensible to explicitly put pointers in constant memory. I’m not sure how this concept ever got traction, but it makes no sense to me. Pass the pointer as a kernel parameter. Do you know where those get stored? In constant memory! And without more information, there is absolutely no reason to arbitrarily break up a large chunk allocation into smaller pieces.

I find cuda-memcheck to be an extremely powerful and useful tool, especially with codes compiled with -lineinfo. Does cuda-memcheck report any errors with your code? If so, rather than grasping at concocted theories, I think you would be well-advised to grab a hold of one of those errors and aggressively understand it and root it out of your code. Then wash, rinse, and repeat.

For both debugging and profiling of code, it’s common practice to use conventional wisdom to guide our efforts, but it’s frequently misguided, in my opinion. I would suggest using the tools, rather than speculation, to guide your efforts.

I will try… the problem is that I am basically writing an application that translates a complex model written in a different scripting language into CUDA. I built a test case and it worked fine; but when I generated the full model I started to get all these weird (and non-repeating) errors. It will be hard to come up with an “intermediate” stage that still fails, but I will give it a crack - that might tell me what my error is anyway. I doubt it’s a bug in the code; not because I’m a master programmer, but because all the code was generated algorithmically, so if I screwed up building that it would be pretty obvious. Will post again soon.

The fact that the failing code has been generated rather than written manually would seem to have no bearing on the chances that it contains bugs. In my experience, code generators can generate “bad” code just like a human can.

Standard debugging suggestions should apply to your situation: (1) check all status returns, (2) bisect and conditionally disable portions iof the code to narrow down the source of error (3) use logging to get a trace of activity (4) single-step in a debugger, etc., etc.

As txbob points out, cuda-memcheck, especially in conjunction with modern GPU architectures, can diagnose and pinpoint a whole slew of issues. Highly recommended.

Yea, I’m an idiot.
Was a bug in the code - basically, I had an integer variable that was being used as an index variable for some arrays and I wasn’t initializing it properly, so it had some ridiculously high value that took the reference out-of-bounds. Thanks for the feedback; you prevented me from going down a way more complicated route than I needed to.