Im a newbie with CUDA so assistance is greatly appreciated - especially when directed at my level of understanding!
I have an old serial code am trying to port to a GTX275. It runs fine on CPU but when port just one (very) compute intensive loop I get:
“call to cuMemAlloc returned error 2: Out of Memory
CUDA driver version: 4000”
Array used in loop is ~4,000,000 elements (4byte)
To make life simpler I wrote a few line matrix multiplication program (different to above) and increased the size till reproduced the same error. array(3000,3000,3) worked, but array(4000,4000,3) returned the same error as quoted above. The screen also temporarily blacked out on the latter…
Using GPUZ it looks like I should easily have sufficient memory and in small problems (3000 one) GPU load goes up to 100% for a second or 2)
Does anyone know what could be wrong or what I can do to fix it?
For Info I am working in Fortran with the Portland Visual Fortran compiler with accelerator directives (not CUDA in C). I dont believe this is where the problem lies but am open to suggestions.
If anyone here uses/or is thinking of using the PG compiler I would recomend it. Has made life much easier for me while learning.
You could try this first to see how much memory you’ve got.
CUresult cuMemGetInfo(unsigned int* free, unsigned int* total);
The screen going black or something like that is almost a certain indication of some out-of-bound memory operation. Maybe you could consider setting up another card for graphics and use the GTX 275 for CUDA only.
thanks for the post. Im not sure how to do that through the Portland compiler… from GPUZ though it does imply im not using much.
I do havbe a nother graphics card. ill try and figure how to have the 275 for CUDA only. Im sure I read somewhere that you can specify which ‘device’ is used for porting to GPGPU.
If I do do this and all works than I understand it wont black out - which is good! - but wont solve the root problem for the code im trying to parallelize.
Things other than the array can use device memory - kernels reserve per thread local memory when contexts are established. If the kernel uses a lot of local memory (it can be up to 16kb per thread), loading it into a context can consume a lot of memory. Given you have no direct access to the actual CUDA code being run (IIRC), I would be raising this with PGI.