Problems with local memory

Hi people first topic here and newbie in CUDA,
I’m writing a parallel version (CUDA C) of a graph coloring algorithm code where i use, in my kernel function, some local (5 arrays that each thread should have their owns) linearized arrays (They were 2D matrix in the sequential algorithm). I need them to be local because each thread run a large process where this arrays are used a lot of times. I created and allocated them like this:

double* trail = (double*) malloc(d_problem.max_colors * d_problem.nof_vertices * sizeof(double));

That said my problem is that when the size of this arrays gets big (e.g when nof_vertices = 450) i get “an illegal memory access was encountered” error, but the funny thing is that this happens only when i try to launch more than 32 threads (I know that number of threads is laughable but with <=32 threads it runs ok). Now if try to run the same code with a problem where nof_vertices = 128 it runs even with 4096 threads (altough the kernel run time increases a lot when i increase the number of threads and i find it really oddly, i’m going to look into this later). To summarized i think the problem is with my local memory data, but since the problem happens only when the arrays are big, but not when the total local memory allocated is big im getting confused. I’m stating this because the problem with nof_vertices = 128 runs with 4096+ threads while the nof_vertices = 450 arrays can’t even 64 threads.

So my questions are: Is a good pratice use local memory this much? Why i get error when the arrays are big but not when the total memory is a lot bigger? Is there some limit to per-thread/block memory? Moving all this data to global memory and making iteration math with the threadID to make each thread access only the right amount of space would solve? But Global memory is the slowest so it would destroy my performance, right?

I’m sorry for throwing all this at once and i know i’m lost and confused but i googled everywhere and i can’t find anyone with some similar problem, if someone can help i will be really grateful!

You might want to read the relevant section of the programming guide and note that in-kernel malloc has a default limit (total) of 8MB and that the limit can be increased.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

Hey that solved the problem! Thank you so much! To launch big problems with a lot of threads i changed the Heap Size to 256MB and it works! Although the performance is kinda trash and gets worse when i launch more threads… Since it is parallel that doesn’t make much sense, does it? Any advice or overall suggestion?
And thank you again!

device malloc is slow. The more you do, the slower it is. Instead, try and allocate a single buffer from the host side using cudaMalloc, and carve it up for use by as many threads as needed.