CUDA in-kernel malloc

I have narrowed down the problem in my code to the malloc statements in my kernel. They are not giving an error, but the values of other variables that are in the kernel are changing due to, what I suspect, is memory corruption from using too much of the heap. I have the cudaThreadGetLimit call in my code which returns 8MB. My kernel call looks as follows:

dim3 dimGrid (100,100);
dim3 dimBlock (1,1);
kernel <<< dimGrid, dimBlock >>> (…arguments…);

So I want 10000 threads (just trying to make it simplistic with the code I am working with). Inside the kernel, there are two places where there are malloc’s. The first allocates 2 char sequences (with the max of this being 500 chars each) and a matrix that is maximum 500*500 ints. By my calculations thats less than the limit given by cudaThreadGetLimit. Am I looking at this incorrectly? Is that value telling me something different than I am thinking? Does this 8MB in fact mean per thread or does it mean maximum memory that can be allocated by all threads together. Thanks for the help. I am a beginning CUDA programmer.

Do you deallocate the memory when thread exists?

I have another question?
Can we malloc memory in kernel?
Do you have any example about this?

See section B.15 of the CUDA 4.0 Programming Guide:

If you’re using smallish-amounts of memory per block in your kernel (16KB or 64KB, depending on your card), it’s much better to use shared memory (in a nutshell, it’s faster). With that said, if you’re using CUDA 3.2 or higher then you can use both malloc and free in device code.

I’d advise checking out the SDK examples