CL_OUT_OF_RESOURCES error on clGetEventProfilingInfo

Hi

I am using “OpenCL 1.0 CUDA 4.0.1”, to program in Nvidia Tesla C2050 in linux platform, I am facing a CL_OUT_OF_RESOURCES error on clGetEventProfilingInfo while measuring the kernel execution time, My kernel function calls other functions from itself, Calling the function more times(5 or more) gives the error even with get_global_size(0)=32, but calling it once can handle upto global_size=1024. Why calling the function several times gives such error??

My kernel program looks like:

void mul() 	// it multiplies two polynomials in a binary field, represented by 20 unsigned long words.

for( i=0; i<612;i++)
{ 	...................
....................
mul(.......)
mul(.......) 
mul(.......) 	//mul is called 5 times under a loop runing 612 times
mul(.......)
mul(.......)
..................
...................
}

It gives an CL_OUT_OF_RESOURCE error on clGetEventProfilingInfo while measuring the kernel execution time.
Is there any bound to the maximum size of each thread private memory?? then i may exceed that.
I have another question, from where is the private memory of each thread is allocated, is it on chip and as fast as shared memory, or its same as accessing global memory?? Any help or suggestion is appreciable…
Thanks.