Kernel Question

Hi.
In my algorithm, i have the matrix of size: 1000 X 1000.
Now, i am trying to create a kernel function where each thread will map to one (i,j) coordinate in the matrix but im getting an error.

My question is: is there a limit of how many threads or blocks can execute one kernel at a time (im asking because i heard that there is) or is it just the case
that im giving the function too much memory to deal with because it is coping with matrix of size 100 X 100.

Thanks for any answers!

1024 threads per block, 6500x65000x65000 blocks for compute capability 2.0 and higher. How are you running the kernel?

CUDA Programming Guide is our friend.

1024 threads per block, but is that in 1-dimension, or overall ???

For example, can i do:

dim3 threads(1024,1024,1) - so that would be 1024 * 1024 in total

or, i can do 1024 threads per block in TOTAL:

dim3 threads(100,10,1) - ??

Lets say you define dim3 threads;

You must obey these rules:

threads.x<=1024

threads.y<=1024;

threads.z<=64 and

threads.xthreads.ythreads.z<=1024

In order to do the job you define dim3 threads=dim3(1024,1,1 and blocks=(1024,1,1);

now you submit the kernel wit kernel<<<blocks,threads>>>(argumrnts); This way you have 1024 blocks with 1024 threads per block tottaly 1 mil

Alternativvel you can use kernel<<<1024,1024>>>(argumrnts) same thing if you use only 1D grids.