Limit on kernel block / grid numbers?

I’m working on a few fairly simple data conversion kernels in my code, I’ve got both of them to work find for small data sets, but when I scale up, the output gets all kinds of messed up.

For example: I’m operating on memory that is num_samples in size (we’ll call the data sets data1 and data2

Each “thread” of the function is independant of any others, so I don’t really care what order they get executed in, as long as the input (data1) and output (data2) are in the same order.

I call the function like this:

my_function <<< (num_samples/256) , 256, 0 >>> ( data1, data2 );

if num_samples <= 8388608 everything works fine, when num_samples > 8388608 it doesn’t work (admittedly I’ve only tried with 16M not 8M+1). I don’t get errors or anything, the data is just wrong.

I’ve looked through the documentation constantly this week and haven’t really found anything that really mentions any kind of limits that I’d be running into on this. (the max kernel time for 16M should be ~30ms or so MAX).


The maximum size for each dimension in the grid is 2^16-1=65535.
If you are using a 1D grid and 256 threads per block, you can only process
65535*256=16776960 elements if you are using a 1:1 mapping between element position and threadid.
Look at the Black-Scholes example for a way to handle generic size arrays.

Ahh, that would be why it’s not working, thanks I didn’t see that documented specifically.