Limitation of the number of iterations in CUDA


I am a beginer for cuda programming.
So I am trying to understand the cuda programming techniques.
However, I build a cuda programming, which is very simple as follows:
But, it is well running when iter value as input in main is nearly 10,000
, while it is stopped when its value is nearly 1,000,000.
I can not understand why it operates like the above.
Please let me know it.
(I am using 9600M GT on LG notebook with CUDA2.3)
------------------------ my program --------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <cutil.h>

global void iterative(int ptr, int n, int iter)
for(int i=0;i<iter;i++){
int idx= blockIdx.x
if(idx < n){

void main()
int *dMem, size=10000, iter;

printf("input iter:");
scanf("%d", &iter);
CUDA_SAFE_CALL(cudaMalloc((void**)&dMem, sizeof(int) * size));
CUDA_SAFE_CALL(cudaMemset(dMem, 0, sizeof(int) * size));

int	block= size/32+1;
iterative<<<block, 32>>>(dMem, size, iter);



I believe that the issue may be due to the fact that kernel calls (“iterative<<<block, 32>>>(dMem, size, iter);”, in your case) have a time limit when the GPU has to also render a display. It also depends on the Operating System under which you are running CUDA.

I can’t recall the concrete numbers but, from memory, the limits are about 2-3 seconds in Windows and 5 seconds in Linux.

So, if you are using Linux, to test if this is indeed the cause of your issue try disabling your ‘X’. In Ubuntu the command is something like:

sudo /etc/init.d/gdm stop # Replace 'stop' with 'start' when you want to re-start it afterwards!

Let us know how that goes.

Thank you for your explaination.

I am using Window 7 with 32bits.

But, I am developing the program which has to iterative many times(above 20,000,000) in kernel.

Still I do not have any idea for this problem.

As far as I’m aware you’ll have to break up your kernel call into many kernel calls to avoid this issue under Windows.

So, something like this:

int nKernelCalls = iter/10000;   // Experiment to see what the largest value you can use is

for (int n=0; n<nKernelCalls; n++){

	iterative<<<block, 32>>>(dMem, size, iter);


Give that a try and report back?