kernell calls inside a loop is it ok?

So is it OK to make a loop and call a kernel in it like

for(int i = 0 ; i < n ; i++)

{

DoSomeWork<<<dim3,dim3>>>(someargs,i);

}

I ask this by 2 reasons :

  1. performance

  2. I try to do the same in my prog right now and it fails in a very strange way.

So the situation is: I have a kernel that accepts 7 arguments, all the arguments are type of int except one which is resulting array type of int*

I allocate a lot of memory for that array and then start a loop in which I launch my kernel (I’m using loop variable as an argument for the kernel)

And what is REALLY STRANGE - the first kernel lauch is successful, all the others result in “invalid argument”, though I don’t even reallocate or anyhow change pointer for resulting array. Besides I can launch empty kernel and the result will be the same. By the way I’ve made a similar test app and it works =|

Can anyone help?

This is what doesn’t work (all libs are included and so on)

//I've changed some defined values and variables for exact numbers

__global__ void Test(int* points)

{

	//it can be either completely empty - anyhow doesn't work

	points[blockIdx.x + threadIdx.x] = 1;

}

main( )

{

	...

	int* d_points;

	cudaMalloc( (void**)&d_points , 1024*1024*10);

	...

	for (int i = 0 ; i < 1024; i++)

	{

		Test<<<1,1>>>(d_points);

		CUDA_CHECK_ERROR(cudaThreadSynchronize());

		

		//was made to check results after the first iteration

		cudaMemcpy(points, d_points, 1024*1024*10, cudaMemcpyDeviceToHost);

		//because memcpy seems to be async operation

		CUDA_CHECK_ERROR(cudaThreadSynchronize());

		cudaError_t err = cudaGetLastError();

		if( cudaSuccess != err) 

		{

			fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

		}

	}

	...

}

This is simple which works

__global__ void DoCuda(int* points,int offset)

{

	points[threadIdx.x + offset*5] = threadIdx.x;

}

int main(int argc, char* argv[])

{

	int* h_points = (int*)malloc(500);

	int* d_points = NULL;

	cudaMalloc( (void**)&d_points,500 );

	for(int i = 0 ; i < 25 ; i++)

	{

		DoCuda<<<1,5>>>(d_points,i);

		cudaError_t err = cudaGetLastError();

		if (err != cudaSuccess)

		{

			printf("%s",cudaGetErrorString(err));

		}

	}

	cudaMemcpy( h_points, d_points, 500, cudaMemcpyDeviceToHost );

	cudaError_t err = cudaGetLastError();

	for (int i = 0 ; i < 125 ; i++)

	{

		h_points[i]++;

	}

	return 0;

}

Yes, that would be ok.

After replacing [font=“Courier New”]500[/font] with [font=“Courier New”]125*sizeof(int)[/font] and [font=“Courier New”]printf(“%s”,err);[/font] with [font=“Courier New”]printf(“%s”,cudaGetErrorString(err));[/font] and adding [font=“Courier New”]#include <stdio.h>[/font] on top your program ran fine for me.

sizeof(int) is 4 bytes so 125*sizeof(int) is the same as 500…(yes I could use sizeof(int) to make it more clear but 500 is shorter than sizeof(int) and will be good for just a simple test app)

About the second - I’m watching until the first error in debugger, but that doesn’t really matter. I’ll change it not to confuse people.

All libs are included, it compiles and runs, the matter are described problems.

OK,
I’ve found error. It’s interesting - it was related with the texture I’ve tried to allocate.
So I’ve tried to allocate texture as big as half memory of my gpu memory (256 mb).
And the maximum texture size you can find here http://forums.nvidia.com/index.php?showtopic=30934 , it’s 2^27 * sizeof(texturedata)
The most interesting fact is that there’s no error allocating, binding and so on and even the first kernel run is ok, the error appears just on the second kernel call.