Maximum number of instruction inside a Kernel

nachovall · October 7, 2009, 2:25pm

Hi everyone.

I have a problem in my kernel that investigating I think I finally discover which is the problem. Some days ago I [topic=“107943”]post[/topic] about a problem with my graphics card which seems to have strange behavior after some executions. teju said me that may be I was accessing in a non 'malloc’ed position. To ensure that this wasn’t happening I wrote a simple program to guarantee correct memory access. But the problem was still there.

I was wondering if the problem was in the loop of the Kernel which is very big. I did some test with this program:

#include <cuda.h>

#include <cutil.h>

#include <iostream>

#define BLOCK_X	32

#define BLOCK_Y	16

#define N1 4500000

//#define N1 4700000

using namespace std;

__global__ void bucles(float * data, float * res, float Nbucle)

{

	for(uint n=0;n<Nbucle;n++)

		res[n%512]=data[n%512];

}

int main()

{

	uint Nbucle = N2;

	uint N = 512;

	float * data = new float[N];

	for(uint i=0;i<N;i++)

		data[i]=3.;

	float * dataDevice;

 	cudaMalloc((void**)&dataDevice,sizeof(float)*N);

	cudaMemcpy(dataDevice,data,sizeof(float)*N,cudaMemcpyHostToDevice); 

	float * resDevice;

 	cudaMalloc((void**)&resDevice,sizeof(float)*N);

	dim3 dimBlock(BLOCK_X,BLOCK_Y);

 	dim3 dimGrid(N/(dimBlock.x*dimBlock.y)); 

 	

	bucles<<<dimGrid,dimBlock>>>(dataDevice,resDevice,Nbucle);

	float * res = new float[N];

	cudaMemcpy(res,resDevice,sizeof(float)*N,cudaMemcpyDeviceToHost);

	cudaFree(dataDevice);

 	cudaFree(resDevice);

	for(uint i=0;i<N;i++)

	{

		if(i%200==0) cout<<endl;

		cout<<res[i]<<",";

	}

	cout<<endl;

	delete data;

	delete res;

	return 0;

}

The grid just have 1 block and the block size is 512. Every thread loops Nbucle times. I want to test which is the limit of the number of instructions a thread supports (if exists). I tried for different values of N1 and I bound (more or less) the values. For N1=4500000 the results was OK and the graphics hardware response right but for N1=4700000 the results were all wrong and screen looks like the attachment image.

I didn’t find the exact value where this happens because some times for N1=4500000 the result was wrong (depends on the execution). So I analyze my kernel with the CUDA Visual profiler and I saw how the same kernel some times was executing different numbers of instructions on different runnings. In this [topic=“108354”]post[/topic] is explained why. This explains why for N1=4500000 sometimes was right and sometimes was wrong. Because I am on the limit of the maximum instruction per thread (I concluded).

So my question to all these is: I am correct? Exists a max number of instructions per thread? If exists, why?

In the CUDA programing guide 2.2 I read something about the max size of a Kernel. “The limit on kernel size is 2 million PTX instructions”. I understand that this doesn’t mean the number of instructions per thread but the size of the compiled kernel. I assume that nvcc doesn’t unroll my loop (since N is a variable), so my simply program doesn’t reach the 2 million PTX instructions.

Any help will be very appreciated. Thanks.

avidday · October 7, 2009, 2:40pm

The looks like a hardware problem to me. I had one gpu (a DDR3 GT9500) which would generate hard driver errors, unspecified launch failures, hose the display, and do all sorts of other odd things which running very computationally intensive kernels. There are no limits of the kind you are speculating about.

nachovall · October 8, 2009, 8:05am

That’s what I thought at fisrt, but a friend of mine in the same laboratory has exactly the same problem. Different graphics card, but the sam O.S. and CUDA version. Any more ideas?

Thanks.

biebo · October 8, 2009, 1:04pm

I also have the same problem, when ever i increase size of the memory to be allocated in device my screen goes out of order and then i have to “ctrl+alt+f3” and back to f7 as my x server is in f7.

i have ubuntu 9.04
card = 9200M
driver 180.XXX

as for as memory is concern i run a simple memory test but the result before and after kernel launches were different…

– before running kernel
total mem: 128.405 MB, free: 128.293 MB, used : 0.113 MB
total mem: 128.405 MB, free: 128.293 MB, used : 0.113 MB
– after running kernel
total mem: 255.312 MB, free: 133.688 MB, used : 121.625 MB
error is no error
Processing time: 7073.609863 (ms)
total mem: 255.312 MB, free: 145.688 MB, used : 109.625 MB
total mem: 255.312 MB, free: 145.688 MB, used : 109.625 MB
– end

so may be its not allocating desired memory in the device and we are unintentionally excessing unmalloced space .

??? this is my guess
?
if u find the solution plz let me know

thanks

nachovall · October 8, 2009, 1:43pm

This could explain a lot of things… How can I get the memory state? (total, free, and used) Is there a CUDA command to check it?

This sounds very strange. Why is the total memory value different before and after kernel execution?

Snowball_Two · October 8, 2009, 1:55pm

CUDA is full of restrictions. The systems maximum propertys (like blocksize ect.) can be found by calling the function cudaGetDeviceProp (look in the manual).

Maximum instructions per thread…i read that somewhere, remember something like 2 million, in case i remember correctly.

the values of these restrictions seem to be based on your nvidia graphic chip.

i used to create such screens (like attached in the first post) by allocating too much memory on the device. allways made me smile ;)

_teju · October 8, 2009, 3:09pm

It is always strongly advised to check the status of the ‘cudaMalloc’, through cudaGetLastError() function. That way, before proceeding to actually work with your kernels, you are quite sure that you have actually allocated what you wanted to allocate :)

@nachovall:
It could be really useful if you can post the code which you used to measure the memory usage. Also, if you want to measure the memory usage in real-time, you could use the function ‘cuMemGetInfo’. BUT, the main issue is that it is a CUDA driver function. So, if you are using CUDA runtime API, then you can use this function in your code to monitor the mem usage. :(

biebo · October 9, 2009, 5:46pm

//[debug]
unsigned int free_mem,total_mem, used_mem;
cuMemGetInfo( &free_mem, &total_mem );
used_mem = total_mem-free_mem;
printf(“total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n”,
((double)total_mem)/1024.0/1024.0,
((double)free_mem )/1024.0/1024.0,
((double)used_mem )/1024.0/1024.0 );
//[end debug]

this is the code which i used to find the memory size …

_teju · October 10, 2009, 3:37am

From the CUDA reference manual:

So, my guess (though I can’t justify it fully) would be that the gpu from which you have obtained the information about memory usage is a primary card used even for the display. So, probably there were some more tasks performed by your card while you were doing this experiment.

nachovall · October 13, 2009, 7:52am

Thanks everyone for your replies. Now I have some ideas to work on (cheking memory usage, state…) I’ll check all this things, do some experiments and if it still doesn’t work, I’ll post again. Thanks again.

Topic		Replies	Views
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3037	November 13, 2017
Kernel max instructions? CUDA Programming and Performance	8	1513	February 8, 2018
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3979	February 9, 2024
Maximum number of threads on thread block CUDA Programming and Performance	12	74153	September 21, 2023
Launching Kernel Fail CUDA Programming and Performance	15	3405	May 28, 2014
I don't know that is the problem... CUDA Programming and Performance	11	962	May 9, 2011
CUDA kernels keep on crashing CUDA Programming and Performance	6	3644	October 27, 2008
Maximum number of instructions per kernel CUDA Programming and Performance	2	958	September 11, 2014
Max threads/block CUDA Programming and Performance	10	22209	March 7, 2011
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2268	October 12, 2021

Maximum number of instruction inside a Kernel

Related topics