Maximum number of instruction inside a Kernel

Hi everyone.

I have a problem in my kernel that investigating I think I finally discover which is the problem. Some days ago I [topic=“107943”]post[/topic] about a problem with my graphics card which seems to have strange behavior after some executions. teju said me that may be I was accessing in a non 'malloc’ed position. To ensure that this wasn’t happening I wrote a simple program to guarantee correct memory access. But the problem was still there.

I was wondering if the problem was in the loop of the Kernel which is very big. I did some test with this program:

#include <cuda.h>

#include <cutil.h>

#include <iostream>

#define BLOCK_X	32

#define BLOCK_Y	16

#define N1 4500000

//#define N1 4700000

using namespace std;

__global__ void bucles(float * data, float * res, float Nbucle)


	for(uint n=0;n<Nbucle;n++)



int main()


	uint Nbucle = N2;

	uint N = 512;

	float * data = new float[N];

	for(uint i=0;i<N;i++)


	float * dataDevice;



	float * resDevice;


	dim3 dimBlock(BLOCK_X,BLOCK_Y);

 	dim3 dimGrid(N/(dimBlock.x*dimBlock.y)); 



	float * res = new float[N];




	for(uint i=0;i<N;i++)


		if(i%200==0) cout<<endl;




	delete data;

	delete res;

	return 0;


The grid just have 1 block and the block size is 512. Every thread loops Nbucle times. I want to test which is the limit of the number of instructions a thread supports (if exists). I tried for different values of N1 and I bound (more or less) the values. For N1=4500000 the results was OK and the graphics hardware response right but for N1=4700000 the results were all wrong and screen looks like the attachment image.

I didn’t find the exact value where this happens because some times for N1=4500000 the result was wrong (depends on the execution). So I analyze my kernel with the CUDA Visual profiler and I saw how the same kernel some times was executing different numbers of instructions on different runnings. In this [topic=“108354”]post[/topic] is explained why. This explains why for N1=4500000 sometimes was right and sometimes was wrong. Because I am on the limit of the maximum instruction per thread (I concluded).

So my question to all these is: I am correct? Exists a max number of instructions per thread? If exists, why?

In the CUDA programing guide 2.2 I read something about the max size of a Kernel. “The limit on kernel size is 2 million PTX instructions”. I understand that this doesn’t mean the number of instructions per thread but the size of the compiled kernel. I assume that nvcc doesn’t unroll my loop (since N is a variable), so my simply program doesn’t reach the 2 million PTX instructions.

Any help will be very appreciated. Thanks.

The looks like a hardware problem to me. I had one gpu (a DDR3 GT9500) which would generate hard driver errors, unspecified launch failures, hose the display, and do all sorts of other odd things which running very computationally intensive kernels. There are no limits of the kind you are speculating about.

That’s what I thought at fisrt, but a friend of mine in the same laboratory has exactly the same problem. Different graphics card, but the sam O.S. and CUDA version. Any more ideas?


I also have the same problem, when ever i increase size of the memory to be allocated in device my screen goes out of order and then i have to “ctrl+alt+f3” and back to f7 as my x server is in f7.

i have ubuntu 9.04
card = 9200M
driver 180.XXX

as for as memory is concern i run a simple memory test but the result before and after kernel launches were different…

– before running kernel
total mem: 128.405 MB, free: 128.293 MB, used : 0.113 MB
total mem: 128.405 MB, free: 128.293 MB, used : 0.113 MB
– after running kernel
total mem: 255.312 MB, free: 133.688 MB, used : 121.625 MB
error is no error
Processing time: 7073.609863 (ms)
total mem: 255.312 MB, free: 145.688 MB, used : 109.625 MB
total mem: 255.312 MB, free: 145.688 MB, used : 109.625 MB
– end

so may be its not allocating desired memory in the device and we are unintentionally excessing unmalloced space .

??? this is my guess
if u find the solution plz let me know


This could explain a lot of things… How can I get the memory state? (total, free, and used) Is there a CUDA command to check it?

This sounds very strange. Why is the total memory value different before and after kernel execution?

CUDA is full of restrictions. The systems maximum propertys (like blocksize ect.) can be found by calling the function cudaGetDeviceProp (look in the manual).

Maximum instructions per thread…i read that somewhere, remember something like 2 million, in case i remember correctly.

the values of these restrictions seem to be based on your nvidia graphic chip.

i used to create such screens (like attached in the first post) by allocating too much memory on the device. allways made me smile ;)

It is always strongly advised to check the status of the ‘cudaMalloc’, through cudaGetLastError() function. That way, before proceeding to actually work with your kernels, you are quite sure that you have actually allocated what you wanted to allocate :)

It could be really useful if you can post the code which you used to measure the memory usage. Also, if you want to measure the memory usage in real-time, you could use the function ‘cuMemGetInfo’. BUT, the main issue is that it is a CUDA driver function. So, if you are using CUDA runtime API, then you can use this function in your code to monitor the mem usage. :(

unsigned int free_mem,total_mem, used_mem;
cuMemGetInfo( &free_mem, &total_mem );
used_mem = total_mem-free_mem;
printf(“total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n”,
((double)free_mem )/1024.0/1024.0,
((double)used_mem )/1024.0/1024.0 );
//[end debug]

this is the code which i used to find the memory size …

From the CUDA reference manual:

So, my guess (though I can’t justify it fully) would be that the gpu from which you have obtained the information about memory usage is a primary card used even for the display. So, probably there were some more tasks performed by your card while you were doing this experiment.

Thanks everyone for your replies. Now I have some ideas to work on (cheking memory usage, state…) I’ll check all this things, do some experiments and if it still doesn’t work, I’ll post again. Thanks again.