maximun size of cuda kernal


i was designing a cuda kernel for the calculation of median filter .

i have a query how many lines of code (max) can be there in cuda kernel any restrictions . in general example i have seen really small codes ,
in median filter i have two store the values of 9 surrounding element of every element , then have to arrange them in ascending order and then 5th element in array (ascending ordered) with be the output value…

i am implementing all in one kernel , please suggest/share ny alternate way if any , thanks . .

The limit is 2 million instructions per kernel, so you can forget about it as a practical limitation. There are are a few examples in the SDK which you probably want to have a look at for ideas (the FDTD3d and DCT8x8 off the top of my head) which show how to use shared memory to improve read performance - so have each block read a 2D tile into shared memory, then have each thread extract its 3x3 nearest neighbours and do an in-register sort to find the median and write that value out.

The information is in the programming guide. In general I can say that the limit is large, you will probably never hit it, partly because the compilation would take very, very long.