"Inline" kernels?

Hi All,

I was wondering if any in Nvidia has looked (or will look) at a change in syntax for kernel calls in CUDA.

EXAMPLE WITH CURRENT SYNTAX:

__global__ void map(int * foo)

{

   int i = threadIdx.x + blockIdx.x * blockDim.x;

   foo[i] = func(foo[i]);

}

int caller()

{

   ... // set up device foo.

   int threads_per_tile = ...;

   int tiles = ...;

   map<<<tiles, threads_per_tile>>>(foo);

   cudaDeviceSynchronize();

   int rv = cudaGetLastError();

   ... // access foo results.

}

EXAMPLE WITH PROPOSED SYNTAX:

int caller()

{

   ... // set up device foo.

   int threads_per_tile = ...;

   int tiles = ...;

   do_all(b = 0..tiles, t = 0..threads)

   {

      int i = t + threads * b;

      foo[i] = func(foo[i]);

   }

   cudaDeviceSynchronize();

   int rv = cudaGetLastError();

   ... // access foo results.

}

I would like to see an “inline” version of a kernel call, where the body of the “do_all” statement is a compound statement. I think this is much easier to read than forcing kernel bodies to be abstracted into a separate procedure. Imagine if you had to write “for” or “while” statements with the body of the statement forced to be a procedure call. Not many would like to program in that language.

Ken

Hi All,

I was wondering if any in Nvidia has looked (or will look) at a change in syntax for kernel calls in CUDA.

EXAMPLE WITH CURRENT SYNTAX:

__global__ void map(int * foo)

{

   int i = threadIdx.x + blockIdx.x * blockDim.x;

   foo[i] = func(foo[i]);

}

int caller()

{

   ... // set up device foo.

   int threads_per_tile = ...;

   int tiles = ...;

   map<<<tiles, threads_per_tile>>>(foo);

   cudaDeviceSynchronize();

   int rv = cudaGetLastError();

   ... // access foo results.

}

EXAMPLE WITH PROPOSED SYNTAX:

int caller()

{

   ... // set up device foo.

   int threads_per_tile = ...;

   int tiles = ...;

   do_all(b = 0..tiles, t = 0..threads)

   {

      int i = t + threads * b;

      foo[i] = func(foo[i]);

   }

   cudaDeviceSynchronize();

   int rv = cudaGetLastError();

   ... // access foo results.

}

I would like to see an “inline” version of a kernel call, where the body of the “do_all” statement is a compound statement. I think this is much easier to read than forcing kernel bodies to be abstracted into a separate procedure. Imagine if you had to write “for” or “while” statements with the body of the statement forced to be a procedure call. Not many would like to program in that language.

Ken

Interesting proposal. I think it could look nice when the kernel is really short.

I’m not familiar with the C preprocessor, though I guess it could help you achieve what you want. If my guess is wrong, you’ll just have to write your own preprocessor and use it as a wrapper.

Interesting proposal. I think it could look nice when the kernel is really short.

I’m not familiar with the C preprocessor, though I guess it could help you achieve what you want. If my guess is wrong, you’ll just have to write your own preprocessor and use it as a wrapper.

You might want to have a look at PGI’s accelerator compilers or hiCUDA.

You might want to have a look at PGI’s accelerator compilers or hiCUDA.

You might also want to have a look at a technology called HMPP, presented here.

You might also want to have a look at a technology called HMPP, presented here.

Hi All,

Thanks for the info. While these technologies seem to implement something like my do_all statement, there are a number of things I don’t like about them:

  • hiCUDA and PGI Acc C use C99 – an absolutely ancient language and compiler. There are no classes, variables have to be declared at the beginning of a block, etc.

  • #pragma are used instead of a more direct, syntactically cleaner C++ language enhancement. CUDA already enhances C++ with the chevron syntax for a kernel call.

  • The tile id variables (i.e., threadIdx, blockIdx, …) are not supported, at least for hiCUDA and PGI Acc C. The do_all statement I suggest keeps the tile id variables, just like existing kernels.

I am hoping to see NVCC/CUDA itself implement inlined kernels. But, OpenCL doesn’t have inlined kernels either. So, I probably won’t see this language enhancement in my lifetime, unless I implement a compiler on top NVCC to do the translation.

Ken

Hi All,

Thanks for the info. While these technologies seem to implement something like my do_all statement, there are a number of things I don’t like about them:

  • hiCUDA and PGI Acc C use C99 – an absolutely ancient language and compiler. There are no classes, variables have to be declared at the beginning of a block, etc.

  • #pragma are used instead of a more direct, syntactically cleaner C++ language enhancement. CUDA already enhances C++ with the chevron syntax for a kernel call.

  • The tile id variables (i.e., threadIdx, blockIdx, …) are not supported, at least for hiCUDA and PGI Acc C. The do_all statement I suggest keeps the tile id variables, just like existing kernels.

I am hoping to see NVCC/CUDA itself implement inlined kernels. But, OpenCL doesn’t have inlined kernels either. So, I probably won’t see this language enhancement in my lifetime, unless I implement a compiler on top NVCC to do the translation.

Ken

Hi,
I understand you remark about using pragmas and I remember thinking the same at first. Especially when you realize that, since they’re going to make a pre-compiler, it doesn’t really matter to them whether it reads pragmas or another layer of specifically designed syntactic sugar.
But at the same time, there are quite a lot of advantages to using pragmas because the code you produce is semantically equivalent if you remove the pragmas. This means that you don’t need to know anything about GPGPU to read and understand the code and more importantly, you can still compile your code with a normal C/C++ or FORTRAN compiler.
Concerning the rather low programing level (C99…), you have to keep in mind that these technologies rely heavily on CUDA, therefore any restriction at the CUDA level needs to be enforced at a higher level as well. CUDA is always allowing more and more high level syntax and these changes need time to be reflected in these technologies.

Regards,

Guillaume

Hi,
I understand you remark about using pragmas and I remember thinking the same at first. Especially when you realize that, since they’re going to make a pre-compiler, it doesn’t really matter to them whether it reads pragmas or another layer of specifically designed syntactic sugar.
But at the same time, there are quite a lot of advantages to using pragmas because the code you produce is semantically equivalent if you remove the pragmas. This means that you don’t need to know anything about GPGPU to read and understand the code and more importantly, you can still compile your code with a normal C/C++ or FORTRAN compiler.
Concerning the rather low programing level (C99…), you have to keep in mind that these technologies rely heavily on CUDA, therefore any restriction at the CUDA level needs to be enforced at a higher level as well. CUDA is always allowing more and more high level syntax and these changes need time to be reflected in these technologies.

Regards,

Guillaume