Cuda for Gimp

Hi there,

for people who might be interested in writing Cuda enabled Gimp plug-in, here’s what I’ve been able to do as a base. Hope it might be useful.

Being new to Gimp plug-in and Cuda, improvements/comments are welcome (there may be mistakes for sure ;-) ).

It is divided into 2 files : and

The first retrieves image raw data from Gimp, calls the second, and puts back the computed values to Gimp.

To compile, I used the following command :

nvcc -o cudatest --host-compilation C pkg-config --cflags --libs gimp-2.0

(the plug-in just inverts the current image)

#include <libgimp/gimp.h>

void kernel_render(unsigned char *buffer,int width,int height,int channels);

static void query(void) {

  static GimpParamDef args[]={

    {GIMP_PDB_INT32   ,"run-mode","Run mode"      },

    {GIMP_PDB_IMAGE   ,"image"   ,"Input image"   },

    {GIMP_PDB_DRAWABLE,"drawable","Input drawable"}




    "Plug-in test cuda de Fred",

    "Teste l'intégration de Cuda dans un plug-in Gimp",

    "Frédéric BISSON",

    "Copyright Frédéric BISSON",


    "_Cuda test...",



    G_N_ELEMENTS (args), 0,

    args, NULL




static void render(GimpDrawable *drawable) {

  gint         channels;

  gint         x1,y1,x2,y2;

  GimpPixelRgn src,dst;

  guchar       *buffer;

  gint         width,height;



 width =x2-x1;



  gimp_pixel_rgn_init(&dst,drawable,x1,y1,width,height,TRUE ,TRUE );











static void run(const gchar *name,gint nparams,const GimpParam *param,gint *nreturn_vals,GimpParam **return_vals) {

  static GimpParam  values[1];

  GimpPDBStatusType status = GIMP_PDB_SUCCESS;

  GimpRunMode       run_mode;

  GimpDrawable      *drawable;

 // Setting mandatory output values


  *return_vals =values;

 values[0].type         =GIMP_PDB_STATUS;


 // Getting run_mode - we won't display a dialog if we are in NONINTERACTIVE mode



    g_message("Cuda test in progress...\n");


 // Get specified drawable






GimpPlugInInfo PLUG_IN_INFO={







#include <stdio.h>

__global__ void doProcessPicture(unsigned char *src,unsigned char *dst,int width,int height,int channels) {

  int i,j;

  int base=threadIdx.x*width*channels*gridDim.x;

  int offset;

 for(j=0;j<gridDim.x;j++) {

    for(i=0;i<width*channels;i++) {






void kernel_render(unsigned char *buffer,int width,int height,int channels) {

  unsigned char *devSrc;

  unsigned char *devDst;

  int bufferSize;

  int nbIter;

  int nbThread;




 // Allocate two buffers on the GPU



 // Copy our buffer into the GPU input buffer


 // Run the GPU routine


 // Retrieve the GPU output buffer into our buffer


 // Free allocated GPU buffers




Thanks for posting this, I think there’s a lot of potential for accelerating GIMP filters using the GPU.

However, I don’t quite understand your code - it looks like you’re using a thread per row in the image. Wouldn’t it make more sense to use a thread per pixel (at least for simple image processing tasks)?

That’s probably because I didn’t understand Cuda and GPU programming very well.

This dummy plugin considered it can launch 192 threads (nbThread=192;) concurrently. My calculation was the following :

  • the GTX280 has 240 processing units, 192 for the GTX260,

  • then it is able to run up to 240 threads concurrently, 192 for the GTX260,

  • I just divide the image height by 192 (so it can run on either the GTX260 and GTX280),

  • then each part is able to work concurrently.

In fact, if the image has 1920 rows, each thread will work on 10 rows.

The images I tried this plug-in on are 2048x1536 RGB images. It represents more than 3 millions pixels. I can’t possibly launch 3 millions threads (???) at the same time.

I probably missed important points in Cuda development.

Maybe I should try a tile approach to the problem ? What do you think ?

At the moment, the code sends everything into the GPU, computes the result (one call to the kernel function) and gets it back. It might be more efficient to send tile by tile to the GPU while this one computes the result.

So many questions…

Actually GTX280 can run 30k threads simultaneously. Each multiprocessor can keep 1024 threads in flight, and there are 30 multiprocessors (24 for GTX260). And apart from that, it is actually much better for performance if your request many more of these, they get scheduled after eachother if needed.

So the easiest thing to do would be:

ask for 256 threads per block (to have better performance on older hardware)

ask for ceil(num_ints/256) blocks. (num_ints = num_pixels * num_channels)

Your code will look like this:

do_ProcessPicture<<<ceil(num_pixels * num_channels/256.0f), 256>>>(devSrc,devDst,num_pixels * num_channels);

__global__ void doProcessPicture(unsigned char *src,unsigned char *dst,int num_int) {

unsigned int int_index = blockIdx.x * blockDim.x + threadIdx.x;

if (int_index < num_int)


 Â dst[int_index]=255-src[int_index];



Ok !

I think I get it.

So, does the following setup take advantage of the architecture ?

dim3 dimBlock(12,12,channels);

dim3 dimGrid(width/12,height/12);



__global__ void doProcessPicture(unsigned char *src,unsigned char *dst,int num_int) {

  unsigned int x=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int y=blockIdx.y*blockDim.y+threadIdx.y;

  unsigned int c=threadIdx.z;


The number of threads per block would be 432 (with channels=3), which remains below 512, the limit for the GTX280.

deviceQuery from the Cuda SDK gives me 65535x65535 as the limits for grid sizes. Is there a limit on the number of block such a grid can contain (except for the memory) ?

Thanks for the answers

It is actually much easier to treat your matrix as a lineair array. There is no need for x and y values in this case. If your would be doing more fancy things, then I think you would keep your block 2D as your grid, and let each tread work on all channels of the pixel. Then is is also very much advisable to have your blocksize be 16x16, that way it is more natural to coalesce memory access.

In this case, the following is by far the easiest (and best-performing option)

dim3 dimBlock(256,1,1);

dim3 dimGrid(ceil(width*height*channels/256.0f),1,1);



__global__ void doProcessPicture(unsigned char *src,unsigned char *dst,int num_int) {

 unsigned int index=blockIdx.x*blockDim.x+threadIdx.x;

if (index<num_int)



Thanks !

post and i wouldnt mind trying it.

i have 2 8800gts g92s