Pointer arithmetic in host code

I’d like to do something like this:

float* vmem_in;

float* vmem_out;

cudaMalloc( (void**)&vmem_in, width*height*sizeof(float) );

cudaMalloc( (void**)&vmem_out, width*height*sizeof(float) );

// Just to illustrate

block.x = 32;

block.y = 16;

grid.x = width/32;

grid.y = height/16;

// mykernel does per-element computation

mykernel<<<grid,block,shmsz>>>( vmem_in, vmem_out, width  );

// Here is the interesting part

grid.x = 1;

block.x = width%32;

mykernel<<<grid,block,shmsz>>>( vmem_in + 32*(width/32), vmem_out + 32*(width/32), width );

It is useful when you cannot find a block size that covers the whole matrix (other than 1x1) and you don’t want to make a shift in the kernel code (which supposes not only a performance loss, but also more registers per thread). I don’t know enough architectural details so I don’t know if it is even possible. It seems like it is not working right now, so what I’d like to know is if this is technically possible and if it will be implemented in the (near) future.

Thank you!

You should be able to do it (where is gpu_mem defined?).

However you could do something like this ( the code is for a square matrix, but it will work for generic N)

// dimension of the block

dim3 dimBlock(256, 16);

// Find out how many blocks will cover the matrix

dim3 dimGrid ( (N/dimBlock.x) + (!(N%dimBlock.x)?0:1),

                   (N/dimBlock.y) + (!(N%dimBlock.y)?0:1) );

// Operate on the kernel

mykernel <<<dimGrid, dimBlock>>> ( vmem_in, vmem_out, N )

where mykernel is:

__global__ void  mykernel(float *in, float *out, int N)

{

  unsigned int idx   = __mul24(blockIdx.x,blockDim.x)+threadIdx.x;

  unsigned int idy   = __mul24(blockIdx.y,blockDim.y)+threadIdx.y;

  // Check if  idx and idy are pointing to a valid element of the matrix

  if( idx<N && idy <N )

  {

  unsigned int index = idx +__mul24(idy ,N);

  out[index]        = in[index];

  }

}

I’ve been doing something similar:

dim3 dimBlock(256, 16);

dim3 dimGrid(ceil((float)N/(float)dimBlock.x), ceil((float)N/(float)dimBlock.y));

With the same index checking code in the kernel. The integer mod might be a little faster, though.

I’m sorry, I wrote that code ad hoc, to illustrate the problem. I corrected it (I also corrected the number of thread to something more real :P).

What I want to do is pointer arithmetic within host code (with pointers to device memory). I don’t know how pointers are passed to kernels from host to device, so I don’t know whether it is even possible.

Here the same example extended for clarity:

This is host code:

float* vmem_in;

float* vmem_out;

cudaMalloc( (void**)&vmem_in, width*height*sizeof(float) );

cudaMalloc( (void**)&vmem_out, width*height*sizeof(float) );

// Suppose that height is multiple of height, but I cannot assure the same for width

block.x = 32;

block.y = 16;

grid.x = width/32;

grid.y = height/16;

// I invoke my kernel

mykernel<<<grid,block,shmsz>>>( vmem_in, vmem_out, width  );

// Here is the interesting part

grid.x = 1;

block.x = width%32;

mykernel<<<grid,block,shmsz>>>( vmem_in + 32*(width/32), vmem_out + 32*(width/32), width );

And here is the kernel code:

__global__

void mykernel( float* in, float* out, int width )

{

    int x = blockIdx.x*blockDim.x + threadIdx.x;

    int y = blockIdx.y*blockDim.y + threadIdx.y;

   out[y*width + x] = in[y*width + x]*in[y*width + x];

}

This way, if 32x16 is the best block size, I can cover as much of the matrix as I can with 32x16 thread blocks, and then launch smaller blocks for the remaining elements, without modifying the

kernel code and without passing an extra argument to the kernel and performing an extra computation (basically, add a shift to x).

As I say, I don’t know whether it is technically possible or not, but it is not working right now (if I’m not missing something).

I think it should work, but another option would be to pad your input arrays and just do the extra work at the edge of the array rather than doing another launch with a smaller block size. As long as the valid output doesnt’ depend on invalid input (the padded region), this will work correctly and likely be faster (if the work for the extra threads is not higher than the kernel launch overhead).

Mark

The problem here is that I have to launch several kernels with different block sizes for the same image and the padding would affect to the results in some of them. So padding is not a solution for this particular case. Another solution would be something like “if I’m outside the image, return”, but I need to pass more arguments and my kernels use more registers this way (to the point that I can’t find a block size that achieves full occupancy). The perfect solution for me is the one I described above, and if you say that it should work, I will try until it works External Media

EDIT:

OK, it’s already working! It was just a little problem with the shift of the destination matrix (I’m using different matrix sizes for input and output, and you know… copy-pasting…). It works as expected, thank you for your help ;)