Raytracing on dataset which is not a power of 2 Uneven dataset

Is it possible to easily setup you grid or blocks so they can work with uneven datasets? The thing is that I have a 3D CT dataset which is not always a multiple of 16. If this is the case so lets say the dataset is 524 x 340 x 1 I now calculate how much blocks of 16x16x1 fit into the dataset.

And if I’m out of blocks that fit i change the blocksize for the last block. so that it will fit into the the dataset.

Is there another solution to do this, or is this the only way to do it.

litte code example:

if((dimx % 16) == 0 && (dimy % 16) == 0) {

  dim3 dimBlock(blockSize_x_, blockSize_y_);

  dim3 dimGrid((dimx_)*(dimz_)/ dimBlock.x, (dimy_) / dimBlock.y);

    gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

      (dimGrid.x / dimz_), blockSize_x_, blockSize_y_,

      offset_x_, offset_y_);

  

  std::cout << "DoTrace..." << std::endl;  

	} else {

  unsigned int a = blockSize_x_ * (dimx_ / 16);

  unsigned int b = dimx_ - a;

 dim3 dimBlock(blockSize_x_, blockSize_y_);

  dim3 dimGrid(a*(nz-1) / dimBlock.x, a / dimBlock.y);

 gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

  	(dimGrid.x / dimz), blockSize_x_, blockSize_y_,

  	offset_x_, offset_y_);

 blockSize_x_ = b; blockSize_y_ = b;

  offset_x_ = a; offset_y_ = a;

 dimBlock(blockSize_x_, blockSize_y_);

  dimGrid((b*dimz_) / dimBlock.x, b / dimBlock.y);

 gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

  	(dimGrid.x / dimz), blockSize_x_, blockSize_y_,

  	offset_x_, offset_y_);	

 blockSize_x_ = b; blockSize_y_ = a;

  offset_x_ = a; offset_y_ = 0;

 dimBlock(blockSize_x_, blockSize_y_);

  dimGrid(1*dimz / dimBlock.x, a / dimBlock.y);

 gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

  	(dimGrid.x / dimz), blockSize_x_, blockSize_y_,

  	offset_x_, offset_y_);

 blockSize_x_ = a; blockSize_y_ = b;

  offset_x_ = 0; offset_y_ = a;

 dimBlock(blockSize_x_, blockSize_y_);

  dimGrid(a*dimz / dimblock.x, 1);

 gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

  	(dimGrid.x / dimz), blockSize_x_, blockSize_y_,

  	offset_x_, offset_y_);

	}

thanks

Jordy

Can you not have the size of your dataset as inputs and in the beginning of your kernel put a

function (inputs, int size_x, int size_y) {
index_x = threadIdx.x + __mul24(blockIdx.x, blockDim.x);
index_y = threadIdx.y + __mul24(blockIdx.y, blockDim.y);

if ((index_x < size_x) && (index_y < size_y)) {
your code
}
}
???

Hi Dennis,

Thanks for the reply but I already found something else for my problem. because I don’t want to let my kernel do to much of nothing. Now I wrote this.

template <class VType>

void GPURaytracer<VType>::DoTrace()

{

	//XXX TODO

	//Implementatie grid met generieke afmetingen != veelvoud 16

	unsigned int a = floor(dimx / blockSize_x_);

	unsigned int b = floor(dimy / blockSize_y_);

	int blockSize_x_temp = blockSize_x_;

	if(a > 0 && b > 0) {

  dim3 dimBlock(blockSize_x_, blockSize_y_);

  dim3 dimGrid(a*dimz_, b);

 gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

    (dimGrid.x / dimz_), blockSize_x_, blockSize_y_,

    offset_x_, offset_y_);

 blockSize_x_ = dimx_ - blockSize_x_ * a;

  offset_x_ = blockSize_x_ * a;

 if(blockSize_x_ > 0) {

  	dimBlock(blockSize_x_, blockSize_y_);

  	dimGrid(1*dimz_, b);

  	

  	gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

    	(dimGrid.x / dimz_), blockSize_x_, blockSize_y_,

    	offset_x_, offset_y_);

  }

 blockSize_y_ = dimy_ - blockSize_y_ * b;

  offset_y_ = blockSize_y * b;

 if(blockSize_x_ > 0 && blockSize_y_ > 0) {

  	dimBlock(blockSize_x_, blockSize_y_);

  	dimGrid(1*dimz_, 1);

 	gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

    	(dimGrid.x / dimz_), blockSize_x_, blockSize_y_,

    	offset_x_, offset_y_);

  }

 offset_x_ = 0;

  blockSize_x_ = blockSize_x_temp;

 if(blockSize_y_ > 0) {	

  	dimBlock(blockSize_x_, blockSize_y);

  	dimGrid(a*dimz_, 1);

  

  	gpu_raytracer(d_radiologicalDepth_, d_CT_dataset_,

    	(dimGrid.x / dimz_), blockSize_x_, blockSize_y_,

    	offset_x_, offset_y_);

  }

	}

Actually what I wrote is (I believe) very fast ;)
There will be a few threads (of the higher warps) that exit immediately. So there will be some threads that do no work, but…
Now you are calling your kernel multiple times (first to do the big multiple of 16 part and then to do the ‘edges’ if I read it correctly) and the overhead of calling a kernel is probably much higher than the extra cost of the if(). That is, if adding the if() means you can call your kernel only once, but I must say that I don’t completely follow the code (although I can read the comments :magic: )