Has anyone heard when we might expect to get support for 3D grids? I just looked at the list of features for the new 2.3 beta and it’s not there. I have some kernels that would really benefit from using 3D blocks but I haven’t found any method to calculate my global z that would allow me to use 3D threadblock x,y,and z coordinates. If anyone knows how to do this I would really appreciate a pointer to some code. I saw Jonathan Cohen’s code posted by Simon Green Jan 09 but it doesn’t allow me to use native 3D block indexing which would be so much more intuitive.

One approach would be to find out how CUDA creates a 2D grid of 3D blocks and then implement that algorithm to get to the global z coordinate.

Thanks wanderine I see how the index can be produced but this ties the number of threads in a block to the width of the original dataset. It would be so sweet if we could just take 8x8x8 or 8x8x4 chunks out of a 3D dataset so as to control the number of threads. It shouldn’t be that hard to discover the algorithm they’re using to make a 2D grid out of 3D blocks. ‘’

Perhaps someone from NVidia would care to share the algorithm and save me some time. Please, ;-)

I think we may be misunderstanding each other. I’m assuming that you have a 3D volume, and you want to give each cell to a thread. Each block of threads can easily be made 3D - support is there already. The question I was attempting to answer was how to assign the blocks themselves 3D co-ordinates. For the 3D volume, you know the number of blocks you need in each dimension (by dividing the number of cells in a given dimension by the number of threads a block has along the same dimension). You can then launch a 1D grid, and have each block convert its 1D blockIdx.x co-ordinate into its true 3D location using the method I showed.

Yes, you understand what I’m trying to do. I don’t want any restrictions on the sizes of the 3D threadblocks that I can create except for threads in a block. I would rather that I not be restricted to cubes, although I can live with that. Correct me if I’m wrong, but converting the index to global x, y, and z the way that you did above using modulus is predicated on the idea that we’re dealing with a 3D threadblock that is a cube, correct?

Weakling! I’m working on a 6-D search for an image registration :D

Well, you can divvy up the threads within a block into 1, 2 or 3D as you please. The index decomposition I described would have to know about this, but it could cope with anything you needed. And its certainly not restricted to cubes - any cuboid will work if you change the value used in the div/mod for each stage. It’s just the reverse operation to

Thanks, I see what you’re doing now. I’m still having a hard time compressing the dimensionality of my data but it’s getting easier after my first week of fulltime CUDA programming. I do love the raw power of it, very heady stuff. Just ported C++ code that we use to create observed gene data that took about 12 hours before CUDA, now it’s 30 minutes !!!

I still want NVidia to free us from all this work and increase grid dimensions. Code maintenance and being able to bring new developers up to speed quickly on CUDA code would be so much easier if indexing our data wasn’t so cryptic and awkward.

Well, indexing is always going to be awkward (to a certain extent) until someone teaches C what an array is - it’ll be interesting to see what [font=“Courier New”]nvfc[/font] does (if it’s ever released), since the Fortran specification defines arrays of up to six dimensions as first-class datatypes (and [font=“Courier New”]ELEMENTAL[/font] functions which should automatically meet the requirements of being a kernel). For the actual grid and thread blocks… I wouldn’t hold my breath. We have all this computational power because playing Quake involves the manipulation of a lot of 3D data to produce a 2D representation of that data. I expect that’s the origin of grids being 2D, but blocks being (potentially) 3D. I don’t think NVIDIA’s prime revenue stream is going to shift any time soon. Of course, they might be able to make this happen (witness the appearance of double precision hardware), but I doubt it’ll be a high priority.

Almost… but to get the right answer, you would have needed the correct information :"> It’s actually 9-D - three rotations, translations and dilations. RIght now, I do each individual fitness evaluation on the GPU, and determine which transforms to use on the CPU. That’s leaving the GPU badly under-utilised, so I want to do the 9D space of all candidate transforms in a single kernel launch.

I thought about this over the weekend and realized that without knowing how the mapping is done you can still establish a layout and assign coordinates. The trick was getting z, so I just created a grid that looks like a piece of celluloid film where grid X = dataset X and Z is time. Each frame is a slice, gridX will always be equal to dataX, and dataY and dataZ can be easily calculated as long as you know the original dimensions of the dataset.

So, as an example, 3D matrix multiplication with float datasets of size 32x32x32;

//////
// .h
typedef struct {
unsigned int dataX;
unsigned int dataY;
unsigned int dataZ;
float* elements;
} Matrix3D;
////////////
// host side
extern "C" void
MatrixMult3D( Matrix3D a, Matrix3D b )
{
Matrix3D c;
c.dataX = a.dataX; // 32 here but can be any multiple of 8
c.dataY = a.dataY; // 32 here but can be any multiple of 8
c.dataZ = a.dataZ; // 32 here but can be any multiple of 4 or 8 depending upon desired threads
c.elements = cudaMalloc((void**) &c.elements, c.dataX*c.dataY*c.dataZ*sizeof( float ));
int numBlocks = (c.dataX*c.dataY*c.dataZ)/(8*8*4);
dim3 dimBlock(8,8,4); // 256 thread version
dim3 dimGrid(c.dataX/8, numBlocks / (c.dataX/8));
k_MatrixMult3D <<< dimGrid, dimBlock >>> ( a, b, c );
}
//////////////
// device side
__global__ void
k_MatrixMult3D( Matrix3D a, Matrix3D b, Matrix3D c )
{
unsigned int dX = threadIdx.x + (blockDim.x*blockIdx.x); // just like a 2D grid of 2D blocks
unsigned int dS = (blockDim.y*blockIdx.y)/c.dataY; // slice index
unsigned int dY = dS = 0 ? threadIdx.y + (blockDim.y*blockIdx.y) : ( threadIdx.y + (blockDim.y*blockIdx.y) ) % c.dataY;
unsigned int dZ = threadIdx.z + (dS * blockDim.z);
unsigned int ndx;
if (dX < c.dataX && dY < c.dataY && dZ < c.dataZ) {
ndx = dX + dY*c.dataX + dZ*c.dataX*c.dataY;
c.elements[ndx] = a.elements[ndx] * b.elements[ndx];
}
}

To me this is more intuitive, and still leaves us room for adding other dimensions. For 3D it really makes things so much simpler because it’s just stacked slices. Any potential problems with this approach that you guys can see (other than the obvious fact that I didn’t use any shared memory in this example so it would run slow)?