Extracting multiple texels at once how to extract a 1D array from a 3D or 2D texture?

The Title and Description more or less sum up what I need help with. In more detail, I have a 3D texture, a cube if you will, which was created in such a way that every element with the coordinates (X,Y,Z) belongs to the same data group with all the other elements with the same X and Y coordinates. Each thread needs to acess all of the relevant data to do the math it’s supposed to. The range for the depth dimension (Z coordinate) is 256 (Z e {0…255}) and the data type is char, which takes up one byte per element, which means that I can launch 64 kernels per block.

Anyways, I would like to use tex3D() or some other function to copy all of the data with the coordinates (X,Y, 0…255) in one move to a local variable. I’ve spent some time pondering this and the only thing that comes to mind is using two double2 variables for reading, and extracting 256 chars from that, but it doesn’t strike me as a particlarly efficient method. Esentially, I could make a 256 wide and max(X)max(Y) high 2D texture, or even a 256max(X)*max(Y) long 1D texture, but the problem of copying multiple texels at once still remains. So, the problem can be generalised into efficiently extracting a submatrix of any N-dim texture (which is essentialy a N-dim matrix).

So, does anyone have a solution?

P.S. I will be going into why I decided on this data organisation in a different thread, where I’ll describe what I’m trying to code, since I’m not sure anymore that this is the optimal solution, but I think that it’s not relevant to this discussion…

Thanks in advance,
Nemanja

Sounds like you’re looking at the problem from a Graphics API background. Forget the textures, use ordinary arrays, and read up on coalescing rules on how to get maximum DRAM bandwidth. You’d probably want to read char4s. (You won’t get better performance casting as double4’s, and in fact casting as double may corrupt your data.)

There’s no automatic way to extract a submatrix from a 3D texture in cuda.

Thank you very very much for your answer. I can’t believe I overlooked the obvious like that… Textures would have been better since I need the clamping adressing mode and there’s going to be a lot of transfer from the cube, but I’ll have to use regular matrices in global/constant memory instead… Thank you again!

Hm, I have an additional question:

Let’s say that the aformentioned 3d array is stored in a variable defined as device unsigned char 3dArray[256][256][256]. The device decorator makes it a part of global memory. Can I simply use memcpy(), a stdio.h function, in order to copy for example 3dArray[0…15][1][1] if I use pointers? Or do I have to use for example cudaMemcpy()? Again, it is imperative for me to be able to read only a PART of an array.

Thanks in advance,
Nemanja

memcpy() doesn’t exist in cuda, and cudaMemcpy() is for copying device to device memory (called from the Host). To copy device to shared, (or to copy device to device in a kernel), you’ll have to write a for() loop.

Also, the only way to get good performance is if you copy contiguous memory. So, rearrange your 3D array so that you’re copying 3dArray[1][1][0…15].

Again, read the Programming Guide, and pay attention to the section on coalescing. It answers everything.