Interleaved Texture Fetch How?

papag62 · January 31, 2008, 6:29pm

OK, I’m looking to do some undistortion on some images using texture fetches. I capture the image with OpenCV to a RGBA IplImage data structure, the data is stored in as a char*.

I then copy the image date to the device, where imData is the char* to the IplImage, RES_H and RES_V are the horizontal and vertical resolution, and POINTS is RES_HRES_V4 (where 4 is the number of channels). I then create a texture binding.

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 8, 8, 8, cudaChannelFormatKindSigned);

  

cudaArray* cu_array;

CUDA_SAFE_CALL(cudaMallocArray( &cu_array, &channelDesc, RES_H, RES_V ));

 CUDA_SAFE_CALL( cudaMemcpyToArray( cu_array, 0, 0, imData, POINTS, cudaMemcpyHostToDevice));

// set texture parameters

tex_Image.addressMode[0] = cudaAddressModeClamp;

tex_Image.addressMode[1] = cudaAddressModeClamp;

tex_Image.filterMode = cudaFilterModePoint;

tex_Image.normalized = false;    

// Bind the array to the texture

CUDA_SAFE_CALL( cudaBindTextureToArray( tex_Image, cu_array, channelDesc));

My kernel follows:

texture <char4, 2, cudaReadModeElementType> tex_Image;

__global__ void

cuUndistort(char* imData,float* Mapx,float* Mapy,int* RES_H, int* RES_V, char4* ResultData)

{

	int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

	int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

	int loc = __mul24(y, RES_H[0]) + x;

	if (x < RES_H[0] && y < RES_V[0]) 

	{

  ResultData[loc] = tex2D(tex_Image, Mapx[loc], Mapy[loc]);

	}    

}

This all works just fine, but my problem is with using char4. I need to copy the resulting data back into a interleaved char* which alone takes ~15ms. I tried allocating char types instead of char4 and using a char texture reference but the image didn’t come out right. I then still have the problem on not being able to use cudaFilterModeLinear as my filter mode.

Does anybody have any suggestion on how to go from my char* image and use a texture reference that can use bi-linear interpolation and still be able to transfer to the host as a char*?

I would also like to be able to do 32bit memory read/writes as I heard this is much faster than 8bit.

Thanks for your time.

paulius · January 31, 2008, 6:57pm

I don’t understand what issue you’re having with char/char4 conversion, please give a more detailed description.

To get bilinear filtering, configure your texture to return normalized floats, rather than element type - filtering is available only for textures that return floats. Then scale the returned value inside your kernel, to get an integer.

Paulius

papag62 · January 31, 2008, 7:22pm

Like I said, I am able to use char4 because the texture fetch returns each channel, RGBA, into its own element in the char4 vector. It work fine but then I need to do

int i=0;

	int j=0;

	/*for(i=0;i<POINTS;i+=CHANNELS)

	{

  imResultData[i]=(char)hResultData[j].x;

  imResultData[i+1]=(char)hResultData[j].y;

  imResultData[i+2]=(char)hResultData[j].z;

  imResultData[i+3]=(char)hResultData[j].w;

  j++;

	}*/

to copy the char4 to a char. When I try using char instead of char4 in my texture reference the resulting data isn’t correct. It seems like CUDA isn’t handling the interleaved data correctly. I first want to make sure that I can use none vector types for interleaved texture fetches.

paulius · January 31, 2008, 11:20pm

…

int i=0;

	int j=0;

	/*for(i=0;i<POINTS;i+=CHANNELS)

	{

  imResultData[i]=(char)hResultData[j].x;

  imResultData[i+1]=(char)hResultData[j].y;

  imResultData[i+2]=(char)hResultData[j].z;

  imResultData[i+3]=(char)hResultData[j].w;

  j++;

	}*/

…[snapback]318113[/snapback]

I still don’t fully follow your case, but I think I’m beginning to understand. By interleaved texture fetches, you mean that color-components are interleaved, is that correct?

In the code snipped, is imResultData a pointer (array) of char in global memory? I don’t understand why you need to write one component at a time to the output. Based on your addressing above you can just assign a char4 at a time. Especially if imResultData is in global memory, you’ll get much better performance writing char4s, since that allows coalescing (run your code through the profiler to check for uncoalesced accesses).

In regards to fetching scalar chars from textures, I haven’t run into, or heard of, an issue fetching scalars from a texture. I’m leaning towards the possibility of a bug in your code causing an incorrect image in the output. Have you tested the correctness of scalar fetching in a small test program, rather than a full kernel?

Paulius

papag62 · February 1, 2008, 4:31am

Don’t worry, I’ve been looking at this for so long I’m beginning to wonder if I can follow my own case :blink:

Yes you are correct, when I stated interleaved texture fetch I meant performing a texture fetch on interleaved data (R,G,B,A,R,G,B,A,R…). In my code snipped imResultData is a char* in host memory (I should have clarified that this was host code).

What I am starting with is a char* image (this is the data type for the data in an OpenCV image) of length RES_HRES_V#CHANNELS on the host side. I am transfering this to the device with a texture bind operation. I then run the kernel to perform a texture fetch using Mapx and Mapy, where Mapx and Mapy are distortion maps that indicate the pixel location in a distorted image that maps to my current pixel location on an undistorted image (I hope this make sense, the threads are set up so that each thread is responsible for a pixel in the undistorted image). I then wish to transfer this back to the host so that I can put the resulting data back into an OpenCV image which requires a char*.

I have been assuming that when I do a channel descriptor with non-zero values for x,y,z, and w that CUDA understands that I have 4 channels interleaved. Is this true, or does the vector type in the texture reference tell CUDA how the data is arranged?

cudaCreateChannelDesc(8, 8, 8, 8, cudaChannelFormatKindSigned);

The only clue I have on this is section 4.4.5.1 which states:

Which uses a CUDA _4 vector for 4-tuples, which I take to be four channel interleaved data and I take it that this means only _4 vector types can be used for four channel interleaved data.

Would my best bet be to have NormalizedFloat char4 texutre reference, call the texture fetches to a float4 and then typecast and copy the float4 data to a char* that I can then transfer back to the host and use in OpenCV?

I really think this is coming across more complicated than it is, so thanks for putting up with it :)

I still don’t fully follow your case, but I think I’m beginning to understand. By interleaved texture fetches, you mean that color-components are interleaved, is that correct?

In the code snipped, is imResultData a pointer (array) of char in global memory? I don’t understand why you need to write one component at a time to the output. Based on your addressing above you can just assign a char4 at a time. Especially if imResultData is in global memory, you’ll get much better performance writing char4s, since that allows coalescing (run your code through the profiler to check for uncoalesced accesses).

In regards to fetching scalar chars from textures, I haven’t run into, or heard of, an issue fetching scalars from a texture. I’m leaning towards the possibility of a bug in your code causing an incorrect image in the output. Have you tested the correctness of scalar fetching in a small test program, rather than a full kernel?

Paulius

[snapback]318223[/snapback]