uchar3 to texure

I do some digital Imageprocessing with Cuda.

From “Bitmap” datastructure i get an

unsigned char[imageWidth * imageHeight*3]

. To make an


out of it is no problem.

But what is the fastest way, to use it as a texture, because texture references are restricted to 1-, 2-, and 4-component vector types?

See CUDA Programming Guide Version 1.0 on Page 24:

Does anyone has a good idea?


The fastest way is to pad an extra byte and read it as a uchar4 texture.

I did, but i think it is slow. So i wanted to find out if anyone has a different solution wich is faster, than mine.

So, what is the best way to make an uchar4 out of an uchar3?

My actual solution is the following:

__global__ void CopyFromUnsignedChar3ToUchar4( unsigned char* input, uchar4 *output)


 Â const int x = blockDim.x * blockIdx.x + threadIdx.x;

 Â const int y = blockDim.y * blockIdx.y + threadIdx.y;

  const int index = Index_ZS(y, x);

 Â uchar4 temp = make_uchar4( *(input + index), 

                             *(input + index + 1),

                             *(input + index + 2),

                             0x00 );

 Â *(output + y * IMAGE_WIDTH + x ) = temp;


then i copy the output to the cudaArray

I don’t know if it’s better, but you can use the Memcpy2D functions to “expand” your array: destination stride is 4 bytes, source stride is 3 bytes.

Also, you could write a CUDA kernel to do this. It may be faster, my hunch says you should be able to get somewhat close to the full device memory bandwidth. It’s also a good exercise.

It will take some fuddling about with shared memory to coalesce reads and writes in a uchar3->uchar4 kernel. The Memcpy2D idea by kristleifur is a good one, let us know if that works.

Just now, I was re-reading Mark Harris’s optimization lecture notes from Supercomputing 2007; There’s a discussion of this sort of shared memory wrangling there on pages 15–20.


Thanks for ur good ideas!

I tried out three of them:


__global__ void CopyUnsignedChar3ToUchar4( unsigned char* input, uchar4 *output)


	const int x = blockDim.x * blockIdx.x + threadIdx.x;

	const int y = blockDim.y * blockIdx.y + threadIdx.y;

	const int index = Index_ZS(y, x);

	*(output + y * IMAGE_WIDTH + x ) = make_uchar4(	*(input + index),

                            *(input + index + 1),

                            *(input + index + 2),

                            0x00 );



__global__ void CopyUnsignedChar3ToUchar4_S( unsigned char* input, uchar4 *output)


	const int x = blockDim.x * blockIdx.x + threadIdx.x;

	const int y = blockDim.y * blockIdx.y + threadIdx.y;

	const int index = Index_ZS(y, x);


	__shared__ uchar4 storage[8][8];

	storage[x][y] = make_uchar4( *(input + index),

         *(input + index + 1),

         *(input + index + 2),

                     0x00 );


	*(output + y * IMAGE_WIDTH + x ) = storage[x][y];



cudaMemcpy2D(dest_uchar4, 4, pcInputImage, 3, 704, 576, cudaMemcpyDeviceToDevice);


Method Nr:               1          1          2          2          3          3

Number of measurements:  250        10000      250        10000      250        10000

Median [ms]              1,080130   1,080480   1,076800   1,082360   0,008363   0,008508

Variance [ms]            0,046579   0,062817   0,006150   0,076088   0,001715   0,003705

Unsureness [ms]          0,002946   0,000628   0,000389   0,000761   0,000108   0,000037

I tried out using the shared memory as buffer in a kernel, that performs some more calculations the the one above. This gave me an improvement of about 1-2 ms. The kernel processed an 704x576x3 (unsigned char) image.

I used a 8800GTS with 640 MB.

None of these method you tried will read memory coalesced. Did you try the cudaMemcpy2D method? You could even do the cudaMemcpy2d method right when you copy from the host->device.

To coalesce, you need to read 4-byte values into a staging area and then write out the created uchar4. But the stride with which the 4-byte values are read also matters for coalescing, which limits the block size to one where 3/4*block_size is a multiple of the warp size.

Like this: (sorry for the nasty pointer tricks, there may be a better way)

// to be run with any block_size of 128 (or any where 3/4 * block_size = a multiple of 32)

// nblocks = ceil(N/block_size);

// shared memory must be allocated with 3*block_size bytes

// I'm being lazy and not checking input/output reads/writes: pad data

// to be a multiple of block_size bytes

__global__ void CopyUnsignedChar3ToUchar4( unsigned char* input, uchar4 *output, int N)


    int idx = blockIdx.x*blockDim.x + threadIdx.x;

    extern char sdata[];

   // read int the uchar3's.... yes, this reads in more data than needed,

    int nints_to_read = blockDim.x * 3 / 4;

    int start_int = blockIdx.x * nints_to_read;

    if (threadIdx.x < nints_to_read)


        ((int*)sdata)[threadIdx.x] = ((int*)input)[start_int + threadIdx.x];




    uchar4 out = make_uchar4(sdata[threadIdx.x*3], sdata[threadIdx.x*3+1], sdata[threadIdx.x*3+2]);

    output[idx] = out;


Note: I just whipped this up while replying: it is untested code. I’m not even positive it will work, but I’m pretty sure I got everything set for coalescing.

Edit: you can check for coalescing by running through the visual profiler. In this operation: memory coalescing is the ONLY thing that matters as far as performance is concerned.

After I fought trought the pointer-basics, i started to like then… but every day I see there ist so much to leran about them.

Bout what I wanted to sayabout this toppic here:

I tried out following:

cudaMemcpy2D(pcInputImage2, 4, inOutImage, 3, IMAGE_WIDTH, IMAGE_HEIGHT, cudaMemcpyHostToDevice);

cudaMemcpy2D(inOutImage, 3, pcInputImage2, 4, IMAGE_WIDTH, IMAGE_HEIGHT, cudaMemcpyDeviceToHost);

It works fine, in both directions. and i don’t think, that there is a faster faster solution

Cool, it works. Did you speed test the 3vs4byte Memcpy2D between host/device? I think I agree that the speed should be fine - the host doesn’t need coalescing, and the device will probably read/write coalesced. At least it should be able to.

Are you using pinned host memory btw?

– oops, dupe –

May be i was a lil fast, when i told it works fine.

//CUDA Programming Guide Version 1.0, Page 81

cudaError_t cudaMemcpy2D(void* dst, size_t dpitch,

                          const void* src, size_t spitch,

                          size_t width, size_t height,

                          enum cudaMemcpyKind kind);

When i execute the followin code:

extern "C" _declspec(dllexport) float CIPDLL_ProcessImageDataNeu( unsigned char *pcSource, unsigned char *pcDestination)


	uchar4	*pcLocalImage;	

	cudaMalloc( (void**) &pcLocalImage,	IMAGE_WIDTH * IMAGE_HEIGHT * sizeof(uchar4) );


	uchar3	*pcLocalImage2;

	cudaMalloc( (void**) &pcLocalImage2,	IMAGE_WIDTH * IMAGE_HEIGHT * sizeof(uchar3) );

	cudaMemcpy(pcLocalImage2, pcSource, IMAGE_WIDTH * IMAGE_HEIGHT * sizeof(uchar3), cudaMemcpyHostToDevice);

	// cudaError_t: Â ... no error

	cudaMemcpy2D(pcLocalImage, sizeof(uchar4), (uchar3*)pcLocalImage2, sizeof(uchar3), IMAGE_WIDTH, IMAGE_HEIGHT, cudaMemcpyDeviceToDevice);	

	// cudaError_t: ... invalid argument


 // ...

Why do i get the “invalid argument”- error, wehen I perform the cudaMemcpy2D call?

I see a couple of problems:

  1. You don’t need to cast pcLocalImage2 to uchar3*; The memcpy functions simply take an array pointer and don’t really care what they point to.

  2. You need IMAGE_WIDTH * sizeof(uchar3) in argument 5. I had trouble with this myself, but if you read the memcpy2D documentation really carefully, you’ll see that some arguments mean “bytes” and some not.

I’d guess that the invalid argument comes from problem 1.

I’m really curious about the speed of transfers directly from [pinned uchar3 host memory] --> [uchar4 device memory] - did you try that? Thanks!

I found out, that my pitch-argument was wrong and that i have to malloc the required memory with cudaMallocPitch.

the following function runs without any cudaError:

extern "C" _declspec(dllexport) void CIPDLL_DivTests(unsigned char *pSource, 

                                                     unsigned char *pDestin,

                                                     double *testResults,

                                                     int numberOfResults)


   unsigned int	timer = 0;

   cutCreateTimer( &timer);

   cutResetTimer( timer);


   uchar3  *pcLocalUchar3;

   size_t  pitchUchar3;

   cudaMallocPitch((void**) &pcLocalUchar3, &pitchUchar3, IMAGE_WIDTH * sizeof(uchar3), IMAGE_HEIGHT);

  uchar4  *pcLocalUchar4;

   size_t  pitchUchar4;

   cudaMallocPitch((void**) &pcLocalUchar4, &pitchUchar4, IMAGE_WIDTH * sizeof(uchar4), IMAGE_HEIGHT);


   // Copy to device

  cudaMemcpy(pcLocalUchar3, pSource, IMAGE_WIDTH * IMAGE_HEIGHT * sizeof(uchar3), cudaMemcpyHostToDevice);

 for( int i = 0; i < numberOfResults; i++)


    cutResetTimer( timer);



    cudaMemcpy2D(pcLocalUchar4, pitchUchar4, pcLocalUchar3, pitchUchar3, IMAGE_WIDTH, IMAGE_HEIGHT, cudaMemcpyDeviceToDevice);




    *(testResults +i ) = (double)cutGetTimerValue(timer);




 cudaMemcpy2D(pcLocalUchar3, pitchUchar3, pcLocalUchar4, pitchUchar4, IMAGE_WIDTH, IMAGE_HEIGHT, cudaMemcpyDeviceToDevice);

  cudaMemcpy(pDestin, pcLocalUchar3, IMAGE_WIDTH * IMAGE_HEIGHT * sizeof(uchar3), cudaMemcpyDeviceToHost);



Anzahl der Messwerte = 250    (Number of measurements)

Mittelwert      [ms] = 0,148122   (Median)

Varianz         [ms] = 0,012344   (Variance)

Unsicherheit    [ms] = 0,000780702  (Unsureness)

Anzahl der Messwerte = 10000

Mittelwert      [ms] = 0,127644

Varianz         [ms] = 0,0225162

Unsicherheit    [ms] = 0,000225162

THANK you very much for ur help!