Best way to copy small pixel arrays in device code?

I have been reading/write pixel values in device code using 3 assignments
as in the following:

const unsigned char *src = srcImage + row*srcPitch + col*3;
unsigned char srcPixel[3];
srcPixel[0] = src[0];
srcPixel[1] = src[1];
srcPixel[2] = src[2];

In host code I can use std::copy (I am afraid cudaMemcpy is overkill in this case,
and thrust::copy does not work here). Is there a better way?

P.S. I never understood why CUDA does not provide types and operators similar to
GLSL’s vec3, dot(), etc…

This is the kind of reads that I would normally do from textures (e.g. texture objects). I think CUDA supports only 1, 2 and 4 component textures so the source image would have to be padded with a blank alpha color channel.

When you configure the texture read mode to cudaReadModeNormalizedFloat with a single texture read of a float4, you get the RGBA components of one pixel presented in the 0…1 range. Any reads of neighbor pixels will also profit from the texture caches.

there are non-member operator overloads and functions available for float3, float4 vectors allowing to do useful maths ( samples/common/inc/helper_math.h )

Thanks @cbuchner1
This particular case I am working is better suit for shared memory than for textures.
It seems nVidia moves the helper_math.h from the primary source since it was only
meant for sample demos and not production code (it seems like I read that comment
somewhere), although I have used those wrappers before.