Mapping BGR to ARGB for video capture

Hello all,

I have been doing some image processing computations with CUDA using a webcam as an image source and I am using NVIDIA’s image denoising code as a base. Image denoising example uses unsigned integer to represent the color of a pixel (ARGB).

As a difference from the original image denoising code, I am using PBOs (not textures) to pass data to the device (via the glBindBuffer - glBufferData - glMapBufferARB - memcpy - glUnmapBufferARB cycle). The video capture library I am using delivers the pixel information as BGR and because of this I copy the video buffer to PBO buffer via this for loop instead of a plain memcpy:

GLubyte* ptr = (GLubyte*)glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);

// change from BGR to ARGB in order to satisfy memory coalesce requirements

for(int r = 0; r < imageH; r++)

{

	for(int c = 0; c < imageW; c++)

	{

		ptr[r*imageW*4+4*c] = frame[r*imageW*3+3*c+2];

		ptr[r*imageW*4+4*c+1] = frame[r*imageW*3+3*c+1];

		ptr[r*imageW*4+4*c+2] = frame[r*imageW*3+3*c];

		ptr[r*imageW*4+4*c+3] = 0;

	}

}

As you can guess, this decreases performance. I was wondering whether I can achieve such a copy in a smoother way, maybe by using texture references? Any suggestions appreciated…

If the camera frame buffer is in pinned memory, the most efficient way would be to map it and run a conversion function on the GPU:

If the camera frame buffer can’t be mapped, your memory bandwidth use will probably double, but it can still be done efficiently
if you write a conversion function with SSE integer intrinsics functions:

  1. read 16 RGB triples (48 bytes) with 3 _mm_load_si128() loads

  2. use _mm_shuffle_epi8() (SSSE3, needs Core or better) to permute the data
    you’ll also need other bit ops to join values that span 2 SSE registers

  3. writeback the results with _mm_store_si128()

To save you the trouble, the code is something like this:

This is just on top of my head and not tested (I’ve been using SSE a lot these days)

#include <emmintrin.h>

#include <tmmintrin.h>

uint unroll_end_x = RoundDown(w, 16);

uint8_t *in, *out;	// must be 16 byte aligned

__m128i mask0 = _mm_set_epi8(0x80, 11, 10, 9, 0x80, 8, 7, 6, 0x80, 5, 4, 3, 0x80, 2, 1, 0),

		   mask1 = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 15, 0x80, 14, 13, 12),

		   mask2 = _mm_set_epi8(0x80, 7, 6, 5, 0x80, 4, 3, 2, 0x80, 1, 0, 0x80, 0x80, 0x80, 0x80, 0x80),

		   mask3 = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 15, 14, 0x80, 13, 12, 11, 0x80, 10, 9, 8),

		   mask4 = _mm_set_epi8(0x80, 3, 2, 1, 0x80, 0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80),

		   mask5 = _mm_set_epi8(0x80, 15, 14, 13, 0x80, 12, 11, 10, 0x80, 9, 8, 7, 0x80, 6, 5, 4);

for (uint y = 0; y < h; ++y)

{

  uint x;

  for (x = 0; x < unroll_end_x; x += 16)

  {

	__m128i src0 = _mm_load_si128((__m128i *)&in[y * in_pitch + x * 3]),

	dst0 = _mm_shuffle_epi8(src0, mask0);

	_mm_store_si128((__m128i *)&out[y * out_pitch + x * 4], dst0);

	__m128i dst1 = _mm_shuffle_epi8(src0, mask1),

			   src1 = _mm_load_si128((__m128i *)&in[y * in_pitch + x * 3 + 16]),

		 dst1_hi = _mm_shuffle_epi8(src1, mask2);

	dst1 = _mm_or_si128(dst1, dst1_hi);

	_mm_store_si128((__m128i *)&out[y * out_pitch + x * 4 + 16], dst1);

	__m128i dst2 = _mm_shuffle_epi8(src1, mask3),

		   src2 = _mm_load_si128((__m128i *)&in[y * in_pitch + x * 3 + 32]),

	   dst2_hi = _mm_shuffle_epi8(src2, mask4);

	dst2 = _mm_or_si128(dst2, dst2_hi);

	_mm_store_si128((__m128i *)&out[y * out_pitch + x * 4 + 32], dst2);

	

   __m128i dst3 = _mm_shuffle_epi8(src2, mask5);

   _mm_store_si128((__m128i *)&out[y * out_pitch + x * 4 + 48], dst3);

   } 

  }

  while (x < w)

  {

	 ...

  }

}

correction:

use _mm_loadu_si128() and _mm_storeu_si128() because there’s no guarantee all rows of either the source or image are 16 byte aligned, unless you pad them.

[nevermind]