I'm just not getting it ...

Hi all,

I’m very new to CUDA development, and I’m afraid I need a nudge in the right direction to get started. I started out creating a simple kernel to convert a UYVY 4:2:2 video images to YUYV (simple byte swapping). I did something like the following:

  1. Create UYVY and YUVY images in system memory.
  2. Allocate space using cudaMallocPitch() in the GPU to hold the source UYVY and dst YUYV images.
  3. DMA the UYVY image to the to the GPU using cudaMemcpy2D().
  4. Run the kernel to convert the UYVY to YUYV.
  5. DMA the result back to system memory using cudaMemcpy2D().

This produced the correct result, but ran incredibly slowly. More slowly than an IPP routine using a single thread on the host. From here I started to simplify my problem. I ended up just trying the write a kernel to initialize a block of memory. Again, the app produced the correct result, but memset on the host ran faster (note that I’m not considering the DMA transfer times in my timings).

So … what is the correct (and fast) way to do something simple like initialize memory? Do you want to create a thread per byte, per word, or per block of bytes?

I’m running these tests on a dual quad-core Xeon PC running 64-bit XP with a Quadro FX 1700. The application has been compiled as a 32-bit (not 64-bit).

If someone could show me a simple kernel and the call to it, I’d really appreciate it.


You should probably post source for some of your kernels and the functions that call them.

My very simple memory init kernel looks like this:

global static void initMem(U8 *dst, size_t pitch, U8 val)


int i = blockIdx.y * 16 + threadIdx.y;

int j = blockIdx.x * 16 + threadIdx.x;

int idx = i * pitch + j;

dst[idx] = val;


and I call it from the host like the following:

void cudaInitMem(U8* pix, size_t pitch, int w, int h, U8 val)


PerformanceTimer timer;


dim3 dimBlock(16, 16);

dim3 dimGrid(w/16, h/16);

initMem<<<dimGrid, dimBlock>>>((U8*)pix, pitch, val);

if (cudaSuccess != cudaGetLastError()) {

    printf("cudaInitUYVY kernel execution failed!\n");


if (cudaSuccess != cudaThreadSynchronize())

    printf("Sync failed with %d\n", cudaGetLastError());

I64 t = timer.getUs();

printf("cudaInitUYVY in %dus\n", t);


Note that I’m currently assuming an 8-bpp grayscale image whose w/h is a multiple of 16 for simplicity.