Number of threads for processing 640x480 image

Hello everyone,
I am a beginner in cuda programming. I am trying to convert a RGB 24-bit bmp image to YUV color format. My image is 640x480 = 307200 pixels, 1 pixel with 3 bytes(R,G,B) = 921600 bytes. To convert it into YUV format, do i need to create 921600 threads or 1 thread/pixel can be written in such a way that will calculate Y,U,V separately?

I think 1 thread per pixel will suffice, so that’s 640x480 threads is indeed 307200 threads. However perhaps reading/writing 3 bytes at a time might create some race conditions/problems…maybe cuda can only safely read/write 4 bytes/an integer at a time.

Thank you :).

Yeah. In Visual studio, an integer acquires 4 bytes.

After reading lots of material, I found that divide the image into 16x16 blocks, where height and width of image is multiple of 16, so my image (640x480) satisfies and then process it, in that way we can process whole image, am I right?

Below is what I have written in my code,

dim3 block(16,16); // 16x16 = 256
dim3 grid(hp->biHeight/16,hp->biWidth/16); // 40x30 = 1200 // 1200x256 = 307200
rgbToyuv<<<grid,block>>>(d_hp, d_data, height, width);

When we say 16x16 = 256, does it process on 256 continuous values of 1D array of image or the meaning is different?

I don’t think that is right. 16x16 is ment for video processing and stuff like that. If all you want to do is convert RGB to YUV then linear processing would be my guess what would work best. Did you find some cuda related documentation that stated otherwise ? If so a link to it would be nice. If it was just some general purpose doc you found than that might not be most ideal.

Thank u for replying Skybuck :),

Yes, I am doing this for my video processing project. After extracting frames, I have started my basic code which works for a single frame and later it can be extended by using a loop or something to process all frames.

If my logic is correct, the kernel function I have wrote means 256 continuous pixels must be processed by each block. In that way, whole image is getting processed.

Does cuda support for writing values in parallel way?