question for thread configuaration

Hello all.

i have a some problem for running image processing using CUDA.
original image size is 8196 * 1024 (char). device is 9800GT.
i want multiply a particular value in all pixel of the image

example>

original image ↓
(0,0)0 (1,0)255 (2,0)1 (3,0)50 (4,0)0 (5,0)0 … … . … (8196,0)0

                 (0,1)255   (1,1)1     (2,1)0   (3,1)0     (4,1)0   (5,1)0  .. ... . ... (8196,1)0

    	                  ...         

                  ...

                 (0,1023) ....



multiply a particular value (0) in all pixel of the image

processed image ↓
(0,0)0 (1,0)0 (2,0)0 (3,0)0 (4,0)0 (5,0)0 … … . … (8196,0)0

                 (0,1)0     (1,1)0   (2,1)0   (3,1)0   (4,1)0   (5,1)0  .. ... . ... (8196,1)0

    	                 ...         

                 ...
     
                (0,1023) ....     

so. i tried to run this program using CUDA but i couldn’t get reasonable processing speed.

i want to know optimal thread configuation value ( block size, grid size, thread number per block, shared memory per block etc…)

thanx for reading.

I’m a bit rusty on understanding what you’re trying to do, but that is not of extremal importance. Compile with the -keep option, and look at the .cubin file. You should see something like this:

code {
name = foobar
lmem = 12
smem = 2112
reg = 21
bar = 1

Then plug the values for register usage, shared memory (reg, and smem respectively) in the Cuda Occupancy Calculator, and play with the values of the block size to get the optimal occupancy.