Newbie help on thread blocks

No, each thread reads a number of values from global memory. There are 13 input-only arrays total (which remain unchanged throughout the execution of the program). 10 of these are pretty small, so I was able to move these into constant memory easily. The remaining 3 are pretty large (sized equally to the I/O arrays), so I am not sure what I can do with those.

Are there any tutorials out there on what shared memory does or how to use it effectively? I was starting to look into that yesterday…

Currently, it is. I am working on a way to overcome this issue. My initial testing indicates that I can cut my time in half if I remove the current branching statements.

I have played around with the Occupancy calculator. I am not absolutely certain that I have used it correctly, but I chose the block size that seemed appropriate according to the calculator. My experiments with different block sizes seem to bear that out for now.

I am currently using 14 registers in my kernel. I originally started with 20. I don’t think I can reduce it any further (or at least much) unless I split the kernel into multiple kernels. However, that will increase the number of global memory reads per iteration of the program. I assume this will ultimately slow my program down.

Fortunately, I do not use any of these functions. Everything is straight-forward multiplication, division, addition, and subtraction.

Any constants are computed prior to the first execution of the kernel and passed in.

Thanks for the comments!

I don’t think that talking about the memory access pattern would give away anything to be concerned about. Once I started playing with constant memory, I started a new thread to discuss its usage. I just started talking about the memory access pattern there last night.

Memory access pattern thread

At this point I’d start thinking about how to use more registers, not less. A good number is 32. But even 64 is fine on G200.