Various general beginner questions constream processors

Hi all,

Newbie here, got a set of quick questions. If you have insight on even just one of them please let me know.

  1. I thought the stream processors were very light weight. I am guessing they do in-order execution, can someone verify? Is there any prefetching of memory or is it all explicit?

  2. I believe there is a library/utility that you can include in your application that will spit out pre-profiling results at compile time. It will say stuff like how many registers were used and if you’re maxing out your register limit and stuff like that. Does anyone know the name of this?

  3. Regarding coalescing of memory - I believe the memory bandwidth is only 128bits (on what hardware?). Would that mean with coalescing, you can only max out the memory bandwidth with 16 doubles (8 bytes each)? So a half-warp can fully maximize the memory bandwidth, right? With singles (4 bytes each), you would need 32 values read in a cycle to max out the memory bandwidth, right? Since a half-warp only has 16 values, a group of coalescing threads reading in singles cannot max out the memory bandwidth. Is this case one where you would want to use the float2 struct? Is that the idea of the ___{2,3,4} structs? Is it possible to make a single thread read in a float2 worth of data in a single cycle? Maybe I’m misunderstanding this altogether…

  4. Regarding coalescing - I am using cudamallocpitch to align every row of my array. After that I need to align my threadblock to the memory banks, right? How large is a memory bank? In other words, how do I find the exact length and multiples of the memory alignment?

  5. If I declare an array as constant where does that memory reside? For my application, this array will be used EXTREMELY often and I would want it in the fastest memory possible - possibly shared memory or register memory. So is constant putting my memory where I’d want it? Where would I want to declare this memory? In my kernel or outside it, as a global?

  6. How do I determine how many registers I get (per thread) before I run out? Will it depend on the GPU?

  7. So I realize that shared memory is shared among all cores on the SM. When a new block is executed [interleaved] won’t it have to swap out the shared memory similar to a context switch? I am possibly thinking about this all wrong.

Please help.

Thanks,
jbu

[list=1]

[*] The SMs are capable of a light form of out-of-order execution: New instructions start as soon as their operands become available, however there is no register renaming and no speculative execution.

[*] Compile with [font=“Courier New”]–ptxas-options=-v[/font] to see register and shared memory use. You can then feed this data into the occupancy calculator spreadsheet.

[*] The memory bus width depends of the specific GPU you are using. Coalescing however is independent of the physical bus width, it only depends on the access pattern per warp (per half-warp for compute capability 1.x devices).

There are instructions to load aligned 64 bit and 128 bit objects per thread, so a float2 (or a float4) can indeed be loaded with a single instruction.

[*] The relevant alignment for coalescing always is 128 bytes, independent of physical bus width. Some optimizations are possible for accesses that only read aligned 64 byte or 32 byte subsets of the 128 byte segments/cachelines.

[*] Variables declared as constant reside in global memory, but are cached. constant memory accesses get serialized if the access is not uniform per warp (half-warp for 1.x devices). In that case use global memory instead on 2.x devices or textures in linear memory on 1.x devices.

[*]Compile with [font=“Courier New”]–ptxas-options=-v[/font]. 2.x devices have a maximum of 127 registers/thread, 2.x devices of 64 registers/thread. However the total number of registers per block may not exceed the total number of registers per SM. Check appendix F of the (4.0) Programming Guide.

[*]Shared memory is divided between the blocks that execute concurrently on a SM. Once the block finishes, it’s contents are discarded. So no swapping of shared memory is ever needed.

If in doubt, consult the Programming Guide. It has all the information you are seeking.