Hi all,
Newbie here, got a set of quick questions. If you have insight on even just one of them please let me know.
-
I thought the stream processors were very light weight. I am guessing they do in-order execution, can someone verify? Is there any prefetching of memory or is it all explicit?
-
I believe there is a library/utility that you can include in your application that will spit out pre-profiling results at compile time. It will say stuff like how many registers were used and if you’re maxing out your register limit and stuff like that. Does anyone know the name of this?
-
Regarding coalescing of memory - I believe the memory bandwidth is only 128bits (on what hardware?). Would that mean with coalescing, you can only max out the memory bandwidth with 16 doubles (8 bytes each)? So a half-warp can fully maximize the memory bandwidth, right? With singles (4 bytes each), you would need 32 values read in a cycle to max out the memory bandwidth, right? Since a half-warp only has 16 values, a group of coalescing threads reading in singles cannot max out the memory bandwidth. Is this case one where you would want to use the float2 struct? Is that the idea of the ___{2,3,4} structs? Is it possible to make a single thread read in a float2 worth of data in a single cycle? Maybe I’m misunderstanding this altogether…
-
Regarding coalescing - I am using cudamallocpitch to align every row of my array. After that I need to align my threadblock to the memory banks, right? How large is a memory bank? In other words, how do I find the exact length and multiples of the memory alignment?
-
If I declare an array as constant where does that memory reside? For my application, this array will be used EXTREMELY often and I would want it in the fastest memory possible - possibly shared memory or register memory. So is constant putting my memory where I’d want it? Where would I want to declare this memory? In my kernel or outside it, as a global?
-
How do I determine how many registers I get (per thread) before I run out? Will it depend on the GPU?
-
So I realize that shared memory is shared among all cores on the SM. When a new block is executed [interleaved] won’t it have to swap out the shared memory similar to a context switch? I am possibly thinking about this all wrong.
Please help.
Thanks,
jbu