If i have one of these :
Device 0: “GeForce 8800 GTX”
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 804585472 bytes
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1350000 kilohertz
what is the expected maximum memory bandwidth and how do i get it in my threads ?
has anyone done a study of the bandwidth under different loading schemes in a cuda thread ?
e.g. should i load float4’s or char1’s, should every thread do a load or only some of them, should i load via texture fetch or simply device memory access.
Thanks for helping a newbie !