the gpu has 8k or 16k registers (depending on the compute capability, see programmer’s guide for more info) which are shared between all threads within a thread block. so if you have, e.g., 256 threads per block, each thread can use up to 32 or 64 registers.
you can have a maximum of 512 threads per block.
in total, you sure should (and probably will) have thousands or even millions of threads, but they are divided into several blocks.
i.e. the following call will generate a total of 128128256=4M threads. they are organized as 128*128=16k blocks with 256 threads/block.
myKernel<<<dim3(128,128),dim3(256)>>(…);
btw your “optimization” will give you a much slower speed on a gpu than the original version, as reading data from global memory outweights every and all calculation. also you should take care that you interleave the threads (e.g. thread x will read elements numThreads*i+x) to get coalesced reads.
[edit]
don’t bother with such things as in your last post, on a gpu it’s all about “bandwidth” (in this case device-device bandwidth ;-)) and coalescing.
don’t forget, that “in” and “out” have to be in device memory, so if you don’t have memCpys surrounding your kernel call, there is something wrong. ;-)
this way you have coalesced reads of “in” (given that you choose “nice” numbers for blockDim, i.e. multiples of 64).
the read and write of “out” will only be coalesced, if the same letter is found in neighboring positions.
i’ve used valsPerThread as i’m not sure, how much overhead is generated. just try some different values.
Ok thanks but so tell me, can I define Letter and in and out also as wchar_t to save memory?
Do you know if this coalesced stuff also applies to ‘normal’ DDR2 RAM read/written by the
CPU? I made some tests but could not recognize any differences between t1m1->t1m2… and
t1m1->t2m1…
memory access on the cpu is completely different, as a cpu has lots of different caches…
you should be able to also use wchar_t, if not, just try with unsigned short for windows, unsigned int for linux.
Sort of. DRAM always needs “coalescing,” but the CPU’s cache buffers it. If you request a few bytes, the CPU will do a coalesced 32 byte transfer anyway and save it in its cache. If you request a few bytes elsewhere from that same cache line, you won’t have to go to DRAM at all.
A GPU, without a cache, will go to DRAM every time and waste tons of bandwidth, so you want to get as much out of every 32 (or 64/128) byte access as possible.