number of registers of a GPU processor 5 general purpose registers in a x86 CPU

Ok I wrote a program that looks a bit like this:


wchar_t Stream[1000000];

int StreamFlow[1000000];

void GiveFlow(register wchar_t Letter, int StreamStartPos, register int StreamEndPos)


for (register int i = StreamStartPos; i < StreamEndPos; i ++)

    if (Stream[i] == Letter)

        StreamFlow[i] += 100000;


I marked the vars that should stay within CPU registers by ‘register’ (actually MSDN says this ‘register’ can be used within code

but is ignored by the compiler, I used it for clarification).

So this code should use 3 registers (i, StreamEndPos, Letter).

Will it also work well on a GPU, or has a GPU such few (general purpose) registers that there’d be a high memory latency?

Can you give me tips on how to improve the performance of that code?

the gpu has 8k or 16k registers (depending on the compute capability, see programmer’s guide for more info) which are shared between all threads within a thread block. so if you have, e.g., 256 threads per block, each thread can use up to 32 or 64 registers.

Ok, but in an other post someone wrote that you have a few thousand threads. Have a look here, post #7:…this+gonna+rock!

So 8k or 16k registers and 16k threads -> one register per thread :(

Can I define how much threads there are? I thought that depends on the number of kernels.

I am confused…

BTW, I found a way to speed up the code:

[codebox]wchar_t Stream[1000000];

int StreamFlow[1000000];

void GiveFlow(register wchar_t Letter, int StreamStartPos, register int StreamEndPos)


register int i = 0;

wchar_t LetterOriginal = Stream[StreamEndPos];

Stream[StreamEndPos] = Letter;

while (1)


    if (Stream[i ++] == Letter)


        if (i >= StreamEndPos) break;

        StreamFlow[i] += 100000;



Stream[StreamEndPos] = LetterOriginal;


Like this you save the aborting check of the for-loop. Note that in Stream there’s natural speech text,

so Stream[i ++] == Letter won’t happen that often.

You only have to pay attention that this kernel does not overwrite the Stream-beginning of the next kernel

(Stream[StreamEndPos] of kernel #1 is Stream[StreamStartPos] of kernel #2).


does anybody know if this would work on a GPU, too:

[codebox]while (1)


if (Stream[i    ] == Letter)


    if (i     >= StreamEndPos) break;

    StreamFlow[i    ] += 100000;


if (Stream[i + 1] == Letter)


    if (i + 1 >= StreamEndPos) break;

    StreamFlow[i + 1] += 100000;


if (Stream[i + 2] == Letter)


    if (i + 2 >= StreamEndPos) break;

    StreamFlow[i + 2] += 100000;


if (Stream[i + 3] == Letter)


    if (i + 3 >= StreamEndPos) break;

    StreamFlow[i + 3] += 100000;


i += 4;


A CPU has an extra calculating unit for those index-calculations. So

Stream[i + 1] takes the same number of CPU cycles as

Stream[i] as ‘+ 1’ is done within one cycle by the extra unit.

Can a GPU do that as well?

Thanks in advance!

you can have a maximum of 512 threads per block.
in total, you sure should (and probably will) have thousands or even millions of threads, but they are divided into several blocks.
i.e. the following call will generate a total of 128128256=4M threads. they are organized as 128*128=16k blocks with 256 threads/block.

btw your “optimization” will give you a much slower speed on a gpu than the original version, as reading data from global memory outweights every and all calculation. also you should take care that you interleave the threads (e.g. thread x will read elements numThreads*i+x) to get coalesced reads.

don’t bother with such things as in your last post, on a gpu it’s all about “bandwidth” (in this case device-device bandwidth ;-)) and coalescing.


And what if I copied Stream and StreamFlow into the device memory?

I’m not sure if I understood that right in the CUDA manual, does that mean in memory, I must have:

t1m1 t2m1 t3m1 t4m1 t1m2 t2m2 t3m2 t4m2 t1m3…

and NOT

t1m1 t1m2 t1m3 t1m4 t2m1 t2m2…

(t=thread, m=32 bit memory value (one int))


How must I change my code to get maximal bandwidth? Is there any possibility?


“global memory” is memory on the device which has a bandwidth of “only” 100GB/s on a gt200.

that’s the one that will give you coalescing.

a kernel doing the same as written in your first post would look like this with cuda:

__global__ static void kernel(int *in,int *out,int letter,int valsPerThread){

  for(int i=0;i<valsPerThread;i++){






don’t forget, that “in” and “out” have to be in device memory, so if you don’t have memCpys surrounding your kernel call, there is something wrong. ;-)

this way you have coalesced reads of “in” (given that you choose “nice” numbers for blockDim, i.e. multiples of 64).

the read and write of “out” will only be coalesced, if the same letter is found in neighboring positions.

i’ve used valsPerThread as i’m not sure, how much overhead is generated. just try some different values.

Ok thanks but so tell me, can I define Letter and in and out also as wchar_t to save memory?

Do you know if this coalesced stuff also applies to ‘normal’ DDR2 RAM read/written by the
CPU? I made some tests but could not recognize any differences between t1m1->t1m2… and

memory access on the cpu is completely different, as a cpu has lots of different caches…
you should be able to also use wchar_t, if not, just try with unsigned short for windows, unsigned int for linux.

Yes, I got odd behaviour. I ran my test program several times, once I got ~58M read operations

from an array per second , then I ran it again and had ~128M operations.

Then I ran it a 3rd time and had ~50M. No idea what was going on.

I think there was no error within my program, and I didn’t have other applications running

(the Windows File Indexer was switched off).


Sort of. DRAM always needs “coalescing,” but the CPU’s cache buffers it. If you request a few bytes, the CPU will do a coalesced 32 byte transfer anyway and save it in its cache. If you request a few bytes elsewhere from that same cache line, you won’t have to go to DRAM at all.

A GPU, without a cache, will go to DRAM every time and waste tons of bandwidth, so you want to get as much out of every 32 (or 64/128) byte access as possible.