Help in write from local mem to the global

I’m new in CUDA and I recently encountered in problem thad probly comes from my disunderstanding the cuda material.

I wrote a simple program which is calculate the sum of all the primes numbers from 0 to N.

Every thread have “int counter=0”, which probly storaged on the local mem.
Assume that evry thread check 16 numbers, (regardless how the data comes into the thred)
and every time it find a prime number, its make counter=counter+1.

so far all working just fine.

But, when I want that every thread will write his “counter” to the global mem, or even to the shared mem, it’s make the program to slow down in a drastic mount.
like “*(C+threadID)=counter” or “shared int threadCount[threadID]=counter”
when C is a pointer to the start of the global mem, and threadID is the serial number of the thread regard to all the threads in the system.

If i want all the tread will write some const valu like “*(C+threadID)=1” it is not slow down the runtime longs

So, can enyone explaine me how to write data from the local mem of the thread to the shared or global?
and, yes, I am using in sync().

thank you!


Why do you think that counter is placed in local mem? According to priority, it should first be tried to be put in register, and only if regs are not enough could they be put into local.

And I think that is why when you want to execute *(C+threadID)=counter, it slows down, because C is in global mem and it consumes much more time for sure.

And the reason why “shared int threadCount[threadID]=counter” also hurt efficiency maybe bank-conflict, I guess.

best regards…

I don’t really see a bank conflict there with the shared memory but queeten is generally right. Also, if you’re testing a code that does a memory write against one that doesn’t, be aware that the compiler is pretty aggressive in optimizing dead code, ie. if it sees the results of some computation aren’t being written anywhere it may ignore it.

Could you show us the code? (both kernel and host)


first thing , thank for you patience

Now, this is my kernel

It is a simple and naive code to calculates how much primes there are in some range

in this case, from 0 to n, n is argument of the kernel func.

the most importenet line in this segment is the last line, don’t be bothered with the rest

every thread is writing only one time to the global, which start in the address “C”

it write the sum of the primes in his scope

(e.g thread number 0, check out 0~7 (in case n=2^20) and write down 4.)

__global__ void

getsPrimes( int* C,int n)


	int counter=0;

	int blockID=blockIdx.y*WG+blockIdx.x;	

	int threadID=threadIdx.y*WB+threadIdx.x;

	int jobPerThread=n/(WG*HG*WB*HB);


	for (int a =(blockID*WB*HB*jobPerThread);

			 a < (blockID*WB*HB*jobPerThread+jobPerThread);






		if (isPrime(a)==1)















Now, in this way, it’s take 1.1 second.

But, if I change the last line from “(C+threadID)=counter;" to "(C+threadID)=1;” - to const valu rather then var, it’s take 0.2 second.

Which mean that the access to the “int counter” take a lot of time


can I improve my preformance?

Again, thenk for all

Hope some day even I would be able to help here…


I usually declare most (if not all) of my variables that I need in registers volatile. It works (tested with SDK 1.1 and 2.0) and in some cases it reduces register usage for me.

As it was already stated in a previous post: If you remove the *(C+threadID)=counter then the compiler might optimize away your counter=counter+1; statement because it sees that the value of counter is not accessed anywhere.

To reduce path divergence, I suggest doing counter=counter+isPrime(a) (only works if 0 and 1 are the only allowed return values and there is no else {} block needed).

But I guess the isPrime function in itself will lead to a lot of divergent paths during execution.

Last year I used CUDA for primality testing and found that you could make minimal divergence by using strong-probable-prime testing.

Since all threads use a similar power loop with only differing data, the divergence was minor, especially if your tested values are all roughly the same magnitude.

If you were doing primality testing using trial divisions… your divergence would be huge.

In GPU programming, it’s not even worthwhile to check for divisibility by 2… early rejection doesn’t gain you anything since the other threads still need to work!

The GPU isn’t ideal for integer computes because its integer multiplies are slow, but the massive SP count makes up for it.

Using CUDA in fact I developed probably the best guaranteed isPrime() function for n<2^32 using a lookup table and one single probable-prime test.

I haven’t released that result yet, though… too busy and it was for my own curiosity and not publication.