Newbie question on shared variables

I’m currently looking into a simple CUDA example from here

It is a program where each position in an array of size N is incremented by one in both host code and device code. I have modified it a bit to time the execution of each method for comparison purposes.

The processing time is only lower for the device code for very large values of N so I wanted to introduce some more “work” for each position in the array to see how that would affect the difference. A for loop that increments an integer 1000 times was my first idea, but it doesn’t work very well. Here are the two functions with the for loops included.

[codebox]

void incrementArrayOnHost(int *a, int N)

{

for (int i=0; i < N; i++){

  a[i] = a[i]+1;

  int x=0;

  for(int j=0; j<10000; j++){

	  x++;

  }

}

}

global void incrementArrayOnDevice(int *a, int N)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx<N){ 

	a[idx] = a[idx]+1;

	int x=0;

	for(int j=0; j<10000; j++){

		x++;

	}

}

}

[/codebox]

The processing time for the device code stays roughly the same no matter how many times the for loop is run. But if I’m not mistaken x is a shared variable and all the threads cooperate to increment it? Any way to make it unique for each thread or maybe I should use another method to introduce more “work” for each thread.

You never write your calculated x out to global memory. So the dead code optimizer recognizes that you kernel does nothing and optimized the contents away.

And your x variable is already unique per thread.

How can I avoid the extra code being excluded by the optimizer?

By writing your variable out to global memory.

Probably a very stupid question, but how? :)

__global__ void incrementArrayOnDevice(int *d_out, int *a, int N)

	{

	int idx = blockIdx.x*blockDim.x + threadIdx.x;

	.... as you have above ....

	d_out[idx] = x;

	}

And allocate d_out on the device with cudaMalloc.