thread id repetitions leading to problems issues with problem sizes over 16777216

Hi all,

I’m a CUDA newbie on Linux. I’m trying out the following code on a GTX 480 :

8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----

#include <cuda.h>
#include <stdio.h>
#include <string.h>
#include “cuda.h”
#include “cuda_runtime_api.h”
#include <sys/time.h>
#include <time.h>
#include <math.h>
#include <gsl_statistics_float.h>
#include <gsl_rng.h>
#include <time.h>

#define PROBLEM_SIZE 30000000

// Prototypes
global void irc_kernel(float *device_input, int problem_size, float *device_output);

// Host function
int main(int argc, char** argv) {

int i;

float *host_input;
host_input = (float )malloc(PROBLEM_SIZEsizeof(float));
for(i=0;i<PROBLEM_SIZE;i++) {
host_input[i] = 4000.0; // an initialization just a number, no real purpose
}

float device_input;
cudaMalloc((void **)&device_input, PROBLEM_SIZE
sizeof(float));

float device_output;
cudaMalloc((void **)&device_output, PROBLEM_SIZE
sizeof(float));

// send input to device
cudaMemcpy(device_input, host_input, PROBLEM_SIZE*sizeof(float), cudaMemcpyHostToDevice);

dim3 dimGrid(65534,1,1);
dim3 dimBlock(512,1,1);

float *host_output;
host_output = (float )malloc(PROBLEM_SIZEsizeof(float));

// invoke the kernel
irc_kernel<<< dimGrid, dimBlock >>>(device_input,PROBLEM_SIZE,device_output);

// retrieve the results from the device
cudaMemcpy(host_output, device_output, PROBLEM_SIZE*sizeof(float), cudaMemcpyDeviceToHost);

for(i=0;i<10;i++) {
printf(“%d: %2.7f\n”,i,host_output[i]);
}

for(i=PROBLEM_SIZE-10;i<PROBLEM_SIZE;i++) {
printf(“%d: %2.7f\n”,i,host_output[i]);
}

cudaFree(device_input);
free(host_input);
free(host_output);

return 0;
}

// Device kernel
global void irc_kernel(float *device_input, int problem_size, float *device_output) {

// single dim :
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if(idx < problem_size) {

__syncthreads();
device_output[idx] = idx;

}

8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----

when PROBLEM_SIZE is more than 16777216 I get :

0: 0.0000000
1: 1.0000000
2: 2.0000000
3: 3.0000000
4: 4.0000000
5: 5.0000000
6: 6.0000000
7: 7.0000000
8: 8.0000000
9: 9.0000000
29999990: 29999990.0000000
29999991: 29999992.0000000
29999992: 29999992.0000000
29999993: 29999992.0000000
29999994: 29999994.0000000
29999995: 29999996.0000000
29999996: 29999996.0000000
29999997: 29999996.0000000
29999998: 29999998.0000000
29999999: 30000000.0000000

which implies that the thread id is not properly set to the memory location. If you print intermediate values you will see that the issue begins after 16777216. Moreover, if PROBLEM_SIZE is less than 16777216 this does not happen.

e.g if PROBLEM_SIZE is 16000000

0: 0.0000000
1: 1.0000000
2: 2.0000000
3: 3.0000000
4: 4.0000000
5: 5.0000000
6: 6.0000000
7: 7.0000000
8: 8.0000000
9: 9.0000000
15999990: 15999990.0000000
15999991: 15999991.0000000
15999992: 15999992.0000000
15999993: 15999993.0000000
15999994: 15999994.0000000
15999995: 15999995.0000000
15999996: 15999996.0000000
15999997: 15999997.0000000
15999998: 15999998.0000000
15999999: 15999999.0000000

Any ideas on why this is happening and how I can solve it?

If someone wants any more info, please let me know.

Thanks!

  • V

P.S: I know there are a bunch of header files in the code which have nothing to do with it, but those are just for some experiments I want to do later. I also know that defining PROBLEM_SIZE as a macro on top and then using a separate variable as a function parameter in the kernel call seems redundant, but that’s because I want to do some experiments with changing PROBLEM_SIZE in a shell script later on when it won’t be a macro anymore.

Use [font=“Courier New”]integer[/font] or [font=“Courier New”]unsigned integer[/font] instead of [font=“Courier New”]float[/font]. [font=“Courier New”]float[/font] variables only have 24 bits of mantissa, so they cannot accurately store odd integers larger than 16777216.

Thanks a million tera! :-) A question though, I want to store numbers in the range starting 100’s of millions up to 10’s of billions and I want to do arithmetic on them, that would be 27 bit to 32 bit mantissa numbers, so yeah I end up using double. But then, would that automatically make CUDA go into double precision mode thus reducing performance?

Yes (although there is no particular double precision mode, it’s just that operations on double precision operands are slower).

On GeForce cards, you might be better off with [font=“Courier New”]long long[/font] instead of [font=“Courier New”]double[/font], particularly if you have a lot of additions.

Oh, and one thing meant to write in my earlier post but somehow forgot:

if (idx < problem_size) {

    __syncthreads();

    device_output[idx] = idx;

}

has undefined behavior and might well cause trouble later. In this particular instance you can just drop the [font=“Courier New”]__syncthreads()[/font] with no ill effect (I guess you inserted it merely to make sure there is no nynchronization problem). In general, [font=“Courier New”]__syncthreads()[/font] may not appear in conditional code where the conditional evaluates differently for different threads of a block.

Thanks again tera! That is really good advice about __syncthreads() in fact i was wondering weather I should use them or not as I don’t forsee any race condition in this code, but just to play it safe in case the threads are racing ahead of each other in an order that is unknown I put it there. I am currently using a GTX 480 and decided to use double, I will try with long long as well and see if there is a better performance.