thread id repetitions leading to problems issues with problem sizes over 16777216

vihan · July 22, 2011, 1:23pm

Hi all,

I’m a CUDA newbie on Linux. I’m trying out the following code on a GTX 480 :

8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----

#include <cuda.h>
#include <stdio.h>
#include <string.h>
#include “cuda.h”
#include “cuda_runtime_api.h”
#include <sys/time.h>
#include <time.h>
#include <math.h>
#include <gsl_statistics_float.h>
#include <gsl_rng.h>
#include <time.h>

#define PROBLEM_SIZE 30000000

// Prototypes
global void irc_kernel(float *device_input, int problem_size, float *device_output);

// Host function
int main(int argc, char** argv) {

int i;

float *host_input;
host_input = (float )malloc(PROBLEM_SIZEsizeof(float));
for(i=0;i<PROBLEM_SIZE;i++) {
host_input[i] = 4000.0; // an initialization just a number, no real purpose
}

float device_input;
cudaMalloc((void **)&device_input, PROBLEM_SIZEsizeof(float));

float device_output;
cudaMalloc((void **)&device_output, PROBLEM_SIZEsizeof(float));

// send input to device
cudaMemcpy(device_input, host_input, PROBLEM_SIZE*sizeof(float), cudaMemcpyHostToDevice);

dim3 dimGrid(65534,1,1);
dim3 dimBlock(512,1,1);

float *host_output;
host_output = (float )malloc(PROBLEM_SIZEsizeof(float));

// invoke the kernel
irc_kernel<<< dimGrid, dimBlock >>>(device_input,PROBLEM_SIZE,device_output);

// retrieve the results from the device
cudaMemcpy(host_output, device_output, PROBLEM_SIZE*sizeof(float), cudaMemcpyDeviceToHost);

for(i=0;i<10;i++) {
printf(“%d: %2.7f\n”,i,host_output[i]);
}

for(i=PROBLEM_SIZE-10;i<PROBLEM_SIZE;i++) {
printf(“%d: %2.7f\n”,i,host_output[i]);
}

cudaFree(device_input);
free(host_input);
free(host_output);

return 0;
}

// Device kernel
global void irc_kernel(float *device_input, int problem_size, float *device_output) {

// single dim :
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if(idx < problem_size) {

__syncthreads();
device_output[idx] = idx;

}

8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----

when PROBLEM_SIZE is more than 16777216 I get :

0: 0.0000000
1: 1.0000000
2: 2.0000000
3: 3.0000000
4: 4.0000000
5: 5.0000000
6: 6.0000000
7: 7.0000000
8: 8.0000000
9: 9.0000000
29999990: 29999990.0000000
29999991: 29999992.0000000
29999992: 29999992.0000000
29999993: 29999992.0000000
29999994: 29999994.0000000
29999995: 29999996.0000000
29999996: 29999996.0000000
29999997: 29999996.0000000
29999998: 29999998.0000000
29999999: 30000000.0000000

which implies that the thread id is not properly set to the memory location. If you print intermediate values you will see that the issue begins after 16777216. Moreover, if PROBLEM_SIZE is less than 16777216 this does not happen.

e.g if PROBLEM_SIZE is 16000000

0: 0.0000000
1: 1.0000000
2: 2.0000000
3: 3.0000000
4: 4.0000000
5: 5.0000000
6: 6.0000000
7: 7.0000000
8: 8.0000000
9: 9.0000000
15999990: 15999990.0000000
15999991: 15999991.0000000
15999992: 15999992.0000000
15999993: 15999993.0000000
15999994: 15999994.0000000
15999995: 15999995.0000000
15999996: 15999996.0000000
15999997: 15999997.0000000
15999998: 15999998.0000000
15999999: 15999999.0000000

Any ideas on why this is happening and how I can solve it?

If someone wants any more info, please let me know.

Thanks!

V

P.S: I know there are a bunch of header files in the code which have nothing to do with it, but those are just for some experiments I want to do later. I also know that defining PROBLEM_SIZE as a macro on top and then using a separate variable as a function parameter in the kernel call seems redundant, but that’s because I want to do some experiments with changing PROBLEM_SIZE in a shell script later on when it won’t be a macro anymore.

tera · July 22, 2011, 2:10pm

Use [font=“Courier New”]integer[/font] or [font=“Courier New”]unsigned integer[/font] instead of [font=“Courier New”]float[/font]. [font=“Courier New”]float[/font] variables only have 24 bits of mantissa, so they cannot accurately store odd integers larger than 16777216.

vihan · July 22, 2011, 5:02pm

Thanks a million tera! :-) A question though, I want to store numbers in the range starting 100’s of millions up to 10’s of billions and I want to do arithmetic on them, that would be 27 bit to 32 bit mantissa numbers, so yeah I end up using double. But then, would that automatically make CUDA go into double precision mode thus reducing performance?

tera · July 22, 2011, 9:42pm

Yes (although there is no particular double precision mode, it’s just that operations on double precision operands are slower).

On GeForce cards, you might be better off with [font=“Courier New”]long long[/font] instead of [font=“Courier New”]double[/font], particularly if you have a lot of additions.

Oh, and one thing meant to write in my earlier post but somehow forgot:

if (idx < problem_size) {

    __syncthreads();

    device_output[idx] = idx;

}

has undefined behavior and might well cause trouble later. In this particular instance you can just drop the [font=“Courier New”]__syncthreads()[/font] with no ill effect (I guess you inserted it merely to make sure there is no nynchronization problem). In general, [font=“Courier New”]__syncthreads()[/font] may not appear in conditional code where the conditional evaluates differently for different threads of a block.

vihan · July 24, 2011, 12:47pm

Thanks again tera! That is really good advice about __syncthreads() in fact i was wondering weather I should use them or not as I don’t forsee any race condition in this code, but just to play it safe in case the threads are racing ahead of each other in an order that is unknown I put it there. I am currently using a GTX 480 and decided to use double, I will try with long long as well and see if there is a better performance.

Topic		Replies	Views
Indexing Errors with a large array CUDA Programming and Performance	3	2214	February 24, 2009
Incorrect Result after large loop in kernel CUDA Programming and Performance	7	13759	November 2, 2010
strange behavior of data size in cudaMalloc or cudaMemcpy CUDA Programming and Performance	2	4960	February 9, 2009
problem with bigger than 32768-size grids CUDA bug? CUDA Programming and Performance	9	6591	January 28, 2009
Understanding Threads in CUDA help me find the exact number of threads for my code CUDA Programming and Performance	4	2364	July 13, 2009
Threads count stops CUDA Programming and Performance	2	2121	September 30, 2008
Weird behavior of CUDA CUDA Programming and Performance	6	5559	February 13, 2008
Unexpected behavior with varying number of threads per block CUDA Programming and Performance	2	3440	November 5, 2008
Simple kernel ocassionally gives very very unusual results CUDA Programming and Performance	4	520	February 26, 2018
Why my code crashes? CUDA Programming and Performance	7	5308	May 8, 2011

thread id repetitions leading to problems issues with problem sizes over 16777216

Related topics