Hi all,
I’m a CUDA newbie on Linux. I’m trying out the following code on a GTX 480 :
8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----
#include <cuda.h>
#include <stdio.h>
#include <string.h>
#include “cuda.h”
#include “cuda_runtime_api.h”
#include <sys/time.h>
#include <time.h>
#include <math.h>
#include <gsl_statistics_float.h>
#include <gsl_rng.h>
#include <time.h>
#define PROBLEM_SIZE 30000000
// Prototypes
global void irc_kernel(float *device_input, int problem_size, float *device_output);
// Host function
int main(int argc, char** argv) {
int i;
float *host_input;
host_input = (float )malloc(PROBLEM_SIZEsizeof(float));
for(i=0;i<PROBLEM_SIZE;i++) {
host_input[i] = 4000.0; // an initialization just a number, no real purpose
}
float device_input;
cudaMalloc((void **)&device_input, PROBLEM_SIZEsizeof(float));
float device_output;
cudaMalloc((void **)&device_output, PROBLEM_SIZEsizeof(float));
// send input to device
cudaMemcpy(device_input, host_input, PROBLEM_SIZE*sizeof(float), cudaMemcpyHostToDevice);
dim3 dimGrid(65534,1,1);
dim3 dimBlock(512,1,1);
float *host_output;
host_output = (float )malloc(PROBLEM_SIZEsizeof(float));
// invoke the kernel
irc_kernel<<< dimGrid, dimBlock >>>(device_input,PROBLEM_SIZE,device_output);
// retrieve the results from the device
cudaMemcpy(host_output, device_output, PROBLEM_SIZE*sizeof(float), cudaMemcpyDeviceToHost);
for(i=0;i<10;i++) {
printf(“%d: %2.7f\n”,i,host_output[i]);
}
for(i=PROBLEM_SIZE-10;i<PROBLEM_SIZE;i++) {
printf(“%d: %2.7f\n”,i,host_output[i]);
}
cudaFree(device_input);
free(host_input);
free(host_output);
return 0;
}
// Device kernel
global void irc_kernel(float *device_input, int problem_size, float *device_output) {
// single dim :
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < problem_size) {
__syncthreads();
device_output[idx] = idx;
}
8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----8-<----
when PROBLEM_SIZE is more than 16777216 I get :
0: 0.0000000
1: 1.0000000
2: 2.0000000
3: 3.0000000
4: 4.0000000
5: 5.0000000
6: 6.0000000
7: 7.0000000
8: 8.0000000
9: 9.0000000
29999990: 29999990.0000000
29999991: 29999992.0000000
29999992: 29999992.0000000
29999993: 29999992.0000000
29999994: 29999994.0000000
29999995: 29999996.0000000
29999996: 29999996.0000000
29999997: 29999996.0000000
29999998: 29999998.0000000
29999999: 30000000.0000000
which implies that the thread id is not properly set to the memory location. If you print intermediate values you will see that the issue begins after 16777216. Moreover, if PROBLEM_SIZE is less than 16777216 this does not happen.
e.g if PROBLEM_SIZE is 16000000
0: 0.0000000
1: 1.0000000
2: 2.0000000
3: 3.0000000
4: 4.0000000
5: 5.0000000
6: 6.0000000
7: 7.0000000
8: 8.0000000
9: 9.0000000
15999990: 15999990.0000000
15999991: 15999991.0000000
15999992: 15999992.0000000
15999993: 15999993.0000000
15999994: 15999994.0000000
15999995: 15999995.0000000
15999996: 15999996.0000000
15999997: 15999997.0000000
15999998: 15999998.0000000
15999999: 15999999.0000000
Any ideas on why this is happening and how I can solve it?
If someone wants any more info, please let me know.
Thanks!
- V
P.S: I know there are a bunch of header files in the code which have nothing to do with it, but those are just for some experiments I want to do later. I also know that defining PROBLEM_SIZE as a macro on top and then using a separate variable as a function parameter in the kernel call seems redundant, but that’s because I want to do some experiments with changing PROBLEM_SIZE in a shell script later on when it won’t be a macro anymore.