i am totally new to CUDA and i wanted to do some simple vector addition but somehow i always get zero answers.
can anybody help me out?
the code is really easy and i have absolutely no idea what there could potentially be wrong.
global void matAdd(float *A, float *B, float *C)
{
int i = threadIdx.x;
C[i] = A[i] * B[i];
}
int
main()
{
float *A_h;
float *B_h;
float *C_h;
float *A_d;
float *B_d;
float *C_d;
A_h = (float *) (malloc(sizeof(float) * VAR));
B_h = (float *) (malloc(sizeof(float) * VAR));
C_h = (float *) (malloc(sizeof(float) * VAR));
cudaMalloc( (void **) &A_d, sizeof(float) * VAR);
cudaMalloc( (void **) &B_d, sizeof(float) * VAR);
cudaMalloc( (void **) &C_d, sizeof(float) * VAR);
printf("RAM allocated ...\n");
for (int i = 0; i < VAR; i++)
{
A_h[i] = 2.0f;
B_h[i] = 2.0f;
}
printf("calling kernel ...\n");
cudaMemcpy(A_d, A_h, sizeof(float) * VAR, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, sizeof(float) * VAR, cudaMemcpyHostToDevice);
// Kernel invocation
dim3 dimBlock(1, 4);
/* copy data to GPU */
printf("copy data ...\n");
matAdd<<<1, dimBlock>>>(A_d, B_d, C_d);
printf("addition done ...\n");
/* copy answer back and display */
cudaMemcpy(C_h, C_d, sizeof(float) * VAR, cudaMemcpyDeviceToHost);
for (int i = 0; i < VAR; i++)
{
printf("line: %f\n", C_h[i]);
}
printf("fixed ...\n");
return 0;
}
i am compiling two vectors, copy them to the GPU and add them up.
then i copy the answer back into the CPU.
no core dumps, no compiler warnings - just crapty answers.
hs@quad:/data/projects/crealiity/cuda$ make
PATH=/usr/local/cuda/bin:/usr/local/cuda/bin/:/home/hs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
nvcc main.cu -o prog
After a quick read-through of your code, the only thing amiss I see is that your grid/thread configuration is incorrect. You are launching 1 block that is 1x4, but indexing by threadIdx.x in your kernel.
However, that would run the 4 threads with threadIdx.x=0 so I’m not sure why the 0’th element is not correct. You are also printing 8 elements but the kernel only writes to 4 of them.
Seems strange to me. I ran your code (added free and cudaFree calls at the end of it External Image, and also I zeroed the C_d array using cudaMemSet).
I got the expected result of having only the first element in the output set to 4 (you’ve been noted on this in a previous reply - block dimensions…).
So other than the two issues I listed in the brackets, I had no problems with it.
The odd thing here is that we seem to have the same card (maybe nt the same vendor but same chip) yet mine gives out a totally different output
on the first test, and passes the 2nd one:
$ ~/NVIDIA_CUDA_SDK/bin/linux/release/deviceQuery There is 1 device supporting CUDA
Device 0: "GeForce 8600 GT"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.19 GHz
Concurrent copy and execution: Yes
Test PASSED
$ ~/NVIDIA_CUDA_SDK/bin/linux/release/histogram64 --help
Using device 0: GeForce 8600 GT
Initializing data...
...allocating CPU memory.
...generating input data
...allocating GPU memory and copying input data
Running GPU histogram (1 iterations)...
histogram64GPU() time (average) : 49.438999 msec //1928.991954 MB/sec
Comparing the results...
...histogramCPU()
histogram64CPU() time : 141.229996 msec //675.263291 MB/sec
Total sum of histogram elements: 100000000
Sum of absolute differences: 0
TEST PASSED
Shutting down...
I seem to have only 4 multiprocessors and 32 cores, while your card reports 16 multiprocessors and 128 cores. Now, this is really really strange to me. Can someone
what driver are you using? 16 MP/128 cores was a bug from 169 or 173.xx, I think, so odds are low that it’s working correctly with CUDA 2.0 (or 2.1) examples
I’m using the 177.82 driver. And from what I understand from tmurray’s post, the report postgresql is getting are incorrect, as the 8600GT has only 4MP/32Cores, like what I’m getting.
postgresql, what driver are you using? Try setting up the 177.82 driver and re-run the tests and your code.
i guess this was the golden advise.
i am using the 169 driver. this can be an issue …
it seems to work once in 10000 tries even.
i will fix the driver side.
If you have a chance to look at my code, vector adding, would you mind to help me figure out the error? It produces exactly same all 0 result for GPGPU partition.
I don’t remember details, but I have some memory that some error cases are missed by cudaThreadSynchronize(…) invoked after the kernel launch. I think if the kernel doesn’t launch at all, then cudaThreadSynchronize(…) function would not return a error. I would suggest invoking getLastError(…) right after the <<<…>>> operator (kernel launch) and check its error code, just to be sure.
On a separate note, you might want to initialize C_h with some garbage numbers to verify that these variables are overwritten in the process of pulling the data from the device. I would do it in the same loop as where you initialize A_h, B_h.