i am totally new to CUDA and i wanted to do some simple vector addition but somehow i always get zero answers.
can anybody help me out?
the code is really easy and i have absolutely no idea what there could potentially be wrong.
After a quick read-through of your code, the only thing amiss I see is that your grid/thread configuration is incorrect. You are launching 1 block that is 1x4, but indexing by threadIdx.x in your kernel.
However, that would run the 4 threads with threadIdx.x=0 so I’m not sure why the 0’th element is not correct. You are also printing 8 elements but the kernel only writes to 4 of them.
Seems strange to me. I ran your code (added free and cudaFree calls at the end of it , and also I zeroed the C_d array using cudaMemSet).
I got the expected result of having only the first element in the output set to 4 (you’ve been noted on this in a previous reply - block dimensions…).
So other than the two issues I listed in the brackets, I had no problems with it.
The odd thing here is that we seem to have the same card (maybe nt the same vendor but same chip) yet mine gives out a totally different output
on the first test, and passes the 2nd one:
$ ~/NVIDIA_CUDA_SDK/bin/linux/release/deviceQuery There is 1 device supporting CUDA
Device 0: "GeForce 8600 GT"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.19 GHz
Concurrent copy and execution: Yes
$ ~/NVIDIA_CUDA_SDK/bin/linux/release/histogram64 --help
Using device 0: GeForce 8600 GT
...allocating CPU memory.
...generating input data
...allocating GPU memory and copying input data
Running GPU histogram (1 iterations)...
histogram64GPU() time (average) : 49.438999 msec //1928.991954 MB/sec
Comparing the results...
histogram64CPU() time : 141.229996 msec //675.263291 MB/sec
Total sum of histogram elements: 100000000
Sum of absolute differences: 0
I seem to have only 4 multiprocessors and 32 cores, while your card reports 16 multiprocessors and 128 cores. Now, this is really really strange to me. Can someone
I don’t remember details, but I have some memory that some error cases are missed by cudaThreadSynchronize(…) invoked after the kernel launch. I think if the kernel doesn’t launch at all, then cudaThreadSynchronize(…) function would not return a error. I would suggest invoking getLastError(…) right after the <<<…>>> operator (kernel launch) and check its error code, just to be sure.
On a separate note, you might want to initialize C_h with some garbage numbers to verify that these variables are overwritten in the process of pulling the data from the device. I would do it in the same loop as where you initialize A_h, B_h.