I had a program that computes a[i]+100*b[i] but using a loop instead of multiplication. The code is like below:
__global__ void GpuAddImpl(int* a, int* b, int* c, size_t size) {
size_t i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size) {
c[i] = a[i];
for (int t = 0; t < 100; ++t) {
c[i] += b[i];
}
}
}
However, I got incorrect result – it seems like the loop is executed only once. For example, if a[i]=990060 b[i]=274112 then expected result is 28401260 but I got 1264172 in c[i]. Why is this?
I’m using a GCP VM with T4 GPU. GPU driver/CUDA info:
$ sudo nvidia-smi
Mon Jul 24 04:21:55 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 66C P0 31W / 70W | 2MiB / 15360MiB | 7% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+