Loop of 100 only executed once?

I had a program that computes a[i]+100*b[i] but using a loop instead of multiplication. The code is like below:

__global__ void GpuAddImpl(int* a, int* b, int* c, size_t size) {
  size_t i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < size) {
    c[i] = a[i];
    for (int t = 0; t < 100; ++t) {
      c[i] += b[i];
    }
  }
}

However, I got incorrect result – it seems like the loop is executed only once. For example, if a[i]=990060 b[i]=274112 then expected result is 28401260 but I got 1264172 in c[i]. Why is this?

I’m using a GCP VM with T4 GPU. GPU driver/CUDA info:
$ sudo nvidia-smi
Mon Jul 24 04:21:55 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 66C P0 31W / 70W | 2MiB / 15360MiB | 7% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+

Interestingly, this issue goes away today.