Bus error when working with 90k array (Cuda C)

I am making a program which reads from the input two numbers, N and K, then reads N integers and saves them into an array. The goal of the program is to calculate the maximum average of all segments of length at least K. It then prints this average (multiplied by 1000 and truncating the decimal places). It’s a task in a programming competition I participated in once, and there exists an O(nlog(n)) solution for this problem. However, I think using a graphics card instead of a processor should be sufficient for an O(nn) solution using O(n) memory. For some reason though, when I reach inputs of around 90 000, the program exits with a bus exception.

#include <bits/stdc++.h>
using namespace std;

__global__
void solve(int n, int k, int *a, int *x) {
    // i is the beginning of the segment
    for (int i = threadIdx.x; i < n; i += blockDim.x) {
        int sum = 0;
        // j is the end of the segment
        for (int j = i; j < n; j++) {
            sum = sum + a[j];
            if (j-i+1 >= k && x[threadIdx.x] < sum * 1000ll / (j-i+1))
                x[threadIdx.x] = sum * 1000ll / (j-i+1);
        }
    }
}

int main() {
    ios::sync_with_stdio(0);
    cin.tie(0);
    cout.tie(0);
    
    // n = number of elements, k = minimum segment length, T = number of threads
    int n, k, T = 256;
    cin >> n >> k;

    // a = input array, x = array of results (one for each thread, so no two threads write to the same place at the same time)
    int *a, *x;

    // https://devblogs.nvidia.com/even-easier-introduction-cuda/ here I just copy their approach
    cudaMallocManaged(&a, n*sizeof(int));
    cudaMallocManaged(&x, T*sizeof(int));
    
    for (int i = 0; i < n; i++)
        cin >> a[i];
    for (int i = 0; i < T; i++)
        x[i] = 0;

    solve<<<1, T>>>(n, k, a, x);

    cudaDeviceSynchronize();

    // the result is the maximum of all results
    int res = 0;
    for (int i = 0; i < T; i++)
        res = max(res, x[i]);
    cout << res << endl;

    cudaFree(a);
    cudaFree(x);

    return 0;
}

I tried figuring out where exactly the error occurs, and from what I could tell the program crashes here:

for (int i = 0; i < T; i++)
        res = max(res, x[i]);

When the program tries to access any value in x, it crashes. If I comment out this for loop, it exits normally. I checked the addresses of the pointers in x, and they are not null.

The program works fine for input arrays of length <= 30 000. This would seem to imply I allocate too much memory, however I can run this program with 1e7 elements just fine (copied from the even easier introduction).

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

My graphics card is a 750M. If there is any extra information you would like, I will reply asap.

This may be a WDDM TDR timeout, or the equivalent on linux if you are using linux. There are no limits on memory size/usage applicable to what you are showing here.