I am making a program which reads from the input two numbers, N and K, then reads N integers and saves them into an array. The goal of the program is to calculate the maximum average of all segments of length at least K. It then prints this average (multiplied by 1000 and truncating the decimal places). It’s a task in a programming competition I participated in once, and there exists an O(nlog(n)) solution for this problem. However, I think using a graphics card instead of a processor should be sufficient for an O(nn) solution using O(n) memory. For some reason though, when I reach inputs of around 90 000, the program exits with a bus exception.
#include <bits/stdc++.h>
using namespace std;
__global__
void solve(int n, int k, int *a, int *x) {
// i is the beginning of the segment
for (int i = threadIdx.x; i < n; i += blockDim.x) {
int sum = 0;
// j is the end of the segment
for (int j = i; j < n; j++) {
sum = sum + a[j];
if (j-i+1 >= k && x[threadIdx.x] < sum * 1000ll / (j-i+1))
x[threadIdx.x] = sum * 1000ll / (j-i+1);
}
}
}
int main() {
ios::sync_with_stdio(0);
cin.tie(0);
cout.tie(0);
// n = number of elements, k = minimum segment length, T = number of threads
int n, k, T = 256;
cin >> n >> k;
// a = input array, x = array of results (one for each thread, so no two threads write to the same place at the same time)
int *a, *x;
// https://devblogs.nvidia.com/even-easier-introduction-cuda/ here I just copy their approach
cudaMallocManaged(&a, n*sizeof(int));
cudaMallocManaged(&x, T*sizeof(int));
for (int i = 0; i < n; i++)
cin >> a[i];
for (int i = 0; i < T; i++)
x[i] = 0;
solve<<<1, T>>>(n, k, a, x);
cudaDeviceSynchronize();
// the result is the maximum of all results
int res = 0;
for (int i = 0; i < T; i++)
res = max(res, x[i]);
cout << res << endl;
cudaFree(a);
cudaFree(x);
return 0;
}
I tried figuring out where exactly the error occurs, and from what I could tell the program crashes here:
for (int i = 0; i < T; i++)
res = max(res, x[i]);
When the program tries to access any value in x, it crashes. If I comment out this for loop, it exits normally. I checked the addresses of the pointers in x, and they are not null.
The program works fine for input arrays of length <= 30 000. This would seem to imply I allocate too much memory, however I can run this program with 1e7 elements just fine (copied from the even easier introduction).
#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
My graphics card is a 750M. If there is any extra information you would like, I will reply asap.