I have a little problem with CUDA last few days.
I have written a very simple program that applies a median filter on a 1D array. It works fine with a float array of little size. Today, I am trying to test it on a very larger array of size 1 440 000. I have set a macro that detects cuda_error after each call to a cuda function.
All is working fine until I reach the cudaMemcpy DeviceToHost which freezes my pc. If I comment out this line of code, this is the cudaFree that freezes my pc…
I don’t know what is happening, has somebody met the same problem ?
That’s not a lot of information. On the face of it, I can only think that the presumably needed first copy from host to device does something undesirable, maybe out of bounds? Perhaps comment out the first copy and see what happens then to fix the real culprit.
That’s not a lot of information. On the face of it, I can only think that the presumably needed first copy from host to device does something undesirable, maybe out of bounds? Perhaps comment out the first copy and see what happens then to fix the real culprit.
I also think this is my mistake, I allocate a float array of size 1 440 000 * sizeof(float) and I bind it to a 1D texture with cudaBindTexture(). Am I out of bound for the texture ? And if so, why cudaBindTexture() doesn’t return an error ?
I also think this is my mistake, I allocate a float array of size 1 440 000 * sizeof(float) and I bind it to a 1D texture with cudaBindTexture(). Am I out of bound for the texture ? And if so, why cudaBindTexture() doesn’t return an error ?
I have found my problem, it’s my fault, I was not writing in the correct chunk of memory.
I have now an another problem. My program works fine when I use this kernel :
(Grid configuration : 32 threads/block, num_bloc = N / thread ->one thread processes one array value)
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int idx = threadIdx.x;
// Compute & put the float value in shared memory //
__syncthreads();
d_out[x] = median[idx];
I’d like to have better coalesced writes in my global memory. So, I tried :
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int idx = threadIdx.x;
// Compute & put the float value in an array called "median" in shared memory //
__syncthreads();
int index = blockDim.x / 4;
if (idx >= index) return;
d_out += blockIdx.x * blockDim.x;
((float4 *)d_out)[index] = median_float4[index];
I have found my problem, it’s my fault, I was not writing in the correct chunk of memory.
I have now an another problem. My program works fine when I use this kernel :
(Grid configuration : 32 threads/block, num_bloc = N / thread ->one thread processes one array value)
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int idx = threadIdx.x;
// Compute & put the float value in shared memory //
__syncthreads();
d_out[x] = median[idx];
I’d like to have better coalesced writes in my global memory. So, I tried :
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int idx = threadIdx.x;
// Compute & put the float value in an array called "median" in shared memory //
__syncthreads();
int index = blockDim.x / 4;
if (idx >= index) return;
d_out += blockIdx.x * blockDim.x;
((float4 *)d_out)[index] = median_float4[index];