Invalid global write of size 4. Need help with debugging

edft56 · October 26, 2020, 10:41am

Hi, I am having trouble debugging a simple kernel I wrote. This is the first kernel I’ve written in cuda so maybe its an obvious mistake on my part. Here’s the code:

Code

include <stdio.h>
include <stdlib.h>
include <string.h>
include
include <assert.h>

include <cuda_runtime.h>

define gpuErrchk(ans) { gpuAssert((ans), FILE, LINE); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,“GPUassert: %s %s %d\n”, cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}

define SHMEM_SIZE 256*6

global void BtdB(const float* restrict input, float* restrict output){
int tid = (blockIdx.x * blockDim.x * 6) + threadIdx.x * 6;
__shared__ float s_d[SHMEM_SIZE];

if (tid<338682) {
    for(int i=0; i<6; i++){
        s_d[tid] = input[tid];
    }
    __syncthreads();

    output[tid] = 4*s_d[tid] - 5*s_d[tid+2] + s_d[tid+4];
    output[tid+1] = -4*s_d[tid+1] - 4*s_d[tid+2] + s_d[tid+3] + s_d[tid+4];
    output[tid+2] = 4*s_d[tid+1] - 4*s_d[tid+2] - s_d[tid+3] + s_d[tid+4];
    output[tid+3] = -2*s_d[tid+1] - s_d[tid+2] + 2*s_d[tid+3] + s_d[tid+4];
    output[tid+4] =  2*s_d[tid+1] - s_d[tid+2] - 2*s_d[tid+3] + s_d[tid+4];
    output[tid+5] =  4*s_d[tid+1] - 5*s_d[tid+3] + s_d[tid+5];
}
}

void verify(float* input, float* output_gpu){
float B[36] = {4,0,-5,0,1,0,0,-4,-4,1,1,0,0,4,-4,-1,1,0,0,-2,-1,2,1,0,0,2,-1,-2,1,0,0,4,0,-5,0,1};
for(int i=0; i<6; i++){
    for(int j=0; j<6; j++){
        float tmp=0;
        for(int k=0; k<6; k++){
            tmp = tmp + B[j*6+k]*input[i*6+k];
        }
        assert(output_gpu[i*6+j] == tmp);
    }
}
}

int main(){
float *input;
float *output;

cudaMallocHost((void**) &input, 338688*sizeof(float));
cudaMallocHost((void**) &output, 338688*sizeof(float));


for(int i=0; i<338688; i++){
    input[i] = (float)(rand() % 100);
}


float* d_input;
float* d_output;
cudaMalloc((void **) &d_input, 338688*sizeof(float));
cudaMalloc((void **) &d_output, 338688*sizeof(float));


cudaMemcpy(d_input, input, 338688*sizeof(float), cudaMemcpyHostToDevice);

int THREADS = 256;

int BLOCKS = (56448 + THREADS - 1) / THREADS;

dim3 threads(THREADS);
dim3 blocks(BLOCKS);


BtdB<<<blocks, threads>>>(d_input,d_output);

gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );

gpuErrchk( cudaMemcpy(output, d_output, 338688*sizeof(float), cudaMemcpyDeviceToHost) );

verify(input,output);


cudaFree(d_input);
cudaFree(d_output);
cudaFreeHost(input);
cudaFreeHost(output);

return 0;

}

The problem arises when I use shared memory. If I read straight from VRAM without writing to shared memory, then the code executes correctly so the indexing seems fine. There might be some shared memory restriction that I am not aware of. Here is the output of cuda-memcheck:

cuda-memcheck output

Invalid global write of size 4
========= at 0x000003b0 in /*****/Btdb.cu:29:BtdB(float const , float)
========= by thread (1,0,0) in block (1,0,0)
========= Address 0x01001818 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x1e5cc8]
========= Host Frame:./Btdb [0xc7bb]
========= Host Frame:./Btdb [0x52495]
========= Host Frame:./Btdb [0x843f]
========= Host Frame:./Btdb [0x82e4]
========= Host Frame:./Btdb [0x8333]
========= Host Frame:./Btdb [0x808b]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
========= Host Frame:./Btdb [0x7aee]

There are a lot of similar errors(different threads,blocks etc). The culprit seems to be this line of code:

for(int i=0; i<6; i++){
     s_d[tid] = input[tid];   <------
}

Memcheck seems confused though as it shouble a shared write instead of a global one.

Some last things to note. I am using a gtx860m (cc 5.0). Driver 455.23.05 on ubuntu 20.04.1 and cuda 11.1
I am using these flags to compile: -G -gencode arch=compute_50,code=sm_50

Thank you for your time.

Robert_Crovella · October 26, 2020, 2:10pm

Perhaps you don’t understand how shared memory works. You’ve allocated 1536 float locations in shared memory:

__shared__ float s_d[SHMEM_SIZE];

Why would you think you can index into it with indices as large as 338681?

if (tid<338682) {
    for(int i=0; i<6; i++){
        s_d[tid] = input[tid];
    }

It also doesn’t make sense to have a loop that is writing the same thing 6 times into the same location.

Perhaps you meant something like this:

  if (tid<338682) {
      for(int i=0; i<6; i++){
          s_d[threadIdx.x*6+i] = input[tid+i];
      }

That’s probably not right either (I haven’t studied your code), but it may help with the illegal indexing.

https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

cuda-memcheck is confused about global/shared because the pointer you have assembled by adding such a large value to a shared base pointer is out of range of the shared space altogether, and so it appears to be a global pointer.

edft56 · October 26, 2020, 2:49pm

Hi Robert, you are absolutely right. I messed with the code trying to debug it and didnt notice it when posting it. It should have been :

if (tid<338682) {
        for(int i=0; i<6; i++){
            s_d[tid+i] = input[tid+i];
        }

But looking at your suggestion it became obvious to me that while i was allocating shared memory space on a per threadblock basis, I wasnt indexing in that way, stupidly thinking that since I included blockId.x in tid the compiler would just get it. So the solution was to replace s_d[tid+i] with s_d[threadIdx.x*6+i] as you suggested.

I want to thank you for your help as I was starting to lose my mind trying to debug this for hours!

Topic		Replies	Views
What is wrong? please help me,thanks CUDA Programming and Performance	2	1732	December 13, 2017
Writing to global memory CUDA Programming and Performance	5	1719	November 19, 2009
Invalid __global__ write how to determine the wriight line? CUDA Programming and Performance	1	775	October 31, 2011
Invalid __global__ write how to determine the wriight line? CUDA Programming and Performance	0	2528	October 31, 2011
Memcheck error, dynamic parallelism and address Compute Sanitizer	4	1181	November 16, 2023
Issue with Writing to Global memory CUDA Programming and Performance	5	2961	May 16, 2009
bug in cuda-memcheck CUDA-MEMCHECK	1	3095	January 17, 2013
Tracking Invalid read size and illegal memory access CUDA Programming and Performance	3	7799	May 24, 2016
shared memory wrong allocation? CUDA Programming and Performance	2	880	July 29, 2009
COMPUTE-SANITIZER " Invalid __shared__ write of size 16 bytes " CUDA Programming and Performance	3	1279	August 17, 2022

Invalid __global__ write of size 4. Need help with debugging

Related topics

Invalid global write of size 4. Need help with debugging