How to properly debug this memory issue

dries · August 26, 2020, 1:05pm

Hi all,

I’m still rather new to cuda and I’m struggling with an issue I can’ seem to debug.
I’m working on a Windows system with a Quadro P620 GPU.
For some reason I cannot seem to get the profilers working properly like I used to have on a Linux system, so I’m flying half blind.
I hope someby can give me a push in the right direction for this.

Below is some sample code for my problem.
This code will generate two vectors, one float* and one complex*.
A function will copy a part of the floats to the real part of the complex

In the second part some small vectors are made and send to the GPU. Without altering them they are copied back and printed out. The expectation is to get two times the same value being printed out.

For N = 16 I correctly get:
5
5

For N =32 I get:
5
-4.31602e+08

include “cuda_runtime.h”
include “device_launch_parameters.h”
include <stdio.h>
include “iostream”

typedef float2 Complex;
global void copyImageToReal(float*, Complex*);

int main()
{
int N = 32;
int height = 2160; // define height
int width = 4096; // define width

Complex* d_target;
cudaMalloc((void**)&d_target, sizeof(Complex)* height* width * N); // assign device memory
cudaMemset((void**)&d_target, 0, sizeof(Complex)* height* width * N); // assign 0 to all elements

float* d_source;
cudaMalloc((void**)&d_source, sizeof(float) * height * width); // assign device memory
cudaMemset((void**)&d_source, 1, sizeof(float)* height* width); // assign 1 to all elements

//copyImageToReal <<< 8640, 1024 >>>(d_source, d_target); // put source data to real part of the target // Target needs to be larger than source

cudaDeviceSynchronize();

// Some random code that has two numbers going to the GPU and back
float* d_Polar;
float* h_Polar = (float*)malloc(sizeof(float) * 2);
float* h_target = (float*)malloc(sizeof(float) * 2);

h_target[0] = 5;
h_target[1] = 7;

std::cout << h_target[0] << std::endl;
cudaMalloc((void**)&d_Polar, sizeof(float) * 2); // assign device memory
cudaMemcpy(d_Polar, h_target, sizeof(float) * 2, cudaMemcpyHostToDevice); // copy data
cudaMemcpy(h_Polar, d_Polar, sizeof(float) * 2, cudaMemcpyDeviceToHost); // copy data
cudaDeviceSynchronize();

std::cout << h_Polar[0] << std::endl;

cudaFree(d_target);
cudaFree(d_source);
cudaFree(d_Polar);
cudaDeviceReset();
return 0;
}

global void copyImageToReal(float* src, Complex* dst) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
dst[idx].x = src[idx];
};

mfatica · August 27, 2020, 4:42pm

This is probably overflowing for N=32. Add (size_t) in front of height or width.

njuffa · August 27, 2020, 8:28pm

Starting a size computation with sizeof(Type) represents a safe idiom. sizeof returns the size as a size_t (or to be more precise, std::size_t), and since further evaluation proceeds left to right according to C++ rules, the entire expression should be evaluated using type size_t.

What is not safe (danger of overflow in intermediate computation) is the other idiom one sees frequently: [int-expression]*sizeof(Type).

I would suggest adding error checking for all CUDA API calls and kernel launches. Alternatively, try running the executable with cuda-memcheck. You will likely discover that the cudaMalloc allocation failed, as more memory is requested than is available. Also, this is most certainly incorrect:

cudaMemset((void**)&d_target,

You want to pass a pointer, not the address of a pointer.

Topic		Replies	Views
Cuda code performance CUDA Programming and Performance	14	3124	December 16, 2014
Strange memory gremlins Getting pwned by pointers CUDA Programming and Performance	9	12174	July 1, 2009
Optimizing performance of a serial <<<1, 1>>> kernel, after long debugging hours CUDA Programming and Performance	13	886	July 2, 2018
Help on fixing some poor performances (rookie) CUDA Programming and Performance	10	7162	November 28, 2007
This is driving me nuts! memory access problem.. CUDA Programming and Performance	5	2662	December 7, 2007
CUDA kernels keep on crashing CUDA Programming and Performance	6	3643	October 27, 2008
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2633	August 2, 2010
Internal Profiling error - insufficient kernel bounds data CUDA Programming and Performance	8	4633	May 9, 2016
Very simple CUDA program bad output CUDA Programming and Performance	3	760	July 3, 2017
Kernel not doing anything CUDA Programming and Performance	8	4595	January 31, 2011

How to properly debug this memory issue

Related topics