Unused memory detected using compute-sanitizer initcheck for THRUST functions

Hello!

I’m currently optimizing a CUDA code, that computes the gradient for each subset, with the total number of subsets fixed. Depending on the initial guess, the gradient computation may fail. The code iterates with different initial guesses till all gradients are computed. A device array (d_mask) serves as a control mechanism, storing the status of subset elements that have already been computed (1 indicates not computed yet, and 0 indicates already computed). The current version launches the same number of thread blocks for each iteration, which results in idle thread blocks for the subsets that have already been computed.

My optimization works as follows:

  1. Create a control array (d_control) using process_mask_kernel, where -1 indicates an already computed subset and the element index indicates those that have not been computed.

mask[0] = 1, → control[0] = 0
mask[1] = 0, → control[1] = -1
mask[2] = 1, → control[2] = 2
mask[3] = 0, → control[3] = -1
mask[4] = 1, → control[4] = 4
mask[5] = 0, → control[5] = -1

  1. Use THRUST::count on the mask array to count the number of 1s (subsets to be computed).
  2. Use THRUST::remove on the control array to remove -1s (subsets already computed).

control[0] = 0
control[1] = 2
control[2] = 4
control[3] = 6
control[4] = 8
control[5] = 10

  1. Launch the kernel for the specific number of subsets.

My optimization is faster, gives right results and the time taken for the two THRUST function executions is negligible compared to the gradient kernel. However, while testing with compute-sanitizer initcheck, it detects two memory access errors, and I don’t know why. I created a toy version of the original code (where even entries are already computed), but compute-sanitizer finds the same errors. The two functions generating the errors are the following lines:

    const int val_count = thrust::count(t_d_mask, t_d_mask + elements, 1);      // count 1s
    thrust::remove(t_d_control, t_d_control + elements, -1);                    // remove -1s

I’ve uploaded the toy version (code.zip code/main.cu) and posted it bellow, as well as the compute-sanitizer output (code/compute-sanitizer_output.dat).

#include <iostream>
#include <cuda_runtime.h>
#include <thrust/device_ptr.h>
#include <thrust/remove.h>
#include <thrust/count.h>

//#include "src/print_output_file.h"

inline void check_cuda_error(cudaError_t err) {
    if (err != cudaSuccess) {
        std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl;
        exit(EXIT_FAILURE);
    }
}

__global__ void process_mask_kernel(const unsigned char* input, const int size, int* output) {
    const int thd_id = blockIdx.x * blockDim.x + threadIdx.x;
    if (thd_id < size)  output[thd_id] = (input[thd_id] == 0) ? -1 : thd_id;
}

int main() {
    constexpr int elements = 500000;
    constexpr int threads = 256;
    const int blocks = (elements + threads - 1) / threads;

    unsigned char *d_mask;

    int *d_control;

    check_cuda_error( cudaMalloc(&d_mask, elements*sizeof(unsigned char)) );
    check_cuda_error( cudaMalloc(&d_control, elements*sizeof(int)) );

    unsigned char *h_mask = new unsigned char[elements];
    for(int i=0; i<elements; ++i)  h_mask[i] = (i+1) % 2;
    check_cuda_error( cudaMemcpy(d_mask, h_mask, elements*sizeof(unsigned char), cudaMemcpyHostToDevice) );
    delete[] h_mask;

    process_mask_kernel<<< blocks, threads >>>(d_mask, elements, d_control);
    cudaGetLastError();

    //copy_device_host_write_to_file( d_control, elements, "odd_entries.dat" );   // test output

    thrust::device_ptr<unsigned char> t_d_mask(d_mask);
    thrust::device_ptr<int> t_d_control(d_control);

    const int val_count = thrust::count(t_d_mask, t_d_mask + elements, 1);      // count 1s
    thrust::remove(t_d_control, t_d_control + elements, -1);                    // remove -1s

    //copy_device_host_write_to_file( d_control, elements, "odd_remove.dat" );    // test output
    std::cout << "Elements: " << elements << ", number of 1s: " << val_count << std::endl;

    cudaFree(d_mask);
    cudaFree(d_control);
    return 0;
}

compute-sanitizer with memcheck doesn’t display any error but the case with initcheck shows 2 errors:

$compute-sanitizer --tool initcheck --track-unused-memory ./main
========= COMPUTE-SANITIZER
========= Unused memory in allocation 0x77132c7e8600 of size 6,151 bytes
========= Not written 4,183 bytes between offsets 0x8 (0x77132c7e8608) and 0x1806 (0x77132c7e9e06) (inclusive)
========= 68.0052% of allocation were unused.
========= Saved host backtrace up to driver entry point at allocation time
========= Host Frame: [0x27d01e]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c [0x65e5d]
========= in /home/amedina/daily_work/cuda_samples/21_thrust/devel_01_thrust/./main
========= Host Frame:libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf [0x30812]
========= in /home/amedina/daily_work/cuda_samples/21_thrust/devel_01_thrust/./main
========= Host Frame:cudaMalloc [0x753b4]
========= in /home/amedina/daily_work/cuda_samples/21_thrust/devel_01_thrust/./main
========= Host Frame:void* thrust::THRUST_200500_890_NS::cuda_cub::mallocthrust::THRUST_200500_890_NS::cuda_cub::tag(thrust::THRUST_200500_890_NS::cuda_cub::execution_policythrust::THRUST_200500_890_NS::cuda_cub::tag&, unsigned long) [0x29ca1]

(full output version attached compute-sanitizer_output.dat )

I’m running the following configuration on Ubuntu 24.04.2 LTS

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   37C    P8              5W /   60W |      56MiB /   8188MiB |     35%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
  1. Can you help me identify the memory usage problem in my CUDA code?
  2. Am I referencing the pointers correctly in my implementation?
  3. Are the THRUST functions I am using well-defined and appropriate for my use case?

Thanks in advance


code.zip (4.4 KB)

Those aren’t memory access errors. They are unused allocation (portions). That doesn’t represent a problem, necessarily. compute-sanitizer flags it as an “error” based on an assumption that you want/need/expect to use every byte of every allocation. But that isn’t sensible (evidently) with thrust.

If you are questioning thrust design/behavior, you can file a thrust issue. You could also open a discussion there. But this looks innocuous to me.

Thank you for the clarification, Bob. If any further issues arise, I’ll reach out to the Thrust community for discussion.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.