Hello!
I’m currently optimizing a CUDA code, that computes the gradient for each subset, with the total number of subsets fixed. Depending on the initial guess, the gradient computation may fail. The code iterates with different initial guesses till all gradients are computed. A device array (d_mask) serves as a control mechanism, storing the status of subset elements that have already been computed (1 indicates not computed yet, and 0 indicates already computed). The current version launches the same number of thread blocks for each iteration, which results in idle thread blocks for the subsets that have already been computed.
My optimization works as follows:
- Create a control array (d_control) using process_mask_kernel, where -1 indicates an already computed subset and the element index indicates those that have not been computed.
mask[0] = 1, → control[0] = 0
mask[1] = 0, → control[1] = -1
mask[2] = 1, → control[2] = 2
mask[3] = 0, → control[3] = -1
mask[4] = 1, → control[4] = 4
mask[5] = 0, → control[5] = -1
…
- Use THRUST::count on the mask array to count the number of 1s (subsets to be computed).
- Use THRUST::remove on the control array to remove -1s (subsets already computed).
control[0] = 0
control[1] = 2
control[2] = 4
control[3] = 6
control[4] = 8
control[5] = 10
…
- Launch the kernel for the specific number of subsets.
My optimization is faster, gives right results and the time taken for the two THRUST function executions is negligible compared to the gradient kernel. However, while testing with compute-sanitizer initcheck, it detects two memory access errors, and I don’t know why. I created a toy version of the original code (where even entries are already computed), but compute-sanitizer finds the same errors. The two functions generating the errors are the following lines:
const int val_count = thrust::count(t_d_mask, t_d_mask + elements, 1); // count 1s
thrust::remove(t_d_control, t_d_control + elements, -1); // remove -1s
I’ve uploaded the toy version (code.zip code/main.cu) and posted it bellow, as well as the compute-sanitizer output (code/compute-sanitizer_output.dat).
#include <iostream>
#include <cuda_runtime.h>
#include <thrust/device_ptr.h>
#include <thrust/remove.h>
#include <thrust/count.h>
//#include "src/print_output_file.h"
inline void check_cuda_error(cudaError_t err) {
if (err != cudaSuccess) {
std::cerr << "CUDA Error: " << cudaGetErrorString(err) << std::endl;
exit(EXIT_FAILURE);
}
}
__global__ void process_mask_kernel(const unsigned char* input, const int size, int* output) {
const int thd_id = blockIdx.x * blockDim.x + threadIdx.x;
if (thd_id < size) output[thd_id] = (input[thd_id] == 0) ? -1 : thd_id;
}
int main() {
constexpr int elements = 500000;
constexpr int threads = 256;
const int blocks = (elements + threads - 1) / threads;
unsigned char *d_mask;
int *d_control;
check_cuda_error( cudaMalloc(&d_mask, elements*sizeof(unsigned char)) );
check_cuda_error( cudaMalloc(&d_control, elements*sizeof(int)) );
unsigned char *h_mask = new unsigned char[elements];
for(int i=0; i<elements; ++i) h_mask[i] = (i+1) % 2;
check_cuda_error( cudaMemcpy(d_mask, h_mask, elements*sizeof(unsigned char), cudaMemcpyHostToDevice) );
delete[] h_mask;
process_mask_kernel<<< blocks, threads >>>(d_mask, elements, d_control);
cudaGetLastError();
//copy_device_host_write_to_file( d_control, elements, "odd_entries.dat" ); // test output
thrust::device_ptr<unsigned char> t_d_mask(d_mask);
thrust::device_ptr<int> t_d_control(d_control);
const int val_count = thrust::count(t_d_mask, t_d_mask + elements, 1); // count 1s
thrust::remove(t_d_control, t_d_control + elements, -1); // remove -1s
//copy_device_host_write_to_file( d_control, elements, "odd_remove.dat" ); // test output
std::cout << "Elements: " << elements << ", number of 1s: " << val_count << std::endl;
cudaFree(d_mask);
cudaFree(d_control);
return 0;
}
compute-sanitizer with memcheck doesn’t display any error but the case with initcheck shows 2 errors:
$compute-sanitizer --tool initcheck --track-unused-memory ./main
========= COMPUTE-SANITIZER
========= Unused memory in allocation 0x77132c7e8600 of size 6,151 bytes
========= Not written 4,183 bytes between offsets 0x8 (0x77132c7e8608) and 0x1806 (0x77132c7e9e06) (inclusive)
========= 68.0052% of allocation were unused.
========= Saved host backtrace up to driver entry point at allocation time
========= Host Frame: [0x27d01e]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:libcudart_static_5382377d5c772c9d197c0cda9fd9742ee6ad893c [0x65e5d]
========= in /home/amedina/daily_work/cuda_samples/21_thrust/devel_01_thrust/./main
========= Host Frame:libcudart_static_f74e2f2bcf2cf49bd1a61332e1d15bd1e748f9cf [0x30812]
========= in /home/amedina/daily_work/cuda_samples/21_thrust/devel_01_thrust/./main
========= Host Frame:cudaMalloc [0x753b4]
========= in /home/amedina/daily_work/cuda_samples/21_thrust/devel_01_thrust/./main
========= Host Frame:void* thrust::THRUST_200500_890_NS::cuda_cub::mallocthrust::THRUST_200500_890_NS::cuda_cub::tag(thrust::THRUST_200500_890_NS::cuda_cub::execution_policythrust::THRUST_200500_890_NS::cuda_cub::tag&, unsigned long) [0x29ca1]
…
(full output version attached compute-sanitizer_output.dat )
I’m running the following configuration on Ubuntu 24.04.2 LTS
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 ... Off | 00000000:01:00.0 On | N/A |
| N/A 37C P8 5W / 60W | 56MiB / 8188MiB | 35% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
- Can you help me identify the memory usage problem in my CUDA code?
- Am I referencing the pointers correctly in my implementation?
- Are the THRUST functions I am using well-defined and appropriate for my use case?
Thanks in advance
code.zip (4.4 KB)