nvCOMP - get compressed data from device

vmetodiev · April 7, 2024, 10:38pm

Hi,

Recently I have been experimenting with the nvCOMP API, the low level quick start example (called “low_level_quickstart_example.cpp” inside the repo).

More specifically, I am interested in obtaining the compressed data that should be generated by the nvcompBatchedLZ4CompressAsync call.

However, when trying to access it via the below approach, I get only “0s”.

Snippet:

  nvcompStatus_t comp_res = nvcompBatchedLZ4CompressAsync(
      device_uncompressed_ptrs,
      device_uncompressed_bytes,
      chunk_size, // The maximum chunk size
      batch_size,
      device_temp_ptr,
      temp_bytes,
      device_compressed_ptrs,
      device_compressed_bytes,
      nvcompBatchedLZ4DefaultOpts,
      stream);

  if (comp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(comp_res == nvcompSuccess);
  }
  cudaStreamSynchronize(stream);
  
  size_t* host_compressed_bytes;
  cudaMallocHost(&host_compressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(
    host_compressed_bytes, device_compressed_bytes,
    sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost);

  for ( size_t i = 0; i < batch_size; i++ )
  {  
    printf("        host_compressed_bytes[%zu]=%zu\n", i, host_compressed_bytes[i]);
  }

  // Re-use host_compressed_ptrs, free the allocated memory from the above lines
  for ( size_t i = 0; i < batch_size; i++ )
      cudaFree(&host_compressed_ptrs[i]);

  for ( size_t i = 0; i < batch_size; i++ )
  {
    cudaMallocHost(&host_compressed_ptrs[i], host_compressed_bytes[i]);

    printf("        host_compressed_ptrs[%zu] = %p\n", i, host_compressed_ptrs[i]);

    cudaMemcpy(
      &host_compressed_ptrs[i], &device_compressed_ptrs[i],
      host_compressed_bytes[i], cudaMemcpyDeviceToHost);
  }
  
  printf("    [+] Content (leading 5 and trailing 5 bytes):\n");
  for ( size_t i = 0; i < batch_size; i++ )
  {
    uint8_t leading_values[5] = { 0x00 };
    uint8_t trailing_values[5] = { 0x00 };

    size_t start_ofs = 0;
    size_t end_ofs = 5;

    size_t idx = 0;
    for ( size_t j = start_ofs; j < end_ofs; j++ )
      leading_values[idx++] = host_compressed_ptrs[i][j];

    start_ofs = host_compressed_bytes[i] - 5;
    end_ofs = host_compressed_bytes[i];
    // printf("        [+] Index  of trailing %zu start_ofs = %zu, end_ofs = %zu\n", i, start_ofs, end_ofs);
    
    idx = 0;
    for ( size_t j = start_ofs; j < host_compressed_bytes[i]; j++ )
      trailing_values[idx++] = host_compressed_ptrs[i][j];

    printf("        host_compressed_ptrs[%zu] = ", i);
    for ( size_t j = 0; j < 5; j++ )
      printf("%02hhX ", leading_values[j]);

    printf(" . . .  ");
    for ( size_t j = 0; j < 5; j++ )
      printf("%02hhX ", trailing_values[j]);
    printf("\n");
  }

May you please advise what I am doing wrong? And is there a more natural way (from the API) to get the compressed buffer from the GPU to the host?

njuffa · April 7, 2024, 11:24pm

Question about nvComp might be a better fit for the sub-forum dealing with GPU-accelerated libraries:

GPU-Accelerated Libraries - NVIDIA Developer Forums.

What happens when you run the code shown here under control of Compute Sanitizer and / or add proper CUDA error checking?

vmetodiev · April 8, 2024, 6:43pm

Hi @njuffa ,

I modified the snippet a bit:

  nvcompStatus_t comp_res = nvcompBatchedLZ4CompressAsync(
      device_uncompressed_ptrs,
      device_uncompressed_bytes,
      chunk_size, // The maximum chunk size
      batch_size,
      device_temp_ptr,
      temp_bytes,
      device_compressed_ptrs,
      device_compressed_bytes,
      nvcompBatchedLZ4DefaultOpts,
      stream);

  if (comp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(comp_res == nvcompSuccess);
  }
  cudaStreamSynchronize(stream);

  printf("[+] Copy device_compressed_bytes array from device to host.\n");
  
  size_t* host_compressed_bytes;
  cudaMallocHost(&host_compressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(host_compressed_bytes, device_compressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost);

  printf("    [+] Max bytes in host_compressed_bytes array slot: %zu\n", max_out_bytes);
  printf("    [+] Slots in host_compressed_ptrs array: %zu\n", batch_size);
  printf("    [+] Elements:\n");

  for (size_t i = 0; i < batch_size; i++)
    printf("        host_compressed_bytes[%zu]=%zu\n", i, host_compressed_bytes[i]);

  // Re-use host_compressed_ptrs, free the allocated memory from the above lines
  cudaFreeHost(host_compressed_ptrs);

  cudaMallocHost(&host_compressed_ptrs, sizeof(size_t) * batch_size);

  for (size_t i = 0; i < batch_size; i++)
  {
    cudaMallocHost(&host_compressed_ptrs[i], host_compressed_bytes[i]);
    printf("        host_compressed_ptrs[%zu] = %p\n", i, host_compressed_ptrs[i]);

    cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);
  }

The code-sanitizer stops with:

    [+] Content (leading 5 and trailing 5 bytes):
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

And the reason is the line where I attempt to copy the device byte arrays to the host, namely here:

cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);

May I misunderstand the usage of cudaMemcpy?

njuffa · April 8, 2024, 6:56pm

Given that Compute Sanitzer reports ERROR SUMMARY: 0 errors there does not seem to be any error condition on the CUDA side. The next step would be to debug this from the nvComp side. I have zero knowledge of nvComp.

vmetodiev · April 8, 2024, 7:12pm

When running without code-sanitizer, I actually get “Segmentation fault” upon the first invocation of cudaMemcpy

cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);

    [+] Elements:
        host_compressed_bytes[0]=65794
        host_compressed_bytes[1]=65794
        host_compressed_bytes[2]=65794
        host_compressed_bytes[3]=65794
        host_compressed_bytes[4]=65794
        host_compressed_bytes[5]=65794
        host_compressed_bytes[6]=65794
        host_compressed_bytes[7]=65794
        host_compressed_bytes[8]=65794
        host_compressed_bytes[9]=65794
        host_compressed_bytes[10]=65794
        host_compressed_bytes[11]=65793
        host_compressed_bytes[12]=65794
        host_compressed_bytes[13]=65794
        host_compressed_bytes[14]=65794
        host_compressed_bytes[15]=17028
        host_compressed_ptrs[0] = 0x7f9a3daf4e00
Segmentation fault (core dumped)

njuffa · April 8, 2024, 7:20pm

When asking for help with debugging, it is usually a good idea to readily share available information rather than making it available bit by bit. So there is an error, but it is not on the GPU side.

Segmentation fault means access to an illegal address on the host side. Most common error scenario is use of an invalid pointer. Track that pointer value back to its origin.

vmetodiev · April 8, 2024, 7:39pm

Thanks again for your quick response, @njuffa !

Well, this is my file (modified low_level_quickstart_example):

#include <random>
#include <assert.h>
#include <iostream>

#include "nvcomp/lz4.h"

/* 
  To build, execute
  
  mkdir build
  cd build
  cmake -DBUILD_EXAMPLES=ON ..
  make -j

  To execute, 
  bin/low_level_quickstart_example
*/

#define CUDA_CHECK(cond)                                                       \
  do {                                                                         \
    cudaError_t err = cond;                                                    \
    if (err != cudaSuccess) {                                               \
      std::cerr << "Failure" << std::endl;                                \
      exit(1);                                                              \
    }                                                                         \
  } while (false)

void execute_example(uint8_t* input_data, const size_t in_bytes)
{
  printf("[+] Creating CUDA stream.\n");
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // First, initialize the data on the host.
  printf("[+] Initialialising data on the device.\n");

  // compute chunk sizes
  printf("    [+] Compute chunk sizes.\n");
  size_t* host_uncompressed_bytes;
  const size_t chunk_size = 65536;
  const size_t batch_size = (in_bytes + chunk_size - 1) / chunk_size;
  printf("    [+] chunk_size = %zu\n", chunk_size);
  printf("    [+] batch_size = %zu\n", batch_size);

  printf("    [+] Allocating memory on the device.\n");
  uint8_t* device_input_data;
  uint8_t* host_input_data;
  cudaMalloc(&device_input_data, in_bytes);
  cudaMemcpyAsync(device_input_data, input_data, in_bytes, cudaMemcpyHostToDevice, stream);

  printf("    [+] Memory address of device_input_data: %p\n", device_input_data);
  printf("    [+] Number of bytes: %zu\n", in_bytes);
  printf("    [+] Bytes:\n");

  cudaMallocHost(&host_input_data, in_bytes);
  cudaMemcpy(host_input_data, device_input_data, in_bytes, cudaMemcpyDeviceToHost);

  // size_t num_bytes_to_print = in_bytes;
  size_t num_bytes_to_print = 20;
  for ( size_t i = 0; i < num_bytes_to_print; i++ )
    if ( i == num_bytes_to_print - 1 ) printf("%X\n", host_input_data[i]);
    else printf("%02hhX,", host_input_data[i]);
  cudaFreeHost(host_input_data);
  // printf("\n");
  printf("...\n");

  printf("[+] Initialialising data on the host.\n");
  cudaMallocHost(&host_uncompressed_bytes, sizeof(size_t) * batch_size);
  for (size_t i = 0; i < batch_size; ++i) {
    if (i + 1 < batch_size) {
      host_uncompressed_bytes[i] = chunk_size;
    } else {
      // last chunk may be smaller
      host_uncompressed_bytes[i] = in_bytes - (chunk_size*i);
    }
  }
  printf("    [+] Max bytes in host_uncompressed_bytes array slot: %zu\n", chunk_size);
  printf("    [+] Slots in host_uncompressed_ptrs array: %zu\n", batch_size);

  printf("    [+] Elements:\n");

  for ( size_t i = 0; i < batch_size; i++ )
    printf("        host_uncompressed_bytes[%zu]=%zu\n", i, host_uncompressed_bytes[i]);

  // Setup an array of pointers to the start of each chunk
  uint8_t** host_uncompressed_ptrs;
  cudaMallocHost(&host_uncompressed_ptrs, sizeof(size_t) * batch_size);
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++) {
    host_uncompressed_ptrs[ix_chunk] = device_input_data + chunk_size * ix_chunk;
  }

    // Setup an array of pointers to the start of each chunk
  uint8_t** host_uncompressed_ptrs_read;
  cudaMallocHost(&host_uncompressed_ptrs_read, sizeof(size_t) * batch_size);
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++) {
    host_uncompressed_ptrs_read[ix_chunk] = host_input_data + chunk_size * ix_chunk;
  }

  printf("    [+] Content (leading 5 and trailing 5 bytes):\n");
  for ( size_t i = 0; i < batch_size; i++ )
  {
    const size_t len = 5;
    uint8_t leading_values[len] = { 0x00 };
    uint8_t trailing_values[len] = { 0x00 };

    size_t start_ofs = 0;
    size_t end_ofs = 5;

    size_t idx = 0;
    for ( size_t j = start_ofs; j < end_ofs; j++ )
      leading_values[idx++] = host_uncompressed_ptrs_read[i][j];

    start_ofs = host_uncompressed_bytes[i] - len;
    end_ofs = host_uncompressed_bytes[i];
    
    // printf("        [+] Index  of trailing %zu start_ofs = %zu, end_ofs = %zu\n", i, start_ofs, end_ofs);
    
    idx = 0;
    for ( size_t j = start_ofs; j < end_ofs; j++ )
      trailing_values[idx++] = host_uncompressed_ptrs_read[i][j];

    printf("        host_uncompressed_ptrs_read[%zu] = ", i);
    for ( size_t j = 0; j < len; j++ )
      printf("%02hhX ", leading_values[j]);

    printf(" . . .  ");
    for ( size_t j = 0; j < len; j++ )
      printf("%02hhX ", trailing_values[j]);
    printf("\n");
  }

  // Moving the uncompressed data to the device
  printf("[+] Allocate memory on the device.\n");

  size_t* device_uncompressed_bytes;
  void** device_uncompressed_ptrs;
  cudaMalloc(&device_uncompressed_bytes, sizeof(size_t) * batch_size);
  cudaMalloc(&device_uncompressed_ptrs, sizeof(size_t) * batch_size);
  
  printf("[+] Move the uncompressed data from host to device.\n");
  cudaMemcpyAsync(device_uncompressed_bytes, host_uncompressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice, stream);
  cudaMemcpyAsync(device_uncompressed_ptrs, host_uncompressed_ptrs, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice, stream);

  // Then we need to allocate the temporary workspace and output space needed by the compressor.
  printf("[+] Allocate the temporary workspace on the device and output space needed by the compressor.\n");
  size_t temp_bytes;
  nvcompBatchedLZ4CompressGetTempSize(batch_size, chunk_size, nvcompBatchedLZ4DefaultOpts, &temp_bytes);
  void* device_temp_ptr;
  cudaMalloc(&device_temp_ptr, temp_bytes);
  printf("    [+] Allocated temp_bytes = %zu - should be ( (chunk_size * batch_size) / 2 )\n", temp_bytes);

  // get the maxmimum output size for each chunk
  printf("[+] Get the maxmimum output size for each chunk.\n");
  size_t max_out_bytes;
  nvcompBatchedLZ4CompressGetMaxOutputChunkSize(chunk_size, nvcompBatchedLZ4DefaultOpts, &max_out_bytes);
  printf("    [+] max_out_bytes = %zu\n", max_out_bytes);

  // Next, allocate output space on the device
  uint8_t** host_compressed_ptrs;
  cudaMallocHost(&host_compressed_ptrs, sizeof(size_t) * batch_size);
  for(size_t ix_chunk = 0; ix_chunk < batch_size; ++ix_chunk) {
      cudaMalloc(&host_compressed_ptrs[ix_chunk], max_out_bytes);
  }

  printf("[+] Allocate device compressed pointers array.\n");
  void** device_compressed_ptrs;
  cudaMalloc(&device_compressed_ptrs, sizeof(size_t) * batch_size);
  cudaMemcpyAsync(
      device_compressed_ptrs, host_compressed_ptrs,
      sizeof(size_t) * batch_size, cudaMemcpyHostToDevice, stream);

  // allocate space for compressed chunk sizes to be written to
  printf("[+] Allocate buffer space for compressed chunk sizes to be written to.\n");
  size_t* device_compressed_bytes;
  cudaMalloc(&device_compressed_bytes, sizeof(size_t) * batch_size);

  // And finally, call the API to compress the data
  printf("[+] Call the nvCOMP API to compress the data.\n");
  nvcompStatus_t comp_res = nvcompBatchedLZ4CompressAsync(
      device_uncompressed_ptrs,
      device_uncompressed_bytes,
      chunk_size, // The maximum chunk size
      batch_size,
      device_temp_ptr,
      temp_bytes,
      device_compressed_ptrs,
      device_compressed_bytes,
      nvcompBatchedLZ4DefaultOpts,
      stream);

  if (comp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(comp_res == nvcompSuccess);
  }
  cudaStreamSynchronize(stream);

  printf("[+] Copy device_compressed_bytes array from device to host.\n");
  
  size_t* host_compressed_bytes;
  cudaMallocHost(&host_compressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(host_compressed_bytes, device_compressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost);

  printf("    [+] Max bytes in host_compressed_bytes array slot: %zu\n", max_out_bytes);
  printf("    [+] Slots in host_compressed_ptrs array: %zu\n", batch_size);
  printf("    [+] Elements:\n");

  for (size_t i = 0; i < batch_size; i++)
    printf("        host_compressed_bytes[%zu]=%zu\n", i, host_compressed_bytes[i]);

  // Re-use host_compressed_ptrs, free the allocated memory from the above lines
  cudaFreeHost(host_compressed_ptrs);

  cudaMallocHost(&host_compressed_ptrs, sizeof(size_t) * batch_size);

  for (size_t i = 0; i < batch_size; i++)
  {
    cudaMallocHost(&host_compressed_ptrs[i], host_compressed_bytes[i]);
    printf("        host_compressed_ptrs[%zu] = %p\n", i, host_compressed_ptrs[i]);

    cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);
  }

  printf("    [+] Content (leading 5 and trailing 5 bytes):\n");
  for (size_t i = 0; i < batch_size; i++)
  {
    uint8_t leading_values[5] = { 0x00 };
    uint8_t trailing_values[5] = { 0x00 };

    size_t start_ofs = 0;
    size_t end_ofs = 5;

    size_t idx = 0;
    for (size_t j = start_ofs; j < end_ofs; j++)
      leading_values[idx++] = host_compressed_ptrs[i][j];

    start_ofs = host_compressed_bytes[i] - 5;
    end_ofs = host_compressed_bytes[i];
    // printf("        [+] Index  of trailing %zu start_ofs = %zu, end_ofs = %zu\n", i, start_ofs, end_ofs);
    
    idx = 0;
    for (size_t j = start_ofs; j < host_compressed_bytes[i]; j++)
      trailing_values[idx++] = host_compressed_ptrs[i][j];

    printf("        host_compressed_ptrs[%zu] = ", i);
    for (size_t j = 0; j < 5; j++)
      printf("%02hhX ", leading_values[j]);

    printf(" . . .  ");
    for (size_t j = 0; j < 5; j++)
      printf("%02hhX ", trailing_values[j]);
    printf("\n");
  }

  // Decompression can be similarly performed on a batch of multiple compressed input chunks. 
  // As no metadata is stored with the compressed data, chunks can be re-arranged as well as decompressed 
  // with other chunks that originally were not compressed in the same batch.

  // If we didn't have the uncompressed sizes, we'd need to compute this information here. 
  // We demonstrate how to do this.
  nvcompBatchedLZ4GetDecompressSizeAsync(
      device_compressed_ptrs,
      device_compressed_bytes,
      device_uncompressed_bytes,
      batch_size,
      stream);

  // Next, allocate the temporary buffer 
  size_t decomp_temp_bytes;
  nvcompBatchedLZ4DecompressGetTempSize(batch_size, chunk_size, &decomp_temp_bytes);
  void * device_decomp_temp;
  cudaMalloc(&device_decomp_temp, decomp_temp_bytes);

  // allocate statuses
  nvcompStatus_t* device_statuses;
  cudaMalloc(&device_statuses, sizeof(nvcompStatus_t)*batch_size);

  // Also allocate an array to store the actual_uncompressed_bytes.
  // Note that we could use nullptr for this. We already have the 
  // actual sizes computed during the call to nvcompBatchedLZ4GetDecompressSizeAsync.
  size_t* device_actual_uncompressed_bytes;
  cudaMalloc(&device_actual_uncompressed_bytes, sizeof(size_t)*batch_size);

  // And finally, call the decompression routine.
  // This decompresses each input, device_compressed_ptrs[i], and places the decompressed
  // result in the corresponding output list, device_uncompressed_ptrs[i]. It also writes
  // the size of the uncompressed data to device_uncompressed_bytes[i].
  nvcompStatus_t decomp_res = nvcompBatchedLZ4DecompressAsync(
      device_compressed_ptrs, 
      device_compressed_bytes, 
      device_uncompressed_bytes, 
      device_actual_uncompressed_bytes, 
      batch_size,
      device_decomp_temp, 
      decomp_temp_bytes, 
      device_uncompressed_ptrs, 
      device_statuses, 
      stream);
  
  if (decomp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(decomp_res == nvcompSuccess);
  }

  cudaStreamSynchronize(stream);
}

int main()
{
  // Initialize a random array of chars
  const size_t in_bytes = 1000000;
  uint8_t* uncompressed_data;
  
  cudaMallocHost(&uncompressed_data, in_bytes);
  
  std::mt19937 random_gen(42);

  // char specialization of std::uniform_int_distribution is
  // non-standard, and isn't available on MSVC, so use short instead,
  // but with the range limited, and then cast below.
  std::uniform_int_distribution<short> uniform_dist(0, 255);
  for (size_t ix = 0; ix < in_bytes; ++ix) {
    uncompressed_data[ix] = static_cast<char>(uniform_dist(random_gen));
  }
  
  execute_example(uncompressed_data, in_bytes);
  return 0;
}

When commenting the problematic cudaMemcpy line 221:

// cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);

The host_compressed_ptrs seems properly initialised :

        host_compressed_ptrs[0] = 0x7fc683af4e00
        host_compressed_ptrs[1] = 0x7fc683b05000
        host_compressed_ptrs[2] = 0x7fc683b15200
        host_compressed_ptrs[3] = 0x7fc683b25400
        host_compressed_ptrs[4] = 0x7fc683b35600
        host_compressed_ptrs[5] = 0x7fc683b45800
        host_compressed_ptrs[6] = 0x7fc683b55a00
        host_compressed_ptrs[7] = 0x7fc683b65c00
        host_compressed_ptrs[8] = 0x7fc683b75e00
        host_compressed_ptrs[9] = 0x7fc683b86000
        host_compressed_ptrs[10] = 0x7fc683b96200
        host_compressed_ptrs[11] = 0x7fc683ba6400
        host_compressed_ptrs[12] = 0x7fc683bb6600
        host_compressed_ptrs[13] = 0x7fc683bc6800
        host_compressed_ptrs[14] = 0x7fc683bd6a00
        host_compressed_ptrs[15] = 0x7fc683be6c00

And the program gets executed without any issues:

$ echo $?
0

martini.andrea · April 12, 2024, 12:29pm

Hi everyone,
using the same example above described, i cannot achieve compressed data. I mean, it seems that the cudaMemcpy in the snippet (after compression success):

  for (size_t i = 0; i < batch_size; i++)
  {
    cudaMallocHost(&host_compressed_ptrs[i], host_compressed_bytes[i]);
    printf("        host_compressed_ptrs[%zu] = %p\n", i, host_compressed_ptrs[i]);

    cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);
  }

does not copy compressed data of each chunk. Infact, when i print bytes with:

 host_compressed_ptrs[i]

i get always null value for each chunk. Following the output. First i print the chunk address and its size, then i try to print some byte of compressed data:

host_compressed_ptrs[0] = 00000002052F4C00  whith chunk size[0] = 65794
        host_compressed_ptrs[1] = 0000000205304E00  whith chunk size[1] = 65794
        host_compressed_ptrs[2] = 0000000205315000  whith chunk size[2] = 65794
        host_compressed_ptrs[3] = 0000000205325200  whith chunk size[3] = 65794
        host_compressed_ptrs[4] = 0000000205335400  whith chunk size[4] = 65794
        host_compressed_ptrs[5] = 0000000205345600  whith chunk size[5] = 65794
        host_compressed_ptrs[6] = 0000000205355800  whith chunk size[6] = 65794
        host_compressed_ptrs[7] = 0000000205365A00  whith chunk size[7] = 65794
        host_compressed_ptrs[8] = 0000000205375C00  whith chunk size[8] = 65794
        host_compressed_ptrs[9] = 0000000205385E00  whith chunk size[9] = 65793
        host_compressed_ptrs[10] = 0000000205396000  whith chunk size[10] = 65794
        host_compressed_ptrs[11] = 00000002053A6200  whith chunk size[11] = 65794
        host_compressed_ptrs[12] = 00000002053B6400  whith chunk size[12] = 65794
        host_compressed_ptrs[13] = 00000002053C6600  whith chunk size[13] = 65794
        host_compressed_ptrs[14] = 00000002053D6800  whith chunk size[14] = 65794
        host_compressed_ptrs[15] = 00000002053E6A00  whith chunk size[15] = 17028
    [+] Content (leading 20 and trailing 20 bytes):
        host_compressed_ptrs[0] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[1] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[2] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[3] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[4] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[5] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[6] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[7] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[8] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[9] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[10] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[11] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[12] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[13] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[14] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        host_compressed_ptrs[15] = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . . .  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Please, can you suggest me where to look to find error?
Is there another way to get compressed data than the mode reported in the code above?
Thank you in advance.

System information:

Windows 11
GeForce gtx 1650 (laptop)
NVIDIA-SMI 552.12 Driver Version: 552.12 CUDA Version: 12.4
Cuda compilation tools, release 12.2, V12.2.140
nvcomp 3.0.6 windows 12.x

vmetodiev · April 14, 2024, 11:28pm

Another attempt (solved the initial segmentation fault issue) below.

Code:

#include <random>
#include <assert.h>
#include <iostream>

#include "nvcomp/lz4.h"

void execute_example(uint8_t* input_data, const size_t in_bytes)
{
  printf("[+] Creating CUDA stream.\n");
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // First, initialize the data on the host.
  printf("[+] Initialialising data on the device.\n");

  // compute chunk sizes
  printf("    [+] Compute chunk sizes.\n");
  size_t* host_uncompressed_bytes;
  const size_t chunk_size = 65536;
  const size_t batch_size = (in_bytes + chunk_size - 1) / chunk_size;
  printf("    [+] chunk_size = %zu\n", chunk_size);
  printf("    [+] batch_size = %zu\n", batch_size);

  printf("    [+] Allocating memory on the device.\n");
  uint8_t* device_input_data;
  cudaMalloc(&device_input_data, in_bytes);
  cudaMemcpy(device_input_data, input_data, in_bytes, cudaMemcpyHostToDevice);

  printf("    [+] Memory address of device_input_data: %p\n", device_input_data);
  printf("    [+] Number of bytes: %zu\n", in_bytes);
  printf("    [+] Bytes:\n");

  // size_t num_bytes_to_print = in_bytes;
  uint8_t* host_input_data;
  cudaMallocHost(&host_input_data, in_bytes);
  cudaMemcpy(host_input_data, device_input_data, in_bytes, cudaMemcpyDeviceToHost);

  size_t num_bytes_to_print = 20;
  for ( size_t i = 0; i < num_bytes_to_print; i++ )
    if ( i == num_bytes_to_print - 1 ) printf("%X\n", host_input_data[i]);
    else printf("%02hhX,", host_input_data[i]);
  printf("...\n");

  printf("[+] Initialialising data on the host.\n");
  cudaMallocHost(&host_uncompressed_bytes, sizeof(size_t) * batch_size);
  for (size_t i = 0; i < batch_size; ++i) {
    if (i + 1 < batch_size) {
      host_uncompressed_bytes[i] = chunk_size;
    } else {
      // last chunk may be smaller
      host_uncompressed_bytes[i] = in_bytes - (chunk_size*i);
    }
  }
  printf("    [+] Max bytes in host_uncompressed_bytes array slot: %zu\n", chunk_size);
  printf("    [+] Slots in host_uncompressed_ptrs array: %zu\n", batch_size);

  printf("    [+] Elements:\n");

  for ( size_t i = 0; i < batch_size; i++ )
    printf("        host_uncompressed_bytes[%zu]=%zu\n", i, host_uncompressed_bytes[i]);

  // Setup an array of pointers to the start of each chunk
  uint8_t** host_uncompressed_ptrs;
  cudaMallocHost(&host_uncompressed_ptrs, sizeof(size_t) * batch_size);
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++) {
    host_uncompressed_ptrs[ix_chunk] = device_input_data + chunk_size * ix_chunk;
  }

    // Setup an array of pointers to the start of each chunk
  uint8_t** host_uncompressed_ptrs_read;
  cudaMallocHost(&host_uncompressed_ptrs_read, sizeof(size_t) * batch_size);
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++) {
    host_uncompressed_ptrs_read[ix_chunk] = host_input_data + chunk_size * ix_chunk;
  }

  printf("    [+] Content #1 (leading 5 and trailing 5 bytes):\n");
  for ( size_t i = 0; i < batch_size; i++ )
  {
    uint8_t leading_values[5] = { 0x00 };
    uint8_t trailing_values[5] = { 0x00 };

    size_t start_ofs = 0;
    size_t end_ofs = 5;

    size_t idx = 0;
    for ( size_t j = start_ofs; j < end_ofs; j++ )
      leading_values[idx++] = host_uncompressed_ptrs_read[i][j];

    start_ofs = host_uncompressed_bytes[i] - 5;
    end_ofs = host_uncompressed_bytes[i];
        
    idx = 0;
    for ( size_t j = start_ofs; j < end_ofs; j++ )
      trailing_values[idx++] = host_uncompressed_ptrs_read[i][j];

    printf("        host_uncompressed_ptrs_read[%zu] = ", i);
    for ( size_t j = 0; j < 5; j++ )
      printf("%02hhX ", leading_values[j]);

    printf(" . . .  ");

    for ( size_t j = 0; j < 5; j++ )
      printf("%02hhX ", trailing_values[j]);
    printf("\n");
  }

  // Moving the uncompressed data to the device
  printf("[+] Allocate memory on the device.\n");

  size_t* device_uncompressed_bytes;
  void** device_uncompressed_ptrs;
  cudaMalloc(&device_uncompressed_bytes, sizeof(size_t) * batch_size);
  cudaMalloc(&device_uncompressed_ptrs, sizeof(size_t) * batch_size);
  
  printf("[+] Move the uncompressed data from host to device.\n");
  cudaMemcpy(device_uncompressed_bytes, host_uncompressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice);
  cudaMemcpy(device_uncompressed_ptrs, host_uncompressed_ptrs, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice);

  // Then we need to allocate the temporary workspace and output space needed by the compressor.
  printf("[+] Allocate the temporary workspace on the device and output space needed by the compressor.\n");
  size_t temp_bytes;
  nvcompBatchedLZ4CompressGetTempSize(batch_size, chunk_size, nvcompBatchedLZ4DefaultOpts, &temp_bytes);
  void* device_temp_ptr;
  cudaMalloc(&device_temp_ptr, temp_bytes);
  printf("    [+] Allocated temp_bytes = %zu - should be ( (chunk_size * batch_size) / 2 )\n", temp_bytes);

  // get the maxmimum output size for each chunk
  printf("[+] Get the maxmimum output size for each chunk.\n");
  size_t max_out_bytes;
  nvcompBatchedLZ4CompressGetMaxOutputChunkSize(chunk_size, nvcompBatchedLZ4DefaultOpts, &max_out_bytes);
  printf("    [+] max_out_bytes = %zu\n", max_out_bytes);

  // Next, allocate output space on the device
  uint8_t** host_compressed_ptrs;
  cudaMallocHost(&host_compressed_ptrs, sizeof(size_t) * batch_size);
  for(size_t ix_chunk = 0; ix_chunk < batch_size; ++ix_chunk) {
      cudaMalloc(&host_compressed_ptrs[ix_chunk], max_out_bytes);
  }
  size_t array_address_step = (size_t)(host_compressed_ptrs[1] - host_compressed_ptrs[0]);
  printf("    [+] Address step within array (hex): %lX, (dec): %zu\n", array_address_step, array_address_step);
  assert(array_address_step >= max_out_bytes);

  printf("[+] Allocate device compressed pointers array.\n");
  void** device_compressed_ptrs;
  cudaMalloc(&device_compressed_ptrs, sizeof(size_t) * batch_size);
  cudaMemcpy(device_compressed_ptrs, host_compressed_ptrs, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice);

  // allocate space for compressed chunk sizes to be written to
  printf("[+] Allocate buffer space for compressed chunk sizes to be written to.\n");
  size_t* device_compressed_bytes;
  cudaMalloc(&device_compressed_bytes, sizeof(size_t) * batch_size);

  // And finally, call the API to compress the data
  printf("[+] Call the nvCOMP API to compress the data.\n");
  
  nvcompStatus_t comp_res = nvcompBatchedLZ4CompressAsync(
      device_uncompressed_ptrs,
      device_uncompressed_bytes,
      chunk_size, // The maximum chunk size
      batch_size,
      device_temp_ptr,
      temp_bytes,
      device_compressed_ptrs,
      device_compressed_bytes,
      nvcompBatchedLZ4DefaultOpts,
      stream);

  if (comp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(comp_res == nvcompSuccess);
  }
  cudaStreamSynchronize(stream);

  printf("[+] Copy compressed from device to host.\n");
  
  size_t* read_back_compressed_bytes;
  cudaMallocHost(&read_back_compressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(read_back_compressed_bytes, device_compressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost);

  for (size_t i = 0; i < batch_size; i++)
    printf("        read_back_compressed_bytes[%zu]=%zu\n", i, read_back_compressed_bytes[i]);

  uint8_t* read_back_compressed_data;
  size_t total_compressed_bytes_len = array_address_step * batch_size * sizeof(uint8_t);
  printf("        total_compressed_bytes_len = %zu (incl. additional \"padding\" dummy bytes at the end of each slot)\n", total_compressed_bytes_len);

  cudaMallocHost(&read_back_compressed_data, total_compressed_bytes_len);
  cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);    // Step #1 (copy all bytes starting from from device_compressed_ptrs over read_back_compressed_data)

  uint8_t** read_back_compressed_ptrs;
  cudaMallocHost(&read_back_compressed_ptrs, sizeof(size_t) * batch_size);
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++)
  {    
    read_back_compressed_ptrs[ix_chunk] = read_back_compressed_data + array_address_step * ix_chunk;                    // Step #2 (map the host_compressed_ptrs over read_back_compressed_data)
  }

  printf("    [+] Content #2 (leading 5 and trailing 5 bytes):\n");
  
  for (size_t i = 0; i < batch_size; i++)
  {
    uint8_t leading_values[5] = { 0x00 };
    uint8_t trailing_values[5] = { 0x00 };

    size_t start_ofs = 0;
    size_t end_ofs = 5;

    size_t idx = 0;
    for (size_t j = start_ofs; j < end_ofs; j++)
      leading_values[idx++] = read_back_compressed_ptrs[i][j];

    start_ofs = read_back_compressed_bytes[i] - 5;
    end_ofs = max_out_bytes;
    
    idx = 0;
    for (size_t j = start_ofs; j < read_back_compressed_bytes[i]; j++)
      trailing_values[idx++] = read_back_compressed_ptrs[i][j];

    printf("        read_back_compressed_ptrs[%zu] = ", i);
    for (size_t j = 0; j < 5; j++)
      printf("%02hhX ", leading_values[j]);

    printf(" . . .  ");

    for (size_t j = 0; j < 5; j++)
      printf("%02hhX ", trailing_values[j]);
    printf("\n");
  }
  
  /*
  // // //
  size_t* copied_compressed_bytes;
  cudaMalloc(&copied_compressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(copied_compressed_bytes, read_back_compressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice);
 
  uint8_t* copied_compressed_data;
  cudaMalloc(&copied_compressed_data, total_compressed_bytes_len);
  cudaMemcpy(copied_compressed_data, read_back_compressed_data, total_compressed_bytes_len, cudaMemcpyHostToDevice);
  
  // Redirect read_back_compressed_ptrs to copied_compressed_data
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++)    
  {
    read_back_compressed_ptrs[ix_chunk] = copied_compressed_data + array_address_step * ix_chunk;
  }

  void** copied_compressed_ptrs;
  cudaMalloc(&copied_compressed_ptrs, sizeof(size_t) * batch_size);
  cudaMemcpy(copied_compressed_ptrs, read_back_compressed_ptrs, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice);
  // // //

  // Decompression can be similarly performed on a batch of multiple compressed input chunks. 
  // As no metadata is stored with the compressed data, chunks can be re-arranged as well as decompressed 
  // with other chunks that originally were not compressed in the same batch.

  // If we didn't have the uncompressed sizes, we'd need to compute this information here. 
  // We demonstrate how to do this.
  nvcompBatchedLZ4GetDecompressSizeAsync(
      copied_compressed_ptrs,
      copied_compressed_bytes,
      device_uncompressed_bytes,
      batch_size,
      stream);

  // Next, allocate the temporary buffer 
  size_t decomp_temp_bytes;
  nvcompBatchedLZ4DecompressGetTempSize(batch_size, chunk_size, &decomp_temp_bytes);
  void * device_decomp_temp;
  cudaMalloc(&device_decomp_temp, decomp_temp_bytes);

  // allocate statuses
  nvcompStatus_t* device_statuses;
  cudaMalloc(&device_statuses, sizeof(nvcompStatus_t)*batch_size);

  // Also allocate an array to store the actual_uncompressed_bytes.
  // Note that we could use nullptr for this. We already have the 
  // actual sizes computed during the call to nvcompBatchedLZ4GetDecompressSizeAsync.
  size_t* device_actual_uncompressed_bytes;
  cudaMalloc(&device_actual_uncompressed_bytes, sizeof(size_t)*batch_size);

  // And finally, call the decompression routine.
  // This decompresses each input, device_compressed_ptrs[i], and places the decompressed
  // result in the corresponding output list, device_uncompressed_ptrs[i]. It also writes
  // the size of the uncompressed data to device_uncompressed_bytes[i].
  nvcompStatus_t decomp_res = nvcompBatchedLZ4DecompressAsync(
      copied_compressed_ptrs, 
      copied_compressed_bytes, 
      device_uncompressed_bytes, 
      device_actual_uncompressed_bytes, 
      batch_size,
      device_decomp_temp, 
      decomp_temp_bytes, 
      device_uncompressed_ptrs, 
      device_statuses, 
      stream);
  
  if (decomp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(decomp_res == nvcompSuccess);
  }

  cudaStreamSynchronize(stream);

  printf("    [+] Content #3 (leading 5 and trailing 5 bytes):\n");

  size_t* feedback_host_uncompressed_bytes;
  cudaMallocHost(&feedback_host_uncompressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(feedback_host_uncompressed_bytes, device_actual_uncompressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost);
  for (size_t i = 0; i < batch_size; i++)
    printf("        feedback_host_uncompressed_bytes[%zu]=%zu\n", i, feedback_host_uncompressed_bytes[i]);
  */
}

int main()
{
  // Initialize a random array of chars
  const size_t in_bytes = 1000000;
  uint8_t* uncompressed_data;
  
  cudaMallocHost(&uncompressed_data, in_bytes);
  
  std::mt19937 random_gen(42);

  // char specialization of std::uniform_int_distribution is
  // non-standard, and isn't available on MSVC, so use short instead,
  // but with the range limited, and then cast below.
  std::uniform_int_distribution<short> uniform_dist(0, 255);
  for (size_t ix = 0; ix < in_bytes; ++ix) {
    uncompressed_data[ix] = static_cast<char>(uniform_dist(random_gen));
  }
  
  execute_example(uncompressed_data, in_bytes);
  return 0;
}

Output:

[+] Creating CUDA stream.
[+] Initialialising data on the device.
    [+] Compute chunk sizes.
    [+] chunk_size = 65536
    [+] batch_size = 16
    [+] Allocating memory on the device.
    [+] Memory address of device_input_data: 0x7f440dc00000
    [+] Number of bytes: 1000000
    [+] Bytes:
5F,CB,F3,2E,BB,C7,99,98,27,72,27,19,0E,75,DD,55,99,24,B5,A6
...
[+] Initialialising data on the host.
    [+] Max bytes in host_uncompressed_bytes array slot: 65536
    [+] Slots in host_uncompressed_ptrs array: 16
    [+] Elements:
        host_uncompressed_bytes[0]=65536
        host_uncompressed_bytes[1]=65536
        host_uncompressed_bytes[2]=65536
        host_uncompressed_bytes[3]=65536
        host_uncompressed_bytes[4]=65536
        host_uncompressed_bytes[5]=65536
        host_uncompressed_bytes[6]=65536
        host_uncompressed_bytes[7]=65536
        host_uncompressed_bytes[8]=65536
        host_uncompressed_bytes[9]=65536
        host_uncompressed_bytes[10]=65536
        host_uncompressed_bytes[11]=65536
        host_uncompressed_bytes[12]=65536
        host_uncompressed_bytes[13]=65536
        host_uncompressed_bytes[14]=65536
        host_uncompressed_bytes[15]=16960
    [+] Content #1 (leading 5 and trailing 5 bytes):
        host_uncompressed_ptrs_read[0] = 5F CB F3 2E BB  . . .  C1 FA 13 E6 27 
        host_uncompressed_ptrs_read[1] = E4 14 C9 51 B1  . . .  FD E0 0A 75 0D 
        host_uncompressed_ptrs_read[2] = 57 30 A1 A8 9E  . . .  CC 30 04 78 54 
        host_uncompressed_ptrs_read[3] = 70 73 31 8D DE  . . .  5D 09 B4 54 FC 
        host_uncompressed_ptrs_read[4] = 69 DE 3B F2 F8  . . .  97 40 45 90 D4 
        host_uncompressed_ptrs_read[5] = CF E6 F5 6B 02  . . .  7A 52 D1 F1 7F 
        host_uncompressed_ptrs_read[6] = 92 EB 2F D5 6F  . . .  E8 EB B2 86 49 
        host_uncompressed_ptrs_read[7] = 67 4D 3A F7 33  . . .  53 01 5A 41 7C 
        host_uncompressed_ptrs_read[8] = 2F B9 87 FB E5  . . .  B6 B3 C9 4A 9B 
        host_uncompressed_ptrs_read[9] = 83 E4 5B 85 BD  . . .  5A 58 54 0F 6E 
        host_uncompressed_ptrs_read[10] = FB A9 EA 41 F0  . . .  49 EB 0A E5 E8 
        host_uncompressed_ptrs_read[11] = D1 89 AE 15 ED  . . .  61 2F C2 8A F5 
        host_uncompressed_ptrs_read[12] = 31 1B 5F 38 F2  . . .  55 C9 5A 5C 0B 
        host_uncompressed_ptrs_read[13] = E4 E3 75 A8 12  . . .  65 78 03 88 92 
        host_uncompressed_ptrs_read[14] = 13 60 03 1F 37  . . .  33 FC 44 18 D6 
        host_uncompressed_ptrs_read[15] = 6C 83 B6 0B 13  . . .  89 AA A6 A0 37 
[+] Allocate memory on the device.
[+] Move the uncompressed data from host to device.
[+] Allocate the temporary workspace on the device and output space needed by the compressor.
    [+] Allocated temp_bytes = 524288 - should be ( (chunk_size * batch_size) / 2 )
[+] Get the maxmimum output size for each chunk.
    [+] max_out_bytes = 65800
    [+] Address step within array (hex): 10200, (dec): 66048
[+] Allocate device compressed pointers array.
[+] Allocate buffer space for compressed chunk sizes to be written to.
[+] Call the nvCOMP API to compress the data.
[+] Copy compressed from device to host.
        read_back_compressed_bytes[0]=65794
        read_back_compressed_bytes[1]=65794
        read_back_compressed_bytes[2]=65794
        read_back_compressed_bytes[3]=65794
        read_back_compressed_bytes[4]=65794
        read_back_compressed_bytes[5]=65794
        read_back_compressed_bytes[6]=65794
        read_back_compressed_bytes[7]=65794
        read_back_compressed_bytes[8]=65794
        read_back_compressed_bytes[9]=65794
        read_back_compressed_bytes[10]=65794
        read_back_compressed_bytes[11]=65793
        read_back_compressed_bytes[12]=65794
        read_back_compressed_bytes[13]=65794
        read_back_compressed_bytes[14]=65794
        read_back_compressed_bytes[15]=17028
        total_compressed_bytes_len = 1056768 (incl. additional "padding" dummy bytes at the end of each slot)
    [+] Content #2 (leading 5 and trailing 5 bytes):
        read_back_compressed_ptrs[0] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[1] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[2] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[3] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[4] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[5] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[6] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[7] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[8] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[9] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[10] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[11] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[12] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[13] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[14] = 00 00 00 00 00  . . .  00 00 00 00 00 
        read_back_compressed_ptrs[15] = 00 00 00 00 00  . . .  00 00 00 00 00

And the snippet with my attempt where I try to obtain the compressed bytes:

  size_t array_address_step = (size_t)(host_compressed_ptrs[1] - host_compressed_ptrs[0]);
  printf("    [+] Address step within array (hex): %lX, (dec): %zu\n", array_address_step, array_address_step);
  assert(array_address_step >= max_out_bytes);

  printf("[+] Copy compressed from device to host.\n");
  
  size_t* read_back_compressed_bytes;
  cudaMallocHost(&read_back_compressed_bytes, sizeof(size_t) * batch_size);
  cudaMemcpy(read_back_compressed_bytes, device_compressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost);

  for (size_t i = 0; i < batch_size; i++)
    printf("        read_back_compressed_bytes[%zu]=%zu\n", i, read_back_compressed_bytes[i]);

  uint8_t* read_back_compressed_data;
  size_t total_compressed_bytes_len = array_address_step * batch_size * sizeof(uint8_t);
  printf("        total_compressed_bytes_len = %zu (incl. additional \"padding\" dummy bytes at the end of each slot)\n", total_compressed_bytes_len);

  cudaMallocHost(&read_back_compressed_data, total_compressed_bytes_len);
  cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);    // Step #1 (copy all bytes starting from from device_compressed_ptrs over read_back_compressed_data)

  uint8_t** read_back_compressed_ptrs;
  cudaMallocHost(&read_back_compressed_ptrs, sizeof(size_t) * batch_size);
  for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++)
  {    
    read_back_compressed_ptrs[ix_chunk] = read_back_compressed_data + array_address_step * ix_chunk;                    // Step #2 (map the host_compressed_ptrs over read_back_compressed_data)
  }

martini.andrea · April 16, 2024, 12:48pm

I got in my code (but i suppose your code too) the error “cudaErrorInvalidValue” when call cudaMemcpy:

cudaError_t memCopyStatus = cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], host_compressed_bytes[i], cudaMemcpyDeviceToHost);

I suppose this should be one of reasons why i get host_compressed_ptr[i] made up by all null values.

It seems that at least one value between chunk size, or chunk address on device could be wrong.

Last, if use a chunk size different than host_compressed_byte[i] (i.e. value less or equal to 128 bytes), the call:
cudaError_t memCopyStatus = cudaMemcpy(host_compressed_ptrs[i], device_compressed_ptrs[i], 128, cudaMemcpyDeviceToHost);

return a success.

I continue to investigate on.

vmetodiev · April 16, 2024, 1:57pm

Hi,

@martini.andrea

Are you sure that the device_compressed_ptrs is actually allocated on the GPU in your case?
Because whenever you try to derefence it from the host (with device_compressed_ptrs*[i]*), it should always give you Segmentation Fault - it has nothing to do with how many bytes you specify in the cudaMemcpy call.
I saw you have opened an issue on the nvCOMP GitHub page, pointing to this post. Basically, I have nothing against, but let’s first identity and confirm with the community what exactly is getting wrong.
Because at the moment of opening the issue, the first obvious thing was our lack of understanding how CUDA deals with the pointers on the host and the GPU side. So to say, IMO, nobody would take such a question seriously.

@nVidia
3. Now, as we have managed to get to the point where we do not have anything suspicious in our CUDA code, could we rely on nVidia for support on how to obtain the compressed data, or I should move to an alternative solution and abandon the nvCOMP library?

Thanks

Robert_Crovella · April 16, 2024, 8:13pm

When I run the code you have shown in your post that begins with:

under compute-sanitizer, I get an error of invalid argument on a call to cudaMemcpy.

So the very first suggestion I have, is: Any time you are having trouble with a CUDA code, I suggest using proper CUDA error checking. If you’re not sure what that is, please google “proper CUDA error checking”
, take the first hit from Stack Overflow (this one), and apply it to your code. When you have done that, please post the new code that shows your changes incorporating that, and let us know if you get any errors when you run it. If so, identify the exact line of code that the error is reported on.

(FWIW I get no such error report from compute-sanitizer when I run the code from the similar example code. )

It’s evident from the code you have posted that you’re not properly handling an array of pointers. This creates the top-level pointer storage, for a pointer to an array of pointers:

  void** device_compressed_ptrs;

here, an allocation is being done to allocate space for an array of pointers, in device memory, using that top-level pointer:

  cudaMalloc(&device_compressed_ptrs, sizeof(size_t) * batch_size);

Later, you do this:

  cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);    // Step #1 (copy all bytes starting from from device_compressed_ptrs over read_back_compressed_data)

We can immediately see that total_compressed_bytes_len is not calculated the same way, and therefore is not the same as sizeof(size_t)*batch_size, therefore you can immediately, by inspection, see that this cannot possibly be correct. It is also not logical to assume that I can directly copy the data by copying using a top-level pointer-to-pointer.

This is something like a deep copy, and although CUDA is being used here, it is entirely analogous to how you would do something similar in C++. You’ll need to realize that there is a pointer-to-pointer arrangement here, and handle it appropriately.

(hint: this last cudaMemcpy line that I have excerpted above, is the line you will discover is problematic when you implement proper CUDA error checking).

Do as you wish, of course.

vmetodiev · April 17, 2024, 10:12pm

Hi @Robert_Crovella,

Robert_Crovella:

It’s evident from the code you have posted that you’re not properly handling an array of pointers. This creates the top-level pointer storage, for a pointer to an array of pointers:
  void** device_compressed_ptrs;
here, an allocation is being done to allocate space for an array of pointers, in device memory, using that top-level pointer:
  cudaMalloc(&device_compressed_ptrs, sizeof(size_t) * batch_size);
Later, you do this:
  cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);    // Step #1 (copy all bytes starting from from device_compressed_ptrs over read_back_compressed_data)

Yes, indeed my code reports errors upon running it with the compute-sanitizer. And the very first error should be in the cudaMemcpy line you mentioned. Below I am presenting my idea and explanation why I am doing so…

Here is how my logic goes: in order to get the content of the buffers that are pointed by the elements within the array of pointers, I would need to:

Iterate through the array of pointers - for each element of the array of pointers, copy N bytes starting from the address where the pointer points to. N is kept in the other array with buffer lengths (like device_compressed_bytes[…]).

But because:
1.1. I cannot de-reference a pointer on the device (GPU), I cannot do such iteration.
1.2. If I copy the array of pointers from device to host (sizeof(size_t) * batch_size), I will have only addresses. And I still have no idea how to get the actual buffers that contain the compressed data.
Do it in a single cudaMemcpy, without de-referencing device pointers, having in mind the following assumptions:
2.1. After I copy the void** device_compressed_ptrs, I print the addresses of host_compressed_ptrs[0], [1], […] and see that all are contiguous regions in memory with a step of 0x10200h (66048 dec).
2.2. I think of the data as a matrix of 16 rows, each of them with the lenght mentioned above (66048 bytes)
2.3. Because the matrix rows look to be contiguous regions, that would mean that I can think of the compressed data buffer as a one-dimensional array.
2.4. The first byte of this array should be pointed by the device_compressed_ptrs[0].
2.5. I assume that inside the nvCOMP .so library, something similar happens when it comes to mapping the data (the below code is my assumption of what could be happening inside the library):

int main()
{
  uint8_t* host_uncompressed_data;
  uint8_t** host_uncompressed_ptrs;
  size_t* host_uncompressed_bytes;
  cudaMallocHost(&host_uncompressed_data, sizeof(uint8_t) * ROWS * ROW_LEN);
  
  for (size_t i = 0; i < ROWS * ROW_LEN; i++)
    host_uncompressed_data[i] = (uint8_t)i;

  cudaMallocHost(&host_uncompressed_bytes, sizeof(size_t) * ROWS);
  for (size_t i = 0; i < ROWS; i++)
    host_uncompressed_bytes[i] = ROW_LEN;

  cudaMallocHost(&host_uncompressed_ptrs, sizeof(size_t *) * ROWS);
  
  size_t offset = 0;
  host_uncompressed_ptrs[0] = host_uncompressed_data + offset;

  for (size_t i = 1; i < ROWS; i++)
  { 
    offset += host_uncompressed_bytes[i-1];
    host_uncompressed_ptrs[i] = host_uncompressed_data + offset;
  }

  for (size_t i = 0; i < ROWS; i++)
    for (size_t j = 0; j < ROW_LEN; j++)
      printf("     host_uncompressed_ptrs[%zu][%zu] = %u at address %p\n", i, j, host_uncompressed_ptrs[i][j], host_uncompressed_ptrs[i] + j);

  uint8_t* device_uncompressed_data;
  uint8_t** device_uncompressed_ptrs;
  size_t* device_uncompressed_bytes;

  cudaMalloc(&device_uncompressed_data, sizeof(uint8_t) * ROWS * ROW_LEN);
  cudaMalloc(&device_uncompressed_ptrs,  sizeof(size_t *) * ROWS);
  cudaMalloc(&device_uncompressed_bytes,   sizeof(size_t) * ROWS);
  
  cudaMemcpy(device_uncompressed_data, host_uncompressed_data, sizeof(uint8_t) * ROWS * ROW_LEN, cudaMemcpyHostToDevice);
  cudaMemcpy(device_uncompressed_bytes, host_uncompressed_bytes, sizeof(size_t) * ROWS, cudaMemcpyHostToDevice);
  cudaMemcpy(device_uncompressed_ptrs, host_uncompressed_ptrs, sizeof(size_t *) * ROWS, cudaMemcpyHostToDevice);

  uint8_t* read_back_uncompressed_data;
  uint8_t** read_back_uncompressed_ptrs;
  size_t* read_back_uncompressed_bytes;

  cudaMallocHost(&read_back_uncompressed_data, sizeof(uint8_t) * ROWS * ROW_LEN);
  cudaMallocHost(&read_back_uncompressed_ptrs,  sizeof(size_t *) * ROWS);
  cudaMallocHost(&read_back_uncompressed_bytes,   sizeof(size_t) * ROWS);

  cudaMemcpy(read_back_uncompressed_data, device_uncompressed_ptrs, sizeof(uint8_t) * ROWS * ROW_LEN, cudaMemcpyDeviceToHost); // The "matrix" starts from device_uncompressed_ptrs, lenght is (sizeof(uint8_t) * ROWS * ROW_LEN)
  cudaMemcpy(read_back_uncompressed_bytes, device_uncompressed_bytes, sizeof(size_t) * ROWS, cudaMemcpyDeviceToHost);
  cudaMemcpy(read_back_uncompressed_ptrs, device_uncompressed_ptrs, sizeof(size_t *) * ROWS, cudaMemcpyDeviceToHost);

  printf("\n");
  for (size_t i = 0; i < ROWS; i++)
    for (size_t j = 0; j < ROW_LEN; j++)
      printf("read_back_uncompressed_data[%zu][%zu] = %u at address %p\n", i, j, read_back_uncompressed_ptrs[i][j], read_back_uncompressed_ptrs[i] + j);

  printf("\n");

  printf("       Address of      host_uncompressed_data is: %p\n", host_uncompressed_data);
  printf("       Address of    device_uncompressed_data is: %p\n", device_uncompressed_data);
  printf("       Address of read_back_uncompressed_data is: %p\n", read_back_uncompressed_data);
  printf("\n");
  printf("       Address of      host_uncompressed_ptrs is: %p\n", host_uncompressed_ptrs);
  printf("       Address of    device_uncompressed_ptrs is: %p\n", device_uncompressed_ptrs);
  printf("       Address of read_back_uncompressed_ptrs is: %p\n", read_back_uncompressed_ptrs);

  printf("\n");

  return 0;
}

And in case that assumption is correct, and if the device_compressed_ptrs[0] points to the first byte, that would mean that the void** device_compressed_ptrs may take me to the beginning of the array. Then, I will copy all the data (possibly with additional bytes) that I will then discard when I map it to array of pointers on the host. And then, for each row, I will get the N bytes by using the lengths from the device_compressed_bytes array.

Because of the above assumption, I simply tried to do the “magic” using:

cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);

I simply could not figure out anything else. I admit I am new to CUDA, I installed the Nsight tools … but as the nvCOMP seems to compiled (.so), I could’t do any further debugging.

Do as you wish, of course.

If I wished, I wouldn’t have started the discussion.

My idea was to compress a file on Host A (via CUDA GPU A), send to it host B and decompress it on Host B (via CUDA GPU B).

I could not find any example that obtains and shows the compressed data. My impression is that the nvCOMP serves as a “GDDR”-saver for a the data that is residing on the GPU (maybe large language model and similar).

That makes me wonder not if I wish or not, but whether my expectations from nvCOMP are relevant at all…

martini.andrea · April 18, 2024, 8:01am

@vmetodiev my idea is to compress file from host A (via CUDA GPU) and send compressed data to host B for decompressing (using cuda GPU or maybe cpu) too.

I could be wrong, but I tried to make the following reasoning, based on how host and device’s memory locations
are created (concerning to compressed data).

Starting from host_compressed_ptrs. This array of pointers, declared as void **, is created on host with:

cudaMallocHost(&host_compressed_ptrs, sizeof(size_t) * batch_size)

After this call, i get:

[+] host_compressed_ptrs: Create output space on HOST
[+] on HOST : host_compressed_ptrs[0] = 0000000000000000 ; &host_compressed_ptrs[0] = 00000002052F4800
[+] on HOST : host_compressed_ptrs[1] = 0000000000000000 ; &host_compressed_ptrs[1] = 00000002052F4808
[+] on HOST : host_compressed_ptrs[2] = 0000000000000000 ; &host_compressed_ptrs[2] = 00000002052F4810
[+] on HOST : host_compressed_ptrs[3] = 0000000000000000 ; &host_compressed_ptrs[3] = 00000002052F4818
[+] on HOST : host_compressed_ptrs[4] = 0000000000000000 ; &host_compressed_ptrs[4] = 00000002052F4820
[+] on HOST : host_compressed_ptrs[5] = 0000000000000000 ; &host_compressed_ptrs[5] = 00000002052F4828
[+] on HOST : host_compressed_ptrs[6] = 0000000000000000 ; &host_compressed_ptrs[6] = 00000002052F4830
[+] on HOST : host_compressed_ptrs[7] = 0000000000000000 ; &host_compressed_ptrs[7] = 00000002052F4838
[+] on HOST : host_compressed_ptrs[8] = 0000000000000000 ; &host_compressed_ptrs[8] = 00000002052F4840
[+] on HOST : host_compressed_ptrs[9] = 0000000000000000 ; &host_compressed_ptrs[9] = 00000002052F4848
[+] on HOST : host_compressed_ptrs[10] = 0000000000000000 ; &host_compressed_ptrs[10] = 00000002052F4850
[+] on HOST : host_compressed_ptrs[11] = 0000000000000000 ; &host_compressed_ptrs[11] = 00000002052F4858
[+] on HOST : host_compressed_ptrs[12] = 0000000000000000 ; &host_compressed_ptrs[12] = 00000002052F4860
[+] on HOST : host_compressed_ptrs[13] = 0000000000000000 ; &host_compressed_ptrs[13] = 00000002052F4868
[+] on HOST : host_compressed_ptrs[14] = 0000000000000000 ; &host_compressed_ptrs[14] = 00000002052F4870
[+] on HOST : host_compressed_ptrs[15] = 0000000000000000 ; &host_compressed_ptrs[15] = 00000002052F4878

The next call tries to allocate space on the device:

 for (size_t ix_chunk = 0; ix_chunk < batch_size; ++ix_chunk) {
        memCreateResult = cudaMalloc(&host_compressed_ptrs[ix_chunk], max_out_bytes);
        printf("\t [+]    host_compressed_ptrs[%lu] = %p  ;  &host_compressed_ptrs[%lu] = %p  \n", ix_chunk, host_compressed_ptrs[ix_chunk], ix_chunk, &host_compressed_ptrs[ix_chunk]);
    }

[+] host_compressed_ptrs: Create output space on DEVICE
[+] host_compressed_ptrs[0] = 0000000B02D74800 ; &host_compressed_ptrs[0] = 00000002052F4800
[+] host_compressed_ptrs[1] = 0000000B02D84A00 ; &host_compressed_ptrs[1] = 00000002052F4808
[+] host_compressed_ptrs[2] = 0000000B02D94C00 ; &host_compressed_ptrs[2] = 00000002052F4810
[+] host_compressed_ptrs[3] = 0000000B02DA4E00 ; &host_compressed_ptrs[3] = 00000002052F4818
[+] host_compressed_ptrs[4] = 0000000B02DB5000 ; &host_compressed_ptrs[4] = 00000002052F4820
[+] host_compressed_ptrs[5] = 0000000B02DC5200 ; &host_compressed_ptrs[5] = 00000002052F4828
[+] host_compressed_ptrs[6] = 0000000B02DD5400 ; &host_compressed_ptrs[6] = 00000002052F4830
[+] host_compressed_ptrs[7] = 0000000B02DE5600 ; &host_compressed_ptrs[7] = 00000002052F4838
[+] host_compressed_ptrs[8] = 0000000B02E00000 ; &host_compressed_ptrs[8] = 00000002052F4840
[+] host_compressed_ptrs[9] = 0000000B02E10200 ; &host_compressed_ptrs[9] = 00000002052F4848
[+] host_compressed_ptrs[10] = 0000000B02E20400 ; &host_compressed_ptrs[10] = 00000002052F4850
[+] host_compressed_ptrs[11] = 0000000B02E30600 ; &host_compressed_ptrs[11] = 00000002052F4858
[+] host_compressed_ptrs[12] = 0000000B02E40800 ; &host_compressed_ptrs[12] = 00000002052F4860
[+] host_compressed_ptrs[13] = 0000000B02E50A00 ; &host_compressed_ptrs[13] = 00000002052F4868
[+] host_compressed_ptrs[14] = 0000000B02E60C00 ; &host_compressed_ptrs[14] = 00000002052F4870
[+] host_compressed_ptrs[15] = 0000000B02E70E00 ; &host_compressed_ptrs[15] = 00000002052F4878

With the next call, the code tries to create an array of pointers on device, which contains the same values of host_compressed_ptrs:

  void** device_compressed_ptrs;
    memCreateResult = cudaMalloc(&device_compressed_ptrs, sizeof(size_t) * batch_size);

    for (size_t ix_chunk = 0; ix_chunk < batch_size; ++ix_chunk)
    {
        printf("\t [+] on DEVICE : device_compressed_ptrs[%lu] = NOT ALLOWED  ;  &device_compressed_ptrs[%lu] = %p  \n", ix_chunk, ix_chunk, &device_compressed_ptrs[ix_chunk]);
    }

[+] device_compressed_ptrs: Create output space on DEVICE
[+] on DEVICE : device_compressed_ptrs[0] = NOT ALLOWED ; &device_compressed_ptrs[0] = 0000000B02DF5800
[+] on DEVICE : device_compressed_ptrs[1] = NOT ALLOWED ; &device_compressed_ptrs[1] = 0000000B02DF5808
[+] on DEVICE : device_compressed_ptrs[2] = NOT ALLOWED ; &device_compressed_ptrs[2] = 0000000B02DF5810
[+] on DEVICE : device_compressed_ptrs[3] = NOT ALLOWED ; &device_compressed_ptrs[3] = 0000000B02DF5818
[+] on DEVICE : device_compressed_ptrs[4] = NOT ALLOWED ; &device_compressed_ptrs[4] = 0000000B02DF5820
[+] on DEVICE : device_compressed_ptrs[5] = NOT ALLOWED ; &device_compressed_ptrs[5] = 0000000B02DF5828
[+] on DEVICE : device_compressed_ptrs[6] = NOT ALLOWED ; &device_compressed_ptrs[6] = 0000000B02DF5830
[+] on DEVICE : device_compressed_ptrs[7] = NOT ALLOWED ; &device_compressed_ptrs[7] = 0000000B02DF5838
[+] on DEVICE : device_compressed_ptrs[8] = NOT ALLOWED ; &device_compressed_ptrs[8] = 0000000B02DF5840
[+] on DEVICE : device_compressed_ptrs[9] = NOT ALLOWED ; &device_compressed_ptrs[9] = 0000000B02DF5848
[+] on DEVICE : device_compressed_ptrs[10] = NOT ALLOWED ; &device_compressed_ptrs[10] = 0000000B02DF5850
[+] on DEVICE : device_compressed_ptrs[11] = NOT ALLOWED ; &device_compressed_ptrs[11] = 0000000B02DF5858
[+] on DEVICE : device_compressed_ptrs[12] = NOT ALLOWED ; &device_compressed_ptrs[12] = 0000000B02DF5860
[+] on DEVICE : device_compressed_ptrs[13] = NOT ALLOWED ; &device_compressed_ptrs[13] = 0000000B02DF5868
[+] on DEVICE : device_compressed_ptrs[14] = NOT ALLOWED ; &device_compressed_ptrs[14] = 0000000B02DF5870
[+] on DEVICE : device_compressed_ptrs[15] = NOT ALLOWED ; &device_compressed_ptrs[15] = 0000000B02DF5878

memCopyResult = cudaMemcpyAsync(device_compressed_ptrs, host_compressed_ptrs, sizeof(size_t) * batch_size, cudaMemcpyHostToDevice, stream);
    printf("\n[+] device_compressed_ptrs: Copy from  host_compressed_ptrs to device_compressed_ptrs\n");
    for (size_t ix_chunk = 0; ix_chunk < batch_size; ++ix_chunk)
    {
        printf("\t [+] on DEVICE : device_compressed_ptrs[%lu] = not allowed  ;  &device_compressed_ptrs[%lu] = %p  ;  host_compressed_ptrs[%lu] = %p \n", ix_chunk, ix_chunk, &device_compressed_ptrs[ix_chunk], ix_chunk, &host_compressed_ptrs[ix_chunk]);
    }

[+] device_compressed_ptrs: Copy from host_compressed_ptrs to device_compressed_ptrs
[+] on DEVICE : device_compressed_ptrs[0] = not allowed ; &device_compressed_ptrs[0] = 0000000B02DF5800 ; host_compressed_ptrs[0] = 0000000B02D74800
[+] on DEVICE : device_compressed_ptrs[1] = not allowed ; &device_compressed_ptrs[1] = 0000000B02DF5808 ; host_compressed_ptrs[1] = 0000000B02D84A00
[+] on DEVICE : device_compressed_ptrs[2] = not allowed ; &device_compressed_ptrs[2] = 0000000B02DF5810 ; host_compressed_ptrs[2] = 0000000B02D94C00
[+] on DEVICE : device_compressed_ptrs[3] = not allowed ; &device_compressed_ptrs[3] = 0000000B02DF5818 ; host_compressed_ptrs[3] = 0000000B02DA4E00
[+] on DEVICE : device_compressed_ptrs[4] = not allowed ; &device_compressed_ptrs[4] = 0000000B02DF5820 ; host_compressed_ptrs[4] = 0000000B02DB5000
[+] on DEVICE : device_compressed_ptrs[5] = not allowed ; &device_compressed_ptrs[5] = 0000000B02DF5828 ; host_compressed_ptrs[5] = 0000000B02DC5200
[+] on DEVICE : device_compressed_ptrs[6] = not allowed ; &device_compressed_ptrs[6] = 0000000B02DF5830 ; host_compressed_ptrs[6] = 0000000B02DD5400
[+] on DEVICE : device_compressed_ptrs[7] = not allowed ; &device_compressed_ptrs[7] = 0000000B02DF5838 ; host_compressed_ptrs[7] = 0000000B02DE5600
[+] on DEVICE : device_compressed_ptrs[8] = not allowed ; &device_compressed_ptrs[8] = 0000000B02DF5840 ; host_compressed_ptrs[8] = 0000000B02E00000
[+] on DEVICE : device_compressed_ptrs[9] = not allowed ; &device_compressed_ptrs[9] = 0000000B02DF5848 ; host_compressed_ptrs[9] = 0000000B02E10200
[+] on DEVICE : device_compressed_ptrs[10] = not allowed ; &device_compressed_ptrs[10] = 0000000B02DF5850 ; host_compressed_ptrs[10] = 0000000B02E20400
[+] on DEVICE : device_compressed_ptrs[11] = not allowed ; &device_compressed_ptrs[11] = 0000000B02DF5858 ; host_compressed_ptrs[11] = 0000000B02E30600
[+] on DEVICE : device_compressed_ptrs[12] = not allowed ; &device_compressed_ptrs[12] = 0000000B02DF5860 ; host_compressed_ptrs[12] = 0000000B02E40800
[+] on DEVICE : device_compressed_ptrs[13] = not allowed ; &device_compressed_ptrs[13] = 0000000B02DF5868 ; host_compressed_ptrs[13] = 0000000B02E50A00
[+] on DEVICE : device_compressed_ptrs[14] = not allowed ; &device_compressed_ptrs[14] = 0000000B02DF5870 ; host_compressed_ptrs[14] = 0000000B02E60C00
[+] on DEVICE : device_compressed_ptrs[15] = not allowed ; &device_compressed_ptrs[15] = 0000000B02DF5878 ; host_compressed_ptrs[15] = 0000000B02E70E00

Taking into account, for example, host_compressed_ptrs[0] = 0000000B02D74800, what i understand is:
host_compressed_ptrs[0] = 0000000B02D74800 is the address on device where CUDA compress algorithm
will write compress data for chunk 0 (and so on).

With this assumption, i’m able to get back data from device in this way:

First i create array of pointers on host with name h_read_back_compressed_ptrs:

 printf("\n\t[+] Create compressed header ptrs array (h_read_back_compressed_ptrs) of %lu elements into HOST .\n", batch_size);
    char** h_read_back_compressed_ptrs;
    gpuErrchk(cudaMallocHost(&h_read_back_compressed_ptrs, sizeof(size_t) * batch_size))

Then i create on host, chunk memory that has size of read_back_compressed_bytes[i]:

  printf("\n[+] List of chunk compressed address into host h_read_back_compressed_ptrs.\n");
    for (size_t i = 0; i < batch_size; i++)
    {
        size_t chunkCompressedSize = read_back_compressed_bytes[i];
        gpuErrchk(cudaMallocHost(&h_read_back_compressed_ptrs[i], sizeof(char) * chunkCompressedSize));
        printf("\t[+] [%lu] Created compressed chunk (&h_read_back_compressed_ptrs) [%p] in host of size %lu\n", i, &h_read_back_compressed_ptrs[i], chunkCompressedSize);
    }

at the end, i copy data from device to host, pointing to host_compressed_ptrs[i]:

 printf("\n[+] Copy chunk compressed data from device (device_compressed_ptrs) to host (h_read_back_compressed_ptrs) \n");
    for (size_t i = 0; i < batch_size; i++)
    {
        size_t chunkCompressedSize = read_back_compressed_bytes[i];
        void* h_dst = h_read_back_compressed_ptrs[i];
        const void* d_src = host_compressed_ptrs[i];

        printf("\t[+]  [%lu] Copy compressed chunk from device [%p] to host [%p] of size %lu\n", i, d_src, h_dst, chunkCompressedSize);
        gpuErrchk(cudaMemcpy(h_dst, d_src, sizeof(char) * chunkCompressedSize, cudaMemcpyDeviceToHost));
    }

[+] Get back compressed data in h_read_back_compressed_ptrs (leading 20 and trailing 20 bytes):
h_read_back_compressed_ptrs[0] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 43 B7 68 BF E7 E0 87 C4 A0 F8 98 C6 6E 86 3A 20 28 A0 1C FE
h_read_back_compressed_ptrs[1] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 05 8E 6C AC 51 50 BB E0 63 6A 58 72 9B 4C 1F AF 92 1E CE BB
h_read_back_compressed_ptrs[2] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . BA 75 65 04 4D 43 6E 82 1F D7 72 3D D7 1D 30 8B 58 B1 D1 91
h_read_back_compressed_ptrs[3] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 6D AB 40 8F 4B 4A C9 F3 82 5C E1 5E 43 1F FE 6E 82 AC B1 E3
h_read_back_compressed_ptrs[4] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 98 A7 82 DC 17 4B 6C FB 0C 03 73 C5 21 72 72 D8 19 7E D8 1D
h_read_back_compressed_ptrs[5] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 21 24 15 1A 40 7A E2 A5 DD 40 0C 2D EC EE 2A 8B 32 06 5A 71
h_read_back_compressed_ptrs[6] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . E9 37 CE F3 8F 8F C5 39 7C 16 31 48 64 3D 93 13 CB 87 51 FD
h_read_back_compressed_ptrs[7] = F0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . CC 93 D3 A4 07 09 59 F1 26 7A 7D 28 FD 00 91 44 20 A1 4E 3F
h_read_back_compressed_ptrs[8] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . D2 27 35 C9 4C 2D 20 FB F6 EC A4 E3 07 89 67 9B 16 44 9F 94
h_read_back_compressed_ptrs[9] = F0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 4D 45 E6 26 FE AD 0D 0F C4 2A F5 9F D0 F3 17 82 12 B7 A4 81
h_read_back_compressed_ptrs[10] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 0C 5A F7 FF 4C D2 CB 40 15 18 55 E0 24 20 EA 14 C4 02 E1 BD
h_read_back_compressed_ptrs[11] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 0D F5 AF E0 F2 56 2C C4 A4 B8 DE E6 A0 5C B3 4C EF 09 EA F9
h_read_back_compressed_ptrs[12] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 05 C9 82 4E CC BA 16 5A D1 7B 2E 19 79 DD 54 06 E5 E7 E0 B1
h_read_back_compressed_ptrs[13] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . E3 4C DF 0A A5 58 78 53 09 79 A2 5D 0E 42 BF 6B E3 14 F1 05
h_read_back_compressed_ptrs[14] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . FD 81 9B 9E DC E6 DB 0F FF 5C AB 2D 5F 99 D4 77 50 2D B9 74
h_read_back_compressed_ptrs[15] = FC FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF . . . 4D 60 2A 5D 8A 28 56 F5 37 F2 80 DD 38 D7 3F 03 AF BF 17 8C

at the moment i don’t know why the first K bytes are the same for each chunks, maybe becouse my assumption contains errors…

Anyway, i’m not still sure this is the right approach. i’m going to decompress compressed data in order to get original ones and compare.

vmetodiev · April 18, 2024, 6:38pm

@martini.andrea

I admit I do not understand your idea, especially these lines:

size_t chunkCompressedSize = read_back_compressed_bytes[i];
void* h_dst = h_read_back_compressed_ptrs[i];
const void* d_src = host_compressed_ptrs[i];

printf("\t[+]  [%lu] Copy compressed chunk from device [%p] to host [%p] of size %lu\n", i, d_src, h_dst, chunkCompressedSize);
gpuErrchk(cudaMemcpy(h_dst, d_src, sizeof(char) * chunkCompressedSize, cudaMemcpyDeviceToHost));

Are you sure that you are actually referring to (and copying from) a proper address on the GPU memory?
And the void* h_dst should have memory allocated by cudaMallocHost before trying to copy to it.

I also cannot figure out why are you are printing things like:

&device_compressed_ptrs[i]

This way, you are printing the address where the pointer is located. If you want to print the address stored in the pointer, then you should not use the ampersand:

&device_compressed_ptrs[i] - the memory address where the compiler put the pointer variable
 device_compressed_ptrs[i] - the memory address that is stored in the pointer variable (in other words, where the pointer points to)

UPDATE:
I tried with your idea, in a limited scope - to read 100 and 1000 bytes from the first “row” of the compressed data from the device. When I call:

cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);

For 100 bytes - I get some data (No errors from the compute-sanitizer)
For 1000 bytes - only zeroes (like in my previous post) (with cudaMemcpy error from the compute-sanitizer)

The length of the buffer is much larger than 1000 bytes (65794 actually):

read_back_compressed_bytes[0]=65794

Anyway, let’s first wait for nVidia’s response about the idea behind nvCOMP and is it possible to get the compressed bytes in a more natural way.

eschmidt · April 19, 2024, 4:10pm

You really do need to wrap all cuda API calls with error checking as Robert mentioned.

The issue here is:

cudaMallocHost(&read_back_compressed_data, total_compressed_bytes_len);
  cudaMemcpy(read_back_compressed_data, device_compressed_ptrs, total_compressed_bytes_len, cudaMemcpyDeviceToHost);

This memcpy doesn’t make sense. The device_compressed_ptrs is an array of ptrs to the start of each compressed chunk. But you’re treating this as the starting address to the compressed data.

Note, in the low-level API you’re using, there’s no requirement that the compressed / uncompressed buffers are contiguous.

To decompress, you’d need to maintain the offset to the start of each compressed chunk and the size of each decompressed chunk.

I’d recommend looking at our high level interace. It avoids the need for managing the pointer arrays and is intended for compressing / decompressing a single large buffer with an easy to use interface.

vmetodiev · April 19, 2024, 7:32pm

So… after combining all of the above suggestions, I am now heading towards the following (I guess it is quite similar or exactly the same as the proposal made by @martini.andrea):

  nvcompStatus_t comp_res = nvcompBatchedLZ4CompressAsync(
      device_uncompressed_ptrs,
      device_uncompressed_bytes,
      chunk_size,
      batch_size,               // Will yield 16 buffers ("rows") with compressed data
      device_temp_ptr,
      temp_bytes,
      device_compressed_ptrs,   // ---> Copy to host_copy_of_device_compressed_ptrs on the host
      device_compressed_bytes,  // ---> Copy to host_copy_of_device_compressed_bytes on the host
      nvcompBatchedLZ4DefaultOpts,
      stream);

  if (comp_res != nvcompSuccess)
  {
    std::cerr << "Failed compression!" << std::endl;
    assert(comp_res == nvcompSuccess);
  }
  CUDA_CHECK(cudaStreamSynchronize(stream));

The new attempt to copy the compressed data from the GPU to the host RAM:

  // Copy the device_compressed_ptrs from device to host
  void** host_copy_of_device_compressed_ptrs = NULL;
  CUDA_CHECK(cudaMallocHost(&host_copy_of_device_compressed_ptrs, sizeof(void *) * batch_size));
  CUDA_CHECK(cudaMemcpy(host_copy_of_device_compressed_ptrs, device_compressed_ptrs, sizeof(size_t *) * batch_size, cudaMemcpyDeviceToHost));
  // ---> We now have the device_compressed_ptrs on the host into host_copy_of_device_compressed_ptrs - those are the pointers to the GPU device allocated buffers with compressed data (16 "rows")

  // Copy the device_compressed_bytes from device to host
  size_t* host_copy_of_device_compressed_bytes = NULL;
  CUDA_CHECK(cudaMallocHost(&host_copy_of_device_compressed_bytes, sizeof(size_t) * batch_size));
  CUDA_CHECK(cudaMemcpy(host_copy_of_device_compressed_bytes, device_compressed_bytes, sizeof(size_t) * batch_size, cudaMemcpyDeviceToHost));
  // ---> We now have the device_compressed_bytes on the host into host_copy_of_device_compressed_bytes - those are the lenghts of the GPU device buffers with compressed data (16 "rows")

  // Allocate host buffers where we will copy the compressed data from device to host
  uint8_t** host_copy_of_compressed_data = NULL;
  CUDA_CHECK(cudaMallocHost(&host_copy_of_compressed_data, sizeof(uint8_t *) * batch_size));

  for (size_t i = 0; i < batch_size; i++)
  {
      // Allocate buffers on the host where the GPU compressed bytes will be copied (the lengths are already known from the host_copy_of_device_compressed_bytes array)
      CUDA_CHECK(cudaMallocHost(&host_copy_of_compressed_data[i], host_copy_of_device_compressed_bytes[i]));

      // Copy the compressed bytes for the current batch (row)
      CUDA_CHECK(cudaMemcpy(host_copy_of_compressed_data[i], host_copy_of_device_compressed_ptrs[i], host_copy_of_device_compressed_bytes[i], cudaMemcpyDeviceToHost));
  }
  // ---> We now have all the compressed data bytes copied from device to host

And the output after running under compute-sanitizer:

    [+] Content #1 (leading 5 and trailing 5 bytes):
        host_uncompressed_ptrs_read[0] = 5F CB F3 2E BB  . . .  C1 FA 13 E6 27 
        host_uncompressed_ptrs_read[1] = E4 14 C9 51 B1  . . .  FD E0 0A 75 0D 
        host_uncompressed_ptrs_read[2] = 57 30 A1 A8 9E  . . .  CC 30 04 78 54 
        host_uncompressed_ptrs_read[3] = 70 73 31 8D DE  . . .  5D 09 B4 54 FC 
        host_uncompressed_ptrs_read[4] = 69 DE 3B F2 F8  . . .  97 40 45 90 D4 
        host_uncompressed_ptrs_read[5] = CF E6 F5 6B 02  . . .  7A 52 D1 F1 7F 
        host_uncompressed_ptrs_read[6] = 92 EB 2F D5 6F  . . .  E8 EB B2 86 49 
        host_uncompressed_ptrs_read[7] = 67 4D 3A F7 33  . . .  53 01 5A 41 7C 
        host_uncompressed_ptrs_read[8] = 2F B9 87 FB E5  . . .  B6 B3 C9 4A 9B 
        host_uncompressed_ptrs_read[9] = 83 E4 5B 85 BD  . . .  5A 58 54 0F 6E 
        host_uncompressed_ptrs_read[10] = FB A9 EA 41 F0  . . .  49 EB 0A E5 E8 
        host_uncompressed_ptrs_read[11] = D1 89 AE 15 ED  . . .  61 2F C2 8A F5 
        host_uncompressed_ptrs_read[12] = 31 1B 5F 38 F2  . . .  55 C9 5A 5C 0B 
        host_uncompressed_ptrs_read[13] = E4 E3 75 A8 12  . . .  65 78 03 88 92 
        host_uncompressed_ptrs_read[14] = 13 60 03 1F 37  . . .  33 FC 44 18 D6 
        host_uncompressed_ptrs_read[15] = 6C 83 B6 0B 13  . . .  89 AA A6 A0 37

    [+] Content #2 (leading 5 and trailing 5 bytes):
        host_copy_of_compressed_data[0] = FC FF FF FF FF  . . .  C1 FA 13 E6 27 
        host_copy_of_compressed_data[1] = FC FF FF FF FF  . . .  FD E0 0A 75 0D 
        host_copy_of_compressed_data[2] = FC FF FF FF FF  . . .  CC 30 04 78 54 
        host_copy_of_compressed_data[3] = F0 FF FF FF FF  . . .  5D 09 B4 54 FC 
        host_copy_of_compressed_data[4] = F0 FF FF FF FF  . . .  97 40 45 90 D4 
        host_copy_of_compressed_data[5] = FC FF FF FF FF  . . .  7A 52 D1 F1 7F 
        host_copy_of_compressed_data[6] = FC FF FF FF FF  . . .  E8 EB B2 86 49 
        host_copy_of_compressed_data[7] = FC FF FF FF FF  . . .  53 01 5A 41 7C 
        host_copy_of_compressed_data[8] = FC FF FF FF FF  . . .  B6 B3 C9 4A 9B 
        host_copy_of_compressed_data[9] = FC FF FF FF FF  . . .  5A 58 54 0F 6E 
        host_copy_of_compressed_data[10] = FC FF FF FF FF  . . .  49 EB 0A E5 E8 
        host_copy_of_compressed_data[11] = F0 FF FF FF FF  . . .  61 2F C2 8A F5 
        host_copy_of_compressed_data[12] = FC FF FF FF FF  . . .  55 C9 5A 5C 0B 
        host_copy_of_compressed_data[13] = FC FF FF FF FF  . . .  65 78 03 88 92 
        host_copy_of_compressed_data[14] = F0 FF FF FF FF  . . .  33 FC 44 18 D6 
        host_copy_of_compressed_data[15] = FC FF FF FF FF  . . .  89 AA A6 A0 37 
========= ERROR SUMMARY: 0 errors

Could you please confirm if this is a correct approach now?

eschmidt · April 19, 2024, 8:36pm

You could also copy the compressed data to a single buffer where you offset into the buffer based on the compressed size of each chunk. I.e.

for (size_t ix_chunk = 0; ix_chunk < batch_size; ix_chunk++)
{
CUDA_CHECK(cudaMemcpy(read_back_compressed_data + offset, read_back_compressed_ptrs[ix_chunk], read_back_compressed_bytes[ix_chunk], cudaMemcpyDeviceToHost));
offset += read_back_compressed_bytes[ix_chunk];
}

Again though, if the goal is as you discussed with compress on node with GPU A, transfer to node with GPU B, i’d strongly recommend using the high level API, as in the high_level_quickstart_example.cpp. Otherwise you’d need to separately transfer each of these buffers and do a lot of setup on the other node.

Topic		Replies	Views
Accelerating Lossless GPU Compression with New Flexible Interfaces in NVIDIA nvCOMP Technical Blog	0	547	March 18, 2022
Very poor performance with cudaMemcpy3DAsync / cudaMemcpy3D for transfer of strided data (device to ... CUDA Programming and Performance	22	2505	July 10, 2019
Optimizing Data Transfer Using Lossless Compression with NVIDIA nvcomp Technical Blog	5	831	January 11, 2021
Unified memory has slow bandwidth over NVLink 2.0 for large data sizes CUDA Programming and Performance	4	1059	September 25, 2019
Problem about time of copy data through shared memory CUDA Programming and Performance	4	355	March 30, 2024
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	938	October 16, 2023
Getting scrambled pixels on NVENC output Video Processing & Optical Flow nvenc	2	766	July 10, 2023
Anomalous performance behavior of nvp-models when encoding h264-video Jetson TX2	14	1485	October 18, 2021
Issue with Nested for-loops in kernel CUDA Programming and Performance	13	2269	April 4, 2019
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	92	October 29, 2024

nvCOMP - get compressed data from device

Related topics