How to perform a basic 2D convolution of 2D images with CUTLASS?

I have a hard time understanding CUTLASS. I have found examples here and there, but I am not able to perform a simple convolution for a 2D image of size WxH with a row filter of size 1xK
I can compile and run, there are no errors, but the result is garbage.
The error certainly lies in coords/strides which are terribly documented.
Can you help ?

for the context :
src is a widthxheight image with a rowstride (in bytes) of srcStride
dst is a widthxheight image with a rowstride (in bytes) of dstStride
kernelData is a (2*kernelRadius+1)x1 image
actyally, I test it with continuous images of size 10x10 and a kernel of size 1x3 full of 1’s
the “dst” seems to receive some data from src, not even multiplied, but most of the dst data is 0’s

#include "cudaCUTLASS.hpp"

#include <cutlass/cutlass.h>
#include <cutlass/conv/kernel/default_conv2d_fprop.h>
#include <cutlass/conv/device/implicit_gemm_convolution.h>
#include <cutlass/util/device_memory.h>

//src, dst, kernelData are all in device memory
void convolutionCUTLASSRow(const float* src, size_t srcStride, float* dst, size_t dstStride, int width, int height, const float* kernelData, int kernelRadius, cudaStream_t stream)
{
  using ElementA = float;
  using ElementB = float;
  using ElementC = float;
  using ElementAccumulator = float;
  using ElementCompute = float;

  using Epilogue = cutlass::epilogue::thread::LinearCombination<
    ElementC,
    1,
    ElementAccumulator,
    ElementCompute
  >;
  using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop <
    ElementA,
    cutlass::layout::TensorNHWC,
    ElementB,
    cutlass::layout::TensorNHWC,
    ElementC,
    cutlass::layout::TensorNHWC,
    ElementAccumulator,
    cutlass::arch::OpClassSimt,
    cutlass::arch::Sm50,
    cutlass::gemm::GemmShape<128, 128, 8>,
    cutlass::gemm::GemmShape<64, 64, 8>,
    cutlass::gemm::GemmShape<1, 1, 1>,
    Epilogue,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    2,
    cutlass::arch::OpMultiplyAdd,
    cutlass::conv::IteratorAlgorithm::kAnalytic
  >::Kernel;

  using Conv2dFprop = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropKernel>;

  Conv2dFprop implicit_gemm_op;

  const int kernelDiameter = 2 * kernelRadius + 1;
  cutlass::Tensor4DCoord input_size(1, height, width, 1);
  cutlass::Tensor4DCoord filter_size(1, kernelDiameter, 1, 1);
  cutlass::Tensor4DCoord output_size(1, height, width, 1);

  cutlass::conv::Conv2dProblemSize problem_size(
    input_size,
    filter_size,
    cutlass::Tensor4DCoord(1, 1, 1, 1),
    cutlass::MatrixCoord(1, 1),
    cutlass::MatrixCoord(1, 1),
    output_size,
    cutlass::conv::Mode::kCrossCorrelation,
    1
  );

  const int srcStrideInElements = static_cast<int>(srcStride / sizeof(float));
  cutlass::layout::TensorNHWC src_layout(1, srcStrideInElements, height* srcStrideInElements);
  auto tensor_src = cutlass::make_TensorRef(const_cast<float*>(src), src_layout);

  cutlass::layout::TensorNHWC ker_layout(1, kernelDiameter, kernelDiameter);
  auto tensor_ker = cutlass::make_TensorRef(const_cast<float*>(kernelData), ker_layout);

  const int dstStrideInElements = static_cast<int>(dstStride / sizeof(float));
  cutlass::layout::TensorNHWC dst_layout(1, dstStrideInElements, height * dstStrideInElements);
  auto tensor_dst = cutlass::make_TensorRef(dst, dst_layout);

  using Arguments = typename Conv2dFprop::Arguments;
  Arguments arguments = Arguments(
    problem_size,
    tensor_src,
    tensor_ker,
    tensor_dst,
    tensor_dst,
    { 1.f, 0.f },
    cutlass::conv::SplitKMode::kSerial
  );

  cutlass::Status status;
  status = implicit_gemm_op.can_implement(arguments);

  size_t workspace_size = implicit_gemm_op.get_workspace_size(arguments);
  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

  status = implicit_gemm_op.initialize(arguments, workspace.get(), stream);

  status = implicit_gemm_op();

}

This is way outside my experience, but if you haven’t already found it, the PDF here may be helpful.

Thanks, but it does not really help. I have already looked at sessions about CUTLASS and GEMM, but the problem is that the doc of CUTLASS itself is basically non-existent. Or at least is not for humans.
There is no sample code that does something useful. Only thousands of unit tests that implement and run kernels. Meanwhile the parameters of basic structures like Layout or Stride are not even detailed with meaningful examples.
And ultimately, when you just want to perform the first step of what CUTLASS is designed for (some linear algebra), you hit a huge stumbling block, even for the simplest case of 2D images (and not a batch of 4D tensors from the inner layer of a DNN).

There is much documentation under /media/docs/ , for example for convolution: cutlass/media/docs/implicit_gemm_convolution.md at main · NVIDIA/cutlass · GitHub

And with that “doc” I was able to produce the above code that does not work and for which I need help :-)

Really, that all-template library miss the point on giving pragmatic usage samples (for instance good luck on understanding how to create a Tensor from existing device memory. I finally found that make_TensorRef helper, but I only digged it from the CUTLASS source code, I could not find any documented usage)

I have made some progress.

The code below, with (as expected) a few fixes about stride usage, almost works.
It works with contiguous or non-contiguous src matrix and contiguous dst matrix.
It fails with contiguous or non-contiguous src matrix and non-contiguous dst matrix. (“fails” here means : “produces garbage with a lot of 0s”)
[edit]
dst is filled within the extra pixels of the first row, so the data is here but mis-strided
that’s why there are lots of 0 : the other rows are untouched
it means that the stride is ignored in the dst tensor
[/edit]

I don’t understand why dst is impacted by a stride, while src handles it perfectly.

#include "cudaCUTLASS.hpp"

#include <cutlass/cutlass.h>
#include <cutlass/conv/kernel/default_conv2d_fprop.h>
#include <cutlass/conv/device/implicit_gemm_convolution.h>
#include <cutlass/util/device_memory.h>

//src, dst, kernelData are all in device memory
void convolutionCUTLASSRow(const float* src, size_t srcStride, float* dst, size_t dstStride, int width, int height, const float* kernelData, int kernelRadius, cudaStream_t stream)
{
  using ElementA = float;
  using ElementB = float;
  using ElementC = float;
  using ElementAccumulator = float;
  using ElementCompute = float;

  using Epilogue = cutlass::epilogue::thread::LinearCombination<
    ElementC,
    1,
    ElementAccumulator,
    ElementCompute
  >;
  using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop <
    ElementA,
    cutlass::layout::TensorNHWC,
    ElementB,
    cutlass::layout::TensorNHWC,
    ElementC,
    cutlass::layout::TensorNHWC,
    ElementAccumulator,
    cutlass::arch::OpClassSimt,
    //cutlass::arch::Sm50,
    cutlass::arch::Sm80,
    cutlass::gemm::GemmShape<128, 128, 8>,
    cutlass::gemm::GemmShape<64, 64, 8>,
    cutlass::gemm::GemmShape<1, 1, 1>,
    Epilogue,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    2,
    cutlass::arch::OpMultiplyAdd,
    cutlass::conv::IteratorAlgorithm::kAnalytic
  >::Kernel;

  using Conv2dFprop = cutlass::conv::device::ImplicitGemmConvolution<Conv2dFpropKernel>;

  Conv2dFprop implicit_gemm_op;

  const int kernelDiameter = 2 * kernelRadius + 1;
  cutlass::Tensor4DCoord input_size(1, height, width, 1);
  cutlass::Tensor4DCoord filter_size(1, 1, kernelDiameter, 1);
  cutlass::Tensor4DCoord output_size(1, height, width, 1);

  cutlass::conv::Conv2dProblemSize problem_size(
    input_size,
    filter_size,
    cutlass::Tensor4DCoord(0, 1, 1, 1),
    cutlass::MatrixCoord(1, 1),
    cutlass::MatrixCoord(1, 1),
    output_size,
    cutlass::conv::Mode::kConvolution,
    1
  );

  const int srcStrideInElements = static_cast<int>(srcStride / sizeof(float));
  cutlass::layout::TensorNHWC src_layout(1, srcStrideInElements, height * srcStrideInElements);
  auto tensor_src = cutlass::make_TensorRef(const_cast<float*>(src), src_layout);

  cutlass::layout::TensorNHWC ker_layout(1, kernelDiameter, kernelDiameter);
  auto tensor_ker = cutlass::make_TensorRef(const_cast<float*>(kernelData), ker_layout);

  const int dstStrideInElements = static_cast<int>(dstStride / sizeof(float));
  cutlass::layout::TensorNHWC dst_layout(1, dstStrideInElements, height * dstStrideInElements);
  auto tensor_dst = cutlass::make_TensorRef(dst, dst_layout);

  using Arguments = typename Conv2dFprop::Arguments;
  Arguments arguments = Arguments(
    problem_size,
    tensor_src,
    tensor_ker,
    tensor_dst,
    tensor_dst,
    { 1.f, 0.f },
    cutlass::conv::SplitKMode::kSerial
  );

  cutlass::Status status;
  status = implicit_gemm_op.can_implement(arguments);

  size_t workspace_size = implicit_gemm_op.get_workspace_size(arguments);
  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

  status = implicit_gemm_op.initialize(arguments, workspace.get(), stream);

  status = implicit_gemm_op();

}

I have escalated that as a bug here : https://github.com/NVIDIA/cutlass/issues/1323