Hello,
I use cuFFT library with multiple gpus (Xt) to accelerate my programme. I’ve run into some issues. Unfortunatelly I cannot see any good solution. The programme simulates wave propagation in a loop that uses R2C and C2R transfomations.
First I create R2C and C2R cufftXt plans. Then I create 3 cudaLibXt descriptors using cufftXtMalloc:
- X with format INPLACE (will be initialized with data),
- Y with format INPLACE_SHUFFLED and
- Z with format INPLACE_SHUFFLED
After I copy in the initial data to X using cufftXtCopy, I begin the main loop.
First a R2C transformation is performed over X. Then a kernel that computes over complex data in X is launched and it computes new values for X, Y and Z based on X. Now I perform a C2R transfomation over X, Y and Z and use the data from all descriptors.
The problem is that after each iteration I end up with X in INPLACE format (that is ok) but Y and Z are in INPLACE format as well, but I need them in INPLACE_SHUFFLED format for next iteration. I dont need to perform a R2C transformation over them to put them in that format. I would like to just change their current state, I know all their data would be invalid then, but I would overwrite them anyway.
The solution that came into my mind is that I could allocate a new descriptor for Y and Z every iteration using INPLACE_SHUFFLED format, but that is not a good enough solution for me.
So I would like to ask if there is an other way, or if it could be possibly created, like some cufftXtSetDescriptorState function.
Thank you very much.
Dave
Hi, I think I have solved the problem. It seems that the current data distribution is only tracked in member libFormat of struct cudaLibXtDesc. It flips from between INPLACE (real) and INPLACE_SHUFFLED (complex) after each transformation. If you manually change it to desired format, it will work.
WARNING: It only overwrites the format. It does not do the data exchange among GPUs!
Hopefully it will not change.
Thanks.
Dave
Hi,
Do you know if there is a way to exchange the data?
My problem is also similar.
Here is the simplified version of my code:
// A simplified version of the kernel
__global__ void modify_arr(desc_data, arr) {
idx = threadIndex;
arr[idx] = desc_data[idx] * A + B; // A and B are constants
}
cufftXtSubFormat_t FORMAT = CUFFT_XT_FORMAT_INPLACE;
cufftHandle plan;
// ... Initialize 2D plan
cudaLibXtDesc* desc;
// ... Allocate desc on GPUs using cufftXtMalloc with given FORMAT
double *arr[num_gpus];
for (gpu_id = 0; gpu_id < num_gpus; gpu_id++) {
cudaSetDevice(gpu_id);
int numElements = desc->descriptor->size[gpu_id] / sizeof(double);
// Allocate numElements to arr[gpu_id] on this GPU
cudaMalloc(&arr[gpu_id], numElements * sizeof(double));
}
// Perform 2D FFT
cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD);
// Modify arr using the data of `desc` on the same GPU
for (gpu_id = 0; gpu_id < num_gpus; gpu_id++) {
cudaSetDevice(gpu_id);
modify_arr<<<...>>>(desc->descriptor->data[gpu_id], arr[gpu_id]);
}
Here, I want to modify the idx
value of arr
using the idx
value of the data of desc
descriptor. But after FFT, the data is exchanged between the GPUs and the original order is lost.
Specifically, is there any way to perform (2D/3D) FFT on multiple GPUs so that if x[i]
was present on GPU g
, then y[i]
(y
represents the result of FFT) will also be present on the same GPU at the same location.
Thank you very much
Aaryan
Hi,
I wrote a function performing the reorder from CUFFT_XT_FORMAT_INPLACE_SHUFFLED
to CUFFT_XT_FORMAT_INPLACE
after R2C
transform using cudaMemcpy3DPeer
for 3D case. I didn’t test it, there may be a bug, but it can give you the idea what to do. You can implement the 2D case the same way, just work with X and Y axes and set Z size to 1.
#include <cstddef>
#include <algorithm>
#include <cuda_runtime.h>
#include <cufftXt.h>
void shuffledToNatural3D(void* const* src,
void* const* dst,
const int* devices,
const std::size_t deviceCount,
const std::size_t* n)
{
const std::size_t z = n[0]; // z dimension size
const std::size_t y = n[1]; // y dimension size
const std::size_t x = n[2]; // x dimension size
const std::size_t xRed = x / 2 + 1; // reduced x dimension size
for (std::size_t i = 0uz; i < deviceCount; ++i)
{
// devices[i]'s local y dimension size
const std::size_t locSrcY = y / deviceCount + (i < y % deviceCount);
for (std::size_t j = 0uz; j < deviceCount; ++j)
{
// devices[j]'s local z dimension size
const std::size_t locDstZ = z / deviceCount + (j < z % deviceCount);
// z position from where read the slab
const std::size_t srcZPos = j * (z / deviceCount) + std::min(j, z % deviceCount);
// y position where to write the slab
const std::size_t dstYPos = j * (y / deviceCount) + std::min(j, y % deviceCount);
cudaMemcpy3DPeerParms params{};
params.srcPos = make_cudaPos(0, 0, srcZPos);
params.srcPtr.ptr = src[i];
params.srcPtr.pitch = xRed * sizeof(cufftComplex);
params.srcPtr.xsize = xRed;
params.srcPtr.ysize = locSrcY;
params.srcDevice = devices[i];
params.dstPos = make_cudaPos(0, dstYPos, 0);
params.dstPtr.ptr = dst[j];
params.dstPtr.pitch = xRed * sizeof(cufftComplex);
params.dstPtr.xsize = xRed;
params.dstPtr.ysize = y;
params.dstDevice = devices[j];
params.extent = make_cudaExtent(xRed * sizeof(cufftComplex), locSrcY, locDstZ);
if (cudaMemcpy3DPeer(¶ms) != cudaSuccess)
{
// handle error
};
}
}
}
Best regards
David
Hi
Can this code be used for shuffling the data in a cudaLibXtDesc
descriptor? What will be the value of src
and dst
in that case?
Thank you very much
Aaryan