I’m trying to multiply many vectors (1x4) by many matrices (4x4) using a CUBLAS Gem operation.
I have them all in col-maj order (pretty sure - but I think it’s actually irrelevant to the bug here), in one contiguous memory allocation. So as far as I can see a Gemm strided batch (single point prec.) operation is perfect for what I’m trying to achieve.
I’ve double checked all of my parameters but I’m getting really strange results. If I write out a sample 1x4 and 4x4 matrix and calculate it by hand, the answer comes out as expected, but CUDA fills it with strange results.
The code I am using (I have renamed some variables for clarity - There are no syntax/variable name issues otherwise and it compiles and runs, the RaiseCuda functions simply check the output of each function):
int batches = 1000;
int floatsPerVector = 4;
int floatsPerMatrix = 16;
float *positions_output = new float[batches * floatsPerVector];
float *dev_mat_points = 0;
float *dev_mat_transform = 0;
float *dev_mat_result = 0;
int count_mat_points = batches * floatsPerVector;
int count_mat_transform = batches * floatsPerMatrix;
// reserve mem
RaiseCuda(cudaMalloc((void**)&dev_mat_points, count_mat_points * sizeof(float)));
RaiseCuda(cudaMalloc((void**)&dev_mat_transform, count_mat_transform * sizeof(float)));
RaiseCuda(cudaMalloc((void**)&dev_mat_result, count_mat_points * sizeof(float)));
// copy in data
RaiseCuda(cudaMemcpy(dev_mat_points, positions_flattened, count_mat_points * sizeof(float), cudaMemcpyHostToDevice));
RaiseCuda(cudaMemcpy(dev_mat_transform, tfs_flattened, count_mat_transform * sizeof(float), cudaMemcpyHostToDevice));
float alpha = 1.f;
float beta = 0.f;
HandleCublasStatus(cublasSgemmStridedBatched(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
1 /*m*/, 4 /*n*/, 4 /*k*/, &alpha,
dev_mat_transform /**B*/, floatsPerVector /**ldb*/, floatsPerTf /**strideB*/,
dev_mat_points /**A*/, floatsPerVector /*lda*/, floatsPerVector /*strideA*/,
&beta,
dev_mat_result /**C*/, floatsPerVector /**ldc*/, floatsPerVector /**strideC*/,
batches /**batchCount*/)
);
// wait for the result
RaiseCuda(cudaDeviceSynchronize());
// copy result back to host
RaiseCuda(cudaMemcpy(positions_output, dev_mat_result, count_mat_points * sizeof(float), cudaMemcpyDeviceToHost));
// clean up memory code here
However these are the types of results I am getting which are unexpected, when I check a random Vector’s result:
As far as I can see they may all follow the same format [N, 0, 0, 0] - Is this indicative of some sort of user error in copying back/interpreting the output? If I work out that matrix multiplication by hand it does not come out to those answers.
RUNNING ON: Nvidia jetson agx xavier
Cuda info:
Built on Wed_Oct_23_21:14:42_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89