When I’m not using MPS, ncu works fine as follows:
However, when I turn on MPS (multi-process service), I can’t do profilling on any device. I’ve used the --mps control command from version 2025.2 and still get this error. What’s the problem?
When I’m not using MPS, ncu works fine as follows:
Hi, @2023055089
Can you please try another simple sample to see if this is still repro ?
Hi. I try a simple program as follows:
include <stdio.h>
global void kernel_A(double* A, int N, int M)
{
double d = 0.0;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// printf(“Kernel A\n”);
if (idx < N) {
#pragma unroll(100)
for (int j = 0; j < M; ++j) {
d += A[idx];
}
A[idx] = d;
}
}
global void kernel_B(double* A, int N, int M)
{
double d = 0.0;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
#pragma unroll(100)
for (int j = 0; j < M; ++j) {
d += A[idx];
}
A[idx] = d;
}
}
global void kernel_C(double* A, const double* B, int N)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// printf(“Kernel C\n”);
// Strided memory access: warp 0 accesses (0, stride, 2*stride, ...), warp 1 accesses
// (1, stride + 1, 2*stride + 1, ...).
const int stride = 16;
int strided_idx = threadIdx.x * stride + blockIdx.x % stride + (blockIdx.x / stride) * stride * blockDim.x;
if (strided_idx < N) {
A[idx] = B[strided_idx] + B[strided_idx];
}
}
int main() {
double* A;
double* B;
int N = 80 * 2048 * 100;
size_t sz = N * sizeof(double);
cudaMalloc((void**) &A, sz);
cudaMalloc((void**) &B, sz);
cudaMemset(A, 0, sz);
cudaMemset(B, 0, sz);
int threadsPerBlock = 64;
int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;
int M = 10000;
kernel_A<<<numBlocks, threadsPerBlock>>>(A, N, M);
cudaFuncSetAttribute(kernel_B, cudaFuncAttributeMaxDynamicSharedMemorySize, 48 * 1024);
kernel_B<<<numBlocks, threadsPerBlock, 48 * 1024>>>(A, N, M);
kernel_C<<<numBlocks, threadsPerBlock>>>(A, B, N);
cudaDeviceSynchronize();
}
When I do profilling without MPS, it can be executed successfully. In addition, the value of metrics is normal:
However, when I do profilling with MPS, there is an error:
Once I use MPS for profilling, I can’t get the right results. What’s the problem? How can I solve it?
Thanks.
Can you also tell the Driver version and GPU you used ?
The driver version is 570.124.06 and the GPU used is Tesla V100.
Hi. One possible reason I seem to have found is that using --devices in the cli at version 2025.2 shows that devices cannot be specified when using MPS. And the GPU 0 on my server was recently unavailable while running another program. As I am profilling on the 7th GPU as specified by the export CUDA_VISIBLE_DEVICES=7 command, and then I get an error. So I would like to ask if it is default to profilling with MPS on GPUs with index 0 in version 2025.2.
Hi, @2023055089
Your analysis is correct. --devices is not support when MPS profiling
Yes, I’ve solved the problem, as long as the profilling is done on a GPU with index 0 there is no error. But I found another problem, in version 2025.2, when I use --mps control, the profilling gets L2 utilization and dram utilization as nan! How to solve this problem?
Please use 575.57.08 driver Driver Details | NVIDIA
As you have posted Using --mps control in version2025.2 gets nan. I will close this topic. Thanks !
This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.