Can still use OMP_NUM_THREADS without OpenMP compilation

Hello,

There is a very strange thing that happened on my end.

  1. I had a program with multiple source files using OpenACC, and I was using a supercomputer at our university, requesting 1 core and 1GB mem CPU, 1 Tesla V100 GPU. Then I compiled and ran the code, using OpenACC to put some parts on GPU (there were no OpenMP directives and libraries in my code), however, it performed equal to OpenMP CPU single thread code (there were no OpenACC directives and libraries in my Code). That’s weird, it seems like OpenMP single thread code automatically offload some parts to GPU to accelerate the program.

  2. After I compiled the OpenACC GPU code, I tried to set OMP_NUM_THREADS=2, 4, 8, 16. It worked! The code was accelerated by these settings even though I did not implement any OpenMP related stuff.

  3. After finding these things, I switched to 1 core and 1GB mem CPU-only mode, so there was no GPU, and the OpenMP code ran much slower than before because there was no GPU to use.

I have no idea what’s going on here, hope somebody can help.

Thanks!

Hi dacongi,

What flags did you use to compile?

By default, the compiler will create a unified binary which will contain parallelize code targeting either the GPU or the host. If at runtime the GPU isn’t found, it will revert to running in parallel on the host. So one possibility is that your V100 isn’t getting detected so it’s falling back.

If you run the PGI utility “pgaccelinfo”, does it detect your V100?

You can also try compiling with “-ta=tesla” so only GPU code is created. If it can’t find a GPU, it should fail at runtime.

Another possibility is that you used the flag “-ta=multicore” so the OpenACC code targeted the CPU only.

Also, can you post the output from the compile with compiler feedback enabled (-Minfo=accel)?

Finally, set the environment variable “PGI_ACC_NOTIFY=1”. This will have the runtime give a message each time a kernel is launched on the GPU so we can see if it is indeed running on the GPU or not.

-Mat

Hi Mat,

Thank you for your response. My OpenACC code was run by the following makefile

FC = pgfortran 
CC = 

FFLAGS =  -fast -ta=nvidia -Minfo=accel -Mfree  
CFLAGS =  -fast -ta=nvidia -Minfo=accel -Mfree 

LFLAGS =  -fast -ta=nvidia -Minfo=accel -Mfree  
OBJS = main.o inputmodule.o readinputsize.o alloc.o yswitch.o red_ybus.o y_sparse.o mac_em.o i_simu_innerloop.o

all: allo ds
allo: inputmodule.o
	$(FC) $(FFLAGS) -c inputmodule.f

ds: $(OBJS)
	$(FC) $(LFLAGS) -o ds $(OBJS) -lm -lblas -llapack 

clean: 
	rm -rf *.o *.mod *.out *.chk

After compiling the code, if I set OMP_NUM_THREADS=2, 4, 8…, it will be a speedup. It’s weird right since no OpenMP will be triggered in this makefile.

As you said, if I request with no GPU, OpenACC program cannot run. And I tried my code with OpenMP single thread then gave a 700 seconds run.

But with a GPU, OpenACC program and OpenMP program, they both do very similar computational performances even if there are no directives and flags to enable another one, say the OpenACC program can use OMP_NUM_THREADS to speed up, with thread=1(default, 144s run). And OpenMP program single thread mode performance is also about 140s, which is quite better than 700s, comparing to without GPU.

The compiler gave the right answer that the block%thread% such things. And when I export PGI_ACC_NOTIFY=1, the CUDA kernal info was showing when running the code. Also, the pgaccelinfo detected V100.

Some outputs from the runtime are here:

launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=105 device=0 threadid=1 num_gangs=25 num_workers=1vector_length=128 grid=25 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=119 device=0 threadid=1 num_gangs=25 num_workers=1vector_length=128 grid=25 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=124 device=0 threadid=1 num_gangs=25 num_workers=1vector_length=128 grid=25 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=135 device=0 threadid=1 num_gangs=29 num_workers=1vector_length=128 grid=29 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=160 device=0 threadid=1 num_gangs=29 num_workers=1vector_length=128 grid=29 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=172 device=0 threadid=1 num_gangs=29 num_workers=1vector_length=128 grid=29 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f function=y_sparse line=201 device=0 threadid=1 num_gangs=3120 num_workers=1 vector_length=128 grid=3120 block=128
launch CUDA kernel  file=/home/cong2/DS/v3_acc/y_sparse.f functi

After compiling the code, if I set OMP_NUM_THREADS=2, 4, 8…, it will be a speedup. It’s weird right since no OpenMP will be triggered in this makefile.

This is weird and I’m not sure what’s going on. Since you’re using “-ta=nvidia” (the older name for -ta=tesla), no parallel OpenACC CPU code should be generated and hence OMP_NUM_THREADS should have no effect. The only thing I can think of is if that the BLAS and/or LAPACK libraries could be OpenMP enabled.

Another question is why the GPU code is slower than expected. Here, I do see that you might not have enough work to fully utilize the GPU. PGI_ACC_NOTIFY shows that the gang size (grid) is only 25 or 29 for most of the kernels. Only the kernel at line 201 is significant at 3120 gangs. Are you able to increase the problem size?

Have you profiled your code? If not, you may want to use either pgprof or set the environment variable "
“PGI_ACC_TIME=1”.

-Mat

Exactly yes, my code has not been optimized, when I found this OMP problem, I stopped to make progress. Now I think you are correct, the libraries may be OMP enabled even though there is no OpenMP triggered, just set the OMP_NUM_THREADS environment variables, some function calls from these libraries can be accelerated automatically.

With this strange feature, by using OpenACC and OMP_NUM_THREADS variable, the most time-consuming function calls for matrix manipulation can be reduced to a shorter time by OMP, and other parts of code, I can move them to GPU to see the benefits and do some optimization.

In my case, compared to a CPU single thread mode (700s), I triggered OpenACC then set OMP_NUM_THREADS=16, the computational time goes 12.9s. Approximately 55X speedup. Not sure if it’s a new finding, but it does save me tons of time with the same and desired result output.

However, I assume some function calls can be optimized by using GPU as well, which means OpenACC implementation on these time-consuming matrix manipulation may have faster performance than setting OpenMP thread and reach up to a 60Xor 70X speedup. I’ll do more research on that futhre