How to specify the number of GPU cores used in my codes instead of all

Hi,
When I run my program in CUDA FORTRAN, can I specify the number of GPU cores used for my program? Instead of all cores.

Thank you very much.
Yu Zhang

Hi Yu Zhang,

You can’t restrict the number of cores, but can set the number of blocks and threads per block. The product of these can set to the number of cores you want to use. However, you wont be able to control where the blocks are scheduled, hence they may or may not be packed on the same SM units. In other words, you may have idle cores on some SMs if the blocks are spread out across the SMs.

In general for performance, it’s best to not restrict the total number of blocks and instead base this on the problem size. Hence, it’s questionable to make this restriction. Though if you’re trying to partition the GPU to allow for multiple binaries to run concurrently, you may consider virtualizing the GPU. I have no experience with vGPUs so can’t offer guidance, Hence only offer this link as a starting point: What Is a Virtual GPU? | NVIDIA Blog

Though if you can give more detail about why you’re wanting to restrict the number of cores, I may be help offer solutions.

-Mat

Hi, Mat

Thank you for your reply!

Suppose I specify the number of blocks and block size during kernel invocation as follows:
call kernel<<<10000, 256>>>(a_d,b)
While there are 46 SM, and 64 SP per SM, totally 2944 cores in my GPU, how kernel launches and organizes these 2560000 threads to utilize the GPU cores? Can I only use 2000 GPU cores for these threads?

Another question, if call kernel<<<1, 256>>>(a_d,b), and how kernel organizes these 256 threads to utilize these SMs?

Thank you very much again.
Yu Zhang

how kernel launches and organizes these 2560000 threads to utilize the GPU cores? Can I only use 2000 GPU cores for these threads?

The number of cores is not very relevant. In order to hide memory latency, warps (groups of 32 threads) will be swapped out, i.e. as one warp fetches memory, another will use the cores to execute. The cores are oversubscribed.

A better questions to ask is what is the occupancy of the kernel? Occupancy measures the percentage of active warps over the number supported warps for a given device. While deprecated and it’s capability moved into Nsight-Compute, I find the CUDA occupancy spreadsheet handy: CUDA Occupancy Calculator :: CUDA Toolkit Documentation. The major items in determining occupancy are having enough threads to fill a GPU, the number of registers used per thread, and the amount of used shared memory per block.

Each SM can run up to 2048 concurrent threads and has 64K registers. Hence in order to achieve 100% occupancy, a maximum of 32 registers per thread can be used. More registers per thread means fewer threads per SM and lower occupancy. Register allocation is done by the device assembler (ptxas) based mostly on the number of local variables used by the kernel though other operations such as address computation can use registers as well. You can see how many registers were allocated per thread by adding the compiler flag “-gpu=ptxinfo”.

Note that having lower occupancy does not always mean poor performance. Many kernels see 50% occupancy and still run well.

Also, this is theoretical occupancy. The achieved occupancy may be slight lower depending on warp stalls. Though to see this you will need to use the Nsight-Compute profiler.

OK, many thanks for your kind reply!

Hi, Mat

How can I use my own static link library (for example: *.lib) and dynamic link library (for example: *.dll) into my CUDA FORTRAN program?
Except put the *.lib and *.dll in the same directory as the source code, what commands do I need to add to the command line when I compile my code in PGI compiler on Windows?
Like this: pgfortran -Mcuda *.f90 C:\cuda\lib* .lib, but failed with error LNK2019: Unresolved external symbol

Thank you very much!
Yu Zhang

Hi Yu Zhang,

What shell are you using? Cygwin bash or DOS cmd?

In “bash” the back-slash, “”, is an escape character so you need to add another back-slash so itself is escaped and passed through. Though I would expect the linker not being able to find the library if this is the case. Are you also seeing a message that the linker can’t find the library?

If not, then you might be missing a library? What symbols are missing? Something from the compiler runtime or one of your own libraries?

Note that typically one would use “-L<path_to_lib_dir>” and “-l” flags where the name of the library with a "lib"prefix. Directly adding the full library path is usually only used for DLL import libraries.

-Mat

Hi, Mat

Thank you for your reply!

I use the DOS cmd to compile my cuda fortran code, using the PGI compiler version 19.10 on Windows.

In my code, I call a subroutine named userdefined_boundary_1(a,b,c), and this subroutine is defined in the library named userdefined_boundary.lib.

In How to use (or call) cublas library, you said ‘On Windows, libraries can sometimes be called “cublas.lib” (i.e. no “lib”). If this is the case, then instead of using “-l” and “-L”, just add the full path and library name to the link line.’

So I try to compile my code with ‘pgfortran -Mcuda hello.f90 C:\cuda\userdefined_boundary.lib’.

But it seems the library is not included into my code. The errro message is error LNK2019: Unresolved external symbol userdefined_boundary_1.

Yu Zhang

‘On Windows, libraries can sometimes be called “cublas.lib” (i.e. no “lib”). If this is the case, then instead of using “-l” and “-L”, just add the full path and library name to the link line.’

Still true. Libraries without the “lib” prefix, typically DLL import libraries, need to be directly added to the link line.

Maybe, but more likely something else is going on. You can add the flag “-v” (verbose) to the link to see the full link line and that the library is being added.

Though, let’s explore if it’s a symbol mismatch error, i.e. the symbol name expected by main is what’s actually in the library.

Is “userdefined_boundary.lib” a DLL import library? If so, did you export the symbol?

What language did you use to create the library? If C++, did you add “export “C” { … }” around the routine prototypes so the symbol names aren’t mangled? Note, you can use the 'pgnm " utility on the library if you want to see the symbol names.

Since there’s no underbar at the end of “userdefined_boundary_1”, are you using ISO_C_BINDING in the Fortran code to create a C interface? Does this symbol match what’s in the library?

-Mat

Hi, Mat

According to your reply, I just solved this problem!

Many thanks for your kind reply!

Yu Zhang

Excellent! What was the problem and which suggestion solved it?

Hi, Mat

If the static link library is compiled by other compiler, e.g., Intel Visual Fortran, that would report an error.

Just recompile the static link library using the PGI compiler, then the library can be successfully included into my code!

Yu Zhang

Ok. Unless you’re using F77 style APIs, intermixing Fortran implementations won’t work. The Fortran standard allows things like modules and allocatable array descriptors to be implementation defined, so typically incompatible between Fortran compilers. F77 is ok since it doesn’t use these features, but even then interoperability can be tricky since symbol name mangling can be different.

Best to use the same Fortran compiler for all Fortran code.

Yes, that’s true.

Many thanks to you, Mat