win32 api issue and unified binary question

Hello Mat,

I have an issue with the code below and a few questions about the unified binary technology.

  • The code below compiles and runs when no optimisation is enabled. However when -O or higher is enabled the program hangs on startup. I believe it is related to the ‘StringCchPrintf’ function (when commented out, no problems). _tcsprintf also has the same issue. Is there a way to solve it?

  • How many simultaneous targets are supported with the -tp compiler option? I have the impression that when alot of targets are specified, for some no specific code is generated.

  • How does the ‘pragma routine tp’ directive work? In the example below
    for the ‘smooth’ -function, it is set to “#pragma routine tp p7 k8 core2 sandybridge” in an effort to generate target specific routine code for four different targets. However the compiler output only reports generated code for penryn (which wasn’t even specified). Can you explain what I’m doing wrong? Using pgi 13.4 with windows.

Thanks,

Ruben

pgcpp main.cpp kernel32.lib user32.lib gdi32.lib -Minfo -fastsse



        5362, PGI Unified Binary version for -tp=penryn-64
WinMain:
     51, Loop not vectorized/parallelized: contains call
smooth__FPfT1fN23iN26:
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times



#include <windows.h>
#include <Strsafe.h>
#include <math.h>

#ifndef WINVER
#define WINVER 0x0502
#endif

// Declare the main WndProc prototype
LRESULT CALLBACK Main_WndProc(HWND, UINT, WPARAM, LPARAM);

//
// Define the WinMain
//
int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, PSTR szCmdLine, int iCmdShow)
{
	static TCHAR szWndClassName[] = TEXT("WinApp1");	// Window classname to be used
	HWND hwnd;											// handle to main window
	MSG msg;											// message from message queue
	WNDCLASS wndclass; 									// our main window class
	
	// Define window class
	wndclass.style			= CS_HREDRAW | CS_VREDRAW;		// redraw (WM_PAINT)on H or V resize
	wndclass.lpfnWndProc 	= Main_WndProc;
	wndclass.cbClsExtra		= 0;
	wndclass.cbWndExtra    	= 0 ;
	wndclass.hIcon         	= LoadIcon (NULL, IDI_APPLICATION) ;
	wndclass.hCursor       	= LoadCursor (NULL, IDC_ARROW) ;
	wndclass.hbrBackground 	= (HBRUSH) GetStockObject (WHITE_BRUSH) ;
	wndclass.lpszMenuName  	= NULL ;
	wndclass.lpszClassName 	= szWndClassName;
	// register windowclass
	if (!RegisterClass (&wndclass))
	{
		MessageBox (NULL, TEXT ("Program requires Windows NT!"), szWndClassName, MB_ICONERROR) ;
          return 0 ;
	}
	
	// Create the window
	hwnd = CreateWindow(szWndClassName, TEXT("Test WinApp1"), 				// LPCTSTR lpClassName, LPCTSTR lpWindowName,
						WS_OVERLAPPEDWINDOW | WS_VSCROLL | WS_HSCROLL, 		// DWORD dwStyle,
						CW_USEDEFAULT, CW_USEDEFAULT,						// int x, int y,
						CW_USEDEFAULT, CW_USEDEFAULT,						// int width, int height,
						NULL, NULL, hInstance, NULL);						// HWND hWndparent, HMENU hMenu, HINSTANCE hInstance, LPVOID lpParam
						
	// Show and repaint window immediately (WM_PAINT)
	ShowWindow(hwnd, iCmdShow);
	UpdateWindow(hwnd);
	
	// Start message pump
	while (GetMessage(&msg, NULL, 0, 0) > 0)		// LPMSG lpMsg, HWND hWnd, UINT wMsgFilterMin, UINT wMsgFilterMax
	{
		TranslateMessage(&msg);
		DispatchMessage(&msg);
	}

	return msg.wParam;	// return Quit code?
}

//
// Main  Window Procedure
//
LRESULT CALLBACK Main_WndProc (HWND hwnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
	HDC hdc;	
	PAINTSTRUCT ps;	
	float sTemp = 3.1415;
	TCHAR szBuffer[20]=TEXT("Empty");	
	
	switch(msg)
	{	
	case WM_CREATE:
		// Get device context
		hdc = GetDC (hwnd);		
		
		ReleaseDC (hwnd, hdc);
		return 0;
		
	case WM_PAINT:		
		// Get device context
		hdc = BeginPaint (hwnd, &ps) ;
		
		// Problematic function
		StringCchPrintf(szBuffer, 20, TEXT("%f"), sTemp);
		
		// Paint text
		TextOut(hdc, 0, 0, szBuffer, lstrlen(szBuffer));

		EndPaint (hwnd, &ps) ;
		return 0 ;		  

	case WM_DESTROY:
		PostQuitMessage (0) ;
		return 0 ;

	// Default handler
	default: return DefWindowProc (hwnd, msg, wParam, lParam);
	}
}

#pragma routine tp p7 k8 core2 sandybridge
void smooth( float* a, float* b, float w0, float w1, float w2, int n, int m, int niters )
{
    int i, j, iter;
    float* tmp;
    for( iter = 1; iter <= niters; ++iter ){
	#pragma acc kernels loop copyin(b[0:n*m]) copy(a[0:n*m]) independent
	for( i = 1; i < n-1; ++i )
	    for( j = 1; j < m-1; ++j )
		a[i*m+j] = w0 * b[i*m+j] + 
		    w1*(b[(i-1)*m+j] + b[(i+1)*m+j] + b[i*m+j-1] + b[i*m+j+1]) +
		    w2*(b[(i-1)*m+j-1] + b[(i-1)*m+j+1] + b[(i+1)*m+j-1] + b[(i+1)*m+j+1]);
	tmp = a;  a = b;  b = tmp;
    }
}

Hi vam,

I had forgotten that we even had a “tp” pragma and from what I can tell our engineers had too. It doesn’t appear to have been updated in many years so only contains older targets. I’ll submit a bug report, but since your the first person to use this feature, my guess is we will just remove it.

Instead, use the “-tp” command switch with multiple 64-bit targets to create a Unified Binary (32-bit targets are not supported) You can put as many targets as you wish.

$ pgcpp main.cpp kernel32.lib user32.lib gdi32.lib -Minfo -fastsse -tp=sandybridge-64,bulldozer-64,core2-64
StringCchPrintfA__FPcULPCce:
   5365, [local to main_cpp]::StringValidateDestA(const char *, unsigned long long, unsigned long long) inlined, size=4 (inline) file main.cpp (10220)
   5373, [local to main_cpp]::StringVPrintfWorkerA(char *, unsigned long long, unsigned long long *, const char *, char *) inlined, size=15 (inline) file main.cpp (10572)
WinMain:
     16, PGI Unified Binary version for -tp=sandybridge-64
     51, Loop not vectorized/parallelized: contains call
WinMain:
     16, PGI Unified Binary version for -tp=bulldozer-64
     51, Loop not vectorized/parallelized: contains call
WinMain:
     16, PGI Unified Binary version for -tp=core2-64
     51, Loop not vectorized/parallelized: contains call
Main_WndProc__FP6HWND__UiULL:
     64, PGI Unified Binary version for -tp=sandybridge-64
Main_WndProc__FP6HWND__UiULL:
     64, PGI Unified Binary version for -tp=bulldozer-64
Main_WndProc__FP6HWND__UiULL:
     64, PGI Unified Binary version for -tp=core2-64
smooth__FPfT1fN23iN26:
    103, PGI Unified Binary version for -tp=sandybridge-64
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times
smooth__FPfT1fN23iN26:
    103, PGI Unified Binary version for -tp=bulldozer-64
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times
smooth__FPfT1fN23iN26:
    103, PGI Unified Binary version for -tp=core2-64
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times
main.cpp:
  • Mat

Hi Mat,

thank you for your reply.

  • Would you know why the “StringCchPrintf” might cause hanging on application startup? It’s part of the WIN32 api (a safer replacement for _stprintf which seems to have the same problem). If you don’t enable optimization, the code runs and shows a simple window. If optimization is enabled, no window is shown and the process remains in task manager.

  • How does the unified binary dispatch code work? Do you check for a certain vendor string or processor architecture, or is it based on the supported feature set of the processor? If a code is optimized for piledriver, sandybridge and generic x64, what path will for instance an Ivybridde or future processor take? Will it use AVX if available?

IMHO your unified binary technology is one of the strongest points of the compiler, since it’s able to specifically optimize for both Intel and AMD (whereas the Intel compiler just optimizes for Intel only and use a less-optimized path for non-Intel chips). AFAIK, they check for their own vendor string and then select a codepath based on supported instruction set. Processors with other vendor strings just run the slower path.
I’d like to understand how the PGI compiler selects its code path and if there are generic recommendations/remarks to make your applications run as good as possible on future architectures.

Thank you.
Best regards,

Ruben

Hi Ruben,

Would you know why the “StringCchPrintf” might cause hanging on application startup? It’s part of the WIN32 api (a safer replacement for _stprintf which seems to have the same problem). If you don’t enable optimization, the code runs and shows a simple window. If optimization is enabled, no window is shown and the process remains in task manager.

Not off hand and your program seems to work for me at high opt.

Are you sure it’s hanging on “StringCchPrintf”? One thing that comes to mind is that early releases of Win7 didn’t support AVX instructions and would cause a program to hang on start-up. To test this, compile with optimization targeting Penryn (i.e. -fast -tp=penryn-64). The solution is to install Win7 SP1.

How does the unified binary dispatch code work? Do you check for a certain vendor string or processor architecture, or is it based on the supported feature set of the processor? If a code is optimized for piledriver, sandybridge and generic x64, what path will for instance an Ivybridde or future processor take? Will it use AVX if available?

At start-up the run time checks the feature set of the processor and then select the appropriate code path. So yes, Ivybridge would use AVX. You can use the utility “pgcpuid” to see the feature list for your processor.

IMHO your unified binary technology is one of the strongest points of the compiler, since it’s able to specifically optimize for both Intel and AMD

Thank you. It’s one of the advantages of being independent. We don’t play favorites and are more interest in getting the fastest performance across all targets.

I’d like to understand how the PGI compiler selects its code path and if there are generic recommendations/remarks to make your applications run as good as possible on future architectures.

When we first implemented Unified Binary, there was a wide performance difference in running a binary built for architecture than another. For example, running an Intel targeted binary on AMD hardware would slow the code by 20% versus a native targeted binary. Though as the x86-64 architectures have matured, this is less of a problem. Now the main issue is portable binaries that can take advantage of new instructions (like AVX) on system which support them, but still run on older processors.

  • Mat

Hi Mat,

thanks for your answers.

Not off hand and your program seems to work for me at high opt.

Just to verify, did you compile using pgcc or pgcpp? If I remember correctly pgcc also ran ok for me, but the hanging occured with the C++ compiler.

Thank you.
Best regards,

Ruben

I was using pgcpp version 13.4 with -fast, in both 32 and 64-bits. In 64-bits, I also added multiple targets to create a unified binary. My systems is a Win7 with SP1 on a Core i5-2400 which supports AVX.

  • Mat