The Carmel cores emulates the ARM Architecture version 8.2, executing both 64-bit AArch64 code, and 32-bit AArch32 code.
Further inputs :
The most important things are the -march flags corresponding to the processor capabilities (something like -march=armv8.2-a+fp16+simd+crypto+predres), turning on appropriate levels of optimization (-O2 in general, -O3 for very hot loops), using -ffast-math where possible, and having the latest Jetpack release. We also see better performance with newer compiler revisions and recompiling some base libraries to better use the available processor features (especially v8.1 LSE atomics).
In terms of scheduling (-mtune), cortex-a75 and cortex-a76 should both be good starting points, as should -mtune=generic-armv8.2-a. The processor dynamic code optimization can compensate for some shortcomings in the scheduling and selection of instructions, so we believe this should be secondary or tertiary in most cases.
If you have questions about the performance of code sequences that you can share, we would be happy to provide some further help or analysis.