For kernel build you really need cross-compile (the next L4T release may change this). I believe CUDA development in user space is probably easiest directly on JTX1 (my opinion…others may be set up for convenient cross-development from a desktop host).
Knowledge you may find useful is that currently L4T user space is 32-bit, but kernel is 64-bit, which is why two compilers are used during kernel builds, but only one compiler is used during user space build. Add to this that the next L4T release (I don’t know when that is) will put everything into 64-bit (both kernel space and user space), and you’ll be back to using a single compiler for everything (I imagine there will be significant performance improvements too).
Thank you for that information! It is helpful to know that current user space is 32-bit.
We had a Ubuntu host cross-compilation set-up for TK1 through Nsight CUDA and were able to utilize that set-up again for TX1. Though Nsight has difficulty pushing the locally built executable to remote platform, we just scp it to TX1 at the moment.
Currently Ubuntu’s graphical features are up and running but it is better for us to go back to headless configuration for accurate benchmarking of CUDA applications. Is it possible to do so in a revertible fashion?
If you can run remote execution without passing through an X11 DISPLAY to your desktop, headless should be fine. If your program executes with error and complains when no DISPLAY variable is set (such as ssh with no “-X” and no “-Y”), then probably native execution and display on Jetson is required for accurate benchmarking.
We do more of non-graphical applications hence headless is preferred. I was able to add a Screen entry to xorg.conf to get to a terminal based interface. When I run tegrastats, I can see GPU staying at 0% until our application kicks in.
I recently got my first TX1, and I’m finding compiling locally on the TX1 or crosscompiling on even a eight-core linux box prohibitively slow. Some time this week I’ll be setting up a proper cross-compilation environment to run on AWS, so I can fire up a 32-core instance, build, transfer, then shut down.
I’ve done this for the Raspberry Pi, and it has been a life saver; the whole kernel (+modules, of course) compiles in about 4 minutes.
If there is interest, I’ll share the AMI and/or instructions on how to set one up.