I am pretty impressed by scottgray’s Maxwell Assembler. Here is the link: https://code.google.com/p/maxas/. The only thing is I don’t have a Maxwell GPU. I only have GT200 and Fermi Nvidia GPU. I was trying to use the Maxwell Assembler on Fermi when I started to try to understand the content of Maxwell Assembler, and it didn’t work obviously. I was wondering if there is an available remote Maxwell GPU to use.
I did all my development for the assembler on a 750Ti. If you can’t find a remote machine, it seems those are pretty reasonably priced. Or you might find one used for even cheaper.
Or if you want to play with an assembler for Fermi there’s asfermi:
I know it sounds like a shameless plug to recommend the 750 Ti but I second the recommendation. You can get one for like $150 and it’s a fantastic card if you’re serious about CUDA. It’s also pretty decent for gaming too.
I just hate doing free advertising T_T
I thought it might be some kind of public clusters accessible to public users. I checked the asfermi assembler, and it’s useful to write assembly code in Fermi preventing ptxas compiler optimization. I guess I will buy a 750 Ti card and dig into the assembler scott developed:)
It’s still in somewhat of a raw state. You will likely have to do a lot of exploration to get proficient at it. The big feature I haven’t added quite yet is fully automatic register allocation. It’s partially there, but to be complete it needs to understand the live time of every register in the program. To do that you need to understand the conditions under which the hardware decides to reorder memory requests. I think I now understand that enough now to finish that feature, I just need the time to build it.
In the meantime you can still specify the allowed register ranges for variables and kind of do your own register allocation management. But for writing complex code, you really want that to be fully automatic. It should also allow the most compact register allocations if you’re trying to squeeze into a certain occupancy.
I also need to put in better support for 64 bit math. It works as is, but it can be better.
And finally, I need to port it python. Don’t know why I didn’t just bite the bullet and start it in that language. It would make a really nice complement to pycuda.
Oh, and another thing I should mention. Working from the maxas codebase it really wouldn’t take a lot of effort to build in Kepler support. I just don’t have a Kepler card to test with, and I think the Maxwell architecture is much better balanced. But there are still a lot of Kepler cards out there to target…
Thank you for your work on the assembler. It definitely helps us understand the detailed internal architecture Nvidia not willing to disclose. Although I am not able to understand all of your implementation right now, I hope I will get the gist of your idea when I have a Maxwell GPU. One thing I noticed in your old posts is you mentioned “context switch is cheap but not free”. You also mentioned it in the section of stall counts of ControlCodes wiki page. https://code.google.com/p/maxas/wiki/ControlCodes. Isn’t it context switch is zero overhead? I wondered why you mentioned “context switch is cheap but not free”.
That statement was made prior to my full investigation of context switching. Context switches (within kernels at least) are mostly free. There are edge cases where it’s not and those were the ones confusing me originally.