I’ve been working on an assembler for Fermi, asfermi, on and off for more than one month already. I think some people on the forum might be interested in programming the Fermi GPUs in the native ISA, so I’m bringing this to your attention.
What it does
For now, asfermi can output assembled cubin directly, or it could be used to replace instructions in existing cubin kernels.
Instructions that are currently supported include: MOV, MOV32I, LD, LDU, LDL, LDC, LDS, ST, STL, STS, FADD, FADD32I, FMUL, FFMA, IADD, IADD32I, IMUL, IMAD, ISETP, FSETP, S2R, LOP, I2F, F2I, EXIT, RET, PRET, BRA, CAL, NOP. Most of the modifiers supported by those instructions are also supported by asfermi.
Right now only 32-bit cubin is supported. Supported architectures include sm_20 and sm_21.
The source formatof asfermi is similar to that of the output of cuobjdump, with the addition that custom directives can be used to specify kernel names, parameters, shared/local/constant memory objects and so on.
What’s to be done
Support 64-bit cubin generation
Support insertion of instructions into existing cubin kernels
thorough testing, debugging
Why do it
I started asfermi with the purpose to probe Fermi’s various architectural features. This is already within reach. asfermi, in its current state, should already be able to probe many interesting things, such as L1/L2 associativity, replacement policy, L2 structure, instruction cache size, instruction latency, warp scheduling pattern, replay pattern, register file characteristics and so on.
Manual optimization. With a native ISA assembler as well as the knowledge of the underlying architectural features, you have all the control over what the code does. You can build stronger ILP to hide latency; you can optimize your code in a way that’s specific to your desired launch configuration and so on
However, I must say that asfermi is not even in beta. Due to the lack of time, I haven’t been able to thoroughly test and debug it. There are even known issues that I haven’t been able to fix. And the worse thing is, my work is again piling up and I won’t be having much time for asfermi until the end of the year. That’s why I’m hoping that someone else with the interest could join this project and speed up the development of asfermi.
I understand that not everyone is interested in playing with the architectural features and manual assembly coding. After all, for the sake of development only, CUDA C is the much easier way to go and ptxas doesn’t do such a bad job at optimizing. However, if you just happen to be highly interested in the Fermi architecture, or if the speed of your application is more important than anything else, you may want to take a look at asfermi and maybe join the development as well.
Please emailme if you are interested. Alternatively, you can directly join the asfermi Google Group that I’ve set up. In a few days I’ll put up a page on the Google Code site to show you how you can get started with the various todos.
Note: currently, manual assembly coding for complex algorithms is very difficult due to the need to keep track of a large number of registers. To ease this problem, I intend to write a GUI editor (C#.NET, usable in Linux/Win/OS X) that keeps track of the registers for the developer and that provides optimization hints based on the instruction latencies&launch configuration. However, this may not be done in any time soon, if no one joins the development of asfermi.