Model Interpretability (attention mechanism) for Megatron models


I’m working on model interpretability (specifically visualizing the attention flow) for Megatron models like biomegatron345m_biovocab_30k_cased and biomegatron-bert-345m-cased. Has anyone worked on this topic before and can offer me some advice?

Also, does anyone know where I can find the model scripts for the BioMegatron models?

Thank you!