Q: I typically use torchrun to launch my distributed training jobs, which be slightly different than python run. Will the omnistate-wrapper still work in this case?
A: Regarding torchrun, I think it should run fine since omnistat processes are separate from your python tasks (but let us know if you run into issues)
A: I've used both torchrun and srun for distributed training and both work fine with Omnistat. 

Q: A follow up question, figures x-axis are time, but are there more details on up to where my program is executing specifically? For example, whether my training is performing attention layer or mlp layer?
A: You can leverage the "annotation" business I mentioned earlier to provide some high-level granularity as to what the application is doing (we have a python helper module for this that we can follow up with more info).

Q: Is integer profiling supported, too?
A: Should be, yes. Just need to register different counters to monitor.

Q: Can one use Omnistat in a container? If yes, should Omnistat be installed on the host or in the container?
A: As long as GPU is visible and SMI library installed, yes (and we use it inside VMs).
A: If running a containerized workload, I imagine could go either way. I might run omnistat on the host and leverage annotations around each container that is launched within a job.

Q: How is that energy measured? Is it a guess based on instructions or a real physical measurement by the PSU?
A: amd-smi provides GPU energy information that is used by Omnistat
A: The Cray PM counter methods they showed earlier are from dedicated sensors on the mainboard. The rocm-smi/amd-smi GPU measurements are also coming from hardware, although on the different side of some connections then the Cray PM counters. There's a talk earlier this year from Bruno and Ashesh Sharma on details of energy measurement