Perf Tool
- Usage
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
bench General framework for benchmark suites
buildid-cache Manage build-id cache.
buildid-list List the buildids in a perf.data file
data Data file related processing
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
inject Filter to augment the events stream with additional information
kmem Tool to trace/measure kernel memory properties
kvm Tool to trace/measure kvm guest os
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
test Runs sanity tests.
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
probe Define new dynamic tracepoints
trace strace inspired tool
- Event
Hardware cache event
Hardware event
Kernel PMU event
Raw hardware event descriptor
Software event
Tracepoint event
Perf Example
perf stat -d -a -g -- sleep 5
Performance counter stats for 'system wide':
213687.051292 task-clock (msec) # 42.730 CPUs utilized (3.19%)
11,088 context-switches # 0.052 K/sec (3.19%)
62 cpu-migrations # 0.000 K/sec (3.22%)
6,663 page-faults # 0.031 K/sec (3.27%)
4,164,015,753 cycles # 0.019 GHz (3.30%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
4,083,498,652 instructions # 0.98 insns per cycle (0.32%)
1,018,073,891 branches # 4.764 M/sec (0.32%)
12,445,678 branch-misses # 1.22% of all branches (0.33%)
1,329,661,336 L1-dcache-loads # 6.222 M/sec (0.33%)
59,291,150 L1-dcache-load-misses # 4.46% of all L1-dcache hits (0.28%)
<not supported> LLC-loads:HG
<not supported> LLC-load-misses:HG
5.000903836 seconds time elapsed
The key metric here is instructions per cycle (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification).
If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.
refer
task-clock: clock time for running tasks
context-switches: occur when the processor switches from one thread to another
cpu-migrations: happens when your process migrates from one core to another.
page-faults: When a program needs a memory page that is not yet mapped to process virtual memory, a page fault occurs.
cycles: They just happen regularly at a given frequency unless the processor is deliberately set idle.
instructions: number is a blunt count of instruction executed.
branches: count is the number of branching happened in the whole run.
branch-misses: a number of cases where branch prediction failed to guess the execution path right
Perf Tools
https://github.com/brendangregg/perf-tools