Table of Contents
To use this tool, you must specify
--tool=callgrind
on the
Valgrind command line.
Callgrind is a profiling tool that records the call history among functions in a program's run as a call-graph. By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls. Optionally, cache simulation and/or branch prediction (similar to Cachegrind) can produce further information about the runtime behavior of an application.
The profile data is written out to a file at program termination. For presentation of the data, and interactive control of the profiling, two command line tools are provided:
This command reads in the profile data, and prints a sorted lists of functions, optionally with source annotation.
For graphical visualization of the data, try KCachegrind, which is a KDE/Qt based GUI that makes it easy to navigate the large amount of data that Callgrind produces.
This command enables you to interactively observe and control the status of a program currently running under Callgrind's control, without stopping the program. You can get statistics information as well as the current stack trace, and you can request zeroing of counters or dumping of profile data.
Cachegrind collects flat profile data: event counts (data reads, cache misses, etc.) are attributed directly to the function they occurred in. This cost attribution mechanism is called self or exclusive attribution.
Callgrind extends this functionality by propagating costs
across function call boundaries. If function foo
calls
bar
, the costs from bar
are added into
foo
's costs. When applied to the program as a whole,
this builds up a picture of so called inclusive
costs, that is, where the cost of each function includes the costs of
all functions it called, directly or indirectly.
As an example, the inclusive cost of
main
should be almost 100 percent
of the total program cost. Because of costs arising before
main
is run, such as
initialization of the run time linker and construction of global C++
objects, the inclusive cost of main
is not exactly 100 percent of the total program cost.
Together with the call graph, this allows you to find the
specific call chains starting from
main
in which the majority of the
program's costs occur. Caller/callee cost attribution is also useful
for profiling functions called from multiple call sites, and where
optimization opportunities depend on changing code in the callers, in
particular by reducing the call count.
Callgrind's cache simulation is based on that of Cachegrind. Read the documentation for Cachegrind: a cache and branch-prediction profiler first. The material below describes the features supported in addition to Cachegrind's features.
Callgrind's ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns.
As with Cachegrind, you probably want to compile with debugging info
(the -g
option) and with optimization turned on.
To start a profile run for a program, execute:
valgrind --tool=callgrind [callgrind options] your-program [program options]
While the simulation is running, you can observe execution with:
callgrind_control -b
This will print out the current backtrace. To annotate the backtrace with event counts, run
callgrind_control -e -b
After program termination, a profile data file named
callgrind.out.<pid>
is generated, where pid is the process ID
of the program being profiled.
The data file contains information about the calls made in the
program among the functions executed, together with
Instruction Read (Ir) event counts.
To generate a function-by-function summary from the profile data file, use
callgrind_annotate [options] callgrind.out.<pid>
This summary is similar to the output you get from a Cachegrind run with cg_annotate: the list of functions is ordered by exclusive cost of functions, which also are the ones that are shown. Important for the additional features of Callgrind are the following two options:
--inclusive=yes
: Instead of using
exclusive cost of functions as sorting order, use and show
inclusive cost.
--tree=both
: Interleave into the
top level list of functions, information on the callers and the callees
of each function. In these lines, which represents executed
calls, the cost gives the number of events spent in the call.
Indented, above each function, there is the list of callers,
and below, the list of callees. The sum of events in calls to
a given function (caller lines), as well as the sum of events in
calls from the function (callee lines) together with the self
cost, gives the total inclusive cost of the function.
By default, you will also get annotated source code
for all relevant functions for which the source can be found. In
addition to source annotation as produced by
cg_annotate
, you will see the
annotated call sites with call counts. For all other options,
consult the (Cachegrind) documentation for
cg_annotate
.
For better call graph browsing experience, it is highly recommended
to use KCachegrind.
If your code
has a significant fraction of its cost in cycles (sets
of functions calling each other in a recursive manner), you have to
use KCachegrind, as callgrind_annotate
currently does not do any cycle detection, which is important to get correct
results in this case.
If you are additionally interested in measuring the
cache behavior of your program, use Callgrind with the option
--cache-sim=yes
.
For branch prediction simulation, use
--branch-sim=yes
.
Expect a further slow down approximately by a factor of 2.
If the program section you want to profile is somewhere in the
middle of the run, it is beneficial to
fast forward to this section without any
profiling, and then enable profiling. This is achieved by using
the command line option
--instr-atstart=no
and running, in a shell:
callgrind_control -i on
just before the
interesting code section is executed. To exactly specify
the code position where profiling should start, use the client request
CALLGRIND_START_INSTRUMENTATION
.
If you want to be able to see assembly code level annotation, specify
--dump-instr=yes
.
This will produce profile data at instruction granularity.
Note that the resulting profile data
can only be viewed with KCachegrind. For assembly annotation, it also is
interesting to see more details of the control flow inside of functions,
i.e. (conditional) jumps. This will be collected by further specifying
--collect-jumps=yes
.
Sometimes you are not interested in characteristics of a full program run, but only of a small part of it, for example execution of one algorithm. If there are multiple algorithms, or one algorithm running with different input data, it may even be useful to get different profile information for different parts of a single program run.
Profile data files have names of the form
callgrind.out.pid.part-threadID
where pid is the PID of the running
program, part is a number incremented on each
dump (".part" is skipped for the dump at program termination), and
threadID is a thread identification
("-threadID" is only used if you request dumps of individual
threads with
--separate-threads=yes
).
There are different ways to generate multiple profile dumps while a program is running under Callgrind's supervision. Nevertheless, all methods trigger the same action, which is "dump all profile information since the last dump or program start, and zero cost counters afterwards". To allow for zeroing cost counters without dumping, there is a second action "zero all cost counters now". The different methods are:
Dump on program termination. This method is the standard way and doesn't need any special action on your part.
Spontaneous, interactive dumping. Use
callgrind_control -d [hint [PID/Name]]
to request the dumping of profile information of the supervised application with PID or Name. hint is an arbitrary string you can optionally specify to later be able to distinguish profile dumps. The control program will not terminate before the dump is completely written. Note that the application must be actively running for detection of the dump command. So, for a GUI application, resize the window, or for a server, send a request.
If you are using KCachegrind for browsing of profile information, you can use the toolbar button Force dump. This will request a dump and trigger a reload after the dump is written.
Periodic dumping after execution of a specified
number of basic blocks. For this, use the command line
option --dump-every-bb=count
.
Dumping at enter/leave of specified functions.
Use the
option --dump-before=function
and --dump-after=function
.
To zero cost counters before entering a function, use
--zero-before=function
.
You can specify these options multiple times for different
functions. Function specifications support wildcards: e.g. use
--dump-before='foo*'
to
generate dumps before entering any function starting with
foo.
Program controlled dumping.
Insert
CALLGRIND_DUMP_STATS;
at the position in your code where you want a profile dump to
happen. Use
CALLGRIND_ZERO_STATS;
to only
zero profile counters.
See Client request reference for more information on
Callgrind specific client requests.
If you are running a multi-threaded application and specify the
command line option
--separate-threads=yes
,
every thread will be profiled on its own and will create its own
profile dump. Thus, the last two methods will only generate one dump
of the currently running thread. With the other methods, you will get
multiple dumps (one for each thread) on a dump request.
By default, whenever events are happening (such as an
instruction execution or cache hit/miss), Callgrind is aggregating
them into event counters. However, you may be interested only in
what is happening within a given function or starting from a given
program phase. To this end, you can disable event aggregation for
uninteresting program parts. While attribution of events to
functions as well as producing separate output per program phase
can be done by other means (see previous section), there are two
benefits by disabling aggregation. First, this is very
fine-granular (e.g. just for a loop within a function). Second,
disabling event aggregation for complete program phases allows to
switch off time-consuming cache simulation and allows Callgrind to
progress at much higher speed with an slowdown of around factor 2
(identical to valgrind
--tool=none
).
There are two aspects which influence whether Callgrind is aggregating events at some point in time of program execution. First, there is the collection state. If this is off, no aggregation will be done. By changing the collection state, you can control event aggregation at a very fine granularity. However, there is not much difference in regard to execution speed of Callgrind. By default, collection is switched on, but can be disabled by different means (see below). Second, there is the instrumentation mode in which Callgrind is running. This mode either can be on or off. If instrumentation is off, no observation of actions in the program will be done and thus, no actions will be forwarded to the simulator which could trigger events. In the end, no events will be aggregated. The huge benefit is the much higher speed with instrumentation switched off. However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time. Also, cache simulator results directly after switching on instrumentation will be skewed due to identified cache misses which would not happen in reality (if you care about this warm-up effect, you should make sure to temporarly have collection state switched off directly after turning instrumentation mode on). However, switching instrumentation state is very useful to skip larger program phases such as an initialization phase. By default, instrumentation is switched on, but as with the collection state, can be changed by various means.
Callgrind can start with instrumentation mode switched off by
specifying option
--instr-atstart=no
.
Afterwards, instrumentation can be controlled in two ways: first,
interactively with:
callgrind_control -i on
(and
switching off again by specifying "off" instead of "on"). Second,
instrumentation state can be programmatically changed with the
macros CALLGRIND_START_INSTRUMENTATION;
and CALLGRIND_STOP_INSTRUMENTATION;
.
Similarly, the collection state at program start can be
switched off by
--instr-atstart=no
.
During execution, it can be controlled programmatically with the
macro CALLGRIND_TOGGLE_COLLECT;
.
Further, you can limit event collection to a specific function by
using --toggle-collect=function
.
This will toggle the collection state on entering and leaving the
specified function. When this option is in effect, the default
collection state at program start is "off". Only events happening
while running inside of the given function will be
collected. Recursive calls of the given function do not trigger
any action. This option can be given multiple times to specify
different functions of interest.
For access to shared data among threads in a multithreaded code, synchronization is required to avoid raced conditions. Synchronization primitives are usually implemented via atomic instructions. However, excessive use of such instructions can lead to performance issues.
To enable analysis of this problem, Callgrind optionally can count the number of atomic instructions executed. More precisely, for x86/x86_64, these are instructions using a lock prefix. For architectures supporting LL/SC, these are the number of SC instructions executed. For both, the term "global bus events" is used.
The short name of the event type used for global bus events is "Ge".
To count global bus events, use
--collect-bus=yes
.
Informally speaking, a cycle is a group of functions which call each other in a recursive way.
Formally speaking, a cycle is a nonempty set S of functions, such that for every pair of functions F and G in S, it is possible to call from F to G (possibly via intermediate functions) and also from G to F. Furthermore, S must be maximal -- that is, be the largest set of functions satisfying this property. For example, if a third function H is called from inside S and calls back into S, then H is also part of the cycle and should be included in S.
Recursion is quite usual in programs, and therefore, cycles sometimes appear in the call graph output of Callgrind. However, the title of this chapter should raise two questions: What is bad about cycles which makes you want to avoid them? And: How can cycles be avoided without changing program code?
Cycles are not bad in itself, but tend to make performance
analysis of your code harder. This is because inclusive costs
for calls inside of a cycle are meaningless. The definition of
inclusive cost, i.e. self cost of a function plus inclusive cost
of its callees, needs a topological order among functions. For
cycles, this does not hold true: callees of a function in a cycle include
the function itself. Therefore, KCachegrind does cycle detection
and skips visualization of any inclusive cost for calls inside
of cycles. Further, all functions in a cycle are collapsed into artificial
functions called like Cycle 1
.
Now, when a program exposes really big cycles (as is
true for some GUI code, or in general code using event or callback based
programming style), you lose the nice property to let you pinpoint
the bottlenecks by following call chains from
main
, guided via
inclusive cost. In addition, KCachegrind loses its ability to show
interesting parts of the call graph, as it uses inclusive costs to
cut off uninteresting areas.
Despite the meaningless of inclusive costs in cycles, the big drawback for visualization motivates the possibility to temporarily switch off cycle detection in KCachegrind, which can lead to misguiding visualization. However, often cycles appear because of unlucky superposition of independent call chains in a way that the profile result will see a cycle. Neglecting uninteresting calls with very small measured inclusive cost would break these cycles. In such cases, incorrect handling of cycles by not detecting them still gives meaningful profiling visualization.
It has to be noted that currently, callgrind_annotate does not do any cycle detection at all. For program executions with function recursion, it e.g. can print nonsense inclusive costs way above 100%.
After describing why cycles are bad for profiling, it is worth talking about cycle avoidance. The key insight here is that symbols in the profile data do not have to exactly match the symbols found in the program. Instead, the symbol name could encode additional information from the current execution context such as recursion level of the current function, or even some part of the call chain leading to the function. While encoding of additional information into symbols is quite capable of avoiding cycles, it has to be used carefully to not cause symbol explosion. The latter imposes large memory requirement for Callgrind with possible out-of-memory conditions, and big profile data files.
A further possibility to avoid cycles in Callgrind's profile data
output is to simply leave out given functions in the call graph. Of course, this
also skips any call information from and to an ignored function, and thus can
break a cycle. Candidates for this typically are dispatcher functions in event
driven code. The option to ignore calls to a function is
--fn-skip=function
.
Aside from possibly breaking cycles, this is used in Callgrind to skip
trampoline functions in the PLT sections
for calls to functions in shared libraries. You can see the difference
if you profile with
--skip-plt=no
.
If a call is ignored, its cost events will be propagated to the
enclosing function.
If you have a recursive function, you can distinguish the first
10 recursion levels by specifying
--separate-recs10=function
.
Or for all functions with
--separate-recs=10
,
but this will
give you much bigger profile data files. In the profile data, you will see
the recursion levels of "func" as the different functions with names
"func", "func'2", "func'3" and so on.
If you have call chains "A > B > C" and "A > C > B"
in your program, you usually get a "false" cycle "B <> C". Use
--separate-callers2=B
--separate-callers2=C
,
and functions "B" and "C" will be treated as different functions
depending on the direct caller. Using the apostrophe for appending
this "context" to the function name, you get "A > B'A > C'B"
and "A > C'A > B'C", and there will be no cycle. Use
--separate-callers=2
to get a 2-caller
dependency for all functions. Note that doing this will increase
the size of profile data files.
If your program forks, the child will inherit all the profiling
data that has been gathered for the parent. To start with empty profile
counter values in the child, the client request
CALLGRIND_ZERO_STATS;
can be inserted into code to be executed by the child, directly
after
fork
.
However, you will have to make sure that the output file format string
(controlled by --callgrind-out-file
) does contain
%p
(which is true by default). Otherwise, the
outputs from the parent and child will overwrite each other or will be
intermingled, which almost certainly is not what you want.
You will be able to control the new child independently from the parent via callgrind_control.
In the following, options are grouped into classes.
Some options allow the specification of a function/symbol name, such as
--dump-before=function
, or
--fn-skip=function
.
All these options can be specified multiple times for different functions.
In addition, the function specifications actually are patterns by supporting
the use of wildcards '*' (zero or more arbitrary characters) and '?'
(exactly one arbitrary character), similar to file name globbing in the
shell. This feature is important especially for C++, as without wildcard
usage, the function would have to be specified in full extent, including
parameter signature.
These options influence the name and format of the profile data files.
--callgrind-out-file=<file>
Write the profile data to
file
rather than to the default
output file,
callgrind.out.<pid>
. The
%p
and %q
format specifiers
can be used to embed the process ID and/or the contents of an
environment variable in the name, as is the case for the core
option
--log-file
.
When multiple dumps are made, the file name
is modified further; see below.
--dump-line=<no|yes> [default: yes]
This specifies that event counting should be performed at
source line granularity. This allows source annotation for sources
which are compiled with debug information
(-g
).
--dump-instr=<no|yes> [default: no]
This specifies that event counting should be performed at per-instruction granularity. This allows for assembly code annotation. Currently the results can only be displayed by KCachegrind.
--compress-strings=<no|yes> [default: yes]
This option influences the output format of the profile data. It specifies whether strings (file and function names) should be identified by numbers. This shrinks the file, but makes it more difficult for humans to read (which is not recommended in any case).
--compress-pos=<no|yes> [default: yes]
This option influences the output format of the profile data. It specifies whether numerical positions are always specified as absolute values or are allowed to be relative to previous numbers. This shrinks the file size.
--combine-dumps=<no|yes> [default: no]
When enabled, when multiple profile data parts are to be generated these parts are appended to the same output file. Not recommended.
These options specify when actions relating to event counts are to be executed. For interactive control use callgrind_control.
--dump-every-bb=<count> [default: 0, never]
Dump profile data every count
basic blocks.
Whether a dump is needed is only checked when Valgrind's internal
scheduler is run. Therefore, the minimum setting useful is about 100000.
The count is a 64-bit value to make long dump periods possible.
--dump-before=<function>
Dump when entering function
.
--zero-before=<function>
Zero all costs when entering function
.
--dump-after=<function>
Dump when leaving function
.
These options specify when events are to be aggregated into event counts. Also see Limiting range of event collection.
--instr-atstart=<yes|no> [default: yes]
Specify if you want Callgrind to start simulation and
profiling from the beginning of the program.
When set to no
,
Callgrind will not be able
to collect any information, including calls, but it will have at
most a slowdown of around 4, which is the minimum Valgrind
overhead. Instrumentation can be interactively enabled via
callgrind_control -i on
.
Note that the resulting call graph will most probably not
contain main
, but will contain all the
functions executed after instrumentation was enabled.
Instrumentation can also be programmatically enabled/disabled. See the
Callgrind include file
callgrind.h
for the macro
you have to use in your source code.
For cache simulation, results will be less accurate when switching on instrumentation later in the program run, as the simulator starts with an empty cache at that moment. Switch on event collection later to cope with this error.
--collect-atstart=<yes|no> [default: yes]
Specify whether event collection is enabled at beginning of the profile run.
To only look at parts of your program, you have two possibilities:
Zero event counters before entering the program part you want to profile, and dump the event counters to a file after leaving that program part.
Switch on/off collection state as needed to only see event counters happening while inside of the program part you want to profile.
The second option can be used if the program part you want to profile is called many times. Option 1, i.e. creating a lot of dumps is not practical here.
Collection state can be
toggled at entry and exit of a given function with the
option --toggle-collect
. If you
use this option, collection
state should be disabled at the beginning. Note that the
specification of --toggle-collect
implicitly sets
--collect-state=no
.
Collection state can be toggled also by inserting the client request
CALLGRIND_TOGGLE_COLLECT
;
at the needed code positions.
--toggle-collect=<function>
Toggle collection on entry/exit of function
.
--collect-jumps=<no|yes> [default: no]
This specifies whether information for (conditional) jumps should be collected. As above, callgrind_annotate currently is not able to show you the data. You have to use KCachegrind to get jump arrows in the annotated code.
--collect-systime=<no|yes|msec|usec|nsec> [default: no]
This specifies whether information for system call times should be collected.
The value no
indicates to record
no system call information.
The other values indicate to record the number of system calls
done (sysCount event) and the elapsed time (sysTime event) spent
in system calls.
The --collect-systime
value gives
the unit used for sysTime : milli seconds, micro seconds or nano
seconds. With the value nsec
,
callgrind also records the cpu time spent during system calls
(sysCpuTime).
The value yes
is a synonym
of msec
.
The value nsec
is not supported
on Darwin.
--collect-bus=<no|yes> [default: no]
This specifies whether the number of global bus events executed should be collected. The event type "Ge" is used for these events.
These options specify how event counts should be attributed to execution contexts. For example, they specify whether the recursion level or the call chain leading to a function should be taken into account, and whether the thread ID should be considered. Also see Avoiding cycles.
--separate-threads=<no|yes> [default: no]
This option specifies whether profile data should be generated separately for every thread. If yes, the file names get "-threadID" appended.
--separate-callers=<callers> [default: 0]
Separate contexts by at most <callers> functions in the call chain. See Avoiding cycles.
--separate-callers<number>=<function>
Separate number
callers for function
.
See Avoiding cycles.
--separate-recs=<level> [default: 2]
Separate function recursions by at most level
levels.
See Avoiding cycles.
--separate-recs<number>=<function>
Separate number
recursions for function
.
See Avoiding cycles.
--skip-plt=<no|yes> [default: yes]
Ignore calls to/from PLT sections.
--skip-direct-rec=<no|yes> [default: yes]
Ignore direct recursions.
--fn-skip=<function>
Ignore calls to/from a given function. E.g. if you have a call chain A > B > C, and you specify function B to be ignored, you will only see A > C.
This is very convenient to skip functions handling callback behaviour. For example, with the signal/slot mechanism in the Qt graphics library, you only want to see the function emitting a signal to call the slots connected to that signal. First, determine the real call chain to see the functions needed to be skipped, then use this option.
--cache-sim=<yes|no> [default: no]
Specify if you want to do full cache simulation. By default, only instruction read accesses will be counted ("Ir"). With cache simulation, further event counters are enabled: Cache misses on instruction reads ("I1mr"/"ILmr"), data read accesses ("Dr") and related cache misses ("D1mr"/"DLmr"), data write accesses ("Dw") and related cache misses ("D1mw"/"DLmw"). For more information, see Cachegrind: a cache and branch-prediction profiler.
--branch-sim=<yes|no> [default: no]
Specify if you want to do branch prediction simulation. Further event counters are enabled: Number of executed conditional branches and related predictor misses ("Bc"/"Bcm"), executed indirect jumps and related misses of the jump address predictor ("Bi"/"Bim").
--simulate-wb=<yes|no> [default: no]
Specify whether write-back behavior should be simulated, allowing to distinguish LL caches misses with and without write backs. The cache model of Cachegrind/Callgrind does not specify write-through vs. write-back behavior, and this also is not relevant for the number of generated miss counts. However, with explicit write-back simulation it can be decided whether a miss triggers not only the loading of a new cache line, but also if a write back of a dirty cache line had to take place before. The new dirty miss events are ILdmr, DLdmr, and DLdmw, for misses because of instruction read, data read, and data write, respectively. As they produce two memory transactions, they should account for a doubled time estimation in relation to a normal miss.
--simulate-hwpref=<yes|no> [default: no]
Specify whether simulation of a hardware prefetcher should be added which is able to detect stream access in the second level cache by comparing accesses to separate to each page. As the simulation can not decide about any timing issues of prefetching, it is assumed that any hardware prefetch triggered succeeds before a real access is done. Thus, this gives a best-case scenario by covering all possible stream accesses.
--cacheuse=<yes|no> [default: no]
Specify whether cache line use should be collected. For every cache line, from loading to it being evicted, the number of accesses as well as the number of actually used bytes is determined. This behavior is related to the code which triggered loading of the cache line. In contrast to miss counters, which shows the position where the symptoms of bad cache behavior (i.e. latencies) happens, the use counters try to pinpoint at the reason (i.e. the code with the bad access behavior). The new counters are defined in a way such that worse behavior results in higher cost. AcCost1 and AcCost2 are counters showing bad temporal locality for L1 and LL caches, respectively. This is done by summing up reciprocal values of the numbers of accesses of each cache line, multiplied by 1000 (as only integer costs are allowed). E.g. for a given source line with 5 read accesses, a value of 5000 AcCost means that for every access, a new cache line was loaded and directly evicted afterwards without further accesses. Similarly, SpLoss1/2 shows bad spatial locality for L1 and LL caches, respectively. It gives the spatial loss count of bytes which were loaded into cache but never accessed. It pinpoints at code accessing data in a way such that cache space is wasted. This hints at bad layout of data structures in memory. Assuming a cache line size of 64 bytes and 100 L1 misses for a given source line, the loading of 6400 bytes into L1 was triggered. If SpLoss1 shows a value of 3200 for this line, this means that half of the loaded data was never used, or using a better data layout, only half of the cache space would have been needed. Please note that for cache line use counters, it currently is not possible to provide meaningful inclusive costs. Therefore, inclusive cost of these counters should be ignored.
--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache.
--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache.
--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache.
The Callgrind tool provides monitor commands handled by the Valgrind gdbserver (see Monitor command handling by the Valgrind gdbserver).
dump [<dump_hint>]
requests to dump the
profile data.
zero
requests to zero the profile data
counters.
instrumentation [on|off]
requests to set
(if parameter on/off is given) or get the current instrumentation state.
status
requests to print out some status
information.
Callgrind provides the following specific client requests in
callgrind.h
. See that file for the exact details of
their arguments.
CALLGRIND_DUMP_STATS
Force generation of a profile dump at specified position in code, for the current thread only. Written counters will be reset to zero.
CALLGRIND_DUMP_STATS_AT(string)
Same as CALLGRIND_DUMP_STATS
,
but allows to specify a string to be able to distinguish profile
dumps.
CALLGRIND_ZERO_STATS
Reset the profile counters for the current thread to zero.
CALLGRIND_TOGGLE_COLLECT
Toggle the collection state. This allows to ignore events
with regard to profile counters. See also options
--collect-atstart
and
--toggle-collect
.
CALLGRIND_START_INSTRUMENTATION
Start full Callgrind instrumentation if not already enabled.
When cache simulation is done, this will flush the simulated cache
and lead to an artificial cache warmup phase afterwards with
cache misses which would not have happened in reality. See also
option
--instr-atstart
.
CALLGRIND_STOP_INSTRUMENTATION
Stop full Callgrind instrumentation if not already disabled.
This flushes Valgrinds translation cache, and does no additional
instrumentation afterwards: it effectivly will run at the same
speed as Nulgrind, i.e. at minimal slowdown. Use this to
speed up the Callgrind run for uninteresting code parts. Use
CALLGRIND_START_INSTRUMENTATION
to enable instrumentation again. See also option
--instr-atstart
.
-h --help
Show summary of options.
--version
Show version of callgrind_annotate.
--show=A,B,C [default: all]
Only show figures for events A,B,C.
--threshold=<0--100> [default: 99%]
Percentage of counts (of primary sort event) we are interested in.
callgrind_annotate stops printing functions when the sum of the cost percentage of the printed functions is bigger or equal to the given threshold percentage.
--sort=A,B,C
Sort columns by events A,B,C [event column order].
Optionally, each event is followed by a : and a threshold, to specify different thresholds depending on the event.
callgrind_annotate stops printing functions when the sum of the cost percentage of the printed functions for all the events is bigger or equal to the given event threshold percentages.
When one or more thresholds are given via this option,
the value of --threshold
is ignored.
--show-percs=<no|yes> [default: no]
When enabled, a percentage is printed next to all event counts. This helps gauge the relative importance of each function and line.
--auto=<yes|no> [default: yes]
Annotate all source files containing functions that helped reach the event count threshold.
--context=N [default: 8]
Print N lines of context before and after annotated lines.
--inclusive=<yes|no> [default: no]
Add subroutine costs to functions calls.
--tree=<none|caller|calling|both> [default: none]
Print for each function their callers, the called functions or both.
-I, --include=<dir>
Add dir
to the list of directories to search
for source files.
By default, callgrind_control acts on all programs run by the current user under Callgrind. It is possible to limit the actions to specified Callgrind runs by providing a list of pids or program names as argument. The default action is to give some brief information about the applications being run under Callgrind.
-h --help
Show a short description, usage, and summary of options.
--version
Show version of callgrind_control.
-l --long
Show also the working directory, in addition to the brief information given by default.
-s --stat
Show statistics information about active Callgrind runs.
-b --back
Show stack/back traces of each thread in active Callgrind runs. For each active function in the stack trace, also the number of invocations since program start (or last dump) is shown. This option can be combined with -e to show inclusive cost of active functions.
-e [A,B,...]
(default: all)Show the current per-thread, exclusive cost values of event counters. If no explicit event names are given, figures for all event types which are collected in the given Callgrind run are shown. Otherwise, only figures for event types A, B, ... are shown. If this option is combined with -b, inclusive cost for the functions of each active stack frame is provided, too.
--dump[=<desc>]
(default: no description)Request the dumping of profile information. Optionally, a description can be specified which is written into the dump as part of the information giving the reason which triggered the dump action. This can be used to distinguish multiple dumps.
-z --zero
Zero all event counters.
-k --kill
Force a Callgrind run to be terminated.
--instr=<on|off>
Switch instrumentation mode on or off. If a Callgrind run has
instrumentation disabled, no simulation is done and no events are
counted. This is useful to skip uninteresting program parts, as there
is much less slowdown (same as with the Valgrind tool "none"). See also
the Callgrind option --instr-atstart
.
--vgdb-prefix=<prefix>
Specify the vgdb prefix to use by callgrind_control.
callgrind_control internally uses vgdb to find and control the active
Callgrind runs. If the --vgdb-prefix
option was used
for launching valgrind, then the same option must be given to
callgrind_control.