utt tutorial

Creating a real-world utt tracing application basically involves 3 steps:

Instrumenting kernel code and writing trace data to relay
Capturing the trace data in userspace and writing to disk, sending over the network, etc
Parsing and displaying the trace data to the user

NOTE: if you already have or are planning to write your own post-processing/parsing module, and you'd just like to take advantage of most of the features of utt tracing code in as simple as possible a manner, see capturing the trace data in userspace and writing to disk, sending over the network, etc, barebones version

For this tutorial, we'll create a tracing application called 'vfstrace' which will trace vfs read, write, open, close, and sendfile system calls.

We'll actually start the tutorial in the middle, with the userspace tracer code. Normally, we'd want to start with the code that originates the data in the kernel, but since this is a tutorial about how to use the userspace library, we'll cover the userspace tracing and parsing parts first and then as a matter of completeness and because there are many ways to write the kernel part, finish things out with a detailed look at how the example instrumentation works.

First though, a brief overview. utt is actually composed of two separate sets of functions and callbacks:

utt - tracing functions - used to continuously stream per-cpu trace data from the kernel to disk, network, or stdout (pipe).
utp - parsing functions - used to parse and display the data read from disk, network or stdin (pipe), which was generated by an utt tracer.

These functions are made available to tracing and parsing applications via the libutt library. The tracing functions in the library are the functions prefixed with utt_ and the parsing functions are prefixed with utp_.

utt itself doesn't implement any tracing or parsing programs, it's at this point just a library for developers to write their own; this tutorial shows in detail one way to do that.

Though tracing and parsing could be implemented in a single executable, there's a benefit in flexibility gained from implementing tracing and parsing as separate executables. Aside from the benefit of making the code modular, it allows the user of both the tracer and parser to specify at run-time the destination and source of the trace data, depending on conditions. For instance, if tracing disk activity, it might make more sense to send the data over the network to avoid perturbing the trace results with its own activity. In other cases, it might be more convenient to simply trace the data to a local disk. Or the same executables can be pipelined together, with the tracer sending all data to stdout and the parser reading that data from stdin and displaying it in real-time to the user ('live tracing' mode).

libutt supports all of these modes of operation; the following tutorial, along with the example vfstrace code, shows how to write these 2 pieces as easily as possible.

Note also that blktrace itself has been ported to libutt and can also be used as an example of a real-world utt application. Because it does quite a bit more than simple tracing, it's more complex in some details, but the code also basically follows the same pattern as the vfstrace example.

So, onward to the tracing code.

Capturing the trace data in userspace and writing to disk, sending over the network, etc

For the tracer, there's actually very little code to write; most of it can be lifted as-is from the vfstrace.c code, but we'll go over most of that code to show the API functions used and how the tracing code basically works. We'll first go over the tracer code that you'll need to write for any application, assuming that that's what someone writing a new application would be interested in right away. After that, we'll cover how the rest of the API is used in the boilerplate code, for those interested.

An utt tracer can operate in different modes e.g. disk mode, where the incoming data is saved directly to disk, network client mode where the data is sent to the destination using sendfile, non-sendfile network mode, and 'live mode', where the data is sent to stdout and the other end of the pipe is a parser or other post-processing utility interpreting the trace stream in real time.

For 'live mode' and non-sendfile network mode, in order to make it easy for the post-processing utilities to make sense of the incoming data, the tracer only sends data when it knows it has complete events to hand over. To make this possible, the tracing engine requires the application to define a callback function called verify_event() (if the application isn't interested in supporting live/non-sendfile network mode, it isn't necessary to implement verify_event() and this section can be skipped). The tracer passes this function a pointer to the start of the buffer containing the data and the end of the buffer containing the data. The application's implementation of this function should verify that the buffer contains a valid event and if it does, return the length of the event for the next time around the loop.

The application's implementation of verify_event() is passed to the tracer when the tracer is created using utt_alloc_trace() (see below) via a set of user-defined callbacks contained in an utt_ops struct. Currently verify_event() is the only callback defined for tracing, and it's only used for pipe mode.

So here's basically all the code you need to write to get a functional tracer, when combined with the boilerplate code in the example:

/*
 * vfstrace_verify_event - utt userspace API callback verify implementation
 */
static int vfstrace_verify_event(struct trace_info *ti,
				 const void *event, const void *buf_end)
{
	const struct vfs_common_trace_rec *t = event;
	int len;

	if (event + sizeof(*t) > buf_end)
		return 0;
	
	if (verify_trace(&ti->di, t->magic))
		return -1;
	
	switch (t->id) {
	case __TRACE_OPEN:
		len = sizeof(struct vfs_open_trace_rec) + t->pdu_len;
		break;
	case __TRACE_CLOSE:
		len = sizeof(struct vfs_close_trace_rec);
		break;
	case __TRACE_READ:
	case __TRACE_WRITE:
		len = sizeof(struct vfs_readwrite_trace_rec);
		break;
	case __TRACE_SENDFILE:
		len = sizeof(struct vfs_sendfile_trace_rec);
		break;
	default:
		return -1;
		break;
	}

	if (event + len > buf_end)
		return 0;
	
	return len;
}

static struct utt_ops ops =
{
	.verify_event = vfstrace_verify_event,
};

The event structs for this application are defined in vfstrace_api.h and are discussed in a little more detail in the kernel instrumentation section. Of course, your application will be logging different event types and therefore would need to adapt this function to use its own event types.

The only other thing an application would need to do outside of the boilerplate code of this example would be to change the call to utt_alloc_trace(), described below.

We'll now go over main(), where most of the API calls happen, but where basically only one line of code needs to be changed for a new application (it isn't necessary to understand how all the other utt_ API functions are used here in order to write a tracer using the boilerplace code - this section is for anyone interested in the complete API:

int main(int argc, char *argv[])
{
	int i, c;
	int stop_watch = 0;
	int event_mask_tmp = 0;

	/* New apps should change this line.  Channels will appear in
	   /sys/debug/kernel/vfs */
	ti = utt_alloc_trace("vfs", VFS_TRACE_MAGIC, VFS_TRACE_VERSION, &ops);

The first line of interest is the call to utt_alloc_trace(). This function allocates a trace_info struct, which contains the state info for the trace and is passed along to later API calls. The first param, "vfs", is both the name of the application and the name of the directory that will be created for the trace in debugfs to hold the relay and 'control' files. It's also included as the middle element in the filenames of the per-cpu output files, the first being the channel name, and the third being the cpu id e.g. for vfstrace, which creates a single channel called "tracer", the per-cpu output files would have the form "tracer.vfs.0", "tracer.vfs.1", etc. As another example, blktrace creates a channel for each device, and its app name is "block", so its output files would be "sda.block.X", "hdb.block.X" etc.

The next two params are or'ed together to make up the magic number for the application. The magic number serves several purposes in utt: it's used by the tracer and parser to verify data integrity, it's used by the parser to determine the endianness of the trace data, and in the future it will be used to identify the application belonging to the event to aid in merging traces from different utt tracers. The magic number is 32 bits long and should be defined by the application to be unique to it and should also be defined by the appliction to have the last byte be 0. The last byte will contain the one-byte trace version, which is used to determine whether the tracer or parser can handle the input data stream; the application should make sure the version changes when the format of the data changes. NOTE: It isn't strictly necessary to use a magic or version number in order to do tracing - for a barebones tracer, 0 can be passed in for either.

The final param is the address of the utt_ops struct defined for the application. The utt_ops contains application-specific callbacks that the application can or must define; in the case of utt, that's just one at this point, and will be covered later. If the application isn't interested in implementing any utt_ops, NULL can be passed for this param.

The next part of the code deals with command-line options. These of course are application-specific, but will probably be the same for most applications. We won't discuss them all, just the few that need explanation:

	while ((c = getopt_long(argc, argv, S_OPTS, l_opts, NULL)) >= 0) {
		switch (c) {
		case 'a':
			i = find_mask_map(optarg, mask_maps,
				  sizeof(mask_maps)/sizeof(mask_maps[0]));
			if (i < 0) {
				fprintf(stderr,"Invalid action mask %s\n",
					optarg);
				return 1;
			}
			event_mask_tmp |= i;
			break;

		case 'A':
			if ((sscanf(optarg, "%x", &i) != 1) || 
			    !valid_act_opt(i, 5)) {
				fprintf(stderr,
					"Invalid set action mask %s/0x%x\n",
					optarg, i);
				return 1;
			}
			event_mask_tmp = i;
			break;

The above is boilerplate code used to parse event masks. Event masks can be used to tell the tracer to parse an explicit set of events. Using these two options, the user can specify either an explicit event mask as a hex value or can specify a series of events to explicitly trace using the event ids names or short names. See the file event_mask.h for an example of how to define the short names. See vfstrace_api.h for the event id definitions for vfstrace. Basically if you set your events up in the same way, your users will be able to specify which events to trace using -a options e.g. -a READ -a OPEN or -a __TRACE_READ -a __TRACE_OPEN.

Of course, if masks aren't used or implemented, everything will be traced.

		case 'r':
			ti->debugfs_path = optarg;
			break;

		case 'o':
			ti->output_name = optarg;
			break;

The -o option allows the user to specify the output file name. The most interesting use of this option is to force all output to stdout:

./vfstrace -o -

This allows the utt tracer to pipe its output to an utt parser, for 'live tracing':

./vfstrace -o - | ./vfsparse -i -

In this case, instead of being saved to disk, the trace data is immediately displayed to the user. Of course, that doesn't have to be the end of the pipeline - yet another post-processing script could search the parser output and alert the user if unusual conditions appear in the output, or the data could be fed into a visualization application for graphical display for instance. Making the tracer and parser relatively simple and modular allows for almost endless possibilities.

The rest of the options are basically just simple attribute options for the trace.

		.
		.
		.
	}

	setlocale(LC_NUMERIC, "en_US");

	if (ti->net_mode == Net_server) {
		if (ti->output_name) {
			fprintf(stderr, "-o ignored in server mode\n");
			ti->output_name = NULL;
		}

		return utt_net_server(ti);
	}

One of the most powerful features of an utt tracer is the ability to send the trace data to another utt tracer over the network. To do this, the tracer that's the destination for the trace data acts as the server e.g.

./vfstrace -l

The other side is the system being traced:

./vfstrace -h servername

All of the trace data generated by the system being traced will be sent to the server, which saves the trace files in a dynamically generated directory on the server, which looks something like this:

127.0.0.1-2008-06-19-02:11:17

which will contain the trace files from the system being traced, when the trace has completed. The parser can then be used on these trace files, just as it would on the local system.

The code above basically sets up the tracer to act as the server.

	if (event_mask_tmp != 0)
		event_mask = event_mask_tmp;

	if (utt_check_debugfs_path(ti))
		return 1;

	if (kill_running_trace) {
		stop_trace();
		return 0;
	}

	ti->ncpus = sysconf(_SC_NPROCESSORS_ONLN);
	if (ti->ncpus < 0) {
		fprintf(stderr, "sysconf(_SC_NPROCESSORS_ONLN) failed\n");
		return 1;
	}

	signal(SIGINT, handle_sigint);
	signal(SIGHUP, handle_sigint);
	signal(SIGTERM, handle_sigint);
	signal(SIGALRM, handle_sigint);
	signal(SIGPIPE, SIG_IGN);

	if (ti->net_mode == Net_client &&
	    utt_net_setup_client(ti))
		return 1;

If we're running in client network mode, the utt_net_setup_client() call is used to make the connection to the server.

	if (start_trace(ti) != 0)
		return 1;

The start_trace() call does several things to start the trace. It first adds a channel to tracer using utt_add_channel(). A tracer can handle per-cpu trace input from any number of channels at the same time; in the case of vfstrace it uses only one, named "tracer". Thus the trace files will be named tracer.vfs.0, tracer.vfs.1, etc. It then writes values to debugfs 'control' files that have been set up by the kernel module for the purpose of parameterizing the trace. When it writes to the "create" file, the kernel module at that point creates the relay channel buffers and the kernel is ready to start tracing. Before we actually kick off the tracing, though, we create our application's per-cpu relay reading threads using utt_start_threads(). Finally when we write to the "start" file, the kernel starts tracing and therefore filling the relay buffers. Just after that, we set our own internal tracing flag and the trace has officially started. Here's the start_trace() function in its entirety:

static int start_trace(struct trace_info *ti)
{
	int err;

	err = utt_add_channel(ti, "tracer", buf_size, buf_nr);
	if (err)
		goto out;

	err = control_write("buf_size", buf_size);
	if (err)
		goto out;
		
	err = control_write("buf_nr", buf_nr);
	if (err)
		goto out;
		
	err = control_write("event_mask", event_mask);
	if (err)
		goto out;
		
	err = control_write("create", 1);
	if (err)
		goto out;
		
	err = utt_start_threads(ti);
	if (err)
		goto out;

	err = control_write("start", 1);
	if (err)
		goto out;
		
	set_tracing(1);
out:
	return err;
}

Back to the main() code:

	atexit(stop_trace);

	if (stop_watch)
		alarm(stop_watch);

	utt_wait_for_threads(ti);

Here, we call utt_wait_for_threads(), which waits until tracing is done and all of the relay reading threads have exited, after the user has pressed ctrl-c. When the user presses ctrl-c, the "stop" control file is written, which stops the tracing in the kernel. This is followed by a call to utt_set_tracing_done(), which causes the reading threads to exit their processing loops.

	utt_exit_trace(ti);
	stop_trace();
	utt_show_stats(ti);
	utt_free_trace(ti);
	destroy_trace();

utt_exit_trace() closes the threads, stop_trace() is there in case the trace wasn't actually stopped yet. utt_show_stats displays the number of events and the size of the event data logged for each cpu in each channel. It also displays the number of events that were dropped because the tracer received too much data to log it to disk or elsewhere (e.g. the user needs to increase the buffer size or turn off journaling, etc.). Finally, the trace object is destroyed and the tracer exits.

	
	return 0;
}

Parsing and displaying the trace data to the user

The parser requires more application-specific information than the tracer does in order to make sense of the data, so the parser code is by necessity somewhat more involved. It is however pretty straightforward - the parts that can't be lifted as-is from the vfsparse.c code can be easily adapted from the application-specific parts.

As with the tracer, we'll go over most of the parser code to show the API functions used and how the parsing engine basically works, but first we'll go over the parser code that you'll need to write for any application, assuming that that's what someone writing a new application would be interested in right away.

But first, an overview of the parsing process:

The parser basically reads the per-cpu trace files for all the channels generated by the tracer (or the trace data piped from the tracer in live mode) into sorted batches of event objects which are later passed back in the correct order to the application for display. Both the reading of event data into event objects and the their sorting and display are implemented or helped out by application-specific callbacks defined in an utt_ops struct object.

For a particular per-cpu file or pipe, while data remain in the trace stream, utp_do_parse() calls the utp_ops read_event() callback function whose responsibility it is to capture the next event's data in a dynamically allocated event object of the client's choosing. The parser saves up batches of these events and sometime later invokes client callbacks and display the data using the utp_ops print_event() callback.

The client's read_event() implementation uses its knowledge of the application's events to allocate event objects that hold the event data and uses the utp_read_data() API function to read the correct amount of data into the allocated object. Its also responsible for checking the integrity of the retrieved data (by checking the magic value in the event using the bad_magic() helper function) and that the version number contained in the magic value is supported by the parser (by using the verify_trace() helper function). Finally, if the trace data is being read on a system with a different endianness from the system it was generated on, read_event() also needs to implement and call a function that will switch the endianness of the data contained in the event object it passes back. As you can see, the client's read_event() needs to do quite a lot, but it's all pretty straightforward and can be easily adapted from existing examples such as the vfstrace example parser, vfsparse.c.

When the parser receives an event from read_event(), it puts it into an internal list or tree data structure along with the other currently outstanding events, which it will later process and send to the application in the proper order. In order to determine the correct order for an event, the parser will or may invoke other utp_ops callbacks such as get_event_time() (always required), get_event_cpu() (always required), get_event_sequence() (optional), get_channel_id() (optional) in order to determine the event's position relative to other events. It may also call set_event_time() callback (required) to adjust the time contained within an event. These callbacks are very simple to implement - basically they return the address of the respective field within the event struct, and are documented elsewhere.

To display the data, the parser calls the client's utp_ops print_event() implementation for each event in the batch of events it's ready to display. The processing print_event does should be relatively simple e.g. cast the void * passed in to the proper event type and print the fields to the output file (in the parse_info.ofp member). One callback of interest related to printing is print_channel_id(). This is optional for the client to implement and normally does nothing, but some apps may wish to specially encode a channel id and print it in a special way e.g. the blktrace implementation encodes device major and minor numbers in the channel id.

Here's the vfsparse application's implementation of read_event(). Basically, when the parsing engine calls this, it's asking for the next event in the given event stream in the form of an opaque user-allocated data object. This object will at later times in the parse be passed back to the application for it to print or satisfy queries about the data contained within it.

The read_event() implementation is passed the file descriptor to read from and at this point knows how much to read to determine the size of the object and in turn create the object. To actually read the data from the current stream, the implementation should use the utp_read_data() API function. For the vfstrace application, every event starts with a common header, decribed by vfs_common_trace_rec. Our implementation first attempts to read that amount of data; if it's not yet available, it just returns and tries again next time. If it's successful, it casts the buffer into a vfs_common_trace_rec and checks the endianness of the data. If necessary, it later converts it to the local system's endianness (note that the parser has to also implement that conversion, but it's simple and will be covered shortly). It then checks the magic number to make sure we have valid data at this point, and the goes on to determine the exact length of the event the data corresponds to. It does this by checking the id field in the common header, which is unique to each event. Given the event type, it can then determine the length of the event. If the event doesn't have any variable-length data in it, it's simply the length of the corresponding event struct. If it does have variable-length data, the application should read the variable data as well so that it can be included in the returned event object. In the case of vfstrace, the __TRACE_OPEN event contains a variable length filename whose length is contained in the pdu_len field of the common record. We use this length to make another utp_read_data() call to retrieve the rest of the data for the event. Now that we know the length of the event, we allocate an object to give to the parser, fill it in with the data we've read, do the endianness conversion and integrity check, and hand it over to the parse engine.

Here's the vfsparse read_event() implementation in its entirety:

void *vfsparse_read_event(struct parse_info *pi,
			  int fd, int should_block, int *fdblock)
{
	int pdu_len, ret;
	int len, var_len, common_len = sizeof(struct vfs_common_trace_rec);
	struct vfs_common_trace_rec *r;
	void *event;
	struct data_info *di = &pi->di;

	ret = utp_read_data(fd, recbuf, common_len, should_block, fdblock);
	if (ret)
		return NULL;

	r = (struct vfs_common_trace_rec *)recbuf;

	/*
	* look at first trace to check whether we need to convert
	* data in the future
	*/
	if (di->data_is_native == -1 && check_data_endianness(di, r->magic))
		return NULL;

	if (bad_magic(di, r->magic))
		return NULL;

	switch (r->id) {
	case __TRACE_OPEN:
		pdu_len = get_pdulen(pi, r);
		len = sizeof(struct vfs_open_trace_rec) + pdu_len;
		break;
	case __TRACE_CLOSE:
		len = sizeof(struct vfs_close_trace_rec);
		break;
	case __TRACE_READ:
	case __TRACE_WRITE:
		len = sizeof(struct vfs_readwrite_trace_rec);
		break;
	case __TRACE_SENDFILE:
		len = sizeof(struct vfs_sendfile_trace_rec);
		break;
	default:
		return NULL;
		break;
	}

	event = malloc(len);
	memcpy(event, r, common_len);
	var_len = len - common_len;
	if (utp_read_data(fd, event + common_len, var_len, 1, fdblock)) {
		free(event);
		return NULL;
	}

	trace_to_cpu(di->data_is_native, event);

	if (verify_trace(di, r->magic)) {
		free(event);
		return NULL;
	}

	return event;
}

When the parser is done with the event we allocated, it will call us back with the event object for us to free, using the utp_op free_event() function. Here's our implementation:

void vfsparse_free_event(void *event)
{
	free(event);
}

As mentioned in the read_event() implementation, the event passed back to the parser should be converted to the endianness of the system we're running on. Since the application is the only one who knows about the data contained in the event, it needs to do this. Below is a helper function vfsparse.c uses to accomplish this for all the events it knows about; other applications should be able to adapt it to do the same for their events.

If the event data is the same as the system the parser is running on, there's nothing to do and it returns. If however the receiving system is little-endian, it needs to convert it. Basically this means converting each field that needs it using the endian macros defined in utt-common.h. They take the form beXX_to_cpu and convert network byte-order data to little-endian if necessary. The vfstrace conversion helper function is shown below; other applications can easily follow the same pattern for their own event types.

static char recbuf[1024];

static void trace_to_cpu(int data_is_native, void *rec)
{
	struct vfs_common_trace_rec *t = rec;

	if (data_is_native)
		return;

	t->magic	= be32_to_cpu(t->magic);
	t->id		= be16_to_cpu(t->id);
	t->cpu		= be16_to_cpu(t->cpu);
	t->sequence	= be32_to_cpu(t->sequence);
	t->time		= be64_to_cpu(t->time);
	t->pid		= be32_to_cpu(t->pid);
	t->pdu_len	= be16_to_cpu(t->pdu_len);

	switch (t->id) {
	case __TRACE_OPEN: {
		struct vfs_open_trace_rec *r = rec;
		r->fd = be32_to_cpu(r->fd);
		break;
	}
	case __TRACE_CLOSE: {
		struct vfs_close_trace_rec *r = rec;
		r->fd = be32_to_cpu(r->fd);
		break;
	}
	case __TRACE_READ:
	case __TRACE_WRITE: {
		struct vfs_readwrite_trace_rec *r = rec;
		r->fd = be32_to_cpu(r->fd);
		r->count = be32_to_cpu(r->count);
		r->ret = be32_to_cpu(r->ret);
		break;
	}
	case __TRACE_SENDFILE: {
		struct vfs_sendfile_trace_rec *r = rec;
		r->out_fd = be32_to_cpu(r->out_fd);
		r->in_fd = be32_to_cpu(r->in_fd);
		r->start_ppos = be64_to_cpu(r->start_ppos);
		r->start_count = be32_to_cpu(r->start_count);
		r->start_max = be64_to_cpu(r->start_max);
		r->end_ppos = be64_to_cpu(r->end_ppos);
		r->end_count = be32_to_cpu(r->end_count);
		r->end_max = be64_to_cpu(r->end_max);
		r->ret = be32_to_cpu(r->ret);
		break;
	}
	default:
		return;
		break;
	}
}

When the parser is ready to have the events printed, it sends them one by one to the application's print_event() utp_ops callback function. At this point the application should display the event data as it sees fit. In the case of vfstrace, it formats the event and prints it to the output file using fprintf:

static int vfsparse_print_event(struct parse_info *pi,
				void *parse_private __attribute ((unused)),
				void *channel_private __attribute ((unused)),
				void *cpu_private __attribute ((unused)),
				void *event,
				int text_output)
{
	struct vfs_common_trace_rec *common = event;

	if (!text_output)
		return 0;

	vfsparse_print_common(pi, common);

	switch (common->id) {
	case __TRACE_OPEN: {
		struct vfs_open_trace_rec *r = event;
		fprintf(pi->ofp, "fd: %8d fname: %s\n", r->fd,
			(char *)r + sizeof(*r));
		break;
	}
	case __TRACE_CLOSE: {
		struct vfs_close_trace_rec *r = event;
		fprintf(pi->ofp, "fd: %8d\n", r->fd);
		break;
	}
	case __TRACE_READ:
	case __TRACE_WRITE: {
		struct vfs_readwrite_trace_rec *r = event;
		fprintf(pi->ofp, "fd: %8d cnt: %8d ret: %8d\n",
			r->fd, r->count, r->ret);
		break;
	}
	case __TRACE_SENDFILE: {
		struct vfs_sendfile_trace_rec *r = event;
		fprintf(pi->ofp, "out_fd: %8d in_fd: %8d start_ppos: %8llu" \
			" start_cnt: %8d start_max: %8llu end_ppos: %8llu" \
			" end_cnt: %8d end_max: %8llu ret: %8d\n",
			r->out_fd, r->in_fd,
			r->start_ppos, r->start_count, r->start_max,
			r->end_ppos, r->end_count, r->end_max,
			r->ret);
		break;
	}
	default:
		break;
	}
	
	return 0;
}

static void vfsparse_print_common(struct parse_info *pi,
				  struct vfs_common_trace_rec *r)
{
	fprintf(pi->ofp, "%7s %2d %8d %5d.%09lu %5u %15s ",
		event_name(r->id), r->cpu, r->sequence,
		(int) SECONDS(r->time), (unsigned long) NANO_SECONDS(r->time),
		r->pid, r->comm);
}

static char *event_name(int id)
{
	char *name;

	switch (id) {
	case __TRACE_OPEN:
		name = "OPEN";
		break;
	case __TRACE_CLOSE:
		name = "CLOSE";
		break;
	case __TRACE_READ:
		name = "READ";
		break;
	case __TRACE_WRITE:
		name = "WRITE";
		break;
	case __TRACE_SENDFILE:
		name = "SENDFIL";
		break;
	default:
		name = "UNKNOWN";
		break;
	}

	return name;
}

The vfsparse print_event() implementation first casts the incoming pointer to the common header type, prints out the common fields and and uses the id field to determine the actual event type to cast to, and prints the variable event fields.

That's basically it - at this point we've achieved what we set out to do - display the binary trace stream in a human-readable form.

There are a few more minor application-specific things the parser needs to do to be complete - we'll take a moment to describe those before we move on to a description of the rest of the utp API by looking at the boilerplatecode.

To support the binary option, which instead of, or in addition to displaying the human-readable form results in the event data being saved to disk as binary data, the application should provide an implementation of the output_binary() utp_ops callback function, which basically does nothing more than reter the length of the data in the event its passed. This code also can be easily adapted by a new parser to dump the application's events.

static void vfsparse_output_binary(struct parse_info *pi, void *event)
{
	struct vfs_common_trace_rec *r = event;
	int pdu_len, len = 0;

	switch (r->id) {
	case __TRACE_OPEN:
		pdu_len = get_pdulen(pi, r);
		len = sizeof(struct vfs_open_trace_rec) + pdu_len;
		break;
	case __TRACE_CLOSE:
		len = sizeof(struct vfs_close_trace_rec);
		break;
	case __TRACE_READ:
	case __TRACE_WRITE:
		len = sizeof(struct vfs_readwrite_trace_rec);
		break;
	case __TRACE_SENDFILE:
		len = sizeof(struct vfs_sendfile_trace_rec);
		break;
	default:
		break;
	}

	if (len)
		utp_output_binary(pi, r, len);
}

Finally, as mentioned before, there are several callbacks called at various times throughout the parse which basically just answer the question, "here's the event object, can you use it to give me the value for this field?". For now, the values the parser needs to retrieve from a event are the event's timestamp and cpu id. These are retrieved using the utp_ops get_event_time() and get_event_cpu(). It also needs to eable to modify the timestamp and have it saved in the event, which is accomplished via the set_event_time() callback. For debugging purposes, the parser will call the get_event_seq() callback if defined. If not, it will call a default version; this callback is therefore optional.

static inline __u32 vfsparse_get_event_cpu(void *event)
{
	struct vfs_common_trace_rec *r = event;
	return r->cpu;
}

static inline __u64 vfsparse_get_event_time(void *event)
{
	struct vfs_common_trace_rec *r = event;
	return r->time;
}
  
static inline void vfsparse_set_event_time(void *event,
					   __u64 time)
{
	struct vfs_common_trace_rec *r = event;
	r->time = time;
}

static inline __u32 vfsparse_get_event_seq(void *event)
{
	struct vfs_common_trace_rec *r = event;
	return r->sequence;
}

And finally, here are the utp_ops for the vfsparse example, for completeness.

static struct utp_ops ops =
{
	.alloc_channel_private = NULL,
	.alloc_cpu_private = NULL,
	.free_event = vfsparse_free_event,
	.get_channel_id = NULL,
	.print_channel_id = NULL,
	.get_event_cpu = vfsparse_get_event_cpu,
	.get_event_time = vfsparse_get_event_time,
	.set_event_time = vfsparse_set_event_time,
	.get_event_seq = vfsparse_get_event_seq,
	.read_event = vfsparse_read_event,
	.print_event = vfsparse_print_event,
	.output_binary = vfsparse_output_binary,
	.show_channel_stats = NULL,
	.show_cpu_stats = NULL,
};

The only other thing an application would need to do outside of the boilerplate code of this example would be to change the call to utp_alloc_parse(), described below.

We'll now go over main(), where most of the API calls happen, but where basically only one line of code needs to be changed for a new application (it isn't necessary to understand how all the other utp_ API functions are used here in order to write a parse using the boilerplace code - this section is for anyone interested in the complete API:

int main(int argc, char *argv[])
{
	int c, ret;
	
	/* New apps should change this line. */
	pi = utp_alloc_parse("vfs", VFS_TRACE_MAGIC, VFS_TRACE_VERSION,
			     NULL, &ops);

The first line of interest is the call to utp_alloc_trace(). This function allocates a parse_info struct, which contains the state info for the parse and is passed along to later API calls. The first param, "vfs", is the name of the application, and is included as the middle element in the filenames of the per-cpu output files, the first being the channel name, and the third being the cpu id e.g. for vfstrace, which creates a single channel called "tracer", the per-cpu output files would have the form "tracer.vfs.0", "tracer.vfs.1", etc. As another example, blktrace creates a channel for each device, and its app name is "block", so its output files would be "sda.block.X", "hdb.block.X" etc.

The next two params are or'ed together to make up the magic number for the application, as decribed previously in the tracer section.

The final param is the address of the utp_ops struct defined for the application. The utp_ops contains application-specific callbacks that the application can or must define; in the case of utp, there are several, most of which were discussed previously; see the utt library README for descriptions of the others.

	while ((c = getopt_long(argc, argv, S_OPTS, l_opts, NULL)) != -1) {
		switch (c) {
		case 'i':
			if (is_pipe(optarg) && !pi->pipeline) {
				pi->pipeline = 1;
				pi->pipename = strdup(optarg);
			} else if (utp_add_channel(pi, optarg) != 0)
				return 1;
			break;
		case 'D':
			pi->input_dir = optarg;
			break;
		case 'o':
			pi->output_name = optarg;
			break;
		case 'O':
			pi->text_output = 0;
			break;
		case 'b':
			pi->rb_batch = atoi(optarg);
			if (pi->rb_batch <= 0)
				pi->rb_batch = RB_BATCH_DEFAULT;
			break;
		case 'q':
			pi->per_channel_and_cpu_stats = 0;
			break;
		case 'w':
			if (utp_find_stopwatch_interval(pi, optarg) != 0)
				return 1;
			break;
		case 'v':
			pi->verbose++;
			break;
		case 'V':
			printf("%s version %s\n", argv[0], vfsparse_version);
			return 0;
		case 'd':
			pi->dump_binary = optarg;
			break;
		default:
			usage(argv[0]);
			return 1;
		}
	}

	while (optind < argc) {
		if (is_pipe(argv[optind]) && !pi->pipeline) {
			pi->pipeline = 1;
			pi->pipename = strdup(argv[optind]);
		} else if (utp_add_channel(pi, argv[optind]) != 0)
			return 1;
		optind++;
	}

For each channel that we'll be parsing (if from files), we need to add the channel to the parser. Since a parser can handle multiple channels, there can be multiple channels specified on the commandline. In the case of vfsparse, there's only one channel, named "tracer" (tracer.vfs.0, etc), so our commandline would look something like this:

./vfsparse tracer

	if (!pi->pipeline && !pi->nchannels) {
		usage(argv[0]);
		return 1;
	}

	signal(SIGINT, handle_sigint);
	signal(SIGHUP, handle_sigint);
	signal(SIGTERM, handle_sigint);

	setlocale(LC_NUMERIC, "en_US");

	ret = utp_do_parse(pi);

utp_do_parse() is the heart of the parse engine and doesn't return until the parse is finished and the data has been displayed to the user.


	if (!ret)
		show_stats(pi);

At the end of the trace, it's often useful to display statistics for the parse. utp_show_channel_and_cpu_stats() can be used for this purpose.

For each channel in the trace, this function invokes the utp_ops callback function show_channel_stats() to display stats for the channel. If this function in utp_ops is null, nothing will be displayed. Typically applications will use the channel private data it allocated when the trace was created to store and update channel stats.

For each cpu in each channel in the trace, this function invokes the utp_ops callback function show_cpu_stats() to display stats for the cpu. If this function in utp_ops is null, nothing will be displayed. Typically applications will use the cpu private data it allocated when the trace was created to store and update cpu stats.

Even if the application doesn't show per-cpu or per-channel stats, this function will print skip information, which displays how many if any records were missing in the input. Normally this would be 0, but if the tracer couldn't keep up with the volume of data being logged, there could be missing records.

	utp_cleanup_parse(pi);

	return ret;
}

Finally, we clean up the parse and exit - we're done.

Instrumenting kernel code and writing trace data to relay

The first thing we'd normally do is instrument the kernel code to log the data we're interested in. I'll describe how the vfstrace example kernel code does that, but since there are many ways of skinning this cat, it shouldn't be taken as the only or even best way to write kernel instrumentation to be consumed by utt-based applications.

As mentioned, we'll be tracing 5 vfs 'events'; we'll show a couple of them here - see the code itself for the others.

linux-2.6/fs/read_write.c:

asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)
{
	struct file *file;
	ssize_t ret = -EBADF;
	int fput_needed;

	file = fget_light(fd, &fput_needed);
	if (file) {
		loff_t pos = file_pos_read(file);
		ret = vfs_read(file, buf, count, &pos);
		file_pos_write(file, pos);
		fput_light(file, fput_needed);
	}


	if (fd > 2)
		vfs_read_trace(fd, count, ret);

	return ret;
}

The code we're adding is shown in red. We're interested in tracing all read system calls, but to avoid being overwhelmed by the standard file streams, we'll say we're only interested in fd's greater than 2. For any read we're interested in, we call a function called vfs_read_trace() and pass it the variables we're interested in logging: fd, count, and ret. vfstrace_trace_read is a macro defined in our own header file in include/linux:

linux-2.6/include/linux/vfstrace_api.h:

#define vfs_read_trace(fd, count, ret) \
	trace_mark(vfs_readwrite, "%u %u %u %d", __TRACE_READ, fd, count, ret);

As you can see it simply creates a trace marker (see Documentation/markers.txt for info about kernel markers) using the data items we passed in, along with an 'event id', __TRACE_READ which is a simple integer defined to identify each event type, also defined in vfstrace_api.h. vfs_readwrite is the name of the handler function that will be invoked when the marker is activated.

As another example, we'll show the instrumentation for the open syscall, which logs the fd and the filename of the file being opened:

linux-2.6/fs/open.c:

long do_sys_open(int dfd, const char __user *filename, int flags, int mode)
{
	char *tmp = getname(filename);
	int fd = PTR_ERR(tmp);

	if (!IS_ERR(tmp)) {
		fd = get_unused_fd_flags(flags);
		if (fd >= 0) {
			struct file *f = do_filp_open(dfd, tmp, flags, mode);
			if (IS_ERR(f)) {
				put_unused_fd(fd);
				fd = PTR_ERR(f);
			} else {
				fsnotify_open(f->f_path.dentry);
				fd_install(fd, f);
			}
		}

		vfs_open_trace(tmp, fd);

		putname(tmp);
	}

	return fd;
}

And here's the vfs_open_trace macro defining the kernel marker:

linux-2.6/include/linux/vfstrace_api.h:

#define vfs_open_trace(filename, fd) \
	trace_mark(vfs_open, "%s %u", filename, fd);

Note in this case, we don't pass an explicit event id - because the marker handler handles only one event, it would be redundant (we did for the previous case because that handler was handling two events, read and write).

That's pretty much the extent of the instrumentation needed in the kernel for each item of interest being logged.

Of course, the code so far only says what to trace, not how or where to trace it. That code, as well as the overall 'trace control' (start/stop tracing, etc) are contained in a separate piece of kernel code, in our case in a kernel module named vfstrace_mod.

Starting with the module init code, probe_init(), the first thing we do is call create_trace_controls(). This and everything related to 'control' or 'controls' is just a bunch of standard boilerplate code that you can copy and use without thinking about it. It's responsible for setting up a group of debugfs files used for managing communication with the userspace code that controls the trace. The debugfs files it creates allow the userspace code (the details of which can also be ignored) to do things like set buffer sizes, set event masks, and start and stop tracing sessions, among other things.

The important part of the probe_init() code is the code that registers the marker probe handler registration functions which associate handler code with each of our events. After these have completed, any execution of the instrumentation code we inserted in the kernel will result in the corresponding handler function being invoked.

vfstrace-utt-example/vfstrace-mod.c:

static int __init probe_init(void)
{
	int ret;

	ret = create_trace_controls();
	if (ret) {
		printk("Unable to create trace controls\n");
		goto err;
	}

	ret = marker_probe_register("vfs_open", "%s %u", vfs_open_probe, NULL);
	if (ret) {
		printk("Unable to register vfstrace probe vfs_open\n");
		goto err;
	}

	ret = marker_probe_register("vfs_readwrite", "%u %u %u %d",
				    vfs_readwrite_probe, NULL);
	if (ret) {
		printk("Unable to register vfstrace probe vfs_readwrite\n");
		goto err;
	}

        .
        .
        .
}

For example, whenever any file is read or written, the handler function vfs_readwrite_probe() will be called with the parameters we specified. vfs_readwrite_probe() is shown below. It receives the parameters from the marker code in the form of a va_list, from which we extract the individual members we specified in the kernel marker:

void vfs_readwrite_probe(void *probe_data, void *call_data,
			 const char *format, va_list *args)
{
	unsigned int id = va_arg(*args, typeof(id));
	unsigned int fd = va_arg(*args, typeof(fd));
	unsigned int count = va_arg(*args, typeof(count));
	int ret = va_arg(*args, typeof(ret));

	if (unlikely(!(tracing & id)))
		return;

	__vfs_readwrite_trace(channel, id, fd, count, ret);
}

It then checks whether or not we're tracing this particular event (as specified by the mask passed in from userspace) and if so, calls a worker function, __vfs_readwrite_trace() with the cooked param values:

static void __vfs_readwrite_trace(struct rchan *channel, u16 id,
				  u32 fd, u32 count, s32 ret)
{
	struct vfs_readwrite_trace_rec *t;
	unsigned long flags;

	local_irq_save(flags);
	t = relay_reserve(channel, sizeof(*t));
	if (t) {
		__vfs_common_trace(&t->common, id, 0, 0, NULL);
		/* event-specific fields */
		t->fd = fd;
		t->count = count;
		t->ret = ret;
	}
	local_irq_restore(flags);
}

__vfs_readwrite_trace() is where the actual logging occurs. We're doing per-cpu logging, so we first disable local irqs, then reserve a spot in the relay channel for our event, a vfs_readwrite_trace_rec:

struct vfs_readwrite_trace_rec {
	struct vfs_common_trace_rec common;

	__u32 fd;		/* file descriptor to read/write */
	__u32 count;		/* count of bytes to read/write */
	__s32 ret;		/* count of bytes read/written, or err */
};

As you can see, it contains the fd, count and ret val that we specified in the kernel marker instrumentation, as expected, but it also contains an embedded struct, vfs_common_trace_rec, which holds data that gets logged for every event:

/*
 * VFS trace, common part of each trace record
 */
struct vfs_common_trace_rec {
	__u32 magic;		/* MAGIC << 8 | version */
	__u16 id;		/* event id */
	__u32 cpu;		/* on what cpu did it happen */
	__u32 sequence;		/* event number */
	__u64 time;		/* in microseconds */
	__u32 pid;		/* who did it */
	char comm[TASK_COMM_LEN]; /* who did it (executable) */
	__u16 pdu_len;		/* length of variable data at end */
};

As you can see, the common part of the record contains several fields, which need some explanation. Some are required by utt, some are merely used by utt if present, while others are required only by the application. The fields required by utt are:

magic
This is a 32-bit number that's used to both verify the integrity of a trace record and determine the endianness of the data in the parser. It will also be used in future versions to determine the application responsible for parsing a given record, so it's important that each application define its own unique magic number as such:
```
/* magic or'ed with version, last byte of magic should be 00 */
#define VFS_TRACE_MAGIC		0x65617500
#define VFS_TRACE_VERSION	0x01
```
The last byte is actually a version number which gets or'ed with the magic number - it's important for proper operation that the last byte of the magic value be 00.
time
This is a 64 bit timestamp associated with each event. The parser will use this to order events. In future versions, it will be possible to merge records from different applications into the same trace stream, so it's in the interest of applications to use the same timestamping method. Currently, that would be:
```
	t->time = ktime_to_ns(ktime_get());
```
The relative ordering of any of these fields in a trace record is unimportant - part of the task of creating an utt tracer or parse is to specify where exactly within the trace record these fields can be found, which we'll do in the next steps.

There are some other fields, not required by utt, but used by utt if present:

sequence
This is a debugging field which is useful to display in traces most of the time. The 'boilerplate' code in this module sets up the sequence counters; your code can just copy the code below for filling in that field:
```
	sequence = per_cpu_ptr(trace_sequence, cpu);
	t->sequence = ++(*sequence);
```
cpu
Also useful to have displayed in traces.

The other common fields are just things we want to display in the output for this application, or are needed for userspace processing:

id
The event id, needed for us to figure out what this event is.
pid
The process id of the process that made the syscall, for display.
comm
The name of the process that made the syscall, for display.

So, here's the entirety of the code that fills in the common part of the trace, for reference:

static inline void __vfs_common_trace(struct vfs_common_trace_rec *t, u16 id,
				      u16 pdu_off, u16 pdu_len, void *pdu_data)
{
	struct task_struct *tsk = current;
	unsigned long *sequence;
	pid_t pid = tsk->pid;
	int cpu;

	cpu = smp_processor_id();
	sequence = per_cpu_ptr(trace_sequence, cpu);

	t->magic = VFS_TRACE_MAGIC | VFS_TRACE_VERSION;
	t->id = id;
	t->cpu = cpu;
	t->sequence = ++(*sequence);
	t->time = ktime_to_ns(ktime_get());
	t->pid = pid;
	memcpy(t->comm, tsk->comm, sizeof(tsk->comm));
	t->pdu_len = pdu_len;

	if (pdu_len)
		memcpy((void *)t + pdu_off, pdu_data, pdu_len);
}

We'll discuss the pdu_len, etc below.

After we log the common part of the trace, we simply fill in the remaining 3 fields of the variable part of our record, and re-enable irqs, and then we're done - the event has been logged:

static void __vfs_readwrite_trace(struct rchan *channel, u16 id,
				  u32 fd, u32 count, s32 ret)
{
	struct vfs_readwrite_trace_rec *t;
	unsigned long flags;

	local_irq_save(flags);
	t = relay_reserve(channel, sizeof(*t));
	if (t) {
		__vfs_common_trace(&t->common, id, 0, 0, NULL);
		/* event-specific fields */
		t->fd = fd;
		t->count = count;
		t->ret = ret;
	}
	local_irq_restore(flags);
}

It would be instructive to also show the logging of the open event, since it's an example of logging a variabl-length event.

Here's the open trace record:

struct vfs_open_trace_rec {
	struct vfs_common_trace_rec common;

	__u32 fd;		/* file descriptor opened, or err */
};

It's pretty simple, just adding the fd to the common data. The common data, however, will in this case point, via its pdu_len member, to the variable length filename at the end of the event.

void vfs_open_probe(void *probe_data, void *call_data,
		    const char *format, va_list *args)
{
	unsigned int id = __TRACE_OPEN;
	char *filename = va_arg(*args, typeof(filename));
	unsigned int fd = va_arg(*args, typeof(fd));

	if (unlikely(!(tracing & id)))
		return;

	__vfs_open_trace(channel, id, fd, strlen(filename) + 1, filename);
}

Starting at the beginning, the vfs_open_probe() handler extracts the filename and fd from the marker, figures out the length of the filename including null, and passes it and the length to __vfs_open_trace():

static void __vfs_open_trace(struct rchan *channel, u16 id, u32 fd,
			     u16 pdu_len, void *pdu_data)
{
	struct vfs_open_trace_rec *t;
	unsigned long flags;
	u16 pdu_off = sizeof(*t);

	local_irq_save(flags);
	t = relay_reserve(channel, sizeof(*t) + pdu_len);
	if (t) {
		__vfs_common_trace(&t->common, id,
				   pdu_off, pdu_len, pdu_data);
		/* event-specific fields */
		t->fd = fd;
	}
	local_irq_restore(flags);
}


__vfs_open_trace() disables irqs, etc, reserves an event that includes the pdu_len of the filename and passes both the length and the filename to __vfs_common_trace() which simply copies it to the end of the open event.  The end of the open event is also passed in as pdu_off.

That's pretty much it for the kernel side of the tracer - the other events logged follow the same pattern, and the only thing left to describe is the cleanup and teardown of the module and controls; again, these are just the reverse of the init code and is all boilerplate.

The next development step would be to write the userspace tracing part
of the code, but since we've already covered that, we're done.

Capturing the trace data in userspace and writing to disk, sending over the network, etc, barebones version


Same as normal, but don't implement ops i.e. verify_trace op (send NULL) or use a MAGIC number or VERSION number.  Also remove anything to do with masks, write ~0L to mask file.

Example code coming soon.