This is based on my first-hand involvement in the audit attempt at structured logging, and on the results of documenting the audit records that were actually sent by various programs. It is rather long, I’m afraid—but please read at least the “issues” section; these lessons were learned the hard way, and any plan that ignores them will fail.
- Any generic log parsing tool WILL need to handle unstructured text. Always.
- Any structured logging format WILL be used inconsistently.
- There WILL NOT be an universal field name namespace.
- You can only “trust” data that always comes from the same piece of code.
- Most log analysis code must specifically know what record it is working with.
- Structured logging formats can not deliver what they promise.
- A potential recommendation:
- Do not define new structured logging formats.
- Define an universal field syntax. Make it trivial to use.
- Treat unstructured text as the primary format, and optimize its use.
- … but is it really worth it?
A. Any generic log parsing tool WILL need to handle unstructured text. Always.Right now we have thousands of Linux software packages that output unstructured text. This, realistically, will never change, and note that incremental moves to structured logging do not matter at all (even if only 5% of the important log messages are unstructured text, the tools still need to handle unstructured text).
Even if we magically managed to convert every package we have to use a specific structured logging format within the next 5 years, a few new other logging infrastructures would appear in the meantime (like we currently have Java’s log4j, Python’s logging module or the proposed CEE standard); these infrastructures would certainly not be 100% compatible with the chosen structured logging format (either carrying more or less information). In practice, the only format everybody can agree on is unstructured text.
Another aspect is logging in “can’t happen” situations, which is one of the places where logging is most valuable. Programmers just won’t use a structured logging API in that situation; you probably can persuade a person to replace
assert(index >= 0);
but we can’t expect that they will write 5 lines of code to properly log this in a structured format; they’re, if anything, in fact more likely to return back to the plain
my_assert(index >= 0, "index file corrupted");
assert()—after all, the situation “cannot happen”, can it?
B. Any structured logging format WILL be used inconsistently.
Bitter experience indicates that even programmers that use a structured logging API instead of unstructured text WILL use it in ways that break the promises of structured logging:
- Using a single “message” string instead of breaking it down to pieces (so even a reader of “structured” log information must handle arbitrary text!).
- Writing UIDs into “user name” fields.
- Using three different field names for a single conceptual field (although there was a helper function that correctly writes the field for an application).
And all of this has happened while writing auditing patches by a relatively small group of people (instead of various upstreams reading documentation and doing their best), and many of the patches went to a single mailing list for review. It is possible that another group or another time would be different, but I don’t really believe it.
C. There WILL NOT be an universal field name namespace.
The only point of structured logging is to be able to parse the messages and extract some useful data (or at least useful statistics) without encoding specific knowledge about the message into the program. For example, one would like to list all messages related to a specific host name.
On a purely structural level, there will be records that simply specify
HOSTNAME=foo. Then there will be records that specify
SERVER_HOSTNAME=bar. Or perhaps even records that specify
MULTICAST_DESTINATION_HOSTNAMES=[bar1,bar2,bar3] (using some convention or
mechanism to express multi-valued fields). And, looking at “B.” above, there
will also be records that specify
HOST=foo. Obviously, doing
any log aggregation based on field names will fail. (Adding “field types”
doesn’t solve this problem—field types will have to be user-extensible, and
thus will also be used inconsistently.)
Looking at the host name example from a semantic level, there will be records
HOSTNAME=foo, and records that specify
if DNS is temporarily down, the same program may actually generate both of
these records in response to the same kind of event! Then there’s the whole
mess of IP addresses that change over time, or are over time reused by
D. You can only “trust” data that always comes from the same piece of code.There are three general categories of data that is usually “trustable”:
- Fields that are always attached to the log record by the logging mechanism and recorded “out of band”.
- Fields in records generated within a single application (only if you can identify the application that has generated the record)
- Fields generated by a library helper (but people WILL emulate the helper instead of using it, sooner or later)
Any other conventions and de-facto standards will, sooner or later, be violated in a way that makes the result of parsing the logs incorrect.
E. Most log analysis code must specifically know what record it is working with.
The set of “trustable” data in logs is very limited (“D.”), so log records will
be inconsistent (“B.”, “C.”). Any application that will want to make sense out
of the data will have to know the specifics—both on the structural level,
e.g. field names, and on the semantic level, e.g. what fields with the same
“type” are relevant (should ‘
ROUTER_HOSTNAMES‘ above be included or excluded?),
what alternative sources of the data exist.
F. Structured logging formats can not deliver what they promise.
Because of “E.”, log analysis is not as much a matter of code as a matter of information—what are the record formats, which fields are relevant, what transformations must be applied. This is a lot of work that isn’t automatable, and the major part is collecting the data (finding all relevant programs and looking at the records they generate), not parsing; once you find the code that generates a specific record, writing a parser is usually quite simple.
Having a simpler or more efficient parser is somewhat valuable (see below), but that’s nowhere close to the idea of logging “directly into a database”: the database will contain nonsensical tuples instead of structured data.
A potential recommendation
Do not define new structured logging formats.See F. above. The logging format change is costly, and does not deliver the desired benefit. Having two parallel logging mechanisms is particularly damaging.
Define an universal field syntax. Make it trivial to use.
The above argues that “structured logging” in the full extent (new formats, field semantics) does not make sense. What does make sense, however, is to define a convention for record formatting (and always admit that it is only a convention, see “A.”) that allows simple parsing into fields (using a single-pass lexer instead of record-specific, and potentially backtracking, regexps).
There are two components to this:
The field syntax must be truly universal—i.e. it must be possible to use it in any existing logging system. In particular, ASCII-only and no embedded newlines.
The field syntax should be trivial to use—writing it must not be more
difficult than using
syslog(), and parsing it must not be more difficult than
indexing a hash value. All we really need is something like:
- A record contains only name=value pairs
- A syntax for “name” is $some_simple_regexp.
- Some simple syntax for “value” that allows escaping of non-ASCII/binary data, and is not corrupted-escaped by any existing logging mechanism.
Perhaps the easiest write-side interface would be to extend
syslog(), or to
provide a similar alternative, e.g.
UFsyslog("foo=%b bar=%B", stringvalue, binaryvalue, binaryvaluesize);
The parser-side interface is even more important; probably the best way to persuade log writers to use the field syntax is to make this field syntax most convenient to use in parsers, and to let parser developers pressure authors of applications that write logs to use this syntax.
However, the caveat from E. still applies: parsing is not the difficult part, and persuading anyone to choose a syntax only to make parsing easier will be a hard sell.
Treat unstructured text as the primary format, and optimize its use.
As argued above, any log parser will have to handle arbitrary text. The
system’s logging mechanism will similarly have to handle arbitrary text, both
on input (compatibility with
syslog()) and output (human-readable output,
compatibility with existing parsers, probably compatibility with existing
remote logging protocols). The only reasonable way to store text is as text
, so it is natural to treat the text as the primary information format.
However, nothing prevents the logging mechanism to be smart about its use of text:
- It can store out-of-band information about the records (e.g. a reliable timestamp, user and process ID).
- It can compress the text, either using generic data compression algorithms or by using specific strategies (e.g. if a message follows the “universal field syntax”, replace repeating field-value pairs with a backreference).
- It can build search indexes over the text: full-text indexes, indexes of the out-of-band information, and indexes of structured information for records that match the “universal field syntax”. Note that the structured indexes are necessarily advisory—looking up a specific value in the HOSTNAME index will not return “all relevant records”.
In this approach, we do not come anywhere close to fulfilling the hope of “structured logging”—to get meaningful data out of the logs it is still necessary to know exactly what record we are looking for.
Most of the records ever written will remain essentially “write-only”—programs will write them, no programs will specifically analyze them, a human may look at them from time to time when there is a problem.
For the records that are not “write-only”, however, application (log writer/parser) developers will have an easier life: they will be able to write data without worrying about escaping, they will have a library that does the log parsing for them, and they may be able to take advantage of an index that allows them to find the relevant records quickly.
… but is it really worth it?
So, where are we? The grand idea of “structured logging” deserves to be abandoned. In its place, we have a suggested convention that may or may not be universally adopted, and an idea for a library that may be used by a few applications.
Is implementing it worth it? That really depends on the number and complexity
of applications that want to regularly parse logs. How many are there? I
would argue that there are very few within Linux distributions: most data that
needs to be parsed is actually stored in specialized databases (
and others); I can only think of
denyhosts, both of which have
already built the necessary regexps to handle existing logs and are unlikely to
grow much over time, so the benefit the new field formatting convention and
library would be minimal, if any. The logging daemons would perhaps benefit
from some additional metadata (for remote logging, log rotation and so on), but
that metadata can be pretty much invisible by the rest of the system.
Outside of Linux distributions, there is large value in log aggregation (e.g. splunk)—but, as argued above, “structured logging” is not the answer, and the field formatting convention is only a minor improvement when compared to the amount of work necessary to collect information about record semantics. Also, any log aggregation tool shipped separately from a Linux distribution will very likely have to handle Windows and other platforms, which restricts the benefits from any Linux-only log format.
Given all this, is there a good reason to work on the field syntax convention and libraries? I don’t know, but I’m leaning towards “not really.”