Testing with Traces?

By: on November 30, 2016

Most APIs and type signatures are hopelessly inadequate for capturing and describing a model. For example, consider a map and the signatures for put and get. Even if you have pure functional type signatures, the signatures on their own convey no information about what they do with a key and value during put. For this reason we have documentation, an awful lot of convention, and tests.

Tests fundamentally are trying to demonstrate an equivalence between two models. One is the implementation code, the other is test code. We may well have multiple different models in test code that focus on different attributes of the implementation code. For example one model might test the statefulness of a map implementation – that a key-value pair, once added, can be found, and once removed, cannot be found. Another model may test the relationship between the map and components the map depends on. For example, if the map wrote its contents to disk, then you might expect a put to provoke a call to some IO routine.

Equivalences between the model as achieved by implementation code and these test models is never normally proven. But if structured the right way then we can at least make reasonable soak tests that, as we run them for longer and longer, give us greater confidence that the test model is satisfied by the implementation model (the inverse is obviously not true, so this is not a symmetric equivalence). This is pretty close to property based testing, but is possibly better suited for stateful code. A property test might take some randomly generated and populated map, an operation to perform on the map, and verify that if a set of properties hold before the operation, then they also hold after the operation.

There have been a couple of cases recently where I’ve been able to do these sorts of equivalence soak tests as mini-interpreters: if you have your test model and your implementation model, then you can treat the API as an instruction set. You can then randomly generate instructions, transform both models equivalently, and make sure you get the same result from both. This keeps concerns nicely separated: you have your two models, you have a stream of instructions. And of course the stream of instructions is never truly random: just seed the random number generator with the current time and record that seed before each run and everything should be deterministic. I recently did an implementation of a LinearHash map which uses this approach to demonstrate an equivalence with Go’s native map.

I really like this approach because: a) nice separation of concerns; b) you don’t have to write thousands of lines of code: a small amount of code will go a long way, and less code means fewer bugs; c) it should still be deterministic; d) the longer you run it, the more code paths will be covered, so the more confidence you can have in the equivalence.

But as the implementation code that’s being tested gets bigger and more complex, building a mini model in your test code that captures any useful behaviour becomes harder and harder. Particularly in large distributed systems, which certainly include micro-services. So I’ve come to the conclusion that what I’m after in such systems is that the model is the communication patterns and the causal relationship between such communications. So, in order to provoke events, you would need a load generator, and probably some sort of chaos monkey to kill different things off from time to time, but then you need a lot of tracing throughout your code base, and you need to be able to capture the causal relationship between messages. In a simple and slow single-threaded thing, physical time may be sufficient, but for distributed systems you probably want to track causality directly.

Some of the tracing systems that are appearing look interesting (e.g. zipkin, appdash, lightstep etc) and might be viable for use in this way, but they tend to be aimed more at tracing the arrival of a request with a micro-service, and less about just one method calling another. Several of them seem to only go down to microsecond resolution, and if you’re doing method calls then you’ll want nanoseconds. Also, we don’t necessarily require real-time collection and collation of trace logs and the volume may well be too high to achieve that. Instead, we could run a load generator for five minutes, and then collect and collate after the event.

So then the test would be a set of properties that you would verify over the global trace logs. They would make statements essentially about how information is expected to propagate: the fact that x has occurred should provoke both y and z to happen as a result; or if x has happened then y should definitely not happen as a result. The fact that this is pretty similar to how jepsen analysis the results of tests against databases is probably no surprise given how much time I spend thinking about and working on databases.

You can then think of feedback mechanisms which are similar to some of the fuzz testing techniques: we could observe which test input provoked which communications, and so by varying that try to find ways to break the properties. This is where it all slightly links back to the earlier discussion of a stream of instructions being interpreted. This is also all getting quite close too to concolic testing but without the symbolic execution bits.

Now I would be keen for such a system to exist but I’ve not found anything that matches my desires just yet. There could be good reasons why such a system would not work well or would not be useful but I’m struggling to find them. Please comment if you can think of them. The reason I think it would be useful is because you could chain together quite naturally sequences of actions that represent the user’s path. For example, logging in, accessing various resources, adding items to a shopping basket etc. These actions then compose nicely and you could have parallel streams of instructions for different users, so hopefully the load will be both reasonably representative but also more exhaustive than actual production. Given I’m already committed to boiling one ocean in my spare time, I doubt I’ll be able to start down this path on my own, but this approach does strike me as having the potential to be generally and widely of use.


Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>