Comparing Theories to more traditional testing

My old work colleague Tim has recently blogged about using NSpec to specify a stack.

NSpec has the same sort of functionality as a unit testing framework such as NUnit. The terminology has been changed to get over the roadblock that some people have in adopting tests.

Theories actually give something over and above normal unit testing, and that’s what I’m going to look at in this blog post. I’m going to look at Tim’s example and show how using theories actually differ from Tim’s more traditional example.

The stack interface for which the implementation was arrived at via speccing is as follows:


public class Stack<t>
{
public Stack();
public void Clear();
public bool Contains(T item);
public T Peek();
public T Pop();
public void Push(T item);

// Properties
public int Count { get; }
}

The following tests were arrived at:


namespace Stack.Specs
{
[Context]
public class WhenTheStackIsEmpty
{
Stack _stack = new Stack<int>();

[Specification]
public void CountShouldBeZero()
{
Specify.That(_stack.Count).ShouldEqual(0);
}

[Specification]
public void PeekShouldThrowException()
{
MethodThatThrows mtt = delegate()
{
_stack.Peek();
};

Specify.ThrownBy(mtt).ShouldBeOfType(typeof(InvalidOperationException));
}
}
}

That’s ample for us to discuss the difference between theories and more normal testing.

For the PeekShouldThrowException test/specification, we can see from the naming of the context that the developer intends to show that for an empty stack, the Peek operation throws an exception. However, what the developer has actually shown is that calling Peek on a newly-created stack throws an exception.

Developers tend to think in fairly general terms, and express this generality by using more specific cases. However, some of this generality can get lost. Theories aim to keep more of that generality.

We can demonstrate this in a theory (don’t take much note of the syntax, just the concepts)


[Theory]
public void PeekOnEmptyStackShouldThrow(Stack<int> stack)
{
try
{
stack.Peek();
Assert.Fail(ExpectedExceptionNotThrown);
}
catch (InvalidOperationException) { }
}

This states that calling Peek() on ANY stack should fail, we need to show that this is only true for an empty stack. We could do this by simply checking for this:


[Theory]
public void PeekOnEmptyStackShouldThrow(Stack<int> stack)
{
try
{
if (stack.Count == 0)
{
stack.Peek();
Assert.Fail(ExpectedExceptionNotThrown);
}
}
catch (InvalidOperationException) { }
}

But as we’ll see in a bit, using assumptions gives us some extra feedback (again, don’t focus on the syntax).


[Theory]
[Assumption("AssumeStackIsEmpty")]
public void PeekOnEmptyStackShouldThrow(Stack<int> stack)
{
try
{

stack.Peek();
Assert.Fail(ExpectedExceptionNotThrown);
}
catch (InvalidOperationException) { }
}

public bool AssumeStackIsEmpty(Stack<int> stack)
{
return stack.Count == 0;
}

This is a much more general statement than the original specification/test, we’re saying that the stack should fail if we try to Peek on it for ANY empty stack.

We don’t care whether this is a newly-created stack, or it is a stack which has been manipulated via its public interface. Also, Liskov Substitution Principle states that we should be able to use any classes derived from Stack, and the theories should hold true.

We validate this theory with example data, in much the same way as when we’re doing test-driven development. The extra power comes from the generality in the way that the theory is written – we can imagine a tool that performs static code analysis on the Stack class to confirm that it obeys this.

However, the literature mentions that the most likely way to validate a theory is via an exploration phase, via a plug-in tool that will try various combinations of input data to look for anything that fails the theory.

It is prohibilively expensive to explore every possible combination of inputs, imagine all the possible values of a double, or in our example, there are an infinite number of operations that could happen to a stack that gets passed in.

This fits in nicely with the name theory with parallels with science – it’s not feasible to prove it, but we look for data to disprove it.

The example data is important for the red-green-refactor cycle. The exploration phase sits outside that – it finds which input data doesn’t fit the theory, allowing the theory to be modifed. There are exploration tools in Java, and I haven’t looked too much into it, but it may be possible to use Microsoft’s Pex as an exploration tool?

Before I forget, this is a possible way to specify the example data for our stack:


[Theory]
[Assumption("AssumeStackIsEmpty")]
[InlineData("EmptyStack", new Stack())]
[PropertyData("EmptiedStack")]
public void PeekOnEmptyStackShouldThrow(Stack<int> stack)
{
try
{
stack.Peek();
Assert.Fail(ExpectedExceptionNotThrown);
}
catch (InvalidOperationException) { }
}

public List<exampledata> EmptiedStack
{
get
{
List<exampledata> data = new List<ExampleData>();
Stack stack = new Stack();
stack.Push(2);
stack.Push(3);
stack.Pop();
stack.Pop();
data.Add(stack);
return data;
}
}

In my prototype extension, the assumptions are important and are validated, as they tell us something vital about the code. I think that all the information about the behaviour of the system is vital, and should be documented and validated, but there are varied opinions on the list. That’s why I’m blogging – give me your feedback 🙂

If the user changed the behaviour of Peek() such that it was valid on an empty stack (it may return a Null Object for certain generic types), then our assumption would not detect this if it was simply filtering the data – the assumption would say “Peek() fails, but only on empty stacks”, whereas Peek() would not fail on empty stacks. See my previous post for the behaviours I have implemented.

Notice in Tim’s implementation how his stack is hardcoded to have at most 10 items. When TDDing we may make slightly less obviously limited implementations to get our tests to pass, but forget to add the extra test cases to show this limitation (the process of progressively adding more and more general test cases is called triangulation).

When writing theories, the same process happens, but writing the theories as a more general statement means that a code reviewer/automated tool can see that the developer intended that we intended that we can push a new item onto ANY stack, not just a stack that contained 9 or less items.

Any thoughts? Have I got the wrong end of the stick? If anyone found this post useful, I might full flesh out the equivalent of Tim’s example.

Sample Theory Implementation as NUnit Extension.

There’s been lots of comments bouncing around on the NUnit mailing list about what exactly constitutes a Theory, and what the desired features are, so I’ve created an NUnit extension with a sample Theory implementation – you can get it, Maslina version 1.0.0.0, from www.taumuon.co.uk/rakija

xUnit.Net implements theories but does not have any in-built Assumption mechanism (you can effectively filter out bad data, which is the same as a filtering assumption). JUnit 4.4, I think, only filters out data – it doesn’t tell us anything about the state of an assumption.

Anyway, from reading the literature on theories (see my previous blog posting), I quite like the idea of having assumptions tell us something about the code, that those assumptions are validated.

The syntax of my addin is quite poor, and there’s not really enough validation of user input, but I’m aiming to try to do some theory-driven development (theorizing?) using it, to see what feels good and what grates.

Any feedback gratefully received (especially – is it valid to say that this is an implementation of a Theory, are validation of assumptions useful or unnecessary fluff?)

Here is the syntax of my extension.


[TestFixture]
public class TheorySampleFixture
{
[Theory]
[PropertyData("MyTestMethodData")]
[InlineData("Parity", new object[] { 1.0, 1.0, 1.0 })]
[InlineData("Parity 2", new object[] { 2.0, 2.0, 1.0 })]
[InlineData("Double Euros", new object[] { 2.0, 1.0, 2.0 })]
// This does not match the assumption, and will cause this
//specific theory Assert to fail, in which case we will get a pass overall.
// If the unit under test were changed to somehow handle zero exchange rate,
// the body of the theory method would pass, but the
// assumption would still not be met and overall we will register a failure.
[InlineData("ExchangeRate Assumption Check", new object[] { 2.0, 1.0, 0.0 })]
// This case will fail, there is an assumption that the dollar value is not three,
// but passing in a value of 3 doesn't cause a failure in the code, demonstrating
// that the assumption serves no purpose
[InlineData("This should fail, assumption met but no failure in method", new object[] { 3.0, 1.0, 3.0 })]
[Assumption("ConvertToEurosAndBackExchangeRateIsNotZero")]
[Assumption("DollarsNotThree")]
public void MyTheoryCanConvertToFromEuros(double amountDollars, double amountEuros, double exchangeRateDollarsPerEuro)
{
// Should check are equivalent within a tolerance
// Calls static method on Convert method
Assert.AreEqual(amountDollars, Converter.ConvertEurosToDollars(Converter.ConvertDollarsToEuros(amountDollars,
exchangeRateDollarsPerEuro), exchangeRateDollarsPerEuro));
}

// Assumption is that the exchange rate is not zero
public bool ConvertToEurosAndBackExchangeRateIsNotZero(double amountDollars, double amountEuros, double exchangeRateDollarsPerEuro)
{
// Should have a tolerance on this
return exchangeRateDollarsPerEuro != 0.0;
}

// Assume that dollar value not equal to three
// This is just to demonstrate that an invalid assumption results in a failure.
public bool DollarsNotThree(double amountDollars, double amountEuros, double exchangeRateDollarsPerEuro)
{
return amountDollars != 3.0;
}

/// Returns the data for MyTestMethod
///
public IList MyTestMethodData
{
get
{
List details = new List();
details.Add(new TheoryExampleDataDetail("Some other case should pass", new object[] { 2.0, 20.0, 5.0}));
return details;
}
}
}

public static class Converter
{
public static double ConvertEurosToDollars(double amountDollars,
double dollarsPerEuro)
{
return amountDollars * dollarsPerEuro;
}

public static double ConvertDollarsToEuros(double amountEuros,
double dollarsPerEuro)
{
return amountEuros / dollarsPerEuro;
}
}

A nicer syntax/api would be to have the assumptions inline:


public void CanConvertToEurosAndBack(double amountDollars, double amountEuros, double exchangeRateDollarsPerEuro)
{
Assume.That(exchangeRateDollarsPerEuro != 0.0);
Assume.That(amountDollars != 0.0);

// Checks are equivalent within a tolerance
// Calls static method on Convert method
Assert.AreEqual(amountDollars, Converter.ConvertEurosToDollars(Converter.ConvertDollarsToEuros(amountDollars,
exchangeRateDollarsPerEuro),exchangeRateDollarsPerEuro));
}

Here’s the rules of my Theory Implementation

If there is no example data, the theory passes (we may want to change this in the future).
If there are no assumptions for a theory, then each set of example data is executed against the theory each producing its own pass or fail.

If assumptions exist, the each set of data is first validated against the assumption – if it meets the assumption, then the test proceeds and any test failure is flagged as an error.
If the example data does not meet the assumption, then if the test passes it indicates that the assumption is invalid, and that case is marked as a failure, with a specific message “AssumptionFailed”. Any assertion failures or exceptions in the actual theory code are treated as passes. (in the future, would we want to mark the specific exception expected in the test methdo if an assumption is not met?).

NOTE: we may want to mark as a failure any theory for which ALL example data fails the assumptions, as a check that the
actual body of the theory is actually being executed. I’ve not done this for now as it would be trickier with the current
NUnit implementation.

Similarly, I was thinking of failing if any of the assumptions weren’t actually executed, but again, this is tricky in the current NUnit implementation (and may not give us much).

Automated exploration would not follow the last two suggested rules. The automation API would need to generate its data and execute it as if it were inline data. It may be helpful for the automated tool to be able to retrieve the user-supplied example data, so it doesn’t report a failure for any known case, but this is probably not necessary.

Feedback on these rules would be most welcome. If you want to change the behaviour of the assumptions (i.e. have assumptions only filter and nothing more), then the behaviour can be changed in TheoryMethod.RunTestMethod()

Here’s the output of the above theory:

Theories

I’ve just released a slightly updated version of my NUnit extension for data-driven unit testing.

There’s been a lot of discussion on the NUnit developer list recently regarding Theories – something new in JUnit and xUnit.Net, and it’s taken a while to discover why they’re so powerful (they’re superficially very similar to data-driven unit tests, and a lot of the differences are semantics).

First, there’s some good background on theories written by David Saff:
http://shareandenjoy.saff.net/tdd-specifications.pdf
http://shareandenjoy.saff.net/2007/04/popper-and-junitfactory.html
http://dspace.mit.edu/bitstream/1721.1/40090/1/MIT-CSAIL-TR-2008-002.pdf

Theories on first glance look like a data-driven unit test, but I think that the most important difference is, is that:

Theories are, in theory (excuse the pun), supposed to pass for ANY POSSIBLE parameters, whereas data-driven tests only express the behaviour examples that the developer has provided (they are nothing new in unit testing – just a way for a developer to more clearly group parameters together, or get the parameterized data from an external data source without recompiling tests).

Theories are a generalized statement of how the program should run, whereas in TDDing, a very explicit statement of intent is made, which can be made to pass by coding that specific case in the implementation, and then the program is made to work by triangulization – expressing the generalization by giving more inputs. However, the theory literature points out that as we haven’t passed in too many data points we can’t be sure whether we’ve actually expressed what we meant.

Theories, by forcing us to write our tests such that they take any inputs, are much more powerful a statement, and allow for the possible inputs to be explored with external tools.

As an aside, one question I posted to the NUnit developer list regarding theories: “One thing that comes to mind, is that theories are written such that all possible inputs should pass. Apart from using a tool such as agitator, is there a way to test that the tests are written in a general way (I mean, if you had a theory that took parameters, but it totally ignored those parameters and worked as a vanilla unit test – i.e. created its own input), then it’s not really a valid theory – is
there a way to detect these cases? Probably not, but I was just idly wondering.” Answers on a postcard to… well, I’d prefer a reply comment 😉