In this chapter, we introduce statistical debugging – the idea that specific events during execution could be statistically correlated with failures. We start with coverage of individual lines and then proceed towards further execution features.
Prerequisites
The idea behind statistical debugging is fairly simple. We have a program that sometimes passes and sometimes fails. This outcome can be correlated with events that precede it – properties of the input, properties of the execution, properties of the program state. If we, for instance, can find that "the program always fails when Line 123 is executed, and it always passes when Line 123 is not executed", then we have a strong correlation between Line 123 being executed and failure.
Such correlation does not necessarily mean causation. For this, we would have to prove that executing Line 123 always leads to failure, and that not executing it does not lead to (this) failure. Also, a correlation (or even a causation) does not mean that Line 123 contains the defect – for this, we would have to show that it actually is an error. Still, correlations make excellent hints as it comes to search for failure causes – in all generality, if you let your search be guided by events that correlate with failures, you are more likely to find important hints on how the failure comes to be.
How can we determine events that correlate with failure? We start with a general mechanism to actually collect events during execution. The abstract Collector
class provides
collect()
method made for collecting events, called from the traceit()
tracer; andevents()
method made for retrieving these events.Both of these are abstract and will be defined further in subclasses.
# ignore
from typing import Any, Callable, Optional, Type, Tuple
from typing import Dict, Set, List, TypeVar, Union
class Collector(Tracer):
"""A class to record events during execution."""
def collect(self, frame: FrameType, event: str, arg: Any) -> None:
"""Collecting function. To be overridden in subclasses."""
pass
def events(self) -> Set:
"""Return a collection of events. To be overridden in subclasses."""
return set()
def traceit(self, frame: FrameType, event: str, arg: Any) -> None:
self.collect(frame, event, arg)
A Collector
class is used like Tracer
, using a with
statement. Let us apply it on the buggy variant of remove_html_markup()
from the Introduction to Debugging:
def remove_html_markup(s): # type: ignore
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif c == '"' or c == "'" and tag:
quote = not quote
elif not tag:
out = out + c
return out
with Collector() as c:
out = remove_html_markup('"abc"')
out
'abc'
There's not much we can do with our collector, as the collect()
and events()
methods are yet empty. However, we can introduce an id()
method which returns a string identifying the collector. This string is defined from the first function call encountered.
Coverage = Set[Tuple[Callable, int]]
class Collector(Collector):
def __init__(self) -> None:
"""Constructor."""
self._function: Optional[Callable] = None
self._args: Optional[Dict[str, Any]] = None
self._argstring: Optional[str] = None
self._exception: Optional[Type] = None
self.items_to_ignore: List[Union[Type, Callable]] = [self.__class__]
def traceit(self, frame: FrameType, event: str, arg: Any) -> None:
"""
Tracing function.
Saves the first function and calls collect().
"""
for item in self.items_to_ignore:
if (isinstance(item, type) and 'self' in frame.f_locals and
isinstance(frame.f_locals['self'], item)):
# Ignore this class
return
if item.__name__ == frame.f_code.co_name:
# Ignore this function
return
if self._function is None and event == 'call':
# Save function
self._function = self.create_function(frame)
self._args = frame.f_locals.copy()
self._argstring = ", ".join([f"{var}={repr(self._args[var])}"
for var in self._args])
self.collect(frame, event, arg)
def collect(self, frame: FrameType, event: str, arg: Any) -> None:
"""Collector function. To be overloaded in subclasses."""
pass
def id(self) -> str:
"""Return an identifier for the collector,
created from the first call"""
return f"{self.function().__name__}({self.argstring()})"
def function(self) -> Callable:
"""Return the function from the first call, as a function object"""
if not self._function:
raise ValueError("No call collected")
return self._function
def argstring(self) -> str:
"""
Return the list of arguments from the first call,
as a printable string
"""
if not self._argstring:
raise ValueError("No call collected")
return self._argstring
def args(self) -> Dict[str, Any]:
"""Return a dict of argument names and values from the first call"""
if not self._args:
raise ValueError("No call collected")
return self._args
def exception(self) -> Optional[Type]:
"""Return the exception class from the first call,
or None if no exception was raised."""
return self._exception
def __repr__(self) -> str:
"""Return a string representation of the collector"""
# We use the ID as default representation when printed
return self.id()
def covered_functions(self) -> Set[Callable]:
"""Set of covered functions. To be overloaded in subclasses."""
return set()
def coverage(self) -> Coverage:
"""
Return a set (function, lineno) with locations covered.
To be overloaded in subclasses.
"""
return set()
Here's how the collector works. We use a with
clause to collect details on a function call:
with Collector() as c:
remove_html_markup('abc')
We can now retrieve details such as the function called...
c.function()
<function __main__.remove_html_markup(s)>
... or its arguments, as a name/value dictionary.
c.args()
{'s': 'abc'}
The id()
method returns a printable representation of the call:
c.id()
"remove_html_markup(s='abc')"
The argstring()
method does the same for the argument string only.
c.argstring()
"s='abc'"
With this, we can collect the basic information to identify calls – such that we can later correlate their events with success or failure.
While collecting, we'd like to avoid collecting events in the collection infrastructure. The items_to_ignore
attribute takes care of this.
class Collector(Collector):
def add_items_to_ignore(self,
items_to_ignore: List[Union[Type, Callable]]) \
-> None:
"""
Define additional classes and functions to ignore during collection
(typically `Debugger` classes using these collectors).
"""
self.items_to_ignore += items_to_ignore
If we exit a block without having collected anything, that's likely an error.
class Collector(Collector):
def __exit__(self, exc_tp: Type, exc_value: BaseException,
exc_traceback: TracebackType) -> Optional[bool]:
"""Exit the `with` block."""
ret = super().__exit__(exc_tp, exc_value, exc_traceback)
if not self._function:
if exc_tp:
return False # re-raise exception
else:
raise ValueError("No call collected")
return ret
So far, our Collector
class does not collect any events. Let us extend it such that it collects coverage information – that is, the set of locations executed. To this end, we introduce a CoverageCollector
subclass which saves the coverage in a set containing functions and line numbers.
class CoverageCollector(Collector, StackInspector):
"""A class to record covered locations during execution."""
def __init__(self) -> None:
"""Constructor."""
super().__init__()
self._coverage: Coverage = set()
def collect(self, frame: FrameType, event: str, arg: Any) -> None:
"""
Save coverage for an observed event.
"""
name = frame.f_code.co_name
function = self.search_func(name, frame)
if function is None:
function = self.create_function(frame)
location = (function, frame.f_lineno)
self._coverage.add(location)
We also override events()
such that it returns the set of covered locations.
class CoverageCollector(CoverageCollector):
def events(self) -> Set[Tuple[str, int]]:
"""
Return the set of locations covered.
Each location comes as a pair (`function_name`, `lineno`).
"""
return {(func.__name__, lineno) for func, lineno in self._coverage}
The methods coverage()
and covered_functions()
allow precise access to the coverage obtained.
class CoverageCollector(CoverageCollector):
def covered_functions(self) -> Set[Callable]:
"""Return a set with all functions covered."""
return {func for func, lineno in self._coverage}
def coverage(self) -> Coverage:
"""Return a set (function, lineno) with all locations covered."""
return self._coverage
Here is how we can use CoverageCollector
to determine the lines executed during a run of remove_html_markup()
:
with CoverageCollector() as c:
remove_html_markup('abc')
c.events()
{('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 9), ('remove_html_markup', 11), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}
Sets of line numbers alone are not too revealing. They provide more insights if we actually list the code, highlighting these numbers:
def code_with_coverage(function: Callable, coverage: Coverage) -> None:
source_lines, starting_line_number = \
getsourcelines(function)
line_number = starting_line_number
for line in source_lines:
marker = '*' if (function, line_number) in coverage else ' '
print(f"{line_number:4} {marker} {line}", end='')
line_number += 1
code_with_coverage(remove_html_markup, c.coverage())
1 * def remove_html_markup(s): # type: ignore 2 * tag = False 3 * quote = False 4 * out = "" 5 6 * for c in s: 7 * if c == '<' and not quote: 8 tag = True 9 * elif c == '>' and not quote: 10 tag = False 11 * elif c == '"' or c == "'" and tag: 12 quote = not quote 13 * elif not tag: 14 * out = out + c 15 16 * return out
Remember that the input s
was "abc"
? In this listing, we can see which lines were covered and which lines were not. From the listing already, we can see that s
has neither tags nor quotes.
Such coverage computation plays a big role in testing, as one wants tests to cover as many different aspects of program execution (and notably code) as possible. But also during debugging, code coverage is essential: If some code was not even executed in the failing run, then any change to it will have no effect.
quiz('Let the input be `"<b>Don\'t do this!</b>"`. '
"Which of these lines are executed? Use the code to find out!",
[
"`tag = True`",
"`tag = False`",
"`quote = not quote`",
"`out = out + c`"
], "[ord(c) - ord('a') - 1 for c in 'cdf']")
"<b>Don't do this!</b>"
. Which of these lines are executed? Use the code to find out!
To find the solution, try this out yourself:
with CoverageCollector() as c:
remove_html_markup("<b>Don't do this!</b>")
# code_with_coverage(remove_html_markup, c.coverage)
Let us get back to the idea that we want to correlate events with passing and failing outcomes. For this, we need to examine events in both passing and failing runs, and determine their differences – since it is these differences we want to associate with their respective outcome.
The StatisticalDebugger
base class takes a collector class (such as CoverageCollector
). Its collect()
method creates a new collector of that very class, which will be maintained by the debugger. As argument, collect()
takes a string characterizing the outcome (such as 'PASS'
or 'FAIL'
). This is how one would use it:
debugger = StatisticalDebugger()
with debugger.collect('PASS'):
some_passing_run()
with debugger.collect('PASS'):
another_passing_run()
with debugger.collect('FAIL'):
some_failing_run()
Let us implement StatisticalDebugger
. The base class gets a collector class as argument:
class StatisticalDebugger:
"""A class to collect events for multiple outcomes."""
def __init__(self, collector_class: Type = CoverageCollector, log: bool = False):
"""Constructor. Use instances of `collector_class` to collect events."""
self.collector_class = collector_class
self.collectors: Dict[str, List[Collector]] = {}
self.log = log
The collect()
method creates (and stores) a collector for the given outcome, using the given outcome to characterize the run. Any additional arguments are passed to the collector.
class StatisticalDebugger(StatisticalDebugger):
def collect(self, outcome: str, *args: Any, **kwargs: Any) -> Collector:
"""Return a collector for the given outcome.
Additional args are passed to the collector."""
collector = self.collector_class(*args, **kwargs)
collector.add_items_to_ignore([self.__class__])
return self.add_collector(outcome, collector)
def add_collector(self, outcome: str, collector: Collector) -> Collector:
if outcome not in self.collectors:
self.collectors[outcome] = []
self.collectors[outcome].append(collector)
return collector
The all_events()
method produces a union of all events observed. If an outcome is given, it produces a union of all events with that outcome:
class StatisticalDebugger(StatisticalDebugger):
def all_events(self, outcome: Optional[str] = None) -> Set[Any]:
"""Return a set of all events observed."""
all_events = set()
if outcome:
if outcome in self.collectors:
for collector in self.collectors[outcome]:
all_events.update(collector.events())
else:
for outcome in self.collectors:
for collector in self.collectors[outcome]:
all_events.update(collector.events())
return all_events
Here's a simple example of StatisticalDebugger
in action:
s = StatisticalDebugger()
with s.collect('PASS'):
remove_html_markup("abc")
with s.collect('PASS'):
remove_html_markup('<b>abc</b>')
with s.collect('FAIL'):
remove_html_markup('"abc"')
The method all_events()
returns all events collected:
s.all_events()
{('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 8), ('remove_html_markup', 9), ('remove_html_markup', 10), ('remove_html_markup', 11), ('remove_html_markup', 12), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}
If given an outcome as argument, we obtain all events with the given outcome.
s.all_events('FAIL')
{('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 9), ('remove_html_markup', 11), ('remove_html_markup', 12), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}
The attribute collectors
maps outcomes to lists of collectors:
s.collectors
{'PASS': [remove_html_markup(s='abc'), remove_html_markup(s='<b>abc</b>')], 'FAIL': [remove_html_markup(s='"abc"')]}
Here's the collector of the one (and first) passing run:
s.collectors['PASS'][0].id()
"remove_html_markup(s='abc')"
s.collectors['PASS'][0].events()
{('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 9), ('remove_html_markup', 11), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}
To better highlight the differences between the collected events, we introduce a method event_table()
that prints out whether an event took place in a run.
s = StatisticalDebugger()
with s.collect('PASS'):
remove_html_markup("abc")
with s.collect('PASS'):
remove_html_markup('<b>abc</b>')
with s.collect('FAIL'):
remove_html_markup('"abc"')
s.event_table(args=True)
remove_html_markup |
s='abc' |
s='<b>abc</b>' |
s='"abc"' |
---|---|---|---|
remove_html_markup:1 | X | X | X |
remove_html_markup:2 | X | X | X |
remove_html_markup:3 | X | X | X |
remove_html_markup:4 | X | X | X |
remove_html_markup:6 | X | X | X |
remove_html_markup:7 | X | X | X |
remove_html_markup:8 | - | X | - |
remove_html_markup:9 | X | X | X |
remove_html_markup:10 | - | X | - |
remove_html_markup:11 | X | X | X |
remove_html_markup:12 | - | - | X |
remove_html_markup:13 | X | X | X |
remove_html_markup:14 | X | X | X |
remove_html_markup:16 | X | X | X |
quiz("How many lines are executed in the failing run only?",
[
"One",
"Two",
"Three"
], 'len([12])')
Indeed, Line 12 executed in the failing run only would be a correlation to look for.
While our StatisticalDebugger
class allows arbitrary outcomes, we are typically only interested in two outcomes, namely passing vs. failing runs. We therefore introduce a specialized DifferenceDebugger
class that provides customized methods to collect and access passing and failing runs.
class DifferenceDebugger(StatisticalDebugger):
"""A class to collect events for passing and failing outcomes."""
PASS = 'PASS'
FAIL = 'FAIL'
def collect_pass(self, *args: Any, **kwargs: Any) -> Collector:
"""Return a collector for passing runs."""
return self.collect(self.PASS, *args, **kwargs)
def collect_fail(self, *args: Any, **kwargs: Any) -> Collector:
"""Return a collector for failing runs."""
return self.collect(self.FAIL, *args, **kwargs)
def pass_collectors(self) -> List[Collector]:
return self.collectors[self.PASS]
def fail_collectors(self) -> List[Collector]:
return self.collectors[self.FAIL]
def all_fail_events(self) -> Set[Any]:
"""Return all events observed in failing runs."""
return self.all_events(self.FAIL)
def all_pass_events(self) -> Set[Any]:
"""Return all events observed in passing runs."""
return self.all_events(self.PASS)
def only_fail_events(self) -> Set[Any]:
"""Return all events observed only in failing runs."""
return self.all_fail_events() - self.all_pass_events()
def only_pass_events(self) -> Set[Any]:
"""Return all events observed only in passing runs."""
return self.all_pass_events() - self.all_fail_events()
We can use DifferenceDebugger
just as a StatisticalDebugger
:
# ignore
T1 = TypeVar('T1', bound='DifferenceDebugger')
def test_debugger_html_simple(debugger: T1) -> T1:
with debugger.collect_pass():
remove_html_markup('abc')
with debugger.collect_pass():
remove_html_markup('<b>abc</b>')
with debugger.collect_fail():
remove_html_markup('"abc"')
return debugger
However, since the outcome of tests may not always be predetermined, we provide a simpler interface for tests that can fail (= raise an exception) or pass (not raise an exception).
class DifferenceDebugger(DifferenceDebugger):
def __enter__(self) -> Any:
"""Enter a `with` block. Collect coverage and outcome;
classify as FAIL if the block raises an exception,
and PASS if it does not.
"""
self.collector = self.collector_class()
self.collector.add_items_to_ignore([self.__class__])
self.collector.__enter__()
return self
def __exit__(self, exc_tp: Type, exc_value: BaseException,
exc_traceback: TracebackType) -> Optional[bool]:
"""Exit the `with` block."""
status = self.collector.__exit__(exc_tp, exc_value, exc_traceback)
if status is None:
pass
else:
return False # Internal error; re-raise exception
if exc_tp is None:
outcome = self.PASS
else:
outcome = self.FAIL
self.add_collector(outcome, self.collector)
return True # Ignore exception, if any
Using this interface, we can rewrite test_debugger_html()
:
# ignore
T2 = TypeVar('T2', bound='DifferenceDebugger')
def test_debugger_html(debugger: T2) -> T2:
with debugger:
remove_html_markup('abc')
with debugger:
remove_html_markup('<b>abc</b>')
with debugger:
remove_html_markup('"abc"')
assert False # Mark test as failing
return debugger
test_debugger_html(DifferenceDebugger())
remove_html_markup |
s='abc' |
s='<b>abc</b>' |
s='"abc"' |
---|---|---|---|
remove_html_markup:1 | X | X | X |
remove_html_markup:2 | X | X | X |
remove_html_markup:3 | X | X | X |
remove_html_markup:4 | X | X | X |
remove_html_markup:6 | X | X | X |
remove_html_markup:7 | X | X | X |
remove_html_markup:8 | - | X | - |
remove_html_markup:9 | X | X | X |
remove_html_markup:10 | - | X | - |
remove_html_markup:11 | X | X | X |
remove_html_markup:12 | - | - | X |
remove_html_markup:13 | X | X | X |
remove_html_markup:14 | X | X | X |
remove_html_markup:16 | X | X | X |
Let us now focus on analyzing events collected. Since events come back as sets, we can compute unions and differences between these sets. For instance, we can compute which lines were executed in any of the passing runs of test_debugger_html()
, above:
debugger = test_debugger_html(DifferenceDebugger())
pass_1_events = debugger.pass_collectors()[0].events()
pass_2_events = debugger.pass_collectors()[1].events()
in_any_pass = pass_1_events | pass_2_events
in_any_pass
{('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 8), ('remove_html_markup', 9), ('remove_html_markup', 10), ('remove_html_markup', 11), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}
Likewise, we can determine which lines were only executed in the failing run:
fail_events = debugger.fail_collectors()[0].events()
only_in_fail = fail_events - in_any_pass
only_in_fail
{('remove_html_markup', 12)}
And we see that the "failing" run is characterized by processing quotes:
code_with_coverage(remove_html_markup, only_in_fail)
1 def remove_html_markup(s): # type: ignore 2 tag = False 3 quote = False 4 out = "" 5 6 for c in s: 7 if c == '<' and not quote: 8 tag = True 9 elif c == '>' and not quote: 10 tag = False 11 elif c == '"' or c == "'" and tag: 12 quote = not quote 13 elif not tag: 14 out = out + c 15 16 return out
debugger = test_debugger_html(DifferenceDebugger())
debugger.all_events()
{('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 8), ('remove_html_markup', 9), ('remove_html_markup', 10), ('remove_html_markup', 11), ('remove_html_markup', 12), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}
These are the lines executed only in the failing run:
debugger.only_fail_events()
{('remove_html_markup', 12)}
These are the lines executed only in the passing runs:
debugger.only_pass_events()
{('remove_html_markup', 8), ('remove_html_markup', 10)}
Again, having these lines individually is neat, but things become much more interesting if we can see the associated code lines just as well. That's what we will do in the next section.
To show correlations of line coverage in context, we introduce a number of visualization techniques that highlight code with different colors.
The first idea is to use a discrete spectrum of three colors:
Code that is not executed stays unhighlighted.
We first introduce an abstract class SpectrumDebugger
that provides the essential functions. suspiciousness()
returns a value between 0 and 1 indicating the suspiciousness of the given event - or None
if unknown.
class SpectrumDebugger(DifferenceDebugger):
def suspiciousness(self, event: Any) -> Optional[float]:
"""
Return a suspiciousness value in the range [0, 1.0]
for the given event, or `None` if unknown.
To be overloaded in subclasses.
"""
return None
The tooltip()
and percentage()
methods convert the suspiciousness into a human-readable form.
class SpectrumDebugger(SpectrumDebugger):
def tooltip(self, event: Any) -> str:
"""
Return a tooltip for the given event (default: percentage).
To be overloaded in subclasses.
"""
return self.percentage(event)
def percentage(self, event: Any) -> str:
"""
Return the suspiciousness for the given event as percentage string.
"""
suspiciousness = self.suspiciousness(event)
if suspiciousness is not None:
return str(int(suspiciousness * 100)).rjust(3) + '%'
else:
return ' ' * len('100%')
The code()
method takes a function and shows each of its source code lines using the given spectrum, using HTML markup:
class SpectrumDebugger(SpectrumDebugger):
def code(self, functions: Optional[Set[Callable]] = None, *,
color: bool = False, suspiciousness: bool = False,
line_numbers: bool = True) -> str:
"""
Return a listing of `functions` (default: covered functions).
If `color` is True, render as HTML, using suspiciousness colors.
If `suspiciousness` is True, include suspiciousness values.
If `line_numbers` is True (default), include line numbers.
"""
if not functions:
functions = self.covered_functions()
out = ""
seen = set()
for function in functions:
source_lines, starting_line_number = \
inspect.getsourcelines(function)
if (function.__name__, starting_line_number) in seen:
continue
seen.add((function.__name__, starting_line_number))
if out:
out += '\n'
if color:
out += '<p/>'
line_number = starting_line_number
for line in source_lines:
if color:
line = html.escape(line)
if line.strip() == '':
line = ' '
location = (function.__name__, line_number)
location_suspiciousness = self.suspiciousness(location)
if location_suspiciousness is not None:
tooltip = f"Line {line_number}: {self.tooltip(location)}"
else:
tooltip = f"Line {line_number}: not executed"
if suspiciousness:
line = self.percentage(location) + ' ' + line
if line_numbers:
line = str(line_number).rjust(4) + ' ' + line
line_color = self.color(location)
if color and line_color:
line = f'''<pre style="background-color:{line_color}"
title="{tooltip}">{line.rstrip()}</pre>'''
elif color:
line = f'<pre title="{tooltip}">{line}</pre>'
else:
line = line.rstrip()
out += line + '\n'
line_number += 1
return out
We introduce a few helper methods to visualize the code with colors in various forms.
class SpectrumDebugger(SpectrumDebugger):
def _repr_html_(self) -> str:
"""When output in Jupyter, visualize as HTML"""
return self.code(color=True)
def __str__(self) -> str:
"""Show code as string"""
return self.code(color=False, suspiciousness=True)
def __repr__(self) -> str:
"""Show code as string"""
return self.code(color=False, suspiciousness=True)
So far, however, central methods like suspiciousness()
or color()
were abstract – that is, to be defined in subclasses. Our DiscreteSpectrumDebugger
subclass provides concrete implementations for these, with color()
returning one of the three colors depending on the line number:
class DiscreteSpectrumDebugger(SpectrumDebugger):
"""Visualize differences between executions using three discrete colors"""
def suspiciousness(self, event: Any) -> Optional[float]:
"""
Return a suspiciousness value [0, 1.0]
for the given event, or `None` if unknown.
"""
passing = self.all_pass_events()
failing = self.all_fail_events()
if event in passing and event in failing:
return 0.5
elif event in failing:
return 1.0
elif event in passing:
return 0.0
else:
return None
def color(self, event: Any) -> Optional[str]:
"""
Return a HTML color for the given event.
"""
suspiciousness = self.suspiciousness(event)
if suspiciousness is None:
return None
if suspiciousness > 0.8:
return 'mistyrose'
if suspiciousness >= 0.5:
return 'lightyellow'
return 'honeydew'
def tooltip(self, event: Any) -> str:
"""Return a tooltip for the given event."""
passing = self.all_pass_events()
failing = self.all_fail_events()
if event in passing and event in failing:
return "in passing and failing runs"
elif event in failing:
return "only in failing runs"
elif event in passing:
return "only in passing runs"
else:
return "never"
This is how the only_pass_events()
and only_fail_events()
sets look like when visualized with code. The "culprit" line is well highlighted:
debugger = test_debugger_html(DiscreteSpectrumDebugger())
debugger
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
We can clearly see that the failure is correlated with the presence of quotes in the input string (which is an important hint!). But does this also show us immediately where the defect to be fixed is?
quiz("Does the line `quote = not quote` actually contain the defect?",
[
"Yes, it should be fixed",
"No, the defect is elsewhere"
], '164 * 2 % 326')
quote = not quote
actually contain the defect?
Indeed, it is the governing condition that is wrong – that is, the condition that caused Line 12 to be executed in the first place. In order to fix a program, we have to find a location that
In our example above, the highlighted code line is a symptom for the error. To some extent, it is also a cause, since, say, commenting it out would also resolve the given failure, at the cost of causing other failures. However, the preceding condition also is a cause, as is the presence of quotes in the input.
Only one of these also is a defect, though, and that is the preceding condition. Hence, while correlations can provide important hints, they do not necessarily locate defects.
For those of us who may not have color HTML output ready, simply printing the debugger lists suspiciousness values as percentages.
print(debugger)
1 50% def remove_html_markup(s): # type: ignore 2 50% tag = False 3 50% quote = False 4 50% out = "" 5 6 50% for c in s: 7 50% if c == '<' and not quote: 8 0% tag = True 9 50% elif c == '>' and not quote: 10 0% tag = False 11 50% elif c == '"' or c == "'" and tag: 12 100% quote = not quote 13 50% elif not tag: 14 50% out = out + c 15 16 50% return out
The criterion that an event should only occur in failing runs (and not in passing runs) can be too aggressive. In particular, if we have another run that executes the "culprit" lines, but does not fail, our "only in fail" criterion will no longer be helpful.
Here is an example. The input
<b color="blue">text</b>
will trigger the "culprit" line
quote = not quote
but actually produce an output where the tags are properly stripped:
remove_html_markup('<b color="blue">text</b>')
'text'
As a consequence, we no longer have lines that are being executed only in failing runs:
debugger = test_debugger_html(DiscreteSpectrumDebugger())
with debugger.collect_pass():
remove_html_markup('<b link="blue"></b>')
debugger.only_fail_events()
set()
In our spectrum output, the effect now is that the "culprit" line is as yellow as all others.
debugger
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
We therefore introduce a different method for highlighting lines, based on their relative occurrence with respect to all runs: If a line has been mostly executed in failing runs, its color should shift towards red; if a line has been mostly executed in passing runs, its color should shift towards green.
This continuous spectrum has been introduced by the seminal Tarantula tool \cite{Jones2002}. In Tarantula, the color hue for each line is defined as follows:
Here, %passed
and %failed
denote the percentage at which a line has been executed in passing and failing runs, respectively. A hue of 0.0 stands for red, a hue of 1.0 stands for green, and a hue of 0.5 stands for equal fractions of red and green, yielding yellow.
We can implement these measures right away as methods in a new ContinuousSpectrumDebugger
class:
class ContinuousSpectrumDebugger(DiscreteSpectrumDebugger):
"""Visualize differences between executions using a color spectrum"""
def collectors_with_event(self, event: Any, category: str) -> Set[Collector]:
"""
Return all collectors in a category
that observed the given event.
"""
all_runs = self.collectors[category]
collectors_with_event = set(collector for collector in all_runs
if event in collector.events())
return collectors_with_event
def collectors_without_event(self, event: Any, category: str) -> Set[Collector]:
"""
Return all collectors in a category
that did not observe the given event.
"""
all_runs = self.collectors[category]
collectors_without_event = set(collector for collector in all_runs
if event not in collector.events())
return collectors_without_event
def event_fraction(self, event: Any, category: str) -> float:
if category not in self.collectors:
return 0.0
all_collectors = self.collectors[category]
collectors_with_event = self.collectors_with_event(event, category)
fraction = len(collectors_with_event) / len(all_collectors)
# print(f"%{category}({event}) = {fraction}")
return fraction
def passed_fraction(self, event: Any) -> float:
return self.event_fraction(event, self.PASS)
def failed_fraction(self, event: Any) -> float:
return self.event_fraction(event, self.FAIL)
def hue(self, event: Any) -> Optional[float]:
"""Return a color hue from 0.0 (red) to 1.0 (green)."""
passed = self.passed_fraction(event)
failed = self.failed_fraction(event)
if passed + failed > 0:
return passed / (passed + failed)
else:
return None
Having a continuous hue also implies a continuous suspiciousness and associated tooltips:
class ContinuousSpectrumDebugger(ContinuousSpectrumDebugger):
def suspiciousness(self, event: Any) -> Optional[float]:
hue = self.hue(event)
if hue is None:
return None
return 1 - hue
def tooltip(self, event: Any) -> str:
return self.percentage(event)
The hue for lines executed only in failing runs is (deep) red, as expected:
debugger = test_debugger_html(ContinuousSpectrumDebugger())
for location in debugger.only_fail_events():
print(location, debugger.hue(location))
('remove_html_markup', 12) 0.0
Likewise, the hue for lines executed in passing runs is (deep) green:
for location in debugger.only_pass_events():
print(location, debugger.hue(location))
('remove_html_markup', 8) 1.0 ('remove_html_markup', 10) 1.0
The Tarantula tool not only sets the hue for a line, but also uses brightness as measure for support – that is, how often was the line executed at all. The brighter a line, the stronger the correlation with a passing or failing outcome.
The brightness is defined as follows:
and it is easily implemented, too:
class ContinuousSpectrumDebugger(ContinuousSpectrumDebugger):
def brightness(self, event: Any) -> float:
return max(self.passed_fraction(event), self.failed_fraction(event))
Our single "only in fail" line has a brightness of 1.0 (the maximum).
debugger = test_debugger_html(ContinuousSpectrumDebugger())
for location in debugger.only_fail_events():
print(location, debugger.brightness(location))
('remove_html_markup', 12) 1.0
With this, we can now define a color for each line. To this end, we override the (previously discrete) color()
method such that it returns a color specification giving hue and brightness. We use the HTML format hsl(hue, saturation, lightness)
where the hue is given as a value between 0 and 360 (0 is red, 120 is green) and saturation and lightness are provided as percentages.
class ContinuousSpectrumDebugger(ContinuousSpectrumDebugger):
def color(self, event: Any) -> Optional[str]:
hue = self.hue(event)
if hue is None:
return None
saturation = self.brightness(event)
# HSL color values are specified with:
# hsl(hue, saturation, lightness).
return f"hsl({hue * 120}, {saturation * 100}%, 80%)"
debugger = test_debugger_html(ContinuousSpectrumDebugger())
Lines executed only in failing runs are still shown in red:
for location in debugger.only_fail_events():
print(location, debugger.color(location))
('remove_html_markup', 12) hsl(0.0, 100.0%, 80%)
... whereas lines executed only in passing runs are still shown in green:
for location in debugger.only_pass_events():
print(location, debugger.color(location))
('remove_html_markup', 8) hsl(120.0, 50.0%, 80%) ('remove_html_markup', 10) hsl(120.0, 50.0%, 80%)
debugger
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
What happens with our quote = not quote
"culprit" line if it is executed in passing runs, too?
with debugger.collect_pass():
out = remove_html_markup('<b link="blue"></b>')
quiz('In which color will the `quote = not quote` "culprit" line '
'be shown after executing the above code?',
[
'<span style="background-color: hsl(120.0, 50.0%, 80%)">Green</span>',
'<span style="background-color: hsl(60.0, 100.0%, 80%)">Yellow</span>',
'<span style="background-color: hsl(30.0, 100.0%, 80%)">Orange</span>',
'<span style="background-color: hsl(0.0, 100.0%, 80%)">Red</span>'
], '999 // 333')
quote = not quote
"culprit" line be shown after executing the above code?
We see that it still is shown with an orange-red tint.
debugger
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
Here's another example, coming right from the Tarantula paper. The middle()
function takes three numbers x
, y
, and z
, and returns the one that is neither the minimum nor the maximum of the three:
def middle(x, y, z): # type: ignore
if y < z:
if x < y:
return y
elif x < z:
return y
else:
if x > y:
return y
elif x > z:
return x
return z
middle(1, 2, 3)
2
Unfortunately, middle()
can fail:
middle(2, 1, 3)
1
Let is see whether we can find the bug with a few additional test cases:
# ignore
T3 = TypeVar('T3', bound='DifferenceDebugger')
def test_debugger_middle(debugger: T3) -> T3:
with debugger.collect_pass():
middle(3, 3, 5)
with debugger.collect_pass():
middle(1, 2, 3)
with debugger.collect_pass():
middle(3, 2, 1)
with debugger.collect_pass():
middle(5, 5, 5)
with debugger.collect_pass():
middle(5, 3, 4)
with debugger.collect_fail():
middle(2, 1, 3)
return debugger
Note that in order to collect data from multiple function invocations, you need to have a separate with
clause for every invocation. The following will not work correctly:
with debugger.collect_pass():
middle(3, 3, 5)
middle(1, 2, 3)
...
debugger = test_debugger_middle(ContinuousSpectrumDebugger())
debugger.event_table(args=True)
middle |
x=3, y=3, z=5 |
x=1, y=2, z=3 |
x=3, y=2, z=1 |
x=5, y=5, z=5 |
x=5, y=3, z=4 |
x=2, y=1, z=3 |
---|---|---|---|---|---|---|
middle:1 | X | X | X | X | X | X |
middle:2 | X | X | X | X | X | X |
middle:3 | X | X | - | - | X | X |
middle:4 | - | X | - | - | - | - |
middle:5 | X | - | - | - | X | X |
middle:6 | X | - | - | - | - | X |
middle:8 | - | - | X | X | - | - |
middle:9 | - | - | X | - | - | - |
middle:10 | - | - | - | X | - | - |
middle:12 | - | - | - | X | X | - |
Here comes the visualization. We see that the return y
line is the culprit here – and actually also the one to be fixed.
debugger
1 def middle(x, y, z): # type: ignore
2 if y < z:
3 if x < y:
4 return y
5 elif x < z:
6 return y
7 else:
8 if x > y:
9 return y
10 elif x > z:
11 return x
12 return z
quiz("Which of the above lines should be fixed?",
[
'<span style="background-color: hsl(45.0, 100%, 80%)">Line 3: `if x < y`</span>',
'<span style="background-color: hsl(34.28571428571429, 100.0%, 80%)">Line 5: `elif x < z`</span>',
'<span style="background-color: hsl(20.000000000000004, 100.0%, 80%)">Line 6: `return y`</span>',
'<span style="background-color: hsl(120.0, 20.0%, 80%)">Line 9: `return y`</span>',
], r'len(" middle ".strip()[:3])')
Indeed, in the middle()
example, the "reddest" line is also the one to be fixed. Here is the fixed version:
def middle_fixed(x, y, z): # type: ignore
if y < z:
if x < y:
return y
elif x < z:
return x
else:
if x > y:
return y
elif x > z:
return x
return z
middle_fixed(2, 1, 3)
2
In a large program, there can be several locations (and events) that could be flagged as suspicious. It suffices that some large code block of say, 1,000 lines, is mostly executed in failing runs, and then all of this code block will be visualized in some shade of red.
To further highlight the "most suspicious" events, one idea is to use a ranking – that is, coming up with a list of events where those events most correlated with failures would be shown at the top. The programmer would then examine these events one by one and proceed down the list. We will show how this works for two "correlation" metrics – first the Tarantula metric, as introduced above, and then the Ochiai metric, which has shown to be one of the best "ranking" metrics.
We introduce a base class RankingDebugger
with an abstract method suspiciousness()
to be overloaded in subclasses. The method rank()
returns a list of all events observed, sorted by suspiciousness, highest first.
class RankingDebugger(DiscreteSpectrumDebugger):
"""Rank events by their suspiciousness"""
def rank(self) -> List[Any]:
"""Return a list of events, sorted by suspiciousness, highest first."""
def susp(event: Any) -> float:
suspiciousness = self.suspiciousness(event)
assert suspiciousness is not None
return suspiciousness
events = list(self.all_events())
events.sort(key=susp, reverse=True)
return events
def __repr__(self) -> str:
return repr(self.rank())
We can use the Tarantula metric to sort lines according to their suspiciousness. The "redder" a line (a hue of 0.0), the more suspicious it is. We can simply define
where $\textit{color hue}$ is as defined above.
This is exactly the suspiciousness()
function as already implemented in our ContinuousSpectrumDebugger
.
We introduce the TarantulaDebugger
class, inheriting visualization capabilities from the ContinuousSpectrumDebugger
class as well as the suspiciousness features from the RankingDebugger
class.
class TarantulaDebugger(ContinuousSpectrumDebugger, RankingDebugger):
"""Spectrum-based Debugger using the Tarantula metric for suspiciousness"""
pass
Let us list remove_html_markup()
with highlighted lines again:
tarantula_html = test_debugger_html(TarantulaDebugger())
tarantula_html
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
Here's our ranking of lines, from most suspicious to least suspicious:
tarantula_html.rank()
[('remove_html_markup', 12), ('remove_html_markup', 14), ('remove_html_markup', 3), ('remove_html_markup', 9), ('remove_html_markup', 6), ('remove_html_markup', 4), ('remove_html_markup', 1), ('remove_html_markup', 7), ('remove_html_markup', 16), ('remove_html_markup', 13), ('remove_html_markup', 2), ('remove_html_markup', 11), ('remove_html_markup', 8), ('remove_html_markup', 10)]
tarantula_html.suspiciousness(tarantula_html.rank()[0])
1.0
We see that the first line in the list is indeed the most suspicious; the two "green" lines come at the very end.
For the middle()
function, we also obtain a ranking from "reddest" to "greenest".
tarantula_middle = test_debugger_middle(TarantulaDebugger())
tarantula_middle
1 def middle(x, y, z): # type: ignore
2 if y < z:
3 if x < y:
4 return y
5 elif x < z:
6 return y
7 else:
8 if x > y:
9 return y
10 elif x > z:
11 return x
12 return z
tarantula_middle.rank()
[('middle', 6), ('middle', 5), ('middle', 3), ('middle', 2), ('middle', 1), ('middle', 12), ('middle', 8), ('middle', 4), ('middle', 10), ('middle', 9)]
tarantula_middle.suspiciousness(tarantula_middle.rank()[0])
0.8333333333333333
The Ochiai Metric \cite{Ochiai1957} first introduced in the biology domain \cite{daSilvaMeyer2004} and later applied for fault localization by Abreu et al. \cite{Abreu2009}, is defined as follows:
where
We can easily implement this formula:
class OchiaiDebugger(ContinuousSpectrumDebugger, RankingDebugger):
"""Spectrum-based Debugger using the Ochiai metric for suspiciousness"""
def suspiciousness(self, event: Any) -> Optional[float]:
failed = len(self.collectors_with_event(event, self.FAIL))
not_in_failed = len(self.collectors_without_event(event, self.FAIL))
passed = len(self.collectors_with_event(event, self.PASS))
try:
return failed / math.sqrt((failed + not_in_failed) * (failed + passed))
except ZeroDivisionError:
return None
def hue(self, event: Any) -> Optional[float]:
suspiciousness = self.suspiciousness(event)
if suspiciousness is None:
return None
return 1 - suspiciousness
Applied on the remove_html_markup()
function, the individual suspiciousness scores differ from Tarantula. However, we obtain a very similar visualization, and the same ranking.
ochiai_html = test_debugger_html(OchiaiDebugger())
ochiai_html
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
ochiai_html.rank()
[('remove_html_markup', 12), ('remove_html_markup', 14), ('remove_html_markup', 3), ('remove_html_markup', 9), ('remove_html_markup', 6), ('remove_html_markup', 4), ('remove_html_markup', 1), ('remove_html_markup', 7), ('remove_html_markup', 16), ('remove_html_markup', 13), ('remove_html_markup', 2), ('remove_html_markup', 11), ('remove_html_markup', 8), ('remove_html_markup', 10)]
ochiai_html.suspiciousness(ochiai_html.rank()[0])
1.0
The same observations also apply for the middle()
function.
ochiai_middle = test_debugger_middle(OchiaiDebugger())
ochiai_middle
1 def middle(x, y, z): # type: ignore
2 if y < z:
3 if x < y:
4 return y
5 elif x < z:
6 return y
7 else:
8 if x > y:
9 return y
10 elif x > z:
11 return x
12 return z
ochiai_middle.rank()
[('middle', 6), ('middle', 5), ('middle', 3), ('middle', 2), ('middle', 1), ('middle', 12), ('middle', 8), ('middle', 4), ('middle', 10), ('middle', 9)]
ochiai_middle.suspiciousness(ochiai_middle.rank()[0])
0.7071067811865475
So, which metric is better? The standard method to evaluate such rankings is to determine a ground truth – that is, the set of locations that eventually are fixed – and to check at which point in the ranking any such location occurs – the earlier, the better. In our remove_html_markup()
and middle()
examples, both the Tarantula and the Ochiai metric perform flawlessly, as the "culprit" line is always ranked at the top. However, this need not always be the case; the exact performance depends on the nature of the code and the observed runs. (Also, the question of whether there always is exactly one possible location where the program can be fixed is open for discussion.)
You will be surprised that over time, several dozen metrics have been proposed \cite{Wong2016}, each performing somewhat better or somewhat worse depending on which benchmark they were applied on. The two metrics discussed above each have their merits – the Tarantula metric was among the first such metrics, and the Ochiai metric is generally shown to be among the most effective ones \cite{Abreu2009}.
While rankings can be easily evaluated, it is not necessarily clear whether and how much they serve programmers. As stated above, the assumption of rankings is that developers examine one potentially defective statement after another until they find the actually defective one. However, in a series of human studies with developers, Parnin and Orso \cite{Parnin2011} found that this assumption may not hold:
It is unclear whether developers can actually determine the faulty nature of a statement by simply looking at it, without any additional information (e.g., the state of the program when the statement was executed or the statements that were executed before or after that one).
In their study, they found that rankings could help completing a task faster, but this effect was limited to experienced developers and simpler code. Artificially changing the rank of faulty statements had little to no effect, implying that developers would not strictly follow the ranked list of statements, but rather search through the code to understand it. At this point, a visualization as in the Tarantula tool can be helpful to programmers as it guides the search, but a ranking that defines where to search may be less useful.
Having said that, ranking has its merits – notably as it comes to informing automated debugging techniques. In the chapter on program repair, we will see how ranked lists of potentially faulty statements tell automated repair techniques where to try to repair the program first. And once such a repair is successful, we have a very strong indication on where and how the program could be fixed!
In fault localization, the larger and the more thorough the test suite, the higher the precision. Let us try out what happens if we extend the middle()
test suite with additional test cases.
The function middle_testcase()
returns a random input for middle()
:
def middle_testcase() -> Tuple[int, int, int]:
x = random.randrange(10)
y = random.randrange(10)
z = random.randrange(10)
return x, y, z
[middle_testcase() for i in range(5)]
[(9, 0, 5), (0, 1, 9), (9, 0, 9), (2, 9, 0), (4, 7, 2)]
The function middle_test()
simply checks if middle()
operates correctly – by placing x
, y
, and z
in a list, sorting it, and checking the middle argument. If middle()
fails, middle_test()
raises an exception.
def middle_test(x: int, y: int, z: int) -> None:
m = middle(x, y, z)
assert m == sorted([x, y, z])[1]
middle_test(4, 5, 6)
with ExpectError():
middle_test(2, 1, 3)
Traceback (most recent call last): File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_10290/3661663124.py", line 2, in <cell line: 1> middle_test(2, 1, 3) File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_10290/40742806.py", line 3, in middle_test assert m == sorted([x, y, z])[1] AssertionError (expected)
The function middle_passing_testcase()
searches and returns a triple x
, y
, z
that causes middle_test()
to pass.
def middle_passing_testcase() -> Tuple[int, int, int]:
while True:
try:
x, y, z = middle_testcase()
middle_test(x, y, z)
return x, y, z
except AssertionError:
pass
(x, y, z) = middle_passing_testcase()
m = middle(x, y, z)
print(f"middle({x}, {y}, {z}) = {m}")
middle(1, 6, 1) = 1
The function middle_failing_testcase()
does the same; but its triple x
, y
, z
causes middle_test()
to fail.
def middle_failing_testcase() -> Tuple[int, int, int]:
while True:
try:
x, y, z = middle_testcase()
middle_test(x, y, z)
except AssertionError:
return x, y, z
(x, y, z) = middle_failing_testcase()
m = middle(x, y, z)
print(f"middle({x}, {y}, {z}) = {m}")
middle(5, 2, 6) = 2
With these, we can define two sets of test cases, each with 100 inputs.
MIDDLE_TESTS = 100
MIDDLE_PASSING_TESTCASES = [middle_passing_testcase()
for i in range(MIDDLE_TESTS)]
MIDDLE_FAILING_TESTCASES = [middle_failing_testcase()
for i in range(MIDDLE_TESTS)]
Let us run the OchiaiDebugger
with these two test sets.
ochiai_middle = OchiaiDebugger()
for x, y, z in MIDDLE_PASSING_TESTCASES:
with ochiai_middle.collect_pass():
middle(x, y, z)
for x, y, z in MIDDLE_FAILING_TESTCASES:
with ochiai_middle.collect_fail():
middle(x, y, z)
ochiai_middle
1 def middle(x, y, z): # type: ignore
2 if y < z:
3 if x < y:
4 return y
5 elif x < z:
6 return y
7 else:
8 if x > y:
9 return y
10 elif x > z:
11 return x
12 return z
We see that the "culprit" line is still the most likely to be fixed, but the two conditions leading to the error (x < y
and x < z
) are also listed as potentially faulty. That is because the error might also be fixed be changing these conditions – although this would result in a more complex fix.
We close this chapter with two directions for further thought. If you wondered why in the above code, we were mostly talking about events
rather than lines covered, that is because our framework allows for tracking arbitrary events, not just coverage. In fact, any data item a collector can extract from the execution can be used for correlation analysis. (It may not be so easily visualized, though.)
Here's an example. We define a ValueCollector
class that collects pairs of (local) variables and their values during execution. Its events()
method then returns the set of all these pairs.
class ValueCollector(Collector):
""""A class to collect local variables and their values."""
def __init__(self) -> None:
"""Constructor."""
super().__init__()
self.vars: Set[str] = set()
def collect(self, frame: FrameType, event: str, arg: Any) -> None:
local_vars = frame.f_locals
for var in local_vars:
value = local_vars[var]
self.vars.add(f"{var} = {repr(value)}")
def events(self) -> Set[str]:
"""A set of (variable, value) pairs observed"""
return self.vars
If we apply this collector on our set of HTML test cases, these are all the events that we obtain – essentially all variables and all values ever seen:
debugger = test_debugger_html(ContinuousSpectrumDebugger(ValueCollector))
for event in debugger.all_events():
print(event)
out = '' c = 'a' out = 'ab' c = '"' c = '/' s = '"abc"' tag = True out = 'a' c = '>' c = '<' s = 'abc' out = 'abc' c = 'b' quote = True tag = False quote = False s = '<b>abc</b>' c = 'c'
However, some of these events only occur in the failing run:
for event in debugger.only_fail_events():
print(event)
s = '"abc"' quote = True c = '"'
Some of these differences are spurious – the string "abc"
(with quotes) only occurs in the failing run – but others, such as quote
being True and c
containing a single quote are actually relevant for explaining when the failure comes to be.
We can even visualize the suspiciousness of the individual events, setting the (so far undiscussed) color
flag for producing an event table:
debugger.event_table(color=True, args=True)
remove_html_markup |
s='abc' |
s='<b>abc</b>' |
s='"abc"' |
---|---|---|---|
c = '"' | - | - | X |
c = '/' | - | X | - |
c = '<' | - | X | - |
c = '>' | - | X | - |
c = 'a' | X | X | X |
c = 'b' | X | X | X |
c = 'c' | X | X | X |
out = '' | X | X | X |
out = 'a' | X | X | X |
out = 'ab' | X | X | X |
out = 'abc' | X | X | X |
quote = False | X | X | X |
quote = True | - | - | X |
s = '"abc"' | - | - | X |
s = '<b>abc</b>' | - | X | - |
s = 'abc' | X | - | - |
tag = False | X | X | X |
tag = True | - | X | - |
There are many ways one can continue from here.
There are lots of properties that all could be related to failures – and if we happen to check for the right one, we may obtain a much crisper definition of what causes the failure. We will come up with more ideas on properties to check as it comes to mining specifications.
The metrics we have discussed so far are pretty generic – that is, they are fixed no matter how the actual event space is structured. The field of machine learning has come up with techniques that learn classifiers from a given set of data – classifiers that are trained from labeled data and then can predict labels for new data sets. In our case, the labels are test outcomes (PASS and FAIL), whereas the data would be features of the events observed.
A classifier by itself is not immediately useful for debugging (although it could predict whether future inputs will fail or not). Some classifiers, however, have great diagnostic quality; that is, they can explain how their classification comes to be. Decision trees fall into this very category.
A decision tree contains a number of nodes, each one associated with a predicate. Depending on whether the predicate is true or false, we follow the given "true" or "false" branch to end up in the next node, which again contains a predicate. Eventually, we end up in the outcome predicted by the tree. The neat thing is that the node predicates actually give important hints on the circumstances that are most relevant for deciding the outcome.
Let us illustrate this with an example. We build a class ClassifyingDebugger
that trains a decision tree from the events collected. To this end, we need to set up our input data such that it can be fed into a classifier.
We start with identifying our samples (runs) and the respective labels (outcomes). All values have to be encoded into numerical values.
class ClassifyingDebugger(DifferenceDebugger):
"""A debugger implementing a decision tree for events"""
PASS_VALUE = +1.0
FAIL_VALUE = -1.0
def samples(self) -> Dict[str, float]:
samples = {}
for collector in self.pass_collectors():
samples[collector.id()] = self.PASS_VALUE
for collector in debugger.fail_collectors():
samples[collector.id()] = self.FAIL_VALUE
return samples
debugger = test_debugger_html(ClassifyingDebugger())
debugger.samples()
{"remove_html_markup(s='abc')": 1.0, "remove_html_markup(s='<b>abc</b>')": 1.0, 'remove_html_markup(s=\'"abc"\')': -1.0}
Next, we identify the features, which in our case is the set of lines executed in each sample:
class ClassifyingDebugger(ClassifyingDebugger):
def features(self) -> Dict[str, Any]:
features = {}
for collector in debugger.pass_collectors():
features[collector.id()] = collector.events()
for collector in debugger.fail_collectors():
features[collector.id()] = collector.events()
return features
debugger = test_debugger_html(ClassifyingDebugger())
debugger.features()
{"remove_html_markup(s='abc')": {('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 9), ('remove_html_markup', 11), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}, "remove_html_markup(s='<b>abc</b>')": {('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 8), ('remove_html_markup', 9), ('remove_html_markup', 10), ('remove_html_markup', 11), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}, 'remove_html_markup(s=\'"abc"\')': {('remove_html_markup', 1), ('remove_html_markup', 2), ('remove_html_markup', 3), ('remove_html_markup', 4), ('remove_html_markup', 6), ('remove_html_markup', 7), ('remove_html_markup', 9), ('remove_html_markup', 11), ('remove_html_markup', 12), ('remove_html_markup', 13), ('remove_html_markup', 14), ('remove_html_markup', 16)}}
All our features have names, which must be strings.
class ClassifyingDebugger(ClassifyingDebugger):
def feature_names(self) -> List[str]:
return [repr(feature) for feature in self.all_events()]
debugger = test_debugger_html(ClassifyingDebugger())
debugger.feature_names()
["('remove_html_markup', 8)", "('remove_html_markup', 14)", "('remove_html_markup', 3)", "('remove_html_markup', 9)", "('remove_html_markup', 6)", "('remove_html_markup', 12)", "('remove_html_markup', 4)", "('remove_html_markup', 1)", "('remove_html_markup', 7)", "('remove_html_markup', 10)", "('remove_html_markup', 16)", "('remove_html_markup', 13)", "('remove_html_markup', 2)", "('remove_html_markup', 11)"]
Next, we define the shape for an individual sample, which is a value of +1 or -1 for each feature seen (i.e., +1 if the line was covered, -1 if not).
class ClassifyingDebugger(ClassifyingDebugger):
def shape(self, sample: str) -> List[float]:
x = []
features = self.features()
for f in self.all_events():
if f in features[sample]:
x += [+1.0]
else:
x += [-1.0]
return x
debugger = test_debugger_html(ClassifyingDebugger())
debugger.shape("remove_html_markup(s='abc')")
[-1.0, 1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0]
Our input X for the classifier now is a list of such shapes, one for each sample.
class ClassifyingDebugger(ClassifyingDebugger):
def X(self) -> List[List[float]]:
X = []
samples = self.samples()
for key in samples:
X += [self.shape(key)]
return X
debugger = test_debugger_html(ClassifyingDebugger())
debugger.X()
[[-1.0, 1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [-1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0]]
Our input Y for the classifier, in contrast, is the list of labels, again indexed by sample.
class ClassifyingDebugger(ClassifyingDebugger):
def Y(self) -> List[float]:
Y = []
samples = self.samples()
for key in samples:
Y += [samples[key]]
return Y
debugger = test_debugger_html(ClassifyingDebugger())
debugger.Y()
[1.0, 1.0, -1.0]
We now have all our data ready to be fit into a tree classifier. The method classifier()
creates and returns the (tree) classifier for the observed runs.
class ClassifyingDebugger(ClassifyingDebugger):
def classifier(self) -> DecisionTreeClassifier:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(self.X(), self.Y())
return classifier
We define a special method to show classifiers:
class ClassifyingDebugger(ClassifyingDebugger):
def show_classifier(self, classifier: DecisionTreeClassifier) -> Any:
dot_data = export_graphviz(classifier, out_file=None,
filled=False, rounded=True,
feature_names=self.feature_names(),
class_names=["FAIL", "PASS"],
label='none',
node_ids=False,
impurity=False,
proportion=True,
special_characters=True)
return graphviz.Source(dot_data)
This is the tree we get for our remove_html_markup()
tests. The top predicate is whether the "culprit" line was executed (-1 means no, +1 means yes). If not (-1), the outcome is PASS. Otherwise, the outcome is TRUE.
debugger = test_debugger_html(ClassifyingDebugger())
classifier = debugger.classifier()
debugger.show_classifier(classifier)
We can even use our classifier to predict the outcome of additional runs. If, for instance, we execute all lines except for, say, Line 7, 9, and 11, our tree classifier would predict failure – because the "culprit" line 12 is executed.
classifier.predict([[1, 1, 1, 1, 1, 1, -1, 1, -1, 1, -1, 1, 1, 1]])
array([-1.])
Again, there are many ways to continue from here. Which events should we train the classifier from? How do classifiers compare in their performance and diagnostic quality? There are lots of possibilities left to explore, and we only begin to realize the potential for automated debugging.
This chapter introduces classes and techniques for statistical debugging – that is, correlating specific events, such as lines covered, with passing and failing outcomes.
To make use of the code in this chapter, use one of the provided StatisticalDebugger
subclasses such as TarantulaDebugger
or OchiaiDebugger
.
Both are instantiated with a Collector
denoting the type of events you want to correlate outcomes with. The default CoverageCollector
, collecting line coverage.
To collect events from calls that are labeled manually, use
debugger = TarantulaDebugger()
with debugger.collect_pass():
remove_html_markup("abc")
with debugger.collect_pass():
remove_html_markup('<b>abc</b>')
with debugger.collect_fail():
remove_html_markup('"abc"')
Within each with
block, the first function call is collected and tracked for coverage. (Note that only the first call is tracked.)
To collect events from tests that use exceptions to indicate failure, use the simpler with
form:
debugger = TarantulaDebugger()
with debugger:
remove_html_markup("abc")
with debugger:
remove_html_markup('<b>abc</b>')
with debugger:
remove_html_markup('"abc"')
assert False # raise an exception
with
blocks that raise an exception will be classified as failing, blocks that do not will be classified as passing. Note that exceptions raised are "swallowed" by the debugger.
After collecting events, you can print out the observed events – in this case, line numbers – in a table, showing in which runs they occurred (X
), and with colors highlighting the suspiciousness of the event. A "red" event means that the event predominantly occurs in failing runs.
debugger.event_table(args=True, color=True)
remove_html_markup |
s='abc' |
s='<b>abc</b>' |
s='"abc"' |
---|---|---|---|
remove_html_markup:1 | X | X | X |
remove_html_markup:2 | X | X | X |
remove_html_markup:3 | X | X | X |
remove_html_markup:4 | X | X | X |
remove_html_markup:6 | X | X | X |
remove_html_markup:7 | X | X | X |
remove_html_markup:8 | - | X | - |
remove_html_markup:9 | X | X | X |
remove_html_markup:10 | - | X | - |
remove_html_markup:11 | X | X | X |
remove_html_markup:12 | - | - | X |
remove_html_markup:13 | X | X | X |
remove_html_markup:14 | X | X | X |
remove_html_markup:16 | X | X | X |
If you collected coverage with CoverageCollector
, you can also visualize the code with similar colors, highlighting suspicious lines:
debugger
1 def remove_html_markup(s): # type: ignore
2 tag = False
3 quote = False
4 out = ""
5
6 for c in s:
7 if c == '<' and not quote:
8 tag = True
9 elif c == '>' and not quote:
10 tag = False
11 elif c == '"' or c == "'" and tag:
12 quote = not quote
13 elif not tag:
14 out = out + c
15
16 return out
The method rank()
returns a ranked list of events, starting with the most suspicious. This is useful for automated techniques that need potential defect locations.
debugger.rank()
[('remove_html_markup', 12), ('remove_html_markup', 14), ('remove_html_markup', 3), ('remove_html_markup', 9), ('remove_html_markup', 6), ('remove_html_markup', 4), ('remove_html_markup', 1), ('remove_html_markup', 7), ('remove_html_markup', 16), ('remove_html_markup', 13), ('remove_html_markup', 2), ('remove_html_markup', 11), ('remove_html_markup', 8), ('remove_html_markup', 10)]
Here are all classes defined in this chapter:
# ignore
from ClassDiagram import display_class_hierarchy
# ignore
display_class_hierarchy([TarantulaDebugger, OchiaiDebugger],
abstract_classes=[
StatisticalDebugger,
DifferenceDebugger,
RankingDebugger
],
public_methods=[
StatisticalDebugger.__init__,
StatisticalDebugger.all_events,
StatisticalDebugger.event_table,
StatisticalDebugger.function,
StatisticalDebugger.coverage,
StatisticalDebugger.covered_functions,
DifferenceDebugger.__enter__,
DifferenceDebugger.__exit__,
DifferenceDebugger.all_pass_events,
DifferenceDebugger.all_fail_events,
DifferenceDebugger.collect_pass,
DifferenceDebugger.collect_fail,
DifferenceDebugger.only_pass_events,
DifferenceDebugger.only_fail_events,
SpectrumDebugger.code,
SpectrumDebugger.__repr__,
SpectrumDebugger.__str__,
SpectrumDebugger._repr_html_,
ContinuousSpectrumDebugger.code,
ContinuousSpectrumDebugger.__repr__,
RankingDebugger.rank
],
project='debuggingbook')
# ignore
display_class_hierarchy([CoverageCollector, ValueCollector],
public_methods=[
Tracer.__init__,
Tracer.__enter__,
Tracer.__exit__,
Tracer.changed_vars, # type: ignore
Collector.__init__,
Collector.__repr__,
Collector.function,
Collector.args,
Collector.argstring,
Collector.exception,
Collector.id,
Collector.collect,
CoverageCollector.coverage,
CoverageCollector.covered_functions,
CoverageCollector.events,
ValueCollector.__init__,
ValueCollector.events
],
project='debuggingbook')
Chapters that build on this one include
The seminal works on statistical debugging are two papers:
The Ochiai metric for fault localization was introduced by \cite{Abreu2009}. The overview by Wong et al. \cite{Wong2016} gives a comprehensive overview on the field of statistical fault localization.
The study by Parnin and Orso \cite{Parnin2011} is a must to understand the limitations of the technique.
What would be a postcondition for middle()
? How can you check it?
Using the dependencies from the chapter on slicing, can you determine which specific data or control dependencies correlate with failure?
In HTML, it is permissible that tag attribute values can also have single quotes, as in
<b class='extrabold'>abc</b>
Such attributes are actually treated correctly by our remove_html_markup()
code.
What happens if test inputs with single quote attributes become part of our test set? How does statistical fault localization change?
Instead of aiming for lines that correlate with failures, we can look at individual branch conditions such as c == "'"
, c == '"'
, or tag
that correlate with failures. In the above case, the condition c == "'"
would correlate, whereas c == '"'
would not.
Reusing the code instrumentation from the chapter on slicing, collect the individual values of Boolean conditions in tests during execution in a ConditionCollector
class. Show the event table.