How About Defect Tracking With Expert Systems?

20 07 2010

In an earlier post I discussed an architecture for defect tracking tools which would separate the concepts of symptoms and causes, thus allowing the tool to more closely resemble how problems manifest in real software and how engineering typically approach those problems. I mentioned the need for such a tool to link to a scientific database, or electronic lab notebook. This is necessary because the way we organize test data during the investigation of a macro problem is different then way we track those problems. I also mentioned that the architecture would make it easier to role the results of defect tracking into a larger knowledge base, allowing engineers to research issues they’re having by referring to a collection of previously observed symptoms. It occurs to me now that what I was describing was the beginning of a merger between defect tracking tools and expert systems.
Is this done anywhere? There are lots of expert systems used for diagnostics and there are lots of bug trackers, but are there systems that combine features of the two? The more I think about it, the more it seems like this is the only reasonable way to practice software engineering.
One way to approach the issue to integrate a defect tracker and an expert system database as two distinct but closely coupled tools. If I were using the architecture I described earlier, I would start by going to my expert system first. I would investigate my symptoms and determine whether or not there’s already a solution for my problem. That is, maybe my symptoms do not represent a software defect at all. Rather, diagnostics reveal a fault in hardware or installation parameters. If I don’t find a known solution, then the process of exercising the expert system has got me started on the path of breaking down my problem into well defined pieces. This will help me to design tests for further investigation and it will provide the basis for a new addition to the expert knowledge base once I’ve diagnosed my bug the hard way.
Upon exiting the expert system without a solution, I would turn to the bug tracker to document what I know. Maybe when I started I thought I had a single symptom, but my tour though the knowledgebase gave me a new perspective and I create multiple symptoms. This will make it easier to migrate the information back into the knowledgebase when I’ve fixed the bug, and I can still link all of these symptoms to a single task. Remember, the primary goal of a defect tracker is to allow me to manage my work.
To make life easier, maybe my defect tracker has a means of storing symptom data as expert system entities. That I can move back and forth between the two tools in native format rather than, say, cutting and pasting text between them. That doesn’t mean I should shy away from free form text. My defect tracker should have room for this too, since in the early stages of problem investigation I might want to brain dump a lot of random thoughts and observations. The same can be said for the objects storing the causes.
What I’ve constructed here is a trio of inter-related tools: a defect tracker, an expert system knowledgebase and an electronic lab notebook. All three share a two-way relationship with the other two. Clearly, the tracker is closely related to the knowledgebase, but test results in the notebook may be merged into the knowledge base two without necessary going through the tracker first. After all, you may stumble upon some properties of the system that don’t contribute much to your particular investigation but which are not currently represented in the knowledgebase and might be of use in the future. Likewise, the knowledge base could aid in interpreting test results and iterating over test designs throughout the investigation process.
It occurs to me to ask the question: are these really three separate tools, or three views on the same model? I think ideally they are the latter. Practical concerns may causes us to deploy our environment with separate tools because we might be able to cobble them together from existing software. But I could imagine a single vendor building a defect tracker and lab notebook interface on top of an expert system which actually manages everything.


Why is the AIX real time kernel not real time?

16 07 2010

This is a minor quibble, but why is the AIX real time kernel called that? It’s not real time at all. To enable this feature, you must alter some bosboot settings, reboot the machine and set the RT_MPC environment variable. Its function consists of using on-demand scheduling with a global run queue. That is, a light weight process generates a schedule interrupt as soon as it becomes runnable and the OS executes it on any processor with lower priority tasks currently running, if they exist. This does have the effect of reducing processing latency standard deviations while increasing the average, which is superficially similar to real time processing characteristics, but real time isn’t a subjective thing. It is defined specifically in terms of meeting scheduling guarantees using mathematically provable algorithms. Scheduling guarantees cannot be met simply by trying to schedule tasks as quickly as possible based on priority. It requires a comprehensive analysis of all tasks, their priorities (both semantic priorities as inputs and algorithmic priorities as outputs of the analysis) and their periodicities. And if I’m not mistaken, scheduling guarantee analysis is an NP-complete task over N processors. That is, tasks must be fixed to processors if they are to be scheduled with any of the common real time algorithms. That means the AIX use of the global run queue is completely opposed to the definition of real time analysis.

Incidentally, the on-demand scheduling seems to be an obsolete feature as core density goes up. AIX offsets the periodic scheduling interrupts on each core, meaning that higher core density leads to smaller intervals between interrupts.

How To Make Defect Trackers Better

15 07 2010

In a previous post, I discussed how the defect tracking tool with which I’m familiar don’t work well with approaching problems scientifically.  A part of the problem is organization for test results, which I may deal with in another entry, but I think the larger issue is that the terms “defect” or “bug” confound several separate concepts.  I don’t know if there are any defect tracking tools out there which handle these ideas appropriately, but I will detail my own approach to the problem.  If anyone knows of a good tool which does likewise or does it better, then I’d  be happy to hear about it.

The first objective is to define three independent terms: symptoms, causes and tasks.  A symptom is some output or aggregation of outputs from your software.   Outputs in this case may include crashes and other side effects from running the software that are not included in the intended vector of output (e.g. stream I/O or GUI).  These are what most people would call bugs.  A cause is the root cause of a symptom, such as a coding error or an incorrect parameter.  A cause may result in more than one symptom, and multiple causes may have the same symptom.  A task is something on your to-do list.  Most tasks probably begin life as an investigation into the cause of a particular symptom.  As tasks mature, they became well defined prescriptions for repairing a known cause.

Consider the following graph:

The “S” terms represent symptoms and the “T” represent tasks.  We don’t don’t know why whatever’s happening is happening yet, so there are no causes.  Since we also don’t know how any of the symptoms might be related, we create a separate task to investigate each.  After some investigation, I discover a cause that explains some of my symptoms, and my graph looks like this:

Notice that I re-linked T-1 so that it refers to the cause rather than the symptoms.  I could’ve handled this in other ways.  For example, I could’ve linked T-1 to each of the symptoms and if I wanted to trace T-1 back to the root cause I could’ve done so via the link between the symptoms and the causes.  However, it seems more appropriate that once my task has changed from an investigation to a repair job, it should be linked directly to the problem which it is repairing.  Reasons for this should become apparent later, beginning with what we see in the next graph:

There’s no reason why multiple causes can’t have the same symptom.  I might repair C-1 and still see S-2 pop up from time to time.  Therefore I still need a task which to investigate this symptom.  I suppose there’s nothing wrong with allowing both T-1 and T-2 to link to S-2, but it’s cleaner to link repair tasks directly to causes and investigation tasks directly to symptoms.  This will probably make my search queries a lot easier two, especially if I tend to automate them to produce reports.

Note that it’s possible that when I first find C-1, I withdraw T-2 from my task queue because I think I’m going to kill two birds with one stone  and I don’t need another task.  After all, I don’t know that there’s another cause for S-2 until I fix C-1 and see that S-2 still happens.  That’s okay though.  I can just create a new task if I have to.

With a traditional defect tracker, which places symptoms, tasks and causes all in one entry, doing this may have been trickier.  For example, maybe I’d just withdraw (or worse, delete) T-2 and then when I found that S-2 was still around, I’d either have to go find T-2 in my database again (a difficult task given the poor search function in many trackers) or I’d have to create a new task and copy over all the relevant data from T-1.  Of course, when I first discovered C-1, I would’ve copied the data from T-2 over to T-1 before withdrawing T-2.  Or, maybe I would’ve just linked T-1 to T-2 when I discovered C-1, leaving me again with the job of creating a copy of T-2 to track S-2 back to C-2 independently from all the stuff that’s in T-1 and its data.  This is all very messy.  It wastes the time of engineers and it makes following a chain of events or searching defect history all the more complicated.  These problems are solved by doing something very simple, which is to separate symptoms, causes and tasks into separate objects.

I can further ease my bug fixing pain by marking my symptoms with red/yellow/green indicators of some sort.  A red symptom is one with no known cause.  A yellow symptom is one with a known cause but for which that cause has not been fixed or for which we have insufficient test data to convince ourselves that the problem has really been resolved.  A green symptom is one that no longer occurs in the system.  Of course, there’s always the problem of proving the negative.  I know we can never color a symptom green if we want perfection, but nobody’s perfect.  We all have some threshold marking the point at which we’re willing to call a problem solved.  This color coding allows us to easily see the status of our defect tracking efforts.  I can withdraw T-2 and unlinking from S-2 if I want.  If solving C-1 doesn’t make S-2 go away, I don’t have to go searching for S-2 because it’s probably still on my list of yellow symptoms and is easy to find.  Otherwise, no sweat.  It’s easier to write search tools for symptoms without all the extra chatter from causes and test data, so I should be able to find it if I really have to.

There’s one further complexity that comes to mind.  While causes have precise definitions, symptoms often don’t.     Since I don’t know what conditions might be relevant to a symptom (if I did, I’d know the cause!), I’m probably only going to record the most complete description of the problem I have at the time.  I’m bound to leave out some details.  What happens when I think that S-2 is happening again and upon further investigation I discover that it’s really a slight variation on S-2?  It’s possible that a variation on S-2 has a completely separate cause.  You’d hope not, since good software design should separate control structures enough that similar symptoms are related to a single locus of control, but then again the more poorly design the system, the more you need a sophisticated defect tracker to help you dig yourself out of the hole you’re in.

There are other variations on this theme too.  For example, the new not-quite-S-2 –symptom may very well have a cause that’s closely related to C-1, but because of your particular project’s process cycle you need to open a new issue.  You could always create a new task for C-1 and fill out some more data explaining the broader problem, but there’s probably nothing wrong with creating a new cause and linking it to the new symptom with a new tasks.  You could always refer back to C-1 if you wanted, and one could imagine an even more complex approach where causes are grouped into families.  But why go there if this approach gives us what we need?

Whatever the case, we’re faced with the prospect of merging and splitting symptoms.  We may decide that one symptom has two variations or that two symptoms may need to be merged.  Merging symptoms is easy.  It’s always easy to simplify.  All causes and tasks linked to that symptom can be automatically linked by the tool to the new merged symptom.  Splitting them may be more complex.  If I already have a structure built up around a symptom, I may have to manually iterate through each object that’s linked to it and decide what stays linked to the old symptom and what should be moved to the new one.  In most cases this should be simple.  There should be one tasks and at most a couple causes.  There’s probably no good way to automate this.  It takes human intelligence to notice the split in the first place, and it takes further intelligence to figure out what that means for the data you’ve already built up.  Figuring out how to re-organize the data is what engineers are paid for.  The tool is there to make things easier.

Something else that this structure makes easier is organizing test data.  The test data will likely be stored in some repository outside this tool.  When I’m investigating a symptom, I may run lots of tests and store them in my scientific notebook.  I can link these tests to my task, since that’s really what the task stands for.  The data moves with the task when it becomes a repair task and I can append to it with test data that’s intended to prove that the problem no longer exists.

Note that I can use this system for more than defect tracking.  I can also use it build up a “help” database.  Many times, what uses interpret as a defect is just a misunderstanding of how the software is supposed to work.  By users, I generally mean integrators, since this tool is a development tool and not something for the end user.  I can now create symptoms and causes, linking them with any complexity I want, without creating tasks (because there’s nothing for me to fix).  With a front end search tool, a user can search for the symptoms she’s seeing and find out if it’s just an input problem or if it’s a known bug.

You Need A Futurist on Your Staff

28 05 2010

It should be obvious that most companies need to be aware of changing trends in technology and society.  From there it’s not a big leap to say that most companies would benefit from someone who is not only monitoring the new stuff coming out of the pipeline, but thinking deeply about what has yet to arrive.  It sounds like fun work, but it’s hard to do when you’ve got to put your money with your mouth is.  Some firms might get by with writing memos about the future, but anyone whose business is technology or whose business relies upon significant technology resources will at some point in their career find themselves in a multi-year technological planning project.  When you find yourself here, you find yourself planning a major future investment in technology.  And the planning isn’t done once that investment becomes shovel ready.  If it is large enough, technology will be changing enough during construction that you may have to asses the risks and benefits of modify certain features mid stream.

Accomplishing this means more than reading blogs and trade magazines to see what others are thinking and what companies are offering.  Futurism requires a special set of skills.  At this point you are asking yourself question like:  How mature is the technology?  How mature is the manufacturing process?  Will the technology be a cost effective and reliable replacement for its predecessor and will production rates be able to keep up with demand?  How does the technology fit in with the array of existing industry standards?  Will it force clients to commit to a proprietary standard, or will it fit in well with their existing infrastructure?

One of the most important points these questions will emphasize is that the most advanced technology may not win out.  An example is InfiniBand, which despite fantastic throughput and advanced built in features such as packet routing and remote DMA, falls within the domain a few specialty computing applications.  Support for the protocol among server and network component vendors is good but not great and its difficult to get other kinds of cutting edge technology (such as solid state storage) that make use of it.  Vendors will always support Ethernet first and then roll out an IB version once their product lines are mature.  Because of limited investment, IB may not be able to hold its high performance computing lead for long.  There is so much more money behind Ethernet that the standard and the technology evolve rapidly.  What may be even more important than all of these issues is the lack of qualified people who know how to build and maintain IB networks or write software that makes the most of the protocol’s special properties.

This of course is only the first set of issues that any futurist must face.  However, they show that the discipline is more than a hobby.  It requires experience, and I hope that in the future we find a greater emphasis on formal training in projection techniques.  For some more in depth discussion, you may want to check out this old but still valuable article in the Tech Republic

Could Video Games Accumulate Institutional Intelligence?

13 05 2010

As knowledge workers, we march to the constant drum beat of never ending change.  Few people in the intellectual disciplines know this quite as well as technology grunts like myself.  As a result, training always consumes a certain percentage of my time, and  I’ve been on both sides of the equation.  I find myself in the student’s seat most often, but several times a year I also teach other professionals.  This experience has taught me that being an instructor is little different than being a student.  Every course requires frequent revisions to both the content and presentation.  This mean staying on top of the start of the art so that the courses do not become stale.  And formal training is only the most visible form of education.  Employers also expect their minions to keep up to date through print publications, blogs and podcasts (I highly recommend Software Engineering Radio as an example of the latter).

All of these vectors have their place, but they can often lead to disjointed and un-directed learning.    This leads to an emphasis on certification programs.  A certification program relies on a integrated curriculum that immerses students in a body of knowledge and tests their retention.  A company may develop its own certification program that is tailored to their own domain, or they may rely on industry certifications.   Internal certifications are often the most interested, because they usually move beyond the mere development and application of skills.  They incorporate techniques and philosophies developed by an organization over time.  They encompass a proprietary body of knowledge which only that organization possesses and which it believes are critical to its competitive advantage.  They are the mechanism by which a company can pass its institutional intelligence down from one generation of employees to the next.

Since internal curricula are so central to corporate survival, then it is essential that their training methods be effective.  Unfortunately, they can sometimes suffer from data leakage in the brains of all too human employees.  Working professionals do not have the leisure of college students and must often accumulate vast amounts of information in short periods of time.  They may cram well enough to pass exams and interviews, but future retention is likely spotty and difficult to track.  Like any skill, institutional intelligence is best learned (and relearned) by doing.  For this reason, I wonder if corporations may benefit from developing more simulations.

Simulations do not have to take electronic form.  Role playing is often an effective way of using all of our learning apparatuses to acquire knowledge.  However, I think electronic games are particularly powerful.  They offer the potential for endless complexity.  They can integrate audio, visual and textual information along with the game logic.  Employees can pursue them on their own time and select the sub-sections that are specific to their job roles.  Electronic simulations also make it easier to revisit lessons for purposes of review and retesting.

I think this last case is similar to the way a software shop implements regression testing.  A suite of well written regression tests documents better than any textual artifact what the designers (or testers) think the system ought to do.  It also catalogs known pitfalls which developers have accumulated over years of experience.  This helps to validate new features as well as maintain old ones.  Regression tests are also a repository of institutional intelligence, and simulations intended to test humans may be no less effective.

Simulations are becoming more common.  They are found all over the place outside of the corporate world, from children’s learn games to the military.  They are also found in training courseware for industry wide topics.  Companies may make use of internal simulations, but I wonder if it is less common because it is more difficult to develop.  The other examples of productive video games benefit from economies of scale.  It is expensive for a company, even a large one, to fund the development of a large body of simulator code.  It may also be inefficient, since most companies do not posses the expertise.

What I wonder is if there is a courseware vendor that makes a framework for easy development of such simulations.  There are lots of software packages for adolescent video game developers who want to skip all the nitty gritty graphics control stuff and just work the game logic.  Does such a package exist for corporations?  Such a software suite would not only have to make it easy for a company to develop the simulation in the first place, but offer an efficient means of updating the product in order to keep to relevant.  A simulation is more than just a repository, it is an interactive encyclopedia.  Indeed, it could become something like a next-generation wikipedia:  a vast living document that combines information with a powerful means of delivering it.

Conservation of Process, or What Congress Could Learn from Software Engineering

8 05 2010

Every time there’s a disaster, politicians vow to ensure that it will never happen again so that our children will not have to experience our pain.  We have to look no further than the reaction in the United States and other countries to the financial fubar of 2008.  When law makers only look at regulations when they’re looking for a way to show that they’re doing something, the result is a fragmented regulatory system that accumulates new laws year by year without regard for how the entire bureaucracy works as a system.  Perhaps we should prevent our legislators from changing laws without a comprehensive review.  Even better, perhaps they should be required to hold comprehensive reviews of the entire regulatory scheme for major industries every few years.  The industries could be staggered so that every year we would be entering the review cycle for a different one.  I wouldn’t be surprised if this would yield simpler but more powerful regulations, not to mention more stable legislative cycles where total overhauls such as the recent health care and financial reform bills are less frequent.

I think the same ideas apply to engineering, and most software shops actually do a better job of managing their processes than does the US government.  If they’re large enough, they have a committee that periodically reviews the entire set of engineering guidelines and looks for ways to both improve process control and streamline the directives.  These two are often complimentary rather than mutually exclusive.

That is, a naive political approach to regulation assumes that if something goes wrong, the correct solution is to add some new rule.  Typically, this is something along the lines of “next time, don’t do that.”  After all, how intuitive is it that reducing the number of lines in your handbook would yield more process control?  But once the number of regulations reach a certain level of complexity, following them becomes like driving a car with a steering wheel for each tire.  Sure, if you were an octopus you’d have more control, but you’re not – so you’re dead.  Businesses understand this.  The rank and file are more likely to follow a simpler process, and the regulators are more likely to catch breaches.

Businesses also understand that every line in the manual costs something.  So why don’t they just eliminate all the lines not mandated by the government?  Because they also recognize the competitive advantage of quality control.  The result is that these periodic process reviews also produce cost / benefit reports.  They force us to ask questions about what we really want to get out of our regulations and if the costs are worth while.  By undertaking a comprehensive review, we also get to see the ancillary costs of process reform.  There are many ways in which a new line meant to patch a hole in one spot might interact with other processes in unexpected ways.  Are the costs imposed on those processes worth the overall benefits?  There’s also the other side of the coin:  Can a single regulation replace two regulations that we would never have thought were related without performing the comprehensive review?  Of course, implementing cross-concern regulation requires actually having centralized control over disparate oversight boards.  This is something the federal government frequently lacks.

I call this approach conservation of process, after the laws of conservation of matter and energy from thermodynamics.  The conservation laws propose that while you can transform among different forms of matter and energy and between matter and energy, you can never create any more than the sum total of matter and energy in the universe.  The law of conservation of process proposes a constant cost / benefit function ratio, scaled by the total financial output of an organization.  This forces that organization consider making room for new processes by either removing obsolete (or less beneficial) ones or by somehow streamlining them to achieve the same results with less effort.

All this is possible by employing data-driven management.  One of the hallmarks of a CMMI Level 5 engineering shop is continuous process improvement.  We collect data and use that data to improve processes and improve the techniques for data collection.  As we get better at collecting data, we gain more visibility into what’s going on in our organizations.  This opens up new domains of process efficiency.  The only thing that stops us from using such an approach everywhere is a failure to recognize the need for both quality control and efficiency.  It’s something that business learned a long time ago and I don’t think it’s a lesson that our own governing agencies can ignore for long without impacting the competitiveness of our country.

Arbitrary Taxonomies and the Illusion of Precision (Pt 3)

3 05 2010

In Part One and Part Two of this blog post, which has now become an essay, I talked about how the books Getting Things Done and The Thinker’s Toolkit inspired me to do more with less, intellectually.  Although the train of thought meandered a bit through dales and valleys, my main point was that we have to be careful about assuming that we know more than we do.  Our work often requires precision, and sometimes we pursue the illusion of precision when we cannot access the real thing.  Like desperate cops who put naming a suspect above solving the case, we embark on a mission for an answer that is more convincing than correct.  Everyone from investors to software quality assurance analysts are impressed by evaluations that sound quantitative.  But science and statistics cannot unlock all doors.  In this case, the best approach is not to make stuff up, but to fess up to our limitations.  Be skeptical of decimal points when you know damn well that they cannot be obtained.

Everything that I’ve discussed so far as been relevant to the present.  However, these principals are even more relevant to our assessments of the future.  An example from software engineering is the tendency to over-design.  This happens in other fields of engineering too, but it is way too easy to over-design software.  The costs and consequences are often hidden from view until it’s too late to correct our errors.  It also doesn’t help that over-design is encouraged by a naive interpretation of common software quality attributes (i.e. the “-ilities”).  We hold up squishy words like “scalability” and “maintainability” as virtues, leading to design reviews where someone defends some completely superfluous and non-mandated feature as visionary.

The Agile programming crowd has a rule against over design.  For the project types they are engaged in, experience shows that refactoring is usually less work than building in hooks for feature expansion.  The reason is simple: you won’t know what you need until you need it.  You’ll probably waste time designing for something that never comes to pass while still spending what you would have anyway to implement the new functionality once it becomes necessary.   Not every program is amenable to this philosophy, because rework costs do tend to grow exponentially with program size.  But even for larger projects, you have to have some very concrete ideas about the dimensions of future growth before even attempting to design for them.  By concrete ideas, I mean ideas that are well formed enough to construct requirements that are no less rigorous than your other requirements.

This is the topic that brings me back to Getting Things Done. The book recommends that everyone use a flat alphabetic filing system.  If you’re using paper filing, that means you’re not creating special cabinets or drawers to organize your reference material by high level topics.  Instead, try to group papers by as low a level as possible and then file the folders alphabetically.  That’s it.  Don’t get fancy.  No matter how many drawers you have, you should only have one system distributed across all of them.

You might not think this principle applies to electronic filing, but it does.  If you know you have structured data, then of course you need your system to represent that structure.  But when I say structured data, I mean something you could put into a relational database.  For everything else, filing is never as easy as just hitting the search button.  Most of our data is not organized in a relational database. It’s composed of PDF’s, Excel spreadsheets and all of their companions.  We store them in folders and it is those folders that will lead to the death of our civilization, as we spend endless hours waiting for that Microsoft cartoon dog to find our stuff.  I’m not more hopeful (or trusting) of Google Desktop either.

Why do we do this to ourselves?  Computers make it too easy to make new folders.  They should use that dialogue box that asks you if you’re sure you want to delete something to ask you if you seriously think creating a new folder will clear up your electronic clutter or make it easier to find your stuff later.  How many times has someone asked you to send them that budget report again because they know they have it, but they just can’t find it right now?  If that same person has a folder tree on his or her computer that’s more than ten levels deep, then it’s time to schedule an intervention.

Our electronic folders are the arbitrary taxonomies of the title.  We think we understand that relationships among all that data that’s swimming on our hard drives – but we’re wrong. We’re just as wrong as when we think we can score a movie using fractional stars.  We’re just as wrong as when we think we can rank ten objects front to back.  We’re just as wrong as those guys who built tethers for zeppelins on the Empire State Building because they were designing for the future.  Companies like Google spend millions of dollars developing algorithms via which vast server farms can discover the taxonomies linking data.  You don’t stand a chance of doing that using your own brain.  It’s laughable that I even think I know how my data is related right now, let alone how I will use it in the future.  Save yourself a headache.  Use just a few folders.  Groups things together only if you know for sure they are strongly related.  Be careful how you name your files, so that it’s easy to find what you’re looking for.

In other words, KISS.