Ramblings on Software Architecture: April 2011

Friday 29 April 2011

Multithreading

It all started with one of those late night conversations between a couple of
old programmers who think that they could do stuff better. In reality we never
have the time to do anything more than what we are paid to do, but we can still
dream....

We were discussing the move to multi-core CPUs and how software development was
not tracking that development. As experienced programmers we were both convinced
that multi-threaded programming was not something that the average programmer
can deal with reliably. All the articles being published by Intel etc seemed to
deal with the idea that making it easier to write multi-threaded applications
would some how boost the acceptance of their new CPU designs.

My view is that there are only a few tasks that really need to be written for a
multi-threaded environment. Operating systems, virtual machines and some intensive
graphics processing are obvious candidates. The average business application
should not need this extra complication.

A multi-threaded application is inherently non-deterministic. This has some
important implications. For example, just because it worked once in test does not
mean that it will forever work correctly. A change in the load or other timing
and the behaviour can change radically. A correct test is now just a statement
that it is now known to be possible for the application to work correctly. It
does not say much about the probability of it always working correctly.

In my career, most of the really hard to track down bugs have been the result of
threading issues. In one case a programmer thought that he could improve the
performance of the application by using a global variable for a loop counter. It
took months before the timing caused the program to crash - the crash was caused
by the first thread being interrupted part way through the loop, and the second
thread leaving the loop counter past the end of the array the first thread was
indexing.

In other cases an X Windows GUI program was "improved" by adding an
event dispatching loop deep inside an event handler, presumably to prevent the
program from becoming non-responsive during the lengthy calculation. Occasionally
a fast typist would cause the program to re-enter some functions while they
were still being used for the previous event - much chaos would then ensue.

In yet another case an RPC server process that used a shared memory area for its
data did not have sufficient semaphore locks to prevent the occasional collision
when updating the data values. The sequence of events to trigger this situation
was spread out over several months, so you can imagine how hard it was to track
that one down - good logging is essential for this sort of investigation.

The lesson to take from these examples is that even good programmers find it
difficult to design an algorithm involving multiple threads. For most people it
is hard enough just to cover all the alternative cases. Adding in the possibility
that another thread will change stuff at some random point in the chain is
generally not something that can be easily incorporated.

Agents

One possible approach to efficiently using multi-core machines without adding
the complication of multi-threading is to use an "agent" based approach. The
idea is that the system is constructed from high level objects that only interact
using one way messages. The code within the agent becomes standard event handling
code without the need for more than one thread - each message is processed
to completion before the next is pulled from the queue.

Of course the virtual machine that runs this system is multi-threaded, but the
code for each agent is not. This means that the application programmer's job
becomes much simpler (at the expense of the one off effort to create the
virtual machine.)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday 28 April 2011

Using COTS Products

Initial Build

During the initial build of a system the use of COTS products can dramatically reduce
the time required to deliver the system. This is the case if the products are well
understood by the developers, or where the product provides functionality which
would be expensive to reproduce from scratch. Conversely, the gains can be quickly
lost if the products turn out to be poorly documented or of poor quality.

When there are several products involved it may be more expensive than initially
expected to get the products to correctly interact with each other. It will often be the
case that some bespoke code is required to smooth over the differences.

It is important during the initial build to document exactly what features of each
product are being used. It is also strongly advisable to create a set of tests that can
ensure that new versions of the COTS software still perform the required actions as
expected.

Licenses

COTS products are often sold with some form of license. Often the license will
include some form of lock that prevents the software from being used on alternate
hardware.

The cost of managing the licenses and the constraints placed on how the system can
be deployed must be taken into account.

Data Formats

One of the aims of commercial software vendors is to keep their customers for as long
as possible, and to have them buy new versions of the product as they are released. A
common tactic is to use a data format which is difficult for other products to use. (A
typical example is Microsoft Word which has been renowned for having an
indecipherable format that few other products could interpret.)

The problem with this approach is that it can be difficult to use the product as a
component in a larger system. And, of course, it also makes it difficult to swap the
product out and replace it with another one at a later date.

Upgrade Cycles

When a system consists of several COTS products the maintainers of the system will
eventually be faced with the decision to upgrade to new versions.

The products are unlikely to all require upgrade at the same time. An upgrade to one
product may necessitate the upgrade of other products if the suppliers have made
some attempt to have their products be compatible.

There is usually some point at which upgrades cannot be postponed, such as when the
operating system needs to be upgraded.

Our experience is that this point arrives about every four years for the UNIX
environment. It may be more often in the Windows and Linux environments.

Product Evolution

The suppliers of the COTS products will usually have their products evolve over time
as an incentive for customers to upgrade for the nice new features. When it comes
time to upgrade the system it is possible for the COTS products to no longer provide
the features or interfaces that they did during the initial system build.

COTS products have, almost by definition, a short lifespan. Any such product that is
at all useful is now faced with competition from other vendors, and also from the open
source community. Hence the product must evolve quickly to provide new
functionality or soon be rendered worthless.

It should also be noted that COTS products are actually a rather small (~10%) niche
in the software industry as a whole (most software is written to control some hardware
or business system). The idea of a software component as a commercial product is
quite strange from the economic point of view as it has a near zero cost of production
(but a high cost of development).

Product Obsolescence

COTS products can become obsolete. For example, a package that was used as part of
the system may be withdrawn from sale as a separate product because the vendor
wants to sell it as a part of a larger product.

The vendor may decide to no longer support the platform that the system is running
on. Building a system that runs across multiple platforms is a complication most
support teams could do without.

The vendor can also go out of business, making the acquisition of a new version
impossible.

The immediate impact of this obsolescence will depend on the license mechanism
being used. If the product must contact the vendor's server to operate then the impact
would be immediate. If the license involves a lock to the hardware or network
configuration then you may be able to operate for some time. If the license is time
limited then the crisis point will be predictable.

When to Use COTS

From our experience in maintaining a system for a very long period our advice would
be "as infrequently as possible".

If a system is expected to have a short life time before it is completely replaced then a
COTS solution may be viable, especially if it gets the system into production quickly.
If the entire system is one COTS product, such as an entire HR system, then it may
also be worth considering. The main issue with this approach is how far do you
modify the business to satisfy the product?

If the COTS product provides some function which would be very difficult to
implement, then it may be the only alternative.

COTS to Avoid

Our experience has demonstrated that building a system from a suite of small COTS
products is not a viable long term solution since they tend to evolve in different
directions, and some may cease to exist.

Products that have onerous licensing requirements might be more expensive to
manage than they are worth.

Products which do not provide well documented or open data formats may lead you to
a lock in situation and huge data conversion costs in the future.

Alternatives

The obvious alternative is to build it yourself. This may not be viable in the short
term, but if the system is expected to last for many years then a process of continuous
gradual redevelopment can result in a system which very closely matches the
requirements of the business.

The Open Source Software movement may be able to provide alternatives to the
commercial products. The advantages of OSS include the absence of license costs, the
use of open data formats and access to the source code. OSS can give the same
benefits as COTS during the initial development - perhaps even more since the source
code can be examined to get the fine details of interfaces and data formats.
Maintenance is simplified with OSS as the maintainers can, if they wish, avoid
upgrades and just recompile for the new environment. The emphasis on openness with
OSS avoids issues of later data conversion costs and problems interfacing the
software.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Wednesday 27 April 2011

Chaotic Programming

Introduction

It has been known for many years that developing large software systems is far more
difficult than developing small ones, but why is this the case ?

Over the years we have developed many improvements to the process and many new
techniques designed to allow the construction of more complex systems, but all that
seems to happen is that we plan even bigger systems and end up with budget over-
runs and failed systems, sometimes on a truly monumental scale. We have moved
from unstructured assembly language, to structures languages (Cobol, C) to object
oriented languages ( Java, C++) and now to SOA techniques, but at each stage we
have projects that seem to be too large to succeed.

The Model

We can describe the work done by a software development team as a function of the
requirements for the project.

or, the rate of development (dw/dt) is a function of the difference between the
requirements and the amount of work done so far. This is just a statement that work
begins quickly and tapers off as it approaches the final state.

If we characterise a large system as one that has several development teams working
in unison then we can generalise the above as applying to each team, but with an
additional factor that depends on the state of all the other projects.

where the second term expresses that there is some coupling between this component
project and each other component project, and we only know the state of the other
project as it was at some previous time.

The above expression is a system of "Delay Differential Equations" or DDEs.

In the terms of a software developer, what appears to happen is that as each other
component project publishes its state our project has to modify its code to resatisfy the
new interfaces. The larger the interval between the other projects publishing the more
out of date is our view of their interfaces, and the more work we need to do correct
the situation.

Chaos

A DDE will often have a solution governed by Chaos Theory. As the delay increases,
the solution will initially settle to a single state, then switch between an increasing
number of states, and finally become seemingly random.

It seems reasonable to assume that similar behaviour can be expected from an entire
system of DDE's, as we have in a large software development project.

For small projects where the delay is near zero (i.e. the developers have good
communications) the solution will quickly settle to the desired state. On huge projects
with large delays in communications the solution may never stabilise to a stable state.

Mitigation

The above formula provides clues to several ways to handle large projects.

Reduce the work required: By using efficient development techniques the amount
of work required to implement the project can be reduced. This lessens the impact of
the other component projects, and can reduce the number of other component projects
required. However, what often happens is that more efficient development techniques
just lead to more ambitious projects.

Reduce the delay: If every component project publishes its state at frequent intervals
then the delay is reduced, and hence the chances of a chaotic solution are reduced.
Some large projects do nightly builds in an attempt to minimise the delay. However,
the delay is not just the time to publish, but also the time for the other developers to
absorb the changes, and understand the implications.

Decouple the projects: If the coupling between the component projects ('a' in the
above expression) can be reduced the chances of falling into a chaotic solution will be
reduced. This can be achieved by using well defined, and stable, interfaces between
the component projects.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Monday 25 April 2011

The Spaghetti Has Escaped

Once upon a time programs were written in assembler and other languages which
used the GOTO statement as their means of controlling the flow of execution. For
anything but the simplest of programs they also had to be documented with a flow
chart so that you could get a visual representation of the structure of the program.
Complex programs were likened to a bowl of spaghetti since they had a similar
"structure".

Eventually we came to recognise that the "structured programming techniques" of
languages like Algol could untangle the spaghetti and allow more complex programs.
By restricting ourselves to a small set of conditional, loop and subroutine calls we
were thus able to build more complex programs.

As time went by it started to become clear that there was a new limit on complexity
emerging. Since any code could access any data large programs became difficult to
manage as different parts of the program treated the same data in different ways, or
attributed different meanings to it. The spaghetti had seemingly moved into the data
structures.

The next evolution was the object oriented paradigm. Access to data was localised.
This worked well for a while, but as programs grew in size it started to become
difficult to visualise the flow of control between the objects. For any large program
sequence and collaboration diagrams became a necessity. The spaghetti had moved
back to the flow of control.

The latest evolution, termed Service Oriented Architecture, aims to overcome this
complexity by dividing the system into smaller, simpler programs interconnected by a
standardised message passing infrastructure. The smaller programs should be easier
to understand and test.

Unfortunately the spaghetti has now escaped onto the network, and we are unlikely to
ever be able to get it back under control. We are now destined to create systems
which are non-deterministic and which are likely to have unexpected emergent
behaviours.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.