The Eight Fallacies of Distributed Computing

There are eight fallacies about distributed computing, common misconceptions that were first identified at Sun Microsystems in the 1990s, but well-known even before then: With the passage of time, awareness of these fallacies may have faded amongst IT people, so I’d like to remind you of them. They are:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

I have personal experience of major projects where these strange assumptions were made by senior management and the result was expensive disaster. Many application designers have a touching faith in the network they use, and reality tends to come as a shock.

One might think that, because networks are now faster and cheaper than they were in those far-off days where we first caught on to the idea of distributed systems, these assumptions are no longer fallacies, since technology has turned them into reality. However, despite the marketing spin, the reality of networks continue to trip up the unwary who design distributed systems on the basis of the eight fallacies..

Distributed systems work well within special networks designed and specified for the purpose, and subject to rigorous controls and monitoring, but the chances are that the production environment for your distributed system won’t cut the mustard. Why not? Read on.

The network is reliable

When within a domain, networks probably seem rock-solid. After all, how often does a network component fail nowadays? Even if a single component should fail, there is plenty of redundancy, surely? Well, as networks get more complex, so network administration can become more prone to error, mostly mistakes in configuration. In some cases, up to a third of network changes lead to errors that affect the reliability of the network. Both software and hardware can fail, especially routers, which account for around a quarter of all failures. ‘Uninterruptible’ power supplies can get interrupted, people can make ill-advised device configuration changes, and there can be network congestion, denial of service (DoS) attacks and failed software and firmware upgrades or patches. Networks are subject to disaster both natural and unnatural, and it takes skill to design a network that is resilient to this sort of thing. The wide-area links are outside your control and can easily go wrong.

The incidents on Azure alone, in recent months, make painful reading, and this rate of failure is typical of the major cloud-services providers. For mobile applications, all sorts of things can go wrong: network requests will fail at unpredictable intervals, the destination will be unavailable, data will reach the destination but it fails to send back confirmation, data will be corrupted in transmission or arrive incomplete. Mobile applications must be resilient at the scary end of the reliability spectrum of networks, but all distributed applications must cope with all these eventualities, and network nodes must be able to cope with server failures.

Latency is zero

Latency is not the same as bandwidth. Latency is the time spent waiting for a response. The cause? As well as the obvious processing delay, there will be network latency, consisting of propagation delay, node delay and congestion delay. Propagation delay increases with distance: it is around 30 ms. between Europe and the US. The number of nodes in the path determines the node delay.

Often, developers build distributed systems within an intranet that has insignificant latency and so there is almost no penalty to making frequent fine-grained network calls. This design mistake only becomes apparent when put into a live system.

One of the disconcerting effects of high latency is that it isn’t constant. On a poor network, it can be occasionally counted in seconds. By their nature, there is no guarantee of the order that individual packets will be serviced by a network, or even that the requesting process still exists. Latency just makes things worse. Moreover, Where applications compensate by sending several simultaneous requests, a temporarily-high latency can be exascerbated by the response to it.

Bandwidth is infinite

Whereas most modern cables can handle almost infinite bandwidth, we’ve yet to work out how to build the interconnection devices (hubs, switches, routers etc) that are fast enough to guarantee high bandwidth to all connected users. The typical corporate intranet will still have areas where bandwidth is throttled.

As fast as bandwidth increases over public networks, so too does the usage of the network for services using video and audio, which once employed broadcast technologies. New uses, such as social-media, tend to soak up the increasing bandwidth. Also, there are the constraints of the “last mile” in many locations outside the major cities, and the increasing likelihood of packet loss.

In general, we need to be careful in assuming that high bandwidth is a universal experience. However impressive the network bandwidth, it is not going to get close to the speed at which co-hosted processes can communicate.

The Network is secure

It is strange to still come across network-based systems that have fundamental security weaknesses. Network attacks have increased year on year, and have moved way beyond their original roots in curiosity, malice and crime to be adopted as part of international conflict and political ‘action’. Network attacks are part of life in IT: boring to developers, but essential to prevent. Part of the problem is that network-intrusion detection tends to be low priority and so we just aren’t always aware of successful network breaches.

Traditionally, breaches were generally the consequence of poorly-configured firewalls. Most firewalls are routinely probed for weaknesses, as you’ll immediately find if you foolishly disable one. However, this is just one of a number of ways of breaching a network and a firewall only part of the defense in depth. Wi-Fi is often a weakness, bring your own device (BYOD) can allow intrusion via a compromised device, as can virtualization and software-defined networking (SDN). The increasing DevOps demand for rapidly-changing infrastructure has made it harder to keep essential controls in step. Botnets within enterprise networks are a continuing problem, as are intrusions via business partners.

You need to assume that the network is hostile, and that security must be in depth. This means building security into the basic design of distributed applications and their hosts.

With defense in depth, any part of a distributed system will need to have secure ways of accessing other networked resources.

Security brings its own complications. This will come from the administrative overhead of maintaining the different user accounts, privileges, certificates, accounts and so on. One major Cloud network outage was caused by a permission expiring before it could be renewed.

Topology doesn’t change

Network topologies change constantly and with remarkable speed. This is inevitable due to the increasing pressure for “network agility”, in order to keep it in step with rapidly changing business requirements.

Wherever you deploy an application, you must assume that much of the network topology is likely to be out of your control. Network admins will make changes at a time and for a reason that may not be in your interests. They will move servers and change the networking topology to gain performance or security, and make routing changes as a consequence of server and network faults.

It is therefore a mistake to rely on the permanence of specific endpoints or routes. The physical structure of the network must always be abstracted out of any distributed design.

There is one administrator

Unless the system exists entirely within a small LAN, there will be different administrators associated with the various components of the network. They will have different degrees of expertise, and different responsibilities and priorities.

This will matter if something goes wrong that causes your service to fail. Your service-level agreement will require a response within a definite time. The first stage will be to identify the problem. This may not be easy unless the administrator for the part of the network that has problems is part of your development team. Unfortunately, this isn’t likely. In many networks, the problem could be the responsibility of a different organization entirely. If a cloud component is an essential part of your application, and the cloud has an outage, you are helpless in asserting your priority. All you can do is to wait.

If there are many adminstrators to the network, then it is more difficult to coordinate upgrades to either networks or applications, especially when several busy people are involved. Upgrades and deployments have to be done in coordination, and the larger the number of people involved, the more difficult this becomes!

Transport cost is zero

By transport costs, we mean the overall cost of transporting data across the network. We can refer to time and computer resources, or we can refer to the financial costs.

Transferring data from the application layer to the transport layer requires CPU and other resources. Structured information needs to be serialized (marshalling) or parsed to get data onto the wire. The performance impact of this can be greater than from bandwidth and latency hiccoughs , with XML taking twice as long as JSON because of its verbosity and complexity.

The financial transport cost includes not only the hardware and installation costs of creating the network, but also the cost of monitoring and maintaining network servers, services and infrastructure, and the cost of upgrading the network if you discover that the bandwidth is insufficient, or that your servers can’t actually handle enough concurrent requests. We also need to consider the cost of leased lines and cloud services, which are paid-for by the bandwidth used

The network is homogeneous

A homogeneous network today is rare, even rarer than when the fallacy was first identified! A network is likely to connect computers and other devices, each with different operating systems, different data transfer protocols, and all connected with network components from a variety of suppliers.

There is nothing particularly wrong with heterogenous networks, though, unless it involves proprietary data-transfer protocols that require specialized support, devices or drivers. From an application perspective, it helps a lot if data is transferred in an open-standard format such as CSV, XML or JSON, and industry-standard methods for querying data such as ODBC is used.

Where all components come from one supplier, there is a better chance of reliability because test coverage can be greater, but the reality is of a rich mix of components. This means the interoperability should be built-in from the start of the design of any distributed system.

Conclusions

Data moves slower, and less reliably, outside a Server, even with modern gigabit networks. This is why we have traditionally preferred to scale-up the hardware than to scale out into network-based commodity hardware. The trouble is that this preference ossified into a golden rule. When it proved possible to control the network to the point where the eight fallacies get closer to reality, the ‘best practice’ could be turned on its head, and then the distributed model became more attractive. The problem is that distributed systems, service-oriented architectures and microservices are only effective where the network has been tamed of all its vices. Even today, the breakdown or ‘outage’ of a cloud service happens surprisingly frequently. When you are planning or developing a distributed application, it is a bad idea to assume attributes and qualities in your network that aren’t necessarily there: far better to plan on the assumption that your network will be costly, and will occasionally be unreliable and insecure. Assume that you’ll face high latency, insufficient Bandwidth and a changing topology. When you need changes, you’ll be likely to face the prospect of having to contact many administrators, and a bewildering range of network servers and protocols. You may then be pleasantly surprised, but don’t count on it!