Documentation for a newer release is available. View Latest

How IPF Handles the Reality

Introduction

Possibly the most quoted definition of distributed systems was conveniently given by one of the founders of the field:

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

— Leslie Lamport

Unfortunately, that definition still stands — distributed computing is very much a complex field that often leads to misconceptions and false assumptions.

The now legendary fallacies that L. Peter Deutsch and others at Sun Microsystems identified still remain prevalent and can lead to significant issues if not properly addressed. This document introduces the fallacies and explains how you can try not to fall victim to them when using the IPF SDK.

The Fallacies of Distributed Computing

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Addressing the Fallacies with IPF SDK

1. The Network is Reliable

Software applications are sometimes written with little to no consideration of networking errors. During a network outage, such applications may stall or infinitely wait for an answer packet, permanently consuming memory or other resources. When the failed network becomes available, those applications may also fail to retry any stalled operations or require a (manual) restart.

The IPF Approach

IPF uses Akka and as its official documentation states, by emulating Erlang Akka makes the fallibility of communication explicit through message passing, therefore it does not try to lie and emulate a leaky abstraction and IPF SDK has taken this philosophy to heart.

When communicating with flows built by IPF DSL, IPF developers should always consider network failures. The code generated by IPF DSL provides configurable retries and idempotency controls by default, giving developers the mechanisms for building exactly once delivery semantics when flow inputs are concerned.

Additionally, IPF DSL allows flow transitions on timeouts, thus promoting timeout management to a first-class citizen — more often than not, what to do in case of timeouts caused by no response from an external system is a business decision and not a technical one. By configuring timeouts developers can craft various timeout policies and thus ensure that business demands are met when faced with network issues.

Since in a typical IPF orchestration application the flows integrate closely with IPF Connector Framework, leveraging retries and timeouts in both send and receive connectors may help you deal with temporary network outages, while for longer outages a retry strategy built around messages sent to a dead letter appender may be more applicable.

Akka Cluster also comes bundled with built-in support for network partitions (and machine crashes, which are treated the same way as unreachable via the network):

Akka Failure Detector monitors the health of remote nodes and detects failures using the phi accrual detection algorithm, allowing the system to respond appropriately. The failure detector uses heartbeat messages to check the availability of nodes.
If a node fails to respond to heartbeats within a certain timeframe, it is marked as unreachable and Akka Split Brain Resolver takes appropriate actions such as shutting down the nodes on the minority side of the partition, thus rerouting messages or redistributing tasks to the remaining nodes.

Finally, you should apply the same development practices to your flows as you would with your functions: prefer shorter orchestration flows that perform one thing well. Complex flows — flows that consist of many states and raise a lot of domain events throughout their lifetime — often take longer to recover when Akka distributes them to new nodes. A good rule of thumb would be to aim for flows with no more than 20 states — for instance, if performing a sanctions check in your payment flow consists of 8 states, consider extracting that logic into a flow of its own, especially if it’s something that can be reused by other flows.

With all the retry and timeout configuration options available at different levels, it becomes easy to create configuration that causes more harm than good. As a good rule of thumb you should:

start with no timeouts on your receiving connectors
ensure a sending connector’s attempt timeout is sufficiently shorter than its call timeout if you want the connector to perform any retries
ensure the appropriate sending connector’s call timeout is shorter than the smallest action retry interval — otherwise the flow may give up on an attempt too soon and cause your application to create too many downstream requests

2. Latency is Zero

Developers often write software that assumes that network communication happens instantaneously, with no delay. In reality, latency is always present due to the time it takes for data to travel across the network, influenced by factors such as distance, network congestion, and the processing time of intermediate devices and target external systems. This can lead to performance bottlenecks when the system is deployed in a distributed environment — the applications may hold onto critical system resources like threads while waiting for operations to complete, or initiate more work than can be completed in a certain amount of time, consuming more and more system resources until all available resources are exhausted and the application grinds to a halt or eventually crashes.

The IPF Approach

IPF Connector Framework is built on top of Akka Streams which — similar to other Reactive Streams implementations — provide back-pressure mechanisms to handle varying latencies and ensure a smooth data flow. Akka Streams allows for the definition of data processing pipelines that can adapt to the speed of the slowest component. This prevents overwhelming any part of the system and ensures that data is processed efficiently, even in the presence of network latency.

Additionally, both Akka and IPF SDK have been built around non-blocking operations, ensuring no system resources are being wasted on waiting for I/O operations to complete but are instead used to process pending work.

Seeing how it’s rarely the case that business operations will want to wait forever for an operation to complete, when considering latency in your solutions you should apply the same retry and timeout thinking already outlined in Network is Reliable.

Finally, often the best way to deal with network latency is not to use the network at all — caching frequently used data locally instead of making repeated remote calls is often a great way to reduce the impact of latency on your application. IPF provides its own in-memory caching library but doesn’t prevent you from using any other caching technology you may prefer. Beware of the old saying, though:

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

IPF comes with built-in support for telemetry, out of which the easiest to use are the Akka, application and connector metrics, all of which include latencies of different parts of the system.

3. Bandwidth is Infinite

Software developers often allow unbounded traffic to flow to and from their applications with little to no regard for bandwidth limits, in the worst case causing congestion which adds latency into the network, potentially even wasting bandwidth by increasing dropped packets due to buffer overflows.

The IPF Approach

Very similar to Latency is Zero, the existing back-pressure mechanisms will enable IPF SDK solutions to slow down when latency increases due to bandwidth pushing the network to the brink. But the IPF SDK also offers proactive approaches to controlling the bandwidth before it starts causing issues:

Send connector circuit breakers can be used to proactively shed traffic to downstream systems — once the latency grows past the connector’s configured attempt timeout, circuit breakers will fail-fast and skip transmitting anything over the network.
Message throttling in both the sending and receiving connectors allows you to set arbitrary limits on the amount of messages you are willing to send or receive within a given time interval, applying back-pressure when the limits are exceeded.
IPF Connector conversion stages allow you to use any serialization/deserialization protocol that suits your needs and picking a more efficient protocol than the one you are currently using may help cut some of the bandwidth. The popular options include CBOR, Avro, Protocol Buffers and Ion, but even moving to a binary JSON representation like Smile may yield results.
Review your flows and find opportunities to reduce the amount of data included in domain events:
- Ensure there’s no data redundancy in the domain events themselves e.g. don’t declare a business data element on the domain event if its value doesn’t really change.
- Only declare business data elements that your flows are actively using e.g. don’t try to include data for audit purposes, as that is a concern usually handled outside of orchestration flows themselves.
Enabling compression at different levels.
- When persisting IPF DSL domain events or when communicating with different Akka actors (including IPF DSL flows), Akka Serialization is used to serialize data before transport and some of the available serializers, like the IPF-default Jackson Serializer, offer compression capabilities.
- MongoDB, Kafka as well as many Java HTTP servers support transport level compression that you can enable in order to reduce the size of the payloads exchanged over the network.
Leveraging caching. As mentioned in the previous section, IPF comes with its own in-memory caching library but you are free to use your own. To help reduce the overall bandwidth, however, you can apply caching outside of IPF as well — load balancers and service meshes offer caching support, as do CDNs.

Re-architect some of the components in your application to be local-first. As described elsewhere, IPF SDK relies on event sourcing so it is possible to subscribe to the domain events emitted by the flows and build your (eventually consistent) local materialized views and projections.

While the local-first approach may be taken to improve your application’s resiliency and availability, if you are adopting it solely for the bandwidth improvements you have to ensure the amount of data that you need to read from external services is larger than the event data you are subscribing to — otherwise by going local-first you may actually end up increasing the bandwidth.

Make sure to include detailed network metrics when you are configuring observability of your IPF application deployments.

4. The Network is Secure

Unless security one of the main selling features of the developed system, security is often an afterthought in most software development cycles, which often leads to vulnerabilities.

The IPF Approach

Assuming the network is secure often leads to neglecting encryption of data in transit as without encryption, sensitive information can be intercepted by malicious actors. While a typical application built by IPF SDK will usually not be running within a DMZ, there are still security-conscious decisions that you could be making to reduce the security risks.

Akka Remoting — the Akka component used by Akka Cluster internally and by IPF to communicate with the IPF SDK flows — supports transport-level encryption, with an added bonus of supporting mTLS via frequently rotated certificates when running in Kubernetes.

In addition to supporting transport-level encryption in Akka, IPF SDK also lets you add transport-level encryption to other components:

Embedded Spring Boot servers used to expose various IPF APIs by the AOMs or your orchestration services.
IPF Connectors — HTTP and Kafka through externalized application configuration and JMS through an SSL-enabled ConnectionFactory.
MongoDB through externalized configuration.

Finally, if the target consumers support it, IPF Connectors allows sending and consuming encrypted payloads via encryption stages. Using encrypted payloads with JMS and Kafka also ensures data is encrypted at rest.

Including security and penetration tests in your CD pipeline is a good idea but always consult security experts (internal or external) to periodically do a proper security audit — it could save you from making headlines the bad way!

5. Topology Doesn’t Change

With widespread virtualization of server machines and containerization of applications, Pets vs Cattle analogy is starting to heavily favour the cattle approach, meaning machines or instances of an application getting started, stopped and disappearing are a daily fact of life. Poorly designed systems that do not account for redundancy and fault tolerance are more vulnerable to changes in network and machine topology. If a single node or link fails, the entire system can be brought down to a halt.

The IPF Approach

As mentioned before, IPF SDK relies on Akka Cluster and its plugins to bootstrap the service cluster and then dynamically adapt to changes in the network and machine topology, ensuring continuous availability and scalability. Akka Cluster and its discovery mechanisms (including the IPF-owned MongoDB-based discovery) can automatically detect when nodes join or leave the cluster and redistribute data and tasks accordingly. This ensures that the system remains operational even as the network topology changes.

While the mentioned components cover graceful topology changes, when dealing with non-graceful changes — i.e. crashes and failures — message loss, no response or increased latency are all common symptoms so they are dealt with in much the same way as outlined in Network is Reliable and Latency is Zero.

Additionally, since IPF SDK aims to support writing cloud-native applications by design, it offers some tools to help you when deploying into the cloud:

HTTP APIs for health, liveness and readiness checks via the Spring Boot Actuator and Akka Management HTTP endpoints that can be used by the various platforms to know when to safely start routing traffic to your application instance or when to remove or restart an unresponsive or crashed instance.
Connector Operations API to allow you to build automation around starting and stopping receiving traffic.
Rolling upgrade support to let you avoid downtime when deploying new versions of your applications.

6. There is One Administrator

Particularly frequent when there is a lack of communication between infrastructure and development teams, the assumption of the development teams that there is a single administrator where all knowledge is centralized and consistent can lead to unexpected failures when conflicts and misconfigurations happen.

The IPF Approach

Unlike with previous fallacies, IPF SDK plays only a supporting role in helping you overcome the issues presented by this one — the main act are the observability and monitoring tools and practices you have adopted. As mentioned in Topology Doesn’t Change and Latency is Zero, IPF comes with extensive observability support, as well as automation-friendly and cloud-native HTTP endpoints. These not only enable you to keep track of your SLAs and KPIs but can also be used by alerting and monitoring tools to help you add robust automation into CI/CD pipelines. When your CI/CD includes all your applications and also encompasses infrastructure, it becomes possible to perform automated rollbacks of breaking changes. Even without automation, the IPF and Akka telemetry should allow you to build alerting rules and operational dashboards that can help identify issues.

As part of your operational practices, you should strive to automate everything that can be automated. For areas that cannot be, build good alerting and dashboards so that your operators have a simpler time pinpointing issues.

7. Transport Cost is Zero

Whether you’re building and managing your own datacenters or using one of the cloud providers, there is always a cost associated with network communication. From a software perspective that cost is paid by computational overhead and latency of remote calls, but from an operational perspective the cost is also financial. Forgetting about this cost factor during design and development of your application may turn a great business improvement idea into a huge money pit.

The IPF Approach

Your focus here will be on reducing the number of network calls, the size of the payloads sent over the wire and also on improving data locality. If all that sounds familiar, you’ve been paying attention! The approach taken here is to how Bandwidth is Infinite is handled.

The only difference between the approaches is what you track in your monitoring and reporting tools — you will want to identify the metrics that correlate to your costs, and use them first to guide your optimizations, and then to keep a careful eye on the budget. For cloud deployments, you may want to focus on dedicated networking infrastructure like NAT gateways and cross-datacenter communication since those usually incur significant data transfer costs.

8. The Network is Homogeneous

Developers frequently assume that all parts of a network are uniform in terms of performance, reliability, and configuration, which may be in stark contrast with reality even when small-to-medium-sized local networks are concerned, let alone the huge private networks managed by various cloud providers.

The IPF Approach

This fallacy was a late addition to the list by James Gosling and in a way it binds all the previous fallacies together. The reality is that networks can include a mix of different hardware, software, protocols, configurations, and associated costs and these can all change at any point in your application’s lifetime.

Therefore, the applications you build need to be able to handle network partitions or destinations becoming unreachable, changing latencies and sudden congestions, or security requirements becoming stricter overnight.

Conclusion

The IPF SDK, with its reliance on various Akka modules, provides a modern, comprehensive solution to the realities that come with working with modern architectures. By leveraging these technologies, developers can more easily build robust, scalable, and reliable distributed systems that address the inherent challenges of distributed computing.

As has been mentioned several times throughout this document, IPF SDK was not made to work in isolation — it makes building 12-factor and cloud-native applications easier and to get the most out of it, it needs to be fully integrated into your existing operational ecosystem and included into the relevant alerting and monitoring tools that you use.