Move Fast, Break Fewer Things

18 de jun.

The etiquette to achieve more stability in your applications

Maintaining production applications that are scaling is a giant challenge. All the big tech companies went through this path. Facebook itself had to publicly state against its famous quote “move fast, break things” after the developer community started to claim for more stable releases.

The fact is that are plenty of ways to minimize trouble as new releases arrive. The key here is not to pursue 0% of error, but to take care of exceptions and not expected flows in other to remove the end-user perception of error or minimizing financial loss.

Hope Is Not A Pattern

The first step is to accept that things are going to break and that winter is coming. There is no such thing as “let’s trust …” whatever it is. You need to be skeptical about your user, environment, integrations and beyond.

A great example is when you create an application that does not take care of timeouts. There is no reason to infinitely waiting for an API request to respond. Hope is not a development pattern that you can use to wait for things to happen.

Bulkheads

If you already watched Titanic, you must remember when the doctor is explaining to Rose how it sank. Bulkheads are the isolated chambers that can be closed in an attempt to stop water to flow to other parts of the boat.

You can apply the same concept when defining usage rates for API or computational resources. In an overload problem, you are going to need to save resource enough so you can access as an admin user into your server, for example.

Middlewares

It is a pretty name for tools that integrate systems that were not built to work together. If it happens so you need to create one, keep in mind to decouple as much as you can from both systems.

It will allow both systems to work really independent, and in case of any trouble, the messages exchanged between both systems may be handled when the system is ready again.

Paranoia is Good Engineering

Production environments are much more than one or two computers or smartphones accessing your application. Everything is different.

Applying stability patterns is not enough. Read all the user case scenarios for your application to get new insights into creative ways that flaws may get you in trouble. Users will not use your product for the same purpose you created it.

If you, like me, believe that things will be chaotic as soon as you get far from your computer, you are right! It happens with everybody that has a system to take care of and is changing things continuously.

But since we need to get a life too, the best you can do is to invest time and resources for system recovery as simple, fast and automated as possible.

Continuous Integration

As soon as your applications grow and it is possible, migrate to a continuous integration approach. Do create your builds with automation and from the continuous integration servers. Developers computers are messy and full of risks.

As soon as it is automated, it’s not gonna happen the “it worked at home” situation.

In the end, opt for your CI server to deliver the build in a place no one else can write too.

Immutable Server

After cloud computing took over as something simple and cheap, there is no reason to work with the same server instance forever. You should think of your instances as immutable instances.

For every new release, instead of updating your instance, you will generate a new one based on your image and new automated build, release it, and kill the old instance.

It will guarantee you that your instance is always clean and in a known state. Those deprecated or not used libraries will not keep in your server since you will always create a new one only the resources and setup you need.

In a crisis moment, it is fast to deploy new instances or to recovery old versions too, since it is something you can fast generate from your images.

Config Files

Many teams do not send the config file to repositories for security reasons. The ideal scenario is to have your config files inside and repository, where you can take control of its versions, and you manage the access to the files only for those users who needs to get access to it.

Log Is Not Another Word For Error

Not every mistake your user steps in must be logged as an error. Usage errors, or problems the user is facing with business rules (like trying to execute a task he has no permission) as a warning or something else. Keep the ERROR category for critical problems such as database communication, system outages, etc.

Useful Metrics For Monitoring

Read every number below as a relative in time (like “last n minutes”):

traffic numbers: visits, page requests, number of transactions, simultaneous visits.
for each business transaction type: number processed and aborted, total value processed, average time, conversion rate.
Users: percentage of registered users, number of users, usage patterns, user usage mistakes, successful logins, unsuccessful logins.
Database: SQL exceptions number, quantity of queries, average response time.
Integration: number of timeouts, number of requests, average response time, good response number, number of network errors, number of application error, simultaneous requests number.
Cache: number of items, total memory occupied, usage number, number of erased items by the garbage collector, time spent to create an item.

Database

Migrations

If you have an application up and running with tons of data being stored and processed, it is probably the right time to think about a migration tool such as phinx.

Shim

For very large database migrations, consider the usage of ”shims”, that are like small tools that will run the migration to fill new empty fields or tables before you complete migrate your application.

Almost all of them are better created by activating the action triggers in your database such as insert, update, remove, etc.

Non-Relational Databases

The problem with migrating databases that are too large is the amount of time and resource it is required. Sometimes it implies that more than one version of your application will be hitting different versions of your document.

Some teams operate large migrations by updating the documents when they are used by the new application versions. It may require more development effort, and maybe add more latency to your answer, but it will guarantee that the most used documents will be up and running as soon as possible.

After some period for this rollout, you can run a batch migration in the data that is not updated. It will not only run on a smaller part of the database, as you can keep it running slowly because the most important data is already migrated.

Applications should always accept what they already accepted before, and it should not break for receiving more data.

What Worked Before Keeps Working

In case you need to update any API of the current feature, it is always safe:

Require a subgroup of parameters from last required parameters;
Accept more parameters than before;
Return more parameters than before;

Anything outside those rules may break your application or any other application integrated with you. What is running and working in production is your real documentation, not what you wrote at your website or docs.

Let’s say your API allowed a field called ”URL” to receive any string information. At a specific moment, you decide that you need to validate the field for valid URLs only. Adding a validation at this point may break something, and should be considered as a risk, even if in your documentation was written that it is a field-specific for URLs.

Releasing New Versions

Test coverage must not be up to new versions and releases. It must run to any in-production feature, independent of its version. In case you are working in an API, create your integration tests for both versions endpoints.

By updating the new versions, you may break something that was already working in the past, and the tests will keep you aware of it.

Conway Law

And now, there is the last tip, but very important, that was said by Melvin Conway:

the way two applications are integrated will be correspondent to the same way people at this company communicates

It means that taking care of your application is not enough. You need to take care of the communication flows in your team and how everyone is working together. Otherwise, your system may be failing not because of structure, but because there is miscommunication between your peers.

This text was written with references from ”Release It!” a book wrote by Michael Nygard.

programmingproductinfrastructureteamsoftware development

Marcelo Bissuh