Wednesday 12 June 2019

Just another acronym TBD

Working at scale I am constantly aware of how much we decide upfront. Before it gets anywhere near a team a lot of time goes into looking at what it is, what will change and who will be involved. In some cases, whole designs are considered before a team even see's it.

On the face of it, there is good reason. It costs a lot of money to build things: better make sure it will give a good return. Things taking longer costs even more: better make sure we know what we are getting into. It takes a lot of people to build stuff: better make sure we know who is involved so we can make sure they can actually make it.

We loose something important in doing this - our competitive edge. Every week we take in understanding the risk and cost is another week our customers don't have our product and our competitors have a window of opportunity.

Working with teams, I am aware of how much we assume. We build architectures based on our understanding at the time, which often include a lot of assumptions. I like assumptions because we can actually prove them out - but we usually don't.

We often build more that we actually need since we don't or can't prove out these assumptions. After a while of a service running, I have seen teams reduce the system in lots of different ways. This can sometimes be by removing caches, services that store data or using different scaling patterns that we discover over time.

We would benefit massively from building something small because we can see how it responds in the real world. We get feedback using monitoring and telemetry to understand what is going on and we can make better decisions on it's design and architecture based on that information.

But we need to start somewhere...... so how about we just focus on getting this data in the easiest way we can.

Imagine we took our best guess at an architecture that would suit our intended audience and built the services and deployed them. We make sure we add no logic whatsoever, only the bare minimum to allow the system to interact and we focus only on the monitoring and telemetry.

We can then load test this in a live environment and we could call this a 'best case' system. Without the logic this is the fastest it could operate - anything we add will slow it down. See it as an extreme case, where we are looking at the skinniest skeleton we could possibly get away with.

We can load the system and see what happens. We could also introduce waits in areas we can anticipate more logic and see what happens under load. We can add more monitoring where we have poor visibility. We can stub 3rd parties and make them 'misbehave' to see what happens.

Since there is not much too this, we can quickly move things around and see what happens and fix problems we can see - essentially we searching for a baseline test that we are happy with before we add anything else. We can easily remove things that don't have a measurable impact in the scenarios we are testing.

Since we don't have an logic there is no need for unit tests, meaning changes can be quick. As it does not do anything and contains/accesses no data, it is benign in a live environment so should not constitute a security risk either.

When we do start to add logic, we have a baseline we can compare to and a suite of tests that we use to monitor KPIs that we can actually monitor from the beginning. We also have an architecture that is better than a guess - it's already got some data to support why this is the right place to start.

I call this Telemetry Biased Design but it just sounds like a cool way of making sure you starting with just the right amount of architecture to solve the problem you have.

In full disclosure: at time of writing, I have never tried this. I am no longer an engineer and I work with smart people who get things done in their own way. It's just an idea.

No comments:

Post a Comment