Microservices pitfalls

Update: Slides from a talk of my experiences with Microservices on AWS

..do not think of microservices until you have a people problem

I find it difficult to recommend microservice Architecture. Here are some issues that I've run across.

Very tricky to co-ordinate sweeping changes

Is it deployed? What version are you running? You need to be able to answer these questions quickly and easily amongst all services (not just Lambda functions) across different environments and accounts.

It's too easy to make mistakes. Now I honestly have a fear of deployment to production since there is so many spinning plates!

Do not co-ordinate using integers

Are you using an incrementing integer ID counter to identify your records? Don't! You need to consider using UUID instead else you will run into co-ordination issues as an atomic counter is harder than it sounds.

Ideally the identifier (ID) is prefixed so that you can better understand the payload type.

When using unique IDs, realise you probably need to build a index/map to tell whether the payload is a replay so that you don't perform unnecessary (expensive) operations!

Request IDs

You need to be able to identify your payloads. AWS Lambda has Request IDs you could work with:

ctxObj, ok := lambdacontext.FromContext(ctx)
if ok {
	logWithRequestID = log.WithFields(log.Fields{
	"requestID": ctxObj.AwsRequestID,
})
}

But you need to make sure they are incorporated into your logging and often your payload itself, else you will have difficulty querying your logs and tracking events.

Logging is hard

Say you use CALL mysql.lambda_async (an AWS Aurora feature I recommend you DO NOT USE) to deliver a payload to a lambda function. That then relays it to a front end service. Which then interacts with another microservice, that interacts with your database via mysql procedures. All sorts of things can go wrong like chinese whispers and back pressure in this pipeline.

Can you get a good view what is happening with the changing state between these disparate services? It's extremely difficult to do so without consistent logging.

Even if you use invest time getting a great logging setup with $AWS_PROFILEs and saw, be prepared to hit:

Error ThrottlingException: Rate exceeded

Error ThrottlingException: Rate exceeded support answer

There could be 3rd party solutions for making sense out of CloudWatch logging

Idempotency

When an error occurs, your code might have run completely, partially, or not at all. In most cases, the client or service that invokes your function retries if it encounters an error, so your code must be able to process the same event repeatedly without unwanted effects. If your function manages resources or writes to a database, you need to handle cases where the same request is made several times.

You have errors in your logs: A record cannot be created because it has been created before. Why? Because the message has been seen before. Is this even an error?

It probably shouldn't be in a microservice architecture where duplicate messages could be received.

Asynchronous

Sooner or later you will find synchronicity too hard or too slow or a language like Javascript can't accommodate it. You will then architect for asynchronous events through out. Error feedback via "Dead letter queue" subscriptions. Without a synchronous request and response, you will find it challenging to collate and reason about events.

Test failure and I recommend you run through some failure cases so that your team are familiar with debugging what's going on.

DoS yourself

Microservice architecture generally treats each action or job as its own isolated payload. Imagine the scenario when there is a bulk edit changing state of 100 records. Traditionally in a monolith (non-microservice architecture) a single database connection is open and lets say 100 records are inserted, perhaps in a transaction.

Typically in Microservice architecture 100 messages are not batched & are asynchronously fired for a 100 record edit. Meaning 100 concurrent invocations of lambda, each potentially opening up several database connections each and in contention with each other to perform SQL. Deadlocks can easily result and kill connectivity to your database thanks to this load spike.

Your database likely will not handle these loads spikes well at all and typically you need to run a much higher instance than you really should need!

Limiting lambda concurrency and using SQS are used to smooth out these spikes, but these mechanisms will make your architecture extra complex and tricky to orchestrate. For best performance, you need to "right size" concurrency limits to what your database is capable of, which is hit and miss.

Same problem in different microservices

The way how languages like Go and SQL handle null values in JSON payloads can be ambiguous. Do all teams/microservices agree on when certain attributes can be omitted and what that means?

utf8 or utf8mb4? You need to be strongly consistent and it's really not very easy. A maligned collation connection string to your database could render indexes useless and cause huge performance issues that are hard to EXPLAIN!

When comparing columns/constants with different character set or collation, indexes can not be used.

Specific to AWS Lambda implementations of microservices

Timeouts, queues, dead letters

API Gateway timing out? Convert logic to a lambda and an add a queue! Now you have ratcheted up the complexity and locked yourself to the AWS ecosystem. Now you will not be able to sanely run tests on a local laptop, because you microservice depends on a queue.

Serverless (Lambda) makes no guarantees. You need to be aware of time outs and complex failure cases and how they might retried or not. Can your team analyse failed payloads or invocations? Do they know what DeadLetterErrors means? Did they test and understand the difference when running async: aws lambda invoke-async? Do they have the experience to read the copious metrics and interpret them accordingly?

..request ID would be the same if the request was unsuccessful previously and it's now under retry. This could be a way to identify if the request was retried. DLQ can be applied to ensure unsuccessful events that will be sent to a SNS topic or SQS queue after retried twice if those events still cannot be successful.

I still have not figured out how to analyse a queue between SNS invoking lambda when concurrency is limited. There is very little visibility. You could rely on CloudWatch metrics maybe to give you a gist, but often you need to inspect payloads for that to be useful.

FIFO Queue

You can't trigger lambdas from a SQS FIFO queue. Bizaare. And you need FIFO for handling messages which need to be in order!

https://stackoverflow.com/questions/57666833/how-to-handle-out-of-order-microservice-messages/57759755#57759755

💩Cold starts, fine grained permissions, VPCs

Error 1040: Too many connections

Lambda and SQL is pain since the setup cost can be very high, as well as requiring multiple connections to a database. Credentials & fine grained permissions with IAM roles are pain. VPCs add extra pain. Pain multiplied by the amount of microservices and environments you have.

Conclusion

Adopting microservices requires often exponential co-ordination and consistency challenges. Problems you would rarely see or just see once in monolithic architectures. If you integrate with AWS, you will make it practically impossible to reproduce your environment on a standalone computer. YMMV.

Update: AWS have a Microservices whitepaper which is worth reading.

Update: Good parody on Microservices

Found any of my content interesting or useful?