Maybe Step Functions are not what I need 🤔

Because there are not silver bullets for async operations

I call the systematic testing and experimenting with new things in my career in order to find better ways to perform and achieve higher quality in the software I write on a regular basis the I'm going to fail soon mode, luckily, I’ve been paid for that.

If like me, you decided to go into the computer science world, you probably know what I’m talking about, technology patterns are ephemeral these days, which means we need to keep hydrating ourselves with information in order to be able to fulfill the market demand; shit is simple, people demand new and better services, they want them faster and they want them cheaper.

Said all this, after several years I’ve found that most software projects shares the same constraint: capacity, time and costs (duh), and in order to evolve the paradigm the market has point out to several solutions to make our life simpler (or not) while keeping the market value and ensure the shareholders keep their pockets full at the end of the year, so they came with Functions as a Services (FaaS) which became a hot topic in the last couple years.

Ok, enough introduction, FaaS paradigm means among other things that operations are stateless, we can mitigate the costs of provisioning systems and focus on the code, this line is used to sell these services a lot, however, there is a problem, at some point you need to keep your state somewhere besides the client.

In order to solve this problem AWS came out with Step Functions so we can manage State-as-a-Service operations and fill the gap with this missing piece of technologies that allow to transition between states and make the system easy to scale.

When creating a microservices architecture, the principle of keep responsibilities isolated comes with the responsibility to be able to roll back transactions when needed and due to the curse of knowledge, my mind goes directly to the saga pattern to solve most of these problems.

I’m going to talk in detail about my findings, you can get very good detailed comments about Lambdas orchestration state of the art in here where Ben Kehoe made a very interesting about these findings. I will focus on a very simple detailed that I’ve found it will be useful to explain with a single use case when you should not go directly into Step Functions when you try to orchestrate a workflow.

The use case

I’ve talked about the saga pattern in another context a made the statement about how useful it can be in order to manage sequences of operations and orchestrate state management in redux base applications, you can read about it in here.

I came with the idea that trying to illustrate this same scenario in a serverless application could show what I’ve spent days to find out, remember that I said about failing soon? when it doesn’t mean fast 🤷🏻‍♂️.

Let’s see the following case:

Image by Thomas Burleson

Just for the sake of the example, imaging you been told that given 2 microservices getFlight and getForecast, you have to create a third service which basically will orchestrate the two of them (if you want to add more spicy, also imaging they are private services).

Piece of cake I said, the first thing that came to my mind was a sequence of steps, so the saga patterns can do the trick and we can express it as a Step Functions.

Step Function First Approach

First, we use the ApiGateway services to trigger our getDeparture lambda, this will take care of the rest with the power of Step Functions and come back later with the orchestration of both services, diagrams looks dope.

The Step Machine looks like in the serverless.yml from the serverless framework.

Serverless.yml file
Step Functions UI

There is a problem, micro-services works in an asynchronous fashion. Let’s see how these lambdas look like, using the super amazing code looks like:

getDeparture lambda definition

On line 35, we can see how we are executing the steps of the state machine, however, this is as said, an async function, therefore, it will return a executionArn id, nothing else; not final result, so, what do we return? Operation is not completed… yet.

Let’s talk about some limits that we have:

  • API Gateway is limited to 29 seconds.
  • Lambdas are limited to 5 minutes or 300000.
  • Step functions cost $0.025 per 1,000 executions (125 times more expensive than Lambdas invocation).
  • Lambdas invocation price are more complicated, take a look here.
  • Default throttling limit for a state machine is two executions per second.

So far the getDeparture lambda is in the limbo waiting for something that might or not come, what can we do?

According to the docs, we can use GetExecutionHistory to the Final State and this is how it looks like:

Execution History

There is a caveat, we cannot ensure that our services finished their operations in less than 29 seconds and we are still being invoice while waiting. This looks horrible af.

So, I went for a different approach:

Step functions Approach 2

I added another API Gateway endpoint so the first one /departure/triggers the operations and returns the executionArn and the second /departure/status will return the final ExecutionHistory based on the executionArn, let’s see some code.

getDeparture-status

Problem solve, as long as, we ask for the status when the PassStateEntered is completed. Now are invoiced twice 😬.

Keep it simple asshole.

Well, maybe it was not like that. I needed a different approach less expensive and more fit for the purpose, I came with a different solutions for the async problem and the best one I’ve found was this:

Approach 3, not step functions.

Not Step Functions at all, only a Lambda that triggers other Lambdas, and it looks like this:

Lambda sagas.

Once again the Saga Pattern comes to the rescue, but this is not solve the problem of longer executions, we are tied to the 29 seconds of APIGateway, but at least, we are using a solution that fit-the-use and also the purpose; and considering the fact that if we are waiting this time to make a request to the user, we are another problem.

On a real life problem, I would go for this approach; of course on a serverless fashion, each service works independent without an orchestration layer until we have a better solution for it.

Conclusion

Architecting serveless application is complex, I just tried to use this sample to illustrate a few of a challenge that you might find when orchestrating microservices, again, asynchronous operations are the rock in the shoe.

I hope this helps you to understand a little better concurrencies limitations when designing services. If you are more interested in the topic, I’ve found this lecture really interesting.

Also, If you want to make other experiments and prove me wrong, please do it! here is the code.

I do serverless stuff 🤷🏻‍♂️