A story of task engine evolution in a business software

From fire-and-forget to temporal

Posted by 587 on Wednesday, May 11, 2022

Background

I work in a software company, which builds business software. In our software, one key component is the "task engine" component, which manages all kinds of tasks in our system.

What is called a task in our system? Our definition is: any operation which may takes longer than 30 seconds.

As our company's business grows, our task engine also evolves.

Stage 1: fire and forget

At the beginning, during building MVP step, we just write the task logic directly in the same thread. But soon it encounter an issue: if the HTTP request takes longer to response (longer than 1 minutes), there will be many issues, like:

  • Some browser may resend HTTP GET request.
  • Sometimes, Nginx will think the upstream server is dead, and return 504 HTTP Error Code

So, we made a simple change, just put those longer operations into another thread, and return success to client.

The problem is: the Asynchronous task may finish successfully, or not. Although we logged the exception in error log, but the problems is: we don't check it.

Stage 2: learn from ha-jobs

Then, we checked whether there are some existing open source projects which solves this kind of issues.

As our product is written by Scala, so the search is mainly focus on projects written in Java/Scala/Kotlin.

Later, we found https://github.com/Galeria-Kaufhof/ha-jobs , this project is not well known (less than 30 stars at that time), so it is a little bit risky for us. But it has the following advantages:

  • it is written in Scala, and Play framework
  • record the task status in Apache Cassandra
  • the source code is well written, and the basic ideas are easy to understand

Considering we were also using Scala + Apache Cassandra, we make the second version of our task engine by modify this ha-jobs.

The basic idea of this task engine is:

  • We define many kinds of Tasks, each only focus on one time-consuming task.
  • Submit those tasks to ha-jobs' JobManager
  • JobManager will first mark the Task as Submitted (create a record in Cassandra database), and then begin to schedule that Task.
  • Each Task can periodically update its status (update Cassandra tables)
  • When the task comes to end, ha-jobs will record either its result or its failed exception into Cassandra Table.

Then, take "jdbc data fetch" as example, our main application becomes the following logic:

  • User send a getData HTTP request
  • Our Server assign a UUID as taskId, and submit it to ha-jobs JobManager
  • Our Server will wait around 20 seconds, just in case the query is fast, so that we can return to client as fast as possible
  • If the query didn't finish under 20 seconds, we will return the TaskID to client, in this case, the user's web browser
  • Then, our frontend (javascript) will query the status of that TaskID until that task finished. (In Server, we just check Cassandra table to response that that queryTaskStatus request)

This works for a long time (around 3 years), until our customer base becomes larger.

Stage 3: Temporal, a game changer

The version 2 of our Task Engine has some problems we didn't encountered before.

The most issues come when we begin to support Clustering. At first years, our customer's user count is not that large (bellow 10000 users). for our backend server, as the most time-consuming computing are in other components (either in Apache Spark cluster, or customer's OLAP warehouse), and we use Play Framework's Async, Scala's Future, and Akka every place, so we can achieve a good performance even when there is only ONE backend server. Although the mix of Async and Sync code caused us a lot of trouble to maintain, but that's another story.

Then, finally, we come to a point: we need to support server Clustering. In java world, when we talking about clustering, we mainly use Apache Zookeeper (with Apache Curator). But considering our major customers are still use our software on premise, we need to reduce the maintain cost as low as possible. Adding a 3 node Zookeeper is too heavy.

So, we wrote the server clustering by using Akka Clustering framework (https://doc.akka.io/docs/akka/current/typed/cluster.html). And since for customer with multiple machines, we run our software under Kubernetes, so we use kubernetes-api as akka discovery method to solve Split Brain issues.

Akka Clustering works well. But the major challenges comes to our Task Engine. This is mainly because programming in distributed environment is mentally different with programming in single node. Plus, we also want to take this time to redesign our new Task Engine, to overcome several different issues:

First, Timeout control. For timeout, we set timeout at each places, like when we call a external HTTP server, we set a timeout, when write to database, we set another. But what we miss is a total timeout for the whole workflow.

Second, Task persistence and Task chain. As it turns out, in each business operation, most likely, it is not a single atomic step, but it contains several smaller Tasks. During those steps, many things can happen, java process may crash, unexpected exception may throw. So, we may leave the Operation in a half-done state.

Third, there are many cases, we need to control the retry logic. For example, when we transfer a large dataset from customer warehouse to our computing engine, the network connection may down, so, we need to retry once. But sometimes when we already know the warehouse reject our request due to wrong password, in that case we should not retry to avoid lock down.

There are many other problems we didn't anticipate before we encountered later. So, in general, the Task Engine is more complicated than we expected.

So, whether we can finish this job more elegantly, can we learn from other experienced people for this Task Engine area.

Finally, we came to Temporal project, https://github.com/temporalio/temporal, it was like Eye Opening Moment, and seems to solve our issues exactly. So, we began to make our version 3 Task Engine by using Temporal.

To be continued.