Reduce All the Regions: How Earth Engine Sped Up Large-Scale Zonal Statistics Computations
by Chris Welhoelter, Software Engineer, Earth Engine
Computing statistics about regions of an image is a core Earth Engine capability. For instance, say you have an image that represents tree canopy coverage. You might want to calculate the average coverage for each census tract in your city in order to join the output on census data and study the health benefits of urban trees.
Computing zonal statistics like this on a very large scale has important applications in machine learning pipelines and elsewhere but has historically been difficult or impossible because the Earth Engine infrastructure was primarily optimized for raster operations. For example, if you wanted to calculate the average tree canopy coverage of each of the more than 80,000 census tracts in the continental United States, your job might run for a few days or a week before ultimately timing out. Determined users could try splitting the job into smaller batches of census tracts, but figuring out the right batch size is labor intensive. Splitting up a job also consumes precious concurrent batch task quota. Others might try to run their export at a lower resolution, but sometimes the resulting imprecision is unacceptable. In short, there were no convenient workarounds for Earth Engine’s limited ability to process very large zonal statistics.
The Earth Engine team has worked hard to better support workloads like this and we’re pleased to announce that large-scale zonal statistics exports are running on average seven times faster than this time last year. Our benchmarking now shows that users can calculate the average tree canopy coverage of every census tract in the continental United States at 1 meter scale in about 7 hours. This post dives deep into the backend changes that made it happen.
The Earth Engine computation model
Let’s start by going over how Earth Engine processes a zonal statistics computation. First, it pulls smaller, parallelizable subcomponents out of the computation. In our tree canopy example, each subcomponent might calculate coverage for one individual census tract. Then it breaks those subcomponents into even smaller pieces, say 256x256 pixel tiles in a census tract. It continues the process recursively until it can’t break things down any further. When it’s done, it winds up with a directed acyclic graph representing the entire computation¹:
Each node in the graph represents a subresult that Earth Engine needs in order to compute the node’s parent. For example, to calculate the coverage for tract 1, it first needs to compute the coverage for each of the tiles the tract contains. And to calculate the coverage for all tracts, each individual one must be calculated first. To execute the computation, the graph is traversed from bottom to top, processing sub-computations as their children finish.
Distributed computation in Earth Engine
The continental United States encompasses 8 million square kilometers, which makes 8 trillion pixels at 1 square meter per pixel. Let’s say Earth Engine can process 100k pixels per second on a single core. If it tried to run our calculation on a single core, it would take 2.5 years. Obviously impractical.
But Earth Engine doesn’t run workloads on a single core. Since the beginning, Earth Engine has leveraged Google-scale distributed processing power. Earth Engine’s compute fleet includes hundreds of thousands of CPUs running on tens of thousands of machines. Roughly speaking, each node in the task graph above runs on a separate one of those machines. In fact, the task graph itself is distributed. Each node only knows about its children, even though those children might themselves be the roots of deep subtrees².
Fairness and throttling
The tree canopy example will generate millions of remote calls over the course of its evaluation. Earth Engine has many thousands of users, each of whom might be running similarly huge jobs. That adds up to billions of sub-computations that need to be processed. Earth Engine’s compute fleet is huge, but it isn’t infinite. It can’t run all that work at once. So it needs to limit throughput. And it needs to do so fairly, so that all users get their fair share of the fleet.
Until very recently, each compute node was responsible for limiting its contribution to the overall load on the fleet. Since it didn’t know the current load or how broadly its subtask’s tree would spread, it applied some heuristic throttling rules based on the limited information available. A top-level task might be allowed to send hundreds of concurrent tile aggregation requests, e.g. “calculate mean tree canopy coverage for census tract 123, tile 456”, based on the historical observation that those tasks don’t typically produce many children. Or a deeply recursive task might be forced to run all of its work locally on the assumption that its depth suggests a broad tree of already running siblings.
This worked alright, but there were problems. When the fleet is perfectly loaded, all processors are active and none are overburdened. As more work arrives, Earth Engine can gracefully degrade service to a point, but beyond its maximum capacity, the system will become overloaded. The goal is to keep load as close as possible to perfect while almost never exceeding the maximum capacity. Hand-tuned heuristics were crude tools for the job. To avoid overloading the system, the rules were set conservatively. As a result, Earth Engine operated below perfect load, meaning users weren’t allocated as much concurrent compute as they could be. The tree canopy computation might get many concurrent processors while other processors sat idle. On other days, organic load was just so heavy that it exceeded maximum capacity and servers would get overwhelmed.
Coordinating Compute
To solve these problems, the Earth Engine team built a new, high throughput, low latency, fairness aware, global load balancer for Earth Engine compute loads. Internally, and in this blog post, I refer to this new scheduler as the Compute Manager. The Compute Manager schedules all computations, including both top-level user requests and all subnodes in the computation graph. When Earth Engine has capacity available, work is scheduled immediately. When there isn’t capacity available immediately, the Compute Manager will queue work to run as soon as it becomes available. This greatly increases the amount of parallel compute a user’s task has available when the fleet is under regular load. Calculating tree canopy coverage over every census tract in the continental United States used over 1 year of compute power in the first 6 hours! Under heavier load, the Compute Manager slows down each task proportionally so everyone has a fair share and the fleet doesn’t get overwhelmed. This results in an average 7x improvement in latency for large batch tasks. Tasks that weren’t feasible before are now comfortably runnable.
Challenges
Building this system wasn’t as straightforward as it might seem.
Task graph traversal
Unlike the individual compute nodes, the Compute Manager is aware of the shape of each task graph and can use that information to schedule the highest priority work. Efficiently traversing a dynamic task graph is tricky, though. It knows that internal nodes are blocked by their children and therefore might have idle cores. To minimize idle cpu time, it tries to maximize the ratio of leaf nodes to internal nodes running at any given time. If it knew the shape of the graph in advance, it could just do a depth-first search.
But of course every internal node starts as a leaf and every leaf can always spawn children and become an internal node. So what’s optimal at one moment might not be a moment later. When the system detects that it is significantly divergent from optimal, it preempts tasks mid-execution in order to rebalance. That, of course, introduces its own set of tradeoffs since the work of the preempted task is thrown away (note that users are only charged for tasks that finish successfully).
If the Compute Manager is not careful, it can also become deadlocked. If it is at per-user capacity and all of the currently running tasks are idly waiting for children, it is stuck. This and a few other flavors of deadlock do, in fact, crop up every once in a while. When the system detects deadlock, it either temporarily gives the blocked job more than its fair share of the compute fleet to get it out of trouble, or it forces the parent of the deadlocking task to run the task locally.
Scheduling Latency
Previously, when a task wanted to send a sub-computation out for remote execution, it would send a request directly to the worker that would process it. The Compute Manager is meant to improve the overall throughput of the system by increasing the amount of parallelism each task gets. But it also introduces a new source of latency: the time it takes to prioritize each computation. In implementing the system, it was important that the benefits of increased parallelism greatly outweigh the penalty paid in scheduling latency. To that end, we set a target of 1 millisecond median scheduling latency. To achieve this, Earth Engine uses a nested updateable priority queue so that it can always pop the highest priority task in constant time. Inserts and updates to the queue are O(log(n)). Profiling shows that our scheduling algorithm is almost always significantly faster than our 1 millisecond target.
Fairness
The structure of the nested task queue is also responsible for maintaining fairness. The outer queue keeps track of top-level jobs and always gives priority to the one with the fewest currently executing tasks. So say Earth Engine can handle 100,000 concurrent tasks and there are 100 jobs underway, each running 1000 concurrent sub-computations. When the tree canopy job starts executing, it’s the 101st job in the system. It will queue up a bunch of tasks and the Compute Manager will preferentially schedule them for execution as processors become available until all 101 currently active jobs are running about 990 concurrent tasks. Note that this system maintains fairness even when task runtimes are non-uniform. Another interesting subtlety is that a slightly exotic updateable priority queue is necessary here because it needs to maintain the heap property not only when work is queued or dequeued, but also when a task finishes processing. The inner queue aims to approximate depth-first search by always choosing the deepest task for a given top-level job.
Scale
One of the biggest challenges in implementing this system was just pushing all the bytes through the new centralized service. A single Compute Manager cannot handle all of Earth Engine’s internal traffic so the system is sharded by batch job: all requests for a given job are routed to the same Compute Manager via a deterministic routing hash; different jobs might be handled by different managers. The fairness and efficiency of the system are both inversely proportional to the number of shards, though, so it’s important to minimize the number of Compute Managers.
To make the tension between fairness, efficiency, and shard count a little clearer, consider a system that has 20 compute workers and needs to handle 10 concurrent batch tasks. Ideally, there would be one Compute Manager instance, i.e. the entire system would be a single shard. In this scenario, the Compute Manager has a global view of the system and can distribute worker capacity perfectly fairly among the batch tasks:
But in reality there’s is too much traffic to pump through a single Compute Manager, so the load is sharded among a few Compute Manager instances:
In general, the greater the number of shards, the less fair and efficient the system is overall. On the other hand, the fewer the shards, the heavier the load that each Compute Manager needs to handle. In order to optimize for fairness and efficiency, each Compute Manager needs to be able to handle as large a load as possible.
At peak times, a single Compute Manager task is forwarding huge volumes of computation results. This alone is a strain on both Earth Engine software and the network hardware. Worse, though, any unexpected pause in processing (due to e.g. network weather or hardware issues) will cause the system to exceed its memory allocation in seconds if it isn’t extremely responsive to overload. There are multiple levels of bespoke application-level load shedding in addition to much common Google infrastructure at both the application and network levels to make sure the system is able to weather the heaviest of loads.
Parting thoughts
We hope you enjoyed this deep dive into our computation engine and the work we’ve been doing recently. There’s still more to be done to improve system components like fleet utilization, growth potential, and task resuming — so look out for even more in 2025! But for now, go check out the improvements for yourself and generate some big datasets that will change the world!
[1] Astute readers will notice that some processing is needed before assembling the DAG, e.g. getting the relevant census tract boundaries and calculating how many tiles are in each. Graph construction is actually a very complicated and dynamic process that’s been greatly simplified here. The important thing to know is that at many points in the process, a DAG of subtasks needs to be processed.
[2] Why? You might ask. Earth Engine lets users express arbitrarily complicated computations. So just describing, let alone evaluating, the task graph on a single machine can be very computationally expensive. Instead the task graph is expanded lazily.