Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText

Google Developers BlogMarch 31, 20261 min read0 views

The newly introduced continuous checkpointing feature in Orbax and MaxText is designed to optimize the balance between reliability and performance during model training, addressing issues with conventional fixed-frequency checkpointing. Unlike fixed intervals—which can either compromise reliability or bottleneck performance—continuous checkpointing maximizes I/O bandwidth and minimizes failure risk by asynchronously initiating a new save operation only after the previous one successfully completes. Benchmarks demonstrate that this approach significantly reduces checkpoint intervals and results in substantial resource conservation, especially in large-scale training jobs where mean-time-between-failure (MTBF) is short.

MARCH 31, 2026

The newly introduced continuous checkpointing feature in Orbax and MaxText is engineered to help your training job strike the optimal balance between reliability and performance.

The periodicity of checkpoint generation during model training is conventionally fixed - be it every X training step or every Y minutes. Selecting an appropriate checkpoint frequency is far from trivial, as an incorrect setting often leads to one of two critical scenarios:

Infrequent Checkpointing: Reliability is compromised. A low frequency can result in substantial resource wastage, especially considering that hardware failures or preemption events are common occurrences during extended training runs.
Frequent Checkpointing: Performance becomes the primary constraint. Although the resource-intensive process of checkpointing is typically executed asynchronously, overly frequent initiation can cause training to be blocked or bottlenecked especially with unstable networks.

Continuous checkpointing, conversely, maximizes the exploitation of the host machine and I/O bandwidth and minimizes the risk associated with hardware failures. This capability is achieved with minimal performance degradation, as Orbax intelligently initiates an asynchronous checkpoint save only upon the successful completion of the preceding save operation.

Get Started

Take MaxText as an example, it takes only one step to enable continuous checkpointing. You will simply need to configure the following flags for the training task:

# Enable asynchronous checkpointing  enable_checkpointing: True async_checkpointing: True ...

# Enable asynchronous checkpointing  enable_checkpointing: True async_checkpointing: True ...

Enable continuous checkpointing

enable_continuous_checkpointing: True

Keep the lastest 10 checkpoint to avoid excessive amount of storage consumption

max_num_checkpoints_to_keep: 10 ...`

Kotlin

Copied

Note: continuous checkpointing will override checkpoint_period, which was previously used to control the checkpointing frequency.

MaxText will attempt to save the checkpoint once the previous saving request is fulfilled in the background. Take the llama-3.1-70B model continuous pre-training (CPT) task as an example - on two slices of v5p-128 cluster, we pick two different configurations: (a) continuous checkpointing enabled. (b) checkpointing every 100 steps.

Checkpoint frequency and average training step time with different checkpointing strategies.

As demonstrated by the benchmark findings, the P50 checkpoint intervals are markedly smaller when continuous checkpointing is activated. This is accompanied by an anticipated increase in the average training step time, primarily attributed to the more frequent device-to-host data transfer operations.

To accurately quantify the tangible benefits associated with more frequent checkpointing, we can reasonably assume a mean-time-between-failure (MTBF), where failure encompasses any event that terminates the job, such as hardware malfunctions or preemption events.

Estimate resource preservation with continuous checkpointing, as a function of MTBF.

The benchmark was conducted on a relatively modest cluster configuration, specifically featuring 64 chips per slice; yet, it demonstrates substantial resource conservation. Moreover, the efficiency gains realized through continuous checkpointing are amplified during large-scale training initiatives for the following compelling reasons:

The model files are segmented into smaller fragments, which inherently leads to a notable reduction in the device-to-host blocking time.
The mean time between failures (MTBF) scales inversely and linearly with the magnitude of the scaling operation.

More Comprehensive Use Cases

Orbax also offers more flexible options in terms of saving and preserving checkpoints beyond what MaxText offers today. These options can be defined as highly customizable policies for more complicated use cases:

Continuous checkpoint with minimum interval between checkpoints:

continuous_checkpointing_policy_with_minimum_interval = save_decision_policy.ContinuousCheckpointingPolicy(minimum_interval_secs = 30)

Python

Copied

Each training step might be very small when working with lightweight models, checkpointing too frequently might create some unwanted I/O overhead. A minimum_interval_secs could be set to allow a cool down period between checkpoints.

Preserve the checkpoint based on some customized logics, for example, keep at least one checkpoint every X seconds:

every_n_seconds_preservation_policy = preservation_policy.EveryNSeconds(180)

every_n_seconds_preservation_policy = preservation_policy.EveryNSeconds(180)

Python

Copied

In the above example, Orbax will attempt to save at least one checkpoint every 180 seconds, unless there is nothing within this period. This could be used to prune checkpoints, while maintaining the ability to evaluate or restore to a previous checkpoint.

The abstract policy for saving and preserving are also provided:

@dataclasses.dataclass class CustomizedPreservationPolicy(PreservationPolicy):  """Implement your own policy for reserving checkpoints. """

@dataclasses.dataclass class CustomizedPreservationPolicy(PreservationPolicy):  """Implement your own policy for reserving checkpoints. """

def should_preserve( self, checkpoints: Sequence[PolicyCheckpointInfo], , context: PreservationContext, ) -> Sequence[bool]: result = [is_checkpoint_preservable(checkpoint) for checkpoint in checkpoints] _log_preservation_decision( "Customized Preservation Policy", checkpoints, result ) return result`_

Python

Copied

The above example could be used to preserve checkpoint based on evaluation results.

Things to note

Why cannot we just increase the checkpoint frequency?

Reduction of the checkpoint interval could not achieve the expected result, for the following reasons:

The conventional checkpoint interval maintains a constant value, despite the inherently non-constant nature of the operating environment. Implementing an overly aggressive checkpointing strategy risks blocking the entire training process, leading to idle TPUs and significantly greater resource wastage.
The selection of an appropriate checkpoint interval is non-trivial, as precise prediction is often impossible. Therefore, continuous checkpointing serves as an accessible and reliable methodology to guarantee optimal utilization of both I/O and TPU resources.

Multi-slices I/O concern

A prevalent concern arises when conducting training jobs across multiple slices: the potential bottleneck imposed by the Data Center Network (DCN) bandwidth. When utilizing a multi-slice configuration, DCN bandwidth is leveraged for both model weight updates and checkpointing operations. However, within the Orbax framework, the substantial component of checkpointing remains asynchronous and is confined to communication between the storage server and a single slice (typically slice 0). Crucially, this design ensures that the inter-slice communication remains unblocked and unaffected by the checkpointing process. In our benchmark over multiple slices, we don’t see significant slow down due to the enablement of continuous checkpointing.

Prefer Co-Located Training and Storage Cluster

It is strongly recommended that you ensure the storage bucket is co-located with your training cluster. The efficacy of continuous checkpointing is heavily reliant on network bandwidth. Utilizing a cross-metro network can significantly degrade checkpointing speed, which, in turn, introduces substantial reliability risks to the overall process.

Original source

Google Developers Blog

https://developers.googleblog.com/boost-training-goodput-how-continuous-checkpointing-optimizes-reliability-in-orbax-and-maxtext/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarktraining

Models

UAE’s TII challenges big tech dominance with open source Falcon AI models - Computer Weekly

<a href="https://news.google.com/rss/articles/CBMiugFBVV95cUxPUVVESGgzcXVSdnEzMEg1ZU00NXdPelVSd0tNN0VPajM1VTlDUkZZRWxZQ3hPUEJnZUlwejVNYlJpLTVmX1ljMlE3VVFVWVd5WnozQ1BnNFF2c3loN0JDSlBWN2pwVHhTNC1hMExlWmEwVHlUdm4waHRieXJCYzlzTGljMUNBZ3k2OUl2dHd6NUNWNWprYlk0cDhiNms2UmRYdDNZVzkxR1YzalVleWN3Ujd3TU51bDVvNmc?oc=5" target="_blank">UAE’s TII challenges big tech dominance with open source Falcon AI models</a> Computer Weekly

GNews AI UAE

1mabout 2 months ago

Releases

Murf AI Launches Falcon, the Text-to-Speech API That Outperforms ElevenLabs with 55ms Model Latency Across 35+ Languages - Business Wire

<a href="https://news.google.com/rss/articles/CBMihgJBVV95cUxOOEpqdE9tUTZ4Tnp3c1g4TVlRS1Aybk40cjhlZEw4VUNNMU9DUkxTTHhleFlOM1dDVnpGMUU1VjhKMUZ6LXZxdzBlRmRsZVB1ekthdEdacHJlR3UzalUwS0YzVXotd1c3V0xCdldyVFFacUhWLTJZYTB4WTJ1dWJEbWlxMGctcmxPdUVFSFpwVzI3WGdXYTRoWlFtcU9tYWhwTFNEazFZb0c1bmxqbEtvcjgyNTRVNUJOa21remRfelJ2N2VtdndQMlQzREYwN0pRcVdDVkZ1M0UxM1ZBZ3pIcEF6engtUEdBNWh6V2ZKUnlpbm9LemxxMEpSRzcwYzNwbkhoX05B?oc=5" target="_blank">Murf AI Launches Falcon, the Text-to-Speech API That Outperforms ElevenLabs with 55ms Model Latency Across 35+ Languages</a> Business Wire

GNews AI voice

1m5 months ago

Models

AI Singapore picks Alibaba’s Qwen to drive new regional language model - South China Morning Post

<a href="https://news.google.com/rss/articles/CBMiyAFBVV95cUxOQjFJbzBJSG1LMnR5U1JobTBIMDhUNnRuRlIzZHA1Z1RKbjc5Z3VIN2lxcGd6SV9BUzB2aDJkV1JJUGZIQWlubnptZXlHQjhqTzlUaU90V2dJVGRwTm9kUW1lajdDcFowWjFMeENoVlo0S2xaa3Q5eEpVaUdUYUVDZU56dElZMGMzc0EwOWFIc1JIZ1NZT1dyeTZGQUV4T2dNU1JzX3NmSzNuRnJ5MU1MTUFrZkRrV29TenZOc1VZS3Y3bVg0MWZTTtIByAFBVV95cUxQQjBnQVFMOHJZa0QyVjYtc1VPM3NyZkFtc3phblpEczBRMXJYa3laVEdpNnVzcThQTG9HMF9GR0lUdDdkZFVWM3U1OFc5dTZuWDVuSlpna1NtUVVtZkZ5REM2MHo3Wm92YmpaS3Z0TkRscTRDWVJSTnYxcTRGd3N1MGYxODl0eTJXUGFWOFdld0dwM0VUZ3NjTDdqd0NfU0dOR2VyNnVTOFlXaE9Wc1ZqZzN3Sy1VbXlWMm9UZThadnUwZVJ5dEw5Sw?oc=5" target="_blank">AI Singapore picks Alibaba’s Qwen to drive new regional language model</a> South China Morning Post

GNews AI Singapore

1m4 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 131 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

UAE’s TII challenges big tech dominance with open source Falcon AI models - Computer Weekly

GNews AI UAE

1mabout 2 months ago

Models

AI Singapore picks Alibaba’s Qwen to drive new regional language model - South China Morning Post

GNews AI Singapore

1m4 months ago

Models

Early-fusion hybrid CNN-transformer models for multiclass ovarian tumor ultrasound classification - Frontiers

<a href="https://news.google.com/rss/articles/CBMiogFBVV95cUxNSXRDUm95c0h4SWJrc0E1WGhlaDMwcGcyR2V2WkVERC1CbVBwTF80S3lOanJKMVpFelNDRmlGTVZ1SmdQZHBMZl9ZWVpyMHFDaWlBcXVhVVJ2TzEwSVY1WUlybkhCUWpSdTRLdzFSa2lxVHZEQ3RpT1hCUTBsNlZGd29LbjlYVUVKS3hNdGFRQy1GQkdSTFIzQmdxNDk3WWptWGc?oc=5" target="_blank">Early-fusion hybrid CNN-transformer models for multiclass ovarian tumor ultrasound classification</a> Frontiers

GNews AI transformer

1m6 months ago

Models

Software Defect Prediction using Autoencoder Transformer Model - machinelearning.apple.com

<a href="https://news.google.com/rss/articles/CBMieEFVX3lxTFBqYTFMUHh1QU83VGJQdHNTN1BvZmJGOWhCSVRBTHNOdTNhOHVkSldEQzNXM2VSSzd6bFJnQmhzTXVVSWRWaklKX1V0a2diLUNTSXpYSkRIQzE3cWNsQVFTOFo1VzVBUGhYb2FqSjF0ZHpsaGFnc0NmQg?oc=5" target="_blank">Software Defect Prediction using Autoencoder Transformer Model</a> machinelearning.apple.com

GNews AI transformer

1m6 months ago