Deploy a model to Cloud TPU VMs

Google Cloud provides access to custom-designed machine learning accelerators called Tensor Processing Units (TPUs). TPUs are optimized to accelerate the training and inference of machine learning models, making them ideal for a variety of applications, including natural language processing, computer vision, and speech recognition.

This page describes how to deploy your models to a single host Cloud TPU v5e or v6e for online inference in Vertex AI.

Only Cloud TPU version v5e and v6e are supported. Other Cloud TPU generations are not supported.

To learn which locations Cloud TPU version v5e and v6e are available in, see locations.

Import your model

For deployment on Cloud TPUs, you must import your model to Vertex AI and configure it to use one of the following containers:

Prebuilt optimized TensorFlow runtime container

To import and run a TensorFlow SavedModel on a Cloud TPU, the model must be TPU-optimized. If your TensorFlow SavedModel isn't already TPU optimized, you can optimize your model automatically. To do this, import your model and then Vertex AI optimizes your unoptimized model by using an automatic partitioning algorithm. This optimization doesn't work on all models. If optimization fails, you must manually optimize your model.

The following sample code demonstrates how to use automatic model optimization with automatic partitioning:

  model = aiplatform.Model.upload(
      display_name='TPU optimized model with automatic partitioning',
      artifact_uri="gs://model-artifact-uri",
      serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",
      serving_container_args=[
      ]
  )

For more information on importing models, see importing models to Vertex AI.

Prebuilt PyTorch container

The instructions to import and run a PyTorch model on Cloud TPU are the same as the instructions to import and run a PyTorch model.

For example, TorchServe for Cloud TPU v5e Inference Then, upload the model artifacts to your Cloud Storage folder and upload your model as shown:

model = aiplatform.Model.upload(
    display_name='DenseNet TPU model from SDK PyTorch 2.1',
    artifact_uri="gs://model-artifact-uri",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-tpu.2-1:latest",
    serving_container_args=[],
    serving_container_predict_route="/predictions/model",
    serving_container_health_route="/ping",
    serving_container_ports=[8080]
)

For more information, see export model artifacts for PyTorch and the tutorial notebook for Serve a PyTorch model using a prebuilt container.

Custom container

For custom containers, your model does not need to be a TensorFlow model, but it must be TPU optimized. For information on producing a TPU optimized model, see the following guides for common ML frameworks:

For information on serving models trained with JAX, TensorFlow, or PyTorch on Cloud TPU v5e, see Cloud TPU v5e Inference.

Make sure your custom container meets the custom container requirements.

You must raise the locked memory limit so the driver can communicate with the TPU chips over direct memory access (DMA). For example:

Command line

ulimit -l 68719476736

Python

import resource

resource.setrlimit(
    resource.RLIMIT_MEMLOCK,
    (
        68_719_476_736_000,  # soft limit
        68_719_476_736_000,  # hard limit
    ),
  )

Then, see Use a custom container for inference for information on importing a model with a custom container. If you have want to implement pre or post processing logic, consider using Custom inference routines.

Create an endpoint

The instructions for creating an endpoint for Cloud TPUs are the same as the instructions for creating any endpoint.

For example, the following command creates an endpoint resource:

endpoint = aiplatform.Endpoint.create(display_name='My endpoint')

The response contains the new endpoint's ID, which you use in subsequent steps.

For more information on creating an endpoint, see deploy a model to an endpoint.

Deploy a model

The instructions for deploying a model to Cloud TPUs are the same as the instructions for deploying any model, except you specify one of the following supported Cloud TPU machine types:

Machine Type Number of TPU chips
ct6e-standard-1t 1
ct6e-standard-4t 4
ct6e-standard-8t 8
ct5lp-hightpu-1t 1
ct5lp-hightpu-4t 4
ct5lp-hightpu-8t 8

TPU accelerators are built-in to the machine type. You don't have to specify accelerator type or accelerator count.

For example, the following command deploys a model by calling deployModel:

machine_type = 'ct5lp-hightpu-1t'

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name='My deployed model',
    machine_type=machine_type,
    traffic_percentage=100,
    min_replica_count=1
    sync=True,
)

For more information, see deploy a model to an endpoint.

Get online inferences

The instruction for getting online inferences from a Cloud TPU is the same as the instruction for getting online inferences.

For example, the following command sends an online inference request by calling predict:

deployed_model.predict(...)

For custom containers, see the inference request and response requirements for custom containers.

Securing capacity

For most regions, the TPU v5e and v6e cores per region quota for custom model serving is 0. In some regions, it is limited.

To request a quota increase, see Request a quota adjustment.

Pricing

TPU machine types are billed per hour, just like all other machine type in Vertex Prediction. For more information, see Prediction pricing.

What's next