The AI Inference Single Method Transform (SMT) lets you get inferences on Pub/Sub messages from Vertex AI models. You can use your own custom models deployed on Vertex AI endpoints, or use any of the Google and partner models available through Vertex AI. The model's inferences are added to each message, making them available for downstream processing along with the original message data.
Use cases for the AI Inference SMT include the following:
Real-time enrichment: Add context, classifications, predictions, sentiments, or embeddings to event data as it flows through Pub/Sub.
Simplified AI pipelines: Eliminate the need for intermediary services to get inferences from AI models. Pub/Sub handles calling the AI model and enriching the message with the inference.
Reduced latency for AI pipelines: Remove extra network hops in your architecture to achieve lower end-to-end latency.
Enhanced flow control: To avoid overloading model endpoints, Pub/Sub optimizes the rate of requests to the AI model. For more information, see Message flow in this document.
The AI Inference SMT supports the following types of model:
Self-deployed models. Open, partner, and custom models deployed to a shared or dedicated public Vertex AI endpoint.
Model-as-a-Service (MaaS) models. Models offered as a service through the Model Garden, such as Gemini and Claude, that don't require you to manage the deployment. For a list of MaaS models that are compatible with the AI Inference SMT, see Compatible MaaS models.
Required roles and permissions
To get the permissions that
you need to create a topic or subscription with SMTs,
ask your administrator to grant you the
Pub/Sub Editor (roles/pubsub.editor)
IAM role on your project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to create a topic or subscription with SMTs. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create a topic or subscription with SMTs:
-
Create a topic:
on the projectpubsub.topics.create -
Create a subscription:
on the projectpubsub.subscriptions.create
You might also be able to get these permissions with custom roles or other predefined roles.
Service account permissions
The AI Inference SMT uses an IAM service account to call the
Vertex AI endpoint. By default, it uses the
Cloud Pub/Sub Service Agent account
(service-PROJECT_NUMBER@gcp-sa-pubsub.s3ns-system.iam.gserviceaccount.com).
You can also provide your own service account.
The service account needs the following permissions on the Cloud de Confiance project that contains the Vertex AI endpoint:
aiplatform.endpoints.getaiplatform.endpoints.predict
To give these permissions, grant the following IAM role to the service account:
If you are using the Cloud Pub/Sub Service Agent service account, grant the Vertex AI Service Agent role.
If you are using a different service account, grant the Vertex AI User role.
Message processing
This section describes how the AI Inference SMT processes Pub/Sub messages.
Input
The Pub/Sub message data must be a request to send to the AI model, as a JSON string. You can also specify additional model parameters to send with each request. The SMT merges these parameters with the message data and sends the merged JSON to the model endpoint.
The following table shows which API the SMT calls to get the inference, based on the type of model.
| Model deployment | Model type | API |
|---|---|---|
| Self-deployed | All |
rawPredict
|
| Model-as-a-Service (MaaS) |
Gemini foundational model Example: |
Chat Completions API
|
|
Other Gemini models Example: |
rawPredict
|
|
| Anthropic, Mistral AI, or AI21 |
rawPredict
|
|
| All other MaaS models |
Chat Completions API
|
To format the message data and model parameters correctly, consult the documentation for your model. For example, for Gemini foundational models, see the Chat Completions API examples in the Vertex AI documentation.
Output
If the call to the model endpoint succeeds, the SMT enriches the original
Pub/Sub message with the model response. The enriched message is a
JSON string like the following, where ORIGINAL_MESSAGE
is the original message data and INFERENCE_RESULT is
the response from the model:
{
"original_message": { ORIGINAL_MESSAGE },
"model_output": { INFERENCE_RESULT }
}
Message flow
Topic SMTs: When you define an AI Inference SMT on a topic, Pub/Sub handles incoming messages as follows:
A publisher application sends a message to a Pub/Sub topic.
The message is sent to the configured model endpoint for inference. The enriched message, containing the original data and the model's inference, is written to Pub/Sub's internal storage.
Pub/Sub delivers the enriched message to all attached subscriptions.
Subscription SMTs: When you define an AI Inference SMT on a subscription, Pub/Sub handles incoming messages as follows:
A publisher application sends a message to a Pub/Sub topic.
Pub/Sub delivers the message to the subscription.
The message is sent to the configured model endpoint for inference.
The subscription sends the enriched message to the subscriber application.
Pub/Sub optimizes the rate of requests to the AI model to maximize throughput, based on your deployment's latency and quota. Note: This capability isn't supported when using the unary pull API.
You can chain an AI Inference SMT with one or more JavaScript UDF SMTs. Use this pattern to pre-process a message to fit your model's expected input format, or post-process the model's output before it is delivered to subscribers.
Create an AI Inference SMT
SMTs can be configured on Pub/Sub topics or subscriptions.
- Topic SMTs are executed before Pub/Sub stores the message, and the results are available to all subscribers.
Subscription SMTs are executed before the message is delivered, and the results are only available for that subscription.
Console
In the Cloud de Confiance console, go to the Pub/Sub Topics page.
Create either a topic or a subscription.
To create a topic, click Create topic. The Create topic page opens.
To create a subscription:
Click the name of the topic where you want the subscription.
Click Create subscription. The Add subscription to topic page opens.
Under Transforms, click Add a transform.
For Transform type, select AI Inference.
For Endpoint, enter the full resource name of your model endpoint:
- Self-deployed model:
projects/PROJECT/locations/LOCATION/endpoints/ENDPOINT - Model Garden model:
projects/PROJECT/locations/LOCATION/publishers/PUBLISHER/models/MODEL_NAME
- Self-deployed model:
Optional. Select a Service account to use when calling the Vertex AI endpoint. For more information, see Service account permissions.
Optional. In the Parameters field, enter model parameters as a JSON object. The SMT merges these parameters with each message before calling the model. Example:
{ "temperature": 0.5, "max_tokens": 1000 }To create the topic or subscription, click Create.
gcloud
Create a definition file
Create a YAML or JSON file that defines the Inference AI.
YAML
- aiInference:
endpoint: "ENDPOINT_RESOURCE"
unstructuredInference: {
parameters:
MODEL_PARAMETERS
}
service_account_email: SERVICE_ACCOUNT
JSON
{
"aiInference": {
"endpoint": "ENDPOINT_RESOURCE",
"unstructuredInference": {
"parameters": {
MODEL_PARAMETERS
}
}
"service_account_email": SERVICE_ACCOUNT
}
}
Replace the following:
ENDPOINT_RESOURCE: The full resource name of the model endpoint. Use the following format:
- Self-deployed model:
projects/PROJECT/locations/LOCATION/endpoints/ENDPOINT - Model Garden model:
projects/PROJECT/locations/LOCATION/publishers/PUBLISHER/models/MODEL_NAME
- Self-deployed model:
MODEL_PARAMETERS: Optional. Model parameters, specified as a JSON object. The SMT merges these parameters with each message before calling the model. Example:
{ "temperature": 0.5, "max_tokens": 1000 }SERVICE_ACCOUNT: Optional. A service account email to use when calling the endpoint. For more information, see Service account permissions.
Create a topic or subscription
To create a topic, run the
gcloud pubsub topics create
command.
gcloud pubsub topics create TOPIC_ID \
--message-transforms-file=TRANSFORMS_FILE
Replace the following:
- TOPIC_ID: The ID or name of the topic you want to create.
- TRANSFORMS_FILE: The path to the definition file.
To create a subscription, run the
gcloud pubsub subscriptions create
command.
gcloud pubsub subscriptions create SUBSCRIPTION_ID \
--topic=projects/PROJECT_ID/topics/TOPIC_ID \
--message-transforms-file=TRANSFORMS_FILE
Replace the following:
SUBSCRIPTION_ID: The ID or name of the subscription to create.
PROJECT_ID: The ID of the project that contains the topic.
TOPIC_ID: The ID of the topic to subscribe to.
TRANSFORMS_FILE: The path to the definition file.
Validate and test
Optionally, you can validate and test the configured SMT, before you create the topic or subscription. For more information, see the following documents:
Example: Using the AI Inference SMT
The following example shows how to create a subscription with an AI Inference SMT and then use it to send a prompt to Gemini.
gcloud
Using a text editor, create a file named
ai-smt.yamland paste in the following text:- aiInference: endpoint: projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-2.5-flash unstructuredInference: { parameters: { "max_tokens": 25000 } }Replace the following:
- PROJECT_ID: The ID of your Cloud de Confiance project.
- LOCATION: The location of the endpoint to call.
Example:
us-central1.
Create a new Pub/Sub topic.
gcloud pubsub topics create TOPIC_IDReplace TOPIC_ID with the name of the topic to create. Example:
topic-1.Create a subscription that has an AI Inference SMT.
gcloud pubsub subscriptions create TOPIC_ID-sub \ --ack-deadline=600 \ --topic TOPIC_ID \ --message-transforms-file ai-smt.yamlPublish a message to the topic. The message contains a prompt that is formatted for the Chat Completions API.
gcloud pubsub topics publish TOPIC_ID --message=$'{ "model":"google/gemini-2.5-flash","messages":[{ "role": "user", "content": "Explain how AI works in a few words" }] }'Receive a message from the subscription.
gcloud pubsub subscriptions pull TOPIC_ID-subIf the call to Vertex AI succeeds, the message is enriched with the output from the prompt.
Compatible MaaS models
The following table lists the Model-as-a-Service (MaaS) models that Google has tested with the AI Inference SMT and are known to be compatible. This list is subject to change, as models become deprecated or new MaaS models are added.
| Model | API called |
|---|---|
google/gemini-2.0-flash-001 |
Chat Completions API |
google/gemini-2.0-flash-lite-001 |
Chat Completions API |
google/gemini-2.5-flash |
Chat Completions API |
google/gemini-2.5-flash-lite |
Chat Completions API |
google/gemini-2.5-pro |
Chat Completions API |
google/gemini-2.5-flash-image |
Chat Completions API |
google/gemini-3-pro-preview |
Chat Completions API |
google/gemini-3-pro-image-preview |
Chat Completions API |
google/gemini-3-flash-preview |
Chat Completions API |
google/gemini-3.1-pro-preview |
Chat Completions API |
google/gemini-3.1-flash-image-preview |
Chat Completions API |
google/gemini-3.1-flash-lite-preview |
Chat Completions API |
meta/llama-3.3-70b-instruct-maas |
Chat Completions API |
meta/llama-4-maverick-17b-128e-instruct-maas |
Chat Completions API |
meta/llama-4-scout-17b-16e-instruct-maas |
Chat Completions API |
deepseek-ai/deepseek-r1-0528-maas |
Chat Completions API |
deepseek-ai/deepseek-v3.1-maas |
Chat Completions API |
qwen/qwen3-235b-a22b-instruct-2507-maas |
Chat Completions API |
qwen/qwen3-coder-480b-a35b-instruct-maas |
Chat Completions API |
openai/gpt-oss-20b-maas |
Chat Completions API |
openai/gpt-oss-120b-maas |
Chat Completions API |
google/text-multilingual-embedding-002 |
rawPredict |
google/text-embedding-005 |
rawPredict |
google/text-embedding-large-exp-03-07 |
rawPredict |
google/gemini-embedding-001 |
rawPredict |
google/multimodalembedding |
rawPredict |
anthropic/claude-sonnet-4 |
rawPredict |
anthropic/claude-sonnet-4-5 |
rawPredict |
anthropic/claude-sonnet-4-6 |
rawPredict |
anthropic/claude-opus-4 |
rawPredict |
anthropic/claude-opus-4-1 |
rawPredict |
anthropic/claude-opus-4-5 |
rawPredict |
anthropic/claude-opus-4-6 |
rawPredict |
anthropic/claude-haiku-4-5 |
rawPredict |
mistralai/mistral-small-2503 |
rawPredict |
mistralai/mistral-medium-3 |
rawPredict |
mistralai/mistral-ocr-2505 |
rawPredict |
mistralai/codestral-2 |
rawPredict |
Limitations
Only one AI Inference SMT is allowed per topic or subscription.
Private endpoints are not supported. Self-deployed models must be hosted on public Vertex AI endpoints.
The global endpoint is only supported for Gemini foundation models. For other models, you must use a regional endpoint.
Pub/Sub does not validate the input message data. You are responsible for ensuring the data format is correct.
The transform sends one inference request per Pub/Sub message. Client-side batching is not performed.
Asynchronous batch inferences are not supported.
The inference must not take longer than 60 seconds. If it exceeds 60 seconds, the delivery attempt times out and Pub/Sub retries, up to the configured message retention duration. and retry policy settings If the attempt times out, the message is forwarded to the dead-letter topic, if one is configured.
Unsupported models
The AI Inference SMT doesn't support the following MaaS models. Many of these models have self-deployed versions available that you can use instead.
deepseek-ai/deepseek-ocr-maasdeepseek-ai/deepseek-v3.2-maasgoogle/gemini-embedding-2-previewgoogle/lyria-002google/lyria-3-clip-previewgoogle/lyria-3-pro-previewgoogle/veo-3.1-fast-generate-001google/veo-3.1-generate-001intfloat/multilingual-e5-large-instruct-maasintfloat/multilingual-e5-small-instruct-maasminimaxai/minimax-m2-maasmoonshotai/kimi-k2-thinking-maasqwen/qwen3-next-80b-a3b-instruct-maasqwen/qwen3-next-80b-a3b-thinking-maaszai-org/glm-4.7-maaszai-org/glm-5-maas
Regional constraints
The following constraints apply to AI Inference SMTs based on the region of the Vertex AI endpoint.
If an AI Inference SMT is defined on a topic, then the endpoint region must be within the regions allowed by the topic's message storage policy.
This constraint also applies to subscription SMTs if the Enforce in-transit regions for Pub/Sub messages organization policy constraint is in effect.
If an AI Inference SMT is defined on an export subscription, then the endpoint region must be in the region of the associated resource:
- For a BigQuery subscription, the region of the destination table.
- For a Cloud Storage subscription, the region of the Cloud Storage bucket.
If a publish request is made to a region other than the endpoint region, then Pub/Sub automatically redirects the request to the endpoint region.
If you pull from a subscription with an AI Inference SMT, and the pull request is made to a region other than the endpoint region, then Pub/Sub rejects the request. We recommend using a locational endpoint for pull subscriptions. This constraint applies to both streaming pull and unary pull.
When a push subscription has an AI Inference SMT, the subscription pushes messages from the endpoint region. If a regional constraint violation occurs, then Pub/Sub stops pushing messages from that subscription.
Troubleshooting
This section provides troubleshooting tips for the AI Inference SMT.
Topic SMT errors. If the inference fails when the message is published, the entire publish request fails. The error information is returned to the publisher client.
Subscription SMT errors. If the inference fails when the message is delivered, the message can be forwarded to a dead-letter topic. We recommend setting up a dead-letter topic when using SMTs on a subscription.
Model inference errors. If the inference fails and returns an error, check the following:
Verify that the configured endpoint is correct.
Verify that the Pub/Sub message data contains a valid inference request for your model.
Verify that all model parameters are valid.
The inference might fail for other reasons, such as connectivity issues.
Permission or endpoint errors. If the configured service account loses permission to the endpoint, or the endpoint is deleted, the SMT fails.
Quotas and limits
In addition to Pub/Sub quotas and limits, the AI Inference SMT is subject to the quotas and rate limits of the Vertex AI endpoint. Pub/Sub's built-in flow control automatically adjusts the request rate to avoid overloading the endpoint, but the rate can't exceed the model's quota.
The final transformed message size, including the original message and the inference output, must be less than the Pub/Sub message size limit. If the transformed message exceeds the limit, the transform fails.