Manual feature preprocessing
You can use the
TRANSFORM
clause
of the CREATE MODEL
statement in combination with manual preprocessing
functions to define custom data preprocessing. You can
also use these manual preprocessing functions outside of the TRANSFORM
clause.
If you want to decouple data preprocessing from model training, you can create a
transform-only model
that only performs data transformations by using the TRANSFORM
clause.
You can use the
ML.TRANSFORM
function
to increase the transparency of feature preprocessing. This function lets you
return the preprocessed data from a model's TRANSFORM
clause, so that you can
see the actual training data that goes into the model training, as well as the
actual prediction data that goes into model serving.
For information about feature preprocessing support in
BigQuery ML, see
Feature preprocessing overview.
For information about the supported SQL statements and functions for each model
type, see End-to-end user journey for each model.
Types of preprocessing functions
There are several types of manual preprocessing functions:
- Scalar functions operate on a single row. For example,
ML.BUCKETIZE
.
- Table-valued functions operate on all rows and output a table. For example,
ML.FEATURES_AT_TIME
.
Analytic functions operate on all rows, and output the result for each
row based on the statistics collected across all rows. For example,
ML.QUANTILE_BUCKETIZE
.
You must always use an empty OVER()
clause with ML analytic functions.
When you use ML analytic functions inside theTRANSFORM
clause
during training, the same statistics are automatically applied to
the input in prediction.
The following sections describe the available preprocessing functions.
General functions
Use the following function on string or numerical expressions to do data cleanup:
Numerical functions
Use the following functions on numerical expressions to regularize data:
Categorical functions
Use the following functions on categorize data:
Text functions
Use the following functions on text string expressions:
Image functions
Use the following functions on image data:
Known limitations
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eManual feature preprocessing can be defined using custom functions with the \u003ccode\u003eTRANSFORM\u003c/code\u003e clause in the \u003ccode\u003eCREATE MODEL\u003c/code\u003e statement, or independently.\u003c/p\u003e\n"],["\u003cp\u003eTransform-only models can be created using the \u003ccode\u003eTRANSFORM\u003c/code\u003e clause to perform data transformations without training a model.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003eML.TRANSFORM\u003c/code\u003e function allows users to inspect preprocessed data from a model's \u003ccode\u003eTRANSFORM\u003c/code\u003e clause for improved transparency.\u003c/p\u003e\n"],["\u003cp\u003eManual preprocessing functions are categorized into scalar, table-valued, and analytic functions, each operating on different scopes of data.\u003c/p\u003e\n"],["\u003cp\u003eThe data cleanup, numerical, categorical, text, and image functions are available for use in manual preprocessing.\u003c/p\u003e\n"]]],[],null,["# Manual feature preprocessing\n============================\n\nYou can use the\n[`TRANSFORM` clause](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform)\nof the `CREATE MODEL` statement in combination with manual preprocessing\nfunctions to define custom data preprocessing. You can\nalso use these manual preprocessing functions outside of the `TRANSFORM` clause.\n\nIf you want to decouple data preprocessing from model training, you can create a\n[transform-only model](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-transform)\nthat only performs data transformations by using the `TRANSFORM` clause.\n\nYou can use the\n[`ML.TRANSFORM` function](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transform)\nto increase the transparency of feature preprocessing. This function lets you\nreturn the preprocessed data from a model's `TRANSFORM` clause, so that you can\nsee the actual training data that goes into the model training, as well as the\nactual prediction data that goes into model serving.\n\nFor information about feature preprocessing support in\nBigQuery ML, see\n[Feature preprocessing overview](/bigquery/docs/preprocess-overview).\n\nFor information about the supported SQL statements and functions for each model\ntype, see [End-to-end user journey for each model](/bigquery/docs/e2e-journey).\n\nTypes of preprocessing functions\n--------------------------------\n\nThere are several types of manual preprocessing functions:\n\n- Scalar functions operate on a single row. For example, [`ML.BUCKETIZE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-bucketize).\n- Table-valued functions operate on all rows and output a table. For example, [`ML.FEATURES_AT_TIME`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature-time).\n- Analytic functions operate on all rows, and output the result for each\n row based on the statistics collected across all rows. For example,\n [`ML.QUANTILE_BUCKETIZE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-quantile-bucketize).\n\n You must always use an empty `OVER()` clause with ML analytic functions.\n\n When you use ML analytic functions inside the`TRANSFORM` clause\n during training, the same statistics are automatically applied to\n the input in prediction.\n\nThe following sections describe the available preprocessing functions.\n\n### General functions\n\nUse the following function on string or numerical expressions to do data cleanup:\n\n- [`ML.IMPUTER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-imputer)\n\n### Numerical functions\n\nUse the following functions on numerical expressions to regularize data:\n\n- [`ML.BUCKETIZE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-bucketize)\n- [`ML.MAX_ABS_SCALER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-max-abs-scaler)\n- [`ML.MIN_MAX_SCALER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-min-max-scaler)\n- [`ML.NORMALIZER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-normalizer)\n- [`ML.POLYNOMIAL_EXPAND`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-polynomial-expand)\n- [`ML.QUANTILE_BUCKETIZE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-quantile-bucketize)\n- [`ML.ROBUST_SCALER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-robust-scaler)\n- [`ML.STANDARD_SCALER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-standard-scaler)\n\n### Categorical functions\n\nUse the following functions on categorize data:\n\n- [`ML.FEATURE_CROSS`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature-cross)\n- [`ML.HASH_BUCKETIZE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-hash-bucketize)\n- [`ML.LABEL_ENCODER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-label-encoder)\n- [`ML.MULTI_HOT_ENCODER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-multi-hot-encoder)\n- [`ML.ONE_HOT_ENCODER`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-one-hot-encoder)\n\n### Text functions\n\nUse the following functions on text string expressions:\n\n- [`ML.NGRAMS`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-ngrams)\n- [`ML.BAG_OF_WORDS`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-bag-of-words)\n- [`ML.TF_IDF`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-tf-idf)\n\n### Image functions\n\nUse the following functions on image data:\n\n- [`ML.CONVERT_COLOR_SPACE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-convert-color-space)\n- [`ML.CONVERT_IMAGE_TYPE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-convert-image-type)\n- [`ML.DECODE_IMAGE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-decode-image)\n- [`ML.RESIZE_IMAGE`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-resize-image)\n\nKnown limitations\n-----------------\n\n- BigQuery ML supports both automatic preprocessing and manual preprocessing in the [model export](/bigquery/docs/exporting-models). See the [supported data types](/bigquery/docs/exporting-models#export-transform-types) and [functions](/bigquery/docs/exporting-models#export-transform-functions) for exporting models trained with the [BigQuery ML `TRANSFORM` clause](/bigquery/docs/bigqueryml-transform)."]]