Apache Iceberg managed tables
Apache Iceberg managed tables (formerly BigLake tables for Apache Iceberg in BigQuery) provide the foundation for building open-format lakehouses on Cloud de Confiance by S3NS. Iceberg managed tables offer the same fully managed experience as standard BigQuery tables, but store data in customer-owned storage buckets. Iceberg managed tables support the open Iceberg table format for better interoperability with open-source and third-party compute engines on a single copy of data.
Iceberg managed tables support the following features:
- Table mutations using GoogleSQL data manipulation language (DML).
- Unified batch and high throughput streaming using the BigQuery Storage Write API through connectors like Spark, Dataflow, and other engines.
- Export of Iceberg V2 snapshot and automatic refresh on each table mutation for direct query access with open-source and third-party query engines, such as Spark.
- Schema evolution, which lets you add, drop, and rename columns to suit your needs. This feature also lets you change an existing column's data type and mode. For more information, see Conversion rules.
- Automatic storage optimization, including adaptive file sizing, automatic clustering, garbage collection, and metadata optimization.
- Time travel for historical data access in BigQuery.
- Column-level security and data masking.
- Multi-statement transactions (in Preview).
- Table partitioning (in Preview).
- Table creation in Dataform workflows.
Architecture
Iceberg managed tables bring the convenience of BigQuery resource management to tables that reside in your own cloud buckets. You can use BigQuery and open-source compute engines on these tables without moving the data out of the buckets that you control. You must configure a Cloud Storage bucket before you start using Iceberg managed tables.
Using Iceberg managed tables has the following implications on your bucket:
- BigQuery creates new data files in the bucket in response to write requests and background storage optimizations, such as DML statements and streaming.
- Automatic compaction and clustering are performed on the data files in the bucket. After the expiration of the time travel window, data files are garbage collected. However, if the table is deleted, the associated data files aren't garbage collected. For more information, see Storage optimization.
Creating an Iceberg managed table is similar to creating BigQuery tables. Because it stores data in open formats on Cloud Storage, you must do the following:
- Specify the Cloud resource connection
with
WITH CONNECTIONto configure the connection credentials for BigQuery to access Cloud Storage. - Specify the file format of data storage as
PARQUETwith thefile_format = PARQUETstatement. - Specify the open-source metadata table format as
ICEBERGwith thetable_format = ICEBERGstatement.
Best practices
Directly changing or adding files to the bucket outside of BigQuery can lead to data loss or unrecoverable errors. The following table describes possible scenarios:
| Operation | Consequences | Prevention |
|---|---|---|
| Add new files to the bucket outside BigQuery. | Data loss: New files or objects added outside of BigQuery are not tracked by BigQuery. Untracked files are deleted by background garbage collection processes. | Add data exclusively through BigQuery. This lets
BigQuery track the files and prevent them from being
garbage collected. To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg managed tables. |
| Create a new Iceberg managed table in a non-empty prefix. | Data loss: Extant data isn't tracked by BigQuery, so these files are considered untracked, and deleted by background garbage collection processes. | Only create new Iceberg managed tables in empty prefixes. |
| Modify or replace Iceberg managed table data files. | Data loss: On external modification or replacement,
the table fails a consistency check and becomes unreadable. Queries
against the table fail. There is no self-serve way to recover from this point. Contact support for data recovery assistance. |
Modify data exclusively through BigQuery. This lets
BigQuery track the files and prevent them from being
garbage collected. To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg managed tables. |
| Create two Iceberg managed tables on the same or overlapping URIs. | Data loss: BigQuery doesn't bridge identical URI instances of Iceberg managed tables. Background garbage collection processes for each table will consider the opposite table's files as untracked, and delete them, causing data loss. | Use unique URIs for each Iceberg managed table. |
Cloud Storage bucket configuration best practices
The configuration of your Cloud Storage bucket and its connection with BigQuery have a direct impact on the performance, cost, data integrity, security, and governance of your Iceberg managed tables. The following are best practices to help with this configuration:
Select a name that clearly indicates that the bucket is only meant for Iceberg managed tables.
Choose single-region Cloud Storage buckets that are co-located in the same region as your BigQuery dataset. This coordination improves performance and lowers costs by avoiding data transfer charges.
By default, Cloud Storage stores data in the Standard storage class, which provides sufficient performance. To optimize data storage costs, you can enable Autoclass to automatically manage storage class transitions. Autoclass starts with the Standard storage class and moves objects that aren't accessed to progressively colder classes in order to reduce storage costs. When the object is read again, it's moved back to the Standard class.
Enable uniform bucket-level access and public access prevention.
Verify that the required roles are assigned to the correct users and service accounts.
To prevent accidental data deletion or corruption in your Cloud Storage bucket, restrict write and delete permissions for most users in your organization. You can do this by setting a bucket permission policy with conditions that deny
PUTandDELETErequests for all users, except those that you specify.Apply google-managed or customer-managed encryption keys for extra protection of sensitive data.
Enable audit logging for operational transparency, troubleshooting, and monitoring data access.
Keep the default soft delete policy (7 day retention) to protect against accidental deletions. However, if you find that data has been deleted, engage with support rather than restoring objects manually, as objects that are added or modified outside of BigQuery aren't tracked by BigQuery metadata.
Adaptive file sizing, automatic clustering, and garbage collection are enabled automatically and help with optimizing file performance and cost.
Avoid the following Cloud Storage features, as they are unsupported for Iceberg managed tables:
- Hierarchical namespaces
- Object access control lists (ACLs)
- Customer-supplied encryption keys
- Object versioning
- Object lock
- Bucket lock
- Restoring soft-deleted objects with the BigQuery API or bq CLI
You can implement these best practices by creating your bucket with the following command:
gcloud storage buckets create gs://BUCKET_NAME \ --project=PROJECT_ID \ --location=LOCATION \ --enable-autoclass \ --public-access-prevention \ --uniform-bucket-level-access
Replace the following:
BUCKET_NAME: the name for your new bucketPROJECT_ID: the ID of your projectLOCATION: the location for your new bucket
Iceberg managed table workflows
The following sections describe how to create, load, manage, and query Iceberg managed tables.
Before you begin
Before creating and using Iceberg managed tables, ensure that you have set up a Cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section. For more information about required roles and permissions for connections, see Manage connections.
Required roles
To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:
-
To create Iceberg managed tables:
-
BigQuery Data Owner (
roles/bigquery.dataOwner) on your project -
BigQuery Connection Admin (
roles/bigquery.connectionAdmin) on your project
-
BigQuery Data Owner (
-
To query Iceberg managed tables:
-
BigQuery Data Viewer (
roles/bigquery.dataViewer) on your project -
BigQuery User (
roles/bigquery.user) on your project
-
BigQuery Data Viewer (
-
Grant the connection service account the following roles so it can read and write data in Cloud Storage:
-
Storage Object User (
roles/storage.objectUser) on the bucket -
Storage Legacy Bucket Reader (
roles/storage.legacyBucketReader) on the bucket
-
Storage Object User (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to let BigQuery manage tables in your project:
-
All:
-
bigquery.connections.delegateon your project -
bigquery.jobs.createon your project -
bigquery.readsessions.createon your project -
bigquery.tables.createon your project -
bigquery.tables.geton your project -
bigquery.tables.getDataon your project -
storage.buckets.geton your bucket -
storage.objects.createon your bucket -
storage.objects.deleteon your bucket -
storage.objects.geton your bucket -
storage.objects.liston your bucket
-
You might also be able to get these permissions with custom roles or other predefined roles.
Create Iceberg managed tables
To create an Iceberg managed table, select one of the following methods:
SQL
CREATE TABLE [PROJECT_ID.]DATASET_ID.TABLE_NAME ( COLUMN DATA_TYPE[, ...] ) CLUSTER BY CLUSTER_COLUMN_LIST WITH CONNECTION {CONNECTION_NAME | DEFAULT} OPTIONS ( file_format = 'PARQUET', table_format = 'ICEBERG', storage_uri = 'STORAGE_URI');
Replace the following:
- PROJECT_ID: the project containing the dataset. If undefined, the command assumes the default project.
- DATASET_ID: an existing dataset.
- TABLE_NAME: the name of the table you're creating.
- DATA_TYPE: the data type of the information that is contained in the column.
- CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
- CONNECTION_NAME: the name of the connection. For example,
myproject.us.myconnection.
To use a default connection, specify
DEFAULT instead of the connection string containing
PROJECT_ID.REGION.CONNECTION_ID.
- STORAGE_URI: a fully qualified Cloud Storage
URI. For example,
gs://mybucket/table.
bq
bq --project_id=PROJECT_ID mk \ --table \ --file_format=PARQUET \ --table_format=ICEBERG \ --connection_id=CONNECTION_NAME \ --storage_uri=STORAGE_URI \ --schema=COLUMN_NAME:DATA_TYPE[, ...] \ --clustering_fields=CLUSTER_COLUMN_LIST \ DATASET_ID.MANAGED_TABLE_NAME
Replace the following:
- PROJECT_ID: the project containing the dataset. If undefined, the command assumes the default project.
- CONNECTION_NAME: the name of the connection. For example,
myproject.us.myconnection. - STORAGE_URI: a fully qualified Cloud Storage
URI. For example,
gs://mybucket/table. - COLUMN_NAME: the column name.
- DATA_TYPE: the data type of the information contained in the column.
- CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
- DATASET_ID: the ID of an existing dataset.
- MANAGED_TABLE_NAME: the name of the table you're creating.
API
Call the tables.insert'
method with a defined table
resource, similar to the
following:
{ "tableReference": { "tableId": "TABLE_NAME" }, "biglakeConfiguration": { "connectionId": "CONNECTION_NAME", "fileFormat": "PARQUET", "tableFormat": "ICEBERG", "storageUri": "STORAGE_URI" }, "schema": { "fields": [ { "name": "COLUMN_NAME", "type": "DATA_TYPE" } [, ...] ] } }
Replace the following:
- TABLE_NAME: the name of the table that you're creating.
- CONNECTION_NAME: the name of the connection. For example,
myproject.us.myconnection. - STORAGE_URI: a fully qualified Cloud Storage
URI.
Wildcards are also
supported. For example,
gs://mybucket/table. - COLUMN_NAME: the column name.
- DATA_TYPE: the data type of the information contained in the column.
Import data into Iceberg managed tables
The following sections describe how to import data from various table formats into Iceberg managed tables.
Standard load data from flat files
Iceberg managed tables use BigQuery load jobs to load
external files into Iceberg managed tables. If you have an existing
Iceberg managed table, follow the bq load CLI
guide or the LOAD SQL
guide
to load external data. After loading the data, new Parquet files are written
into the STORAGE_URI/data folder.
If the prior instructions are used without an existing Iceberg managed table, a BigQuery table is created instead.
See the following for tool-specific examples of batch loads into Iceberg managed tables:
SQL
LOAD DATA INTO MANAGED_TABLE_NAME FROM FILES ( uris=['STORAGE_URI'], format='FILE_FORMAT');
Replace the following:
- MANAGED_TABLE_NAME: the name of an existing Iceberg managed table.
- STORAGE_URI: a fully qualified Cloud Storage URI
or a comma-separated list of URIs.
Wildcards are also
supported. For example,
gs://mybucket/table. - FILE_FORMAT: the source table format. For supported formats,
see the
formatrow ofload_option_list.
bq
bq load \ --source_format=FILE_FORMAT \ MANAGED_TABLE \ STORAGE_URI
Replace the following:
- FILE_FORMAT: the source table format. For supported formats,
see the
formatrow ofload_option_list. - MANAGED_TABLE_NAME: the name of an existing Iceberg managed table.
- STORAGE_URI: a fully qualified
Cloud Storage URI
or a comma-separated list of URIs.
Wildcards are also
supported. For example,
gs://mybucket/table.
Standard load from Apache Hive-partitioned files
You can load Hive-partitioned files into Iceberg managed tables using standard BigQuery load jobs. For more information, see Loading externally partitioned data.
Load streaming data from Pub/Sub
You can load streaming data into Iceberg managed tables by using a Pub/Sub BigQuery subscription.
Export data from Iceberg managed tables
The following sections describe how to export data from Iceberg managed tables into various table formats.
Export data into flat formats
To export an Iceberg managed table into a flat format, use the
EXPORT DATA statement
and select a destination format. For more information, see Exporting
data.
Create Iceberg managed table metadata snapshots
To create an Iceberg managed table metadata snapshot, follow these steps:
Export the metadata into the Iceberg V2 format with the
EXPORT TABLE METADATASQL statement.Optional: Schedule Iceberg metadata snapshot refresh. To refresh an Iceberg metadata snapshot based on a set time interval, use a scheduled query.
Optional: Enable metadata auto-refresh for your project to automatically update your Iceberg table metadata snapshot on each table mutation. To enable metadata auto-refresh, contact bigquery-tables-for-apache-iceberg-help@google.com.
EXPORT METADATAcosts are applied on each refresh operation.
The following example creates a scheduled query named My Scheduled Snapshot
Refresh Query using the DDL statement EXPORT TABLE METADATA FROM
mydataset.test. The DDL statement runs every 24 hours.
bq query \ --use_legacy_sql=false \ --display_name='My Scheduled Snapshot Refresh Query' \ --schedule='every 24 hours' \ 'EXPORT TABLE METADATA FROM mydataset.test'
View Iceberg managed table metadata snapshot
After you refresh the Iceberg managed table metadata snapshot you
can find the snapshot in the Cloud Storage
URI that the
Iceberg managed table was originally created in. The /data folder
contains the Parquet file data shards, and the /metadata folder contains the
Iceberg managed table metadata snapshot.
SELECT table_name, REGEXP_EXTRACT(ddl, r"storage_uri\s*=\s*\"([^\"]+)\"") AS storage_uri FROM `mydataset`.INFORMATION_SCHEMA.TABLES;
Note that mydataset and table_name are placeholders for your actual dataset
and table.
Read Iceberg managed tables with Spark
The following sample sets up your environment to use Spark SQL with Spark, and then executes a query to fetch data from a specified Iceberg managed table.
spark-sql \ --packages org.apache.iceberg:iceberg-spark-runtime-ICEBERG_VERSION_NUMBER \ --conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.CATALOG_NAME.type=hadoop \ --conf spark.sql.catalog.CATALOG_NAME.warehouse='BUCKET_PATH' \ # Query the table SELECT * FROM CATALOG_NAME.FOLDER_NAME;
Replace the following:
- ICEBERG_VERSION_NUMBER: the current runtime version. Download the latest version from Iceberg releases.
- CATALOG_NAME: the catalog to reference your Iceberg managed table.
- BUCKET_PATH: the path to the bucket containing the table files.
For example,
gs://mybucket/. - FOLDER_NAME: the folder containing the table files. For example,
myfolder.
Modify Iceberg managed tables
To modify an Iceberg managed table, follow the steps shown in Modifying table schemas.
Use multi-statement transactions
To gain access to multi-statement transactions for Iceberg managed tables, fill out the sign-up form.
Use partitioning
To gain access to partitioning for Iceberg managed tables, fill out the sign-up form.
You partition a table by specifying a partition column, which is used to segment the table. The following column types are supported for Iceberg managed tables:
DATEDATETIMETIMESTAMP
Partitioning a table on a DATE, DATETIME, or TIMESTAMP column is known as
time-unit column
partitioning.
You choose whether the partitions have hourly, daily, monthly, or yearly
granularity.
Iceberg managed tables also support clustering and combining clustered and partitioned tables.
Partitioning limitations
- All BigQuery partitioned table limitations apply.
- Partitioning column types other than
DATE,DATETIME, orTIMESTAMParen't supported. - Partition expiration isn't supported.
- Partition evolution isn't supported.
Create a partitioned Iceberg managed table
To create a partitioned Iceberg managed table, follow the instructions to create a standard Iceberg managed table, and include one of the following, depending on your environment:
- The
PARTITION BYclause - The
--time_partitioning_fieldand--time_partitioning_typeflags - The
timePartitioningproperty
Modify and query partitioned Iceberg managed tables
BigQuery data manipulation language (DML) statements and queries for partitioned Iceberg managed tables are the same as for standard Iceberg managed tables. BigQuery automatically scopes the job to the right partitions, similar to Iceberg hidden partitioning. Additionally, any new data that you add to the table is automatically partitioned.
You can also query partitioned Iceberg managed tables with other engines in the same way as standard Iceberg managed tables. We recommend enabling metadata snapshots for the best experience.
For enhanced security, partitioning information for Iceberg managed tables is decoupled from the data path and is managed entirely by the metadata layer.
Pricing
Iceberg managed table pricing consists of storage, storage optimization, and queries and jobs.
Storage
Iceberg managed tables store all data in Cloud Storage. You are charged for all data stored, including historical table data. Cloud Storage data processing and transfer charges might also apply. Some Cloud Storage operation fees might be waived for operations that are processed through BigQuery or the BigQuery Storage API. There are no BigQuery-specific storage fees. For more information, see Cloud Storage Pricing.
Storage optimization
Iceberg managed tables perform automatic table management, including compaction, clustering, garbage collection, and BigQuery metadata generation/refresh to optimize query performance and reduce storage costs. Compute resource usage for table management is billed in Data Compute Units (DCUs) over time, in per second increments. For more details, see Iceberg managed table pricing.
Data export operations taking place while streaming through the Storage Write API are included in Storage Write API pricing and are not charged as background maintenance. For more information, see Data ingestion pricing.
To view the logs and compute usage for these background operations, query the
INFORMATION_SCHEMA.JOBS view. For
example queries, see the following:
Queries and jobs
Similar to BigQuery tables, you are charged for queries and bytes read (per TiB) if you are using BigQuery on-demand pricing, or slot consumption (per slot hour) if you are using BigQuery capacity compute pricing.
BigQuery pricing also applies to the BigQuery Storage Read API and the Storage Write API.
Load and export operations (such as EXPORT METADATA) use Enterprise edition
pay as you go
slots. This
differs from BigQuery tables, which are not charged for these
operations. If PIPELINE reservations with Enterprise or Enterprise Plus slots
are available, load and export operations preferentially use these reservation
slots instead.
Limitations
Iceberg managed tables have the following limitations:
- Iceberg managed tables don't support
renaming operations or
ALTER TABLE RENAME TOstatements. - Iceberg managed tables don't support
table copies or
CREATE TABLE COPYstatements. - Iceberg managed tables don't support
table clones or
CREATE TABLE CLONEstatements. - Iceberg managed tables don't support
table snapshots or
CREATE SNAPSHOT TABLEstatements. - Iceberg managed tables don't support the following table schema:
- Empty schema
- Schema with
BIGNUMERIC,INTERVAL,JSON,RANGE, orGEOGRAPHYdata types. - Schema with field collations.
- Schema with default value expressions.
- Iceberg managed tables don't support the following schema
evolution cases:
NUMERICtoFLOATtype coercionsINTtoFLOATtype coercions- Adding new nested fields to an existing
RECORDcolumns using SQL DDL statements
- Iceberg managed tables display a 0-byte storage size when queried by the console or APIs.
- Iceberg managed tables don't support materialized views.
- Iceberg managed tables don't support authorized views, but column-level access control is supported.
- Iceberg managed tables don't support change data capture (CDC) updates.
- Iceberg managed tables don't support managed disaster recovery
- Iceberg managed tables don't support row-level security.
- Iceberg managed tables don't support fail-safe windows.
- Iceberg managed tables don't support extract jobs.
- The
INFORMATION_SCHEMA.TABLE_STORAGEview doesn't include Iceberg managed tables. - Iceberg managed tables aren't supported as query result
destinations. You can instead use the
CREATE TABLEstatement with theAS query_statementargument to create a table as the query result destination. CREATE OR REPLACEdoesn't support replacing standard tables with Iceberg managed tables, or Iceberg managed tables with standard tables.- Batch loading and
LOAD DATAstatements only support appending data to existing Iceberg managed tables. - Batch loading and
LOAD DATAstatements don't support schema updates. TRUNCATE TABLEdoesn't support Iceberg managed tables. There are two alternatives:CREATE OR REPLACE TABLE, using the same table creation options.DELETE FROMtableWHEREtrue
- The
APPENDStable-valued function (TVF) doesn't support Iceberg managed tables. - Iceberg metadata might not contain data that was streamed to BigQuery by the Storage Write API within the last 90 minutes.
- Record-based paginated access using
tabledata.listdoesn't support Iceberg managed tables. - Only one concurrent mutating DML statement (
UPDATE,DELETE, andMERGE) runs for each Iceberg managed table. Additional mutating DML statements are queued.