Migrating data and metadata from Hadoop

You can use the BigQuery Data Transfer Service for Hadoop connector to migrate data and metadata from your Hadoop clusters to Trusted Cloud.

The following diagram provides an overview of the Hadoop data and metadata migration process.

Overview of a Hadoop table and metadata migration to BigQuery.

This following sections shows how you can migrate your Hadoop data and metadata to BigQuery.

Limitations

Hadoop transfers are subject to the following limitations:

  • Metadata migration is limited to BigLake Iceberg tables for BigLake Metastore. Apache Hive table migration to BigLake Metastore is not supported.
  • The Hadoop connector is only supported with the bq command-line tool.

Before you begin

Before you schedule a Hadoop transfer, you must perform the following:

Create a Cloud Storage bucket for migrated files

Create a Cloud Storage bucket that will be the destination for your migrated Hadoop files. This bucket is referred to in this document as MIGRATION_BUCKET.

Required files

You must have the following migration files in a Cloud Storage bucket before you can schedule a Hadoop transfer:

  • The extracted metadata file (hive-dumper-output.zip)
  • The translation configuration YAML file (*.config.yaml)
  • The tables mapping YAML files

The following sections describe how to create these files.

hive-dumper-output.zip

Run the dwh-migration-dumper tool to extract metadata for Apache Hive. The tool generates a file named hive-dumper-output.zip to a Cloud Storage bucket, referred to in this document as DUMPER_BUCKET.

Translation configuration YAML file

Create a translation configuration YAML with a name containing the suffix .config.yaml—for example, translation.config.yaml, and upload it to the same bucket that contains hive-dumper-output.zip. Configure the translation configuration YAML to map HDFS paths to Cloud Storage managed folders, similar to the following example:

type: object_rewriter
relation:
- match:
    relationRegex: ".*"
  external:
    location_expression: "'gs://MIGRATION_BUCKET/' + table.schema + '/' + table.name"

Replace MIGRATION_BUCKET with the name of the Cloud Storage bucket that is the destination for your migrated files.

The location_expression field is a common expression language (CEL) expression.

For more information about this configuration YAML, see Guidelines to create a configuration YAML file.

Generate tables mapping YAML files

To generate a tables mapping YAML file, run the following command:

  curl -d '{
    "tasks": {
        "string": {
          "type": "HiveQL2BigQuery_Translation",
          "translation_details": {
              "target_base_uri": "TRANSLATION_OUTPUT_BUCKET",
              "source_target_mapping": {
                "source_spec": {
                    "base_uri": "DUMPER_BUCKET"
                }
              },
              "target_types": ["dts-mapping", "metadata"]
          }
        }
    }
    }' \
    -H "Content-Type:application/json" \
    -H "Authorization: Bearer TOKEN" -X POST https://bigquerymigration.googleapis.com/v2alpha/projects/PROJECT_ID/locations/LOCATION/workflows

Replace the following:

  • TRANSLATION_OUTPUT_BUCKET: the base URI to a Cloud Storage bucket to contain the tables mapping YAML file. For example, gs://output_bucket/tables/.
  • DUMPER_BUCKET: the base URI for Cloud Storage bucket that contains the hive-dumper-output.zip and configuration YAML file.
  • TOKEN: the OAuth token. You can generate this in the command line with the command gcloud auth print-access-token.
  • PROJECT_ID: the project to process the translation.
  • LOCATION: the location where the job is processed. For example, eu or us.

When run, the translation service API returns a WORKFLOW_ID and starts an asynchronous background job. You can monitor the status of this job using the following command:

  curl \
  -H "Content-Type:application/json" \
  -H "Authorization:Bearer TOKEN" -X GET https://bigquerymigration.googleapis.com/v2alpha/projects/PROJECT_ID/locations/LOCATION/workflows/WORKFLOW_ID

When complete, your tables mapping YAML files are created. Your tables mapping YAML files might consist of several mapping files, one for each table, stored in the Cloud Storage folder.

Enable APIs

Enable the following APIs in your Trusted Cloud project:

  • Data Transfer API
  • Storage Transfer API

A service agent is created when you enable the Data Transfer API.

Configure permissions

  1. Create a service account and grant it the BigQuery Admin role (roles/bigquery.admin). This service account is used to create the transfer configuration.
  2. A service agent (P4SA) is created upon enabling the Data Transfer API. Grant it the following roles:
    • roles/metastore.metadataOwner
    • roles/storagetransfer.admin
    • roles/serviceusage.serviceUsageConsumer
    • roles/storage.objectViewer
      • If you are migrating metadata for BigLake Iceberg tables, grant it the roles/storage.objectAdmin and roles/bigquery.admin roles instead of roles/storage.objectViewer.
  3. Grant the service agent the roles/iam.serviceAccountTokenCreator role with the following command:

    gcloud iam service-accounts add-iam-policy-binding SERVICE_ACCOUNT --member serviceAccount:service-PROJECT_NUMBER@gcp-sa-bigquerydatatransfer.s3ns-system.iam.gserviceaccount.com --role roles/iam.serviceAccountTokenCreator
  4. Verify that the user or service account running the Storage Transfer Service agent on-premises is granted the following roles:

    • roles/pubsub.editor
    • roles/storage.admin
    • roles/storagetransfer.serviceAgent

Configure your Storage Transfer Agent

To set up the storage transfer agent required for a Hadoop transfer, do the following:

  1. Install Docker on on-premises agent machines.
  2. Create a Storage Transfer Service agent pool in your Trusted Cloud by S3NS project.
  3. Install agents on your on-premises agent machines.

Schedule a Hadoop transfer

To schedule a Hadoop transfer, enter the bq mk command and supply the transfer creation flag --transfer_config:

  bq mk --transfer_config
  --data_source=hadoop
  --display_name='TRANSFER_NAME'
  --service_account='SERVICE_ACCOUNT'
  --project_id='PROJECT_ID'
  --location='REGION'
  --params='{"table_name_patterns":"LIST_OF_TABLES",
    "agent_pool_name":"AGENT_POOL_NAME",
    "destination_dataproc_metastore":"DATAPROC_METASTORE",
    "translation_output_gcs_path":"gs://TRANSLATION_OUTPUT_BUCKET/metadata/config/default_database/",
    "table_metadata_path":"gs://DUMPER_BUCKET/hive-dumper-output.zip"}'

Replace the following:

  • TRANSFER_NAME: the display name for the transfer configuration. The transfer name can be any value that lets you identify the transfer if you need to modify it later.
  • SERVICE_ACCOUNT: the service account name used to authenticate your transfer. The service account should be owned by the same project_id used to create the transfer and it should have all of the required permissions.
  • PROJECT_ID: your Trusted Cloud by S3NS project ID. If --project_id isn't supplied to specify a particular project, the default project is used.
  • REGION: location of this transfer configuration.
  • LIST_OF_TABLES: a list of entities to be transferred. Use a hierarchical naming spec - database.table. This field supports RE2 regular expression to specify tables. For example:
    • db1..*: specifies all tables in the database
    • db1.table1;db2.table2: a list of tables
  • AGENT_POOL_NAME: the name of the agent pool used for creating agents.
  • DATAPROC_METASTORE: the destination Dataproc Metastore for managed OSS destination. To use BigLake Metastore instead, you can omit this field from this transfer configuration. For more information about using BigLake Metastore to migrate metadata, see Metadata migration.

Run this command to create the transfer configuration and start the Hadoop transfer. Hadoop transfers are scheduled to run every 24 hours by default, but can be configured with transfer scheduling options.

When the transfer is complete, your Hadoop will be migrated to MIGRATION_BUCKET.

Data ingestion options

The following sections provide more information about how you can configure your Hadoop transfers.

Metadata migration

Metadata can be migrated to either Dataproc Metastore or BigLake Metastore with the underlying data stored in Cloud Storage.

To transfer metadata to Dataproc Metastore, specify the URL to your metastore in the destination_dataproc_metastore field.

To transfer metadata to BigLake metastore, you don't need to specify a destination_dataproc_metastore field in your transfer configuration. The system automatically determines the destination BigQuery dataset from the targetName field within the generated YAML mapping files.

The targetName field is formatted as a two-part identifier, for example, bigquery_dataset_name.target_table_name. By default, the first part is the name of the source database. bigquery_dataset_name in this example. You must manually create this BigQuery dataset by source database name before running the transfer.

To use another BigQuery dataset, you must provide an additional configuration YAML file (suffixed with config.yaml) in the DUMPER_BUCKET containing an object rewriter ruleset and then generate the translation mappings. The following example is a ruleset that maps the source database named my_hive_db to a BigQuery dataset named my_bq_dataset:

relation:
  - match:
      schema: my_hive_db
    outputName:
      schema: my_bq_dataset

The schema parameter must correspond to the BigQuery dataset name and the relation parameter must correspond to the table name. The database parameter must also be set to null. For more information, see Output name mapping.

Incremental transfers

When a transfer configuration is set up with a recurring schedule, every subsequent transfer updates the table on Trusted Cloud by S3NS with the latest updates made to the source table. For example, all insert, delete, or update operations with schema changes are reflected in Trusted Cloud by S3NS with each transfer.

Transfer scheduling options

By default, Hadoop transfers are scheduled to run every 24 hours by default. To configure how often transfers are run, add the --schedule flag to the transfer configuration, and specify a transfer schedule using the schedule syntax. Hadoop transfers must have a minimum of 24 hours between transfer runs.

For one-time transfers, you can add the end_time flag to the transfer configuration to only run the transfer once.

Monitor Hadoop transfers

Once you have scheduled a Hadoop transfer, you can monitor the transfer job with bq command-line tool commands. For information about monitoring your transfer jobs, see View your transfers.

Track table migration status

You can also run the dwh-dts-status tool to monitor the status of all transferred tables within a transfer configuration or a particular database. You can also use the dwh-dts-status tool to list all transfer configurations in a project.

Before you begin

Before you can use the dwh-dts-status tool, do the following:

  1. Get the dwh-dts-status tool by downloading the dwh-migration-tool package from the dwh-migration-tools GitHub repository.

  2. Authenticate your account to Trusted Cloud by S3NS with the following command:

    gcloud auth application-default login
    

    For more information, see How Application Default Credentials work.

  3. Verify that the user has the bigquery.admin and logging.viewer role. For more information about IAM roles, see Access control reference.

List all transfer configurations in a project

To list all transfer configurations in a project, use the following command:

  ./dwh-dts-status --list-transfer-configs --project-id=[PROJECT_ID] --location=[LOCATION]

Replace the following:

  • PROJECT_ID : the Trusted Cloud by S3NS project ID that is running the transfers.
  • LOCATION : the location where the transfer configuration was created.

This command outputs a table with a list of transfer configuration names and IDs.

View statuses of all tables in a configuration

To view the status of all tables included in a transfer configuration, use the following command:

  ./dwh-dts-status --list-status-for-config --project-id=[PROJECT_ID] --config-id=[CONFIG_ID] --location=[LOCATION]

Replace the following:

  • PROJECT_ID: the Trusted Cloud by S3NS project ID that is running the transfers.
  • LOCATION: the location where the transfer configuration was created.
  • CONFIG_ID: the ID of the specified transfer configuration.

This command outputs a table with a list of tables, and their transfer status, in the specified transfer configuration. The transfer status can be one of the following values: PENDING, RUNNING, SUCCEEDED, FAILED, CANCELLED.

View statuses of all tables in a database

To view the status of all tables transferred from a specific database, use the following command:

  ./dwh-dts-status --list-status-for-database --project-id=[PROJECT_ID] --database=[DATABASE]

Replace the following:

  • PROJECT_ID: the Trusted Cloud by S3NS project ID that is running the transfers.
  • DATABASE:the name of the specified database.

This command outputs a table with a list of tables, and their transfer status, in the specified database. The transfer status can be one of the following values: PENDING, RUNNING, SUCCEEDED, FAILED, CANCELLED.