Migrating permissions from Hadoop
This document describes how you can migrate permissions from Apache Hadoop Distributed File System (HDFS), Ranger HDFS, and Apache Hive into IAM roles in Cloud Storage or BigQuery.
The permissions migration process consists of the following steps:
- Generate a principals mapping file by first creating a principal ruleset YAML configuration file. Then, run the permission migration tool with the principal ruleset YAML file with the HDFS or Ranger metadata files to generate a principals mapping file.
- Generate a target permissions mapping file by first creating a permissions ruleset YAML file. Then, run the permission migration tool with the permissions ruleset YAML file and the table mapping configuration files, and the HDFS or Ranger metadata files, to generate a target permissions mapping file.
- Run the permission migration tool with the target permissions file to apply permissions to Cloud Storage or BigQuery. You can also use the provided python script to generate a Terraform file that you can use to apply permissions on your own.
Before you begin
Before you migrate permissions, verify that you have done the following:
- Install the
dwh-migration-dumper
tool. - Run the
dwh-migration-dumper
tool to generate the necessary metadata for your data source.
You can also find the Terraform generator script in the terraform.zip
file
inside the release package.
Generate a principals mapping file
A principals mapping file defines mapping rules that maps principals from your source to Trusted Cloud IAM principals.
To generate a principals mapping file, you must first manually create a
principal ruleset YAML file to define how principals are mapped from your source
to Trusted Cloud IAM principals. In the principals
ruleset YAML file, define mapping rules for each of your sources, either
ranger
, HDFS
, or both.
The following example shows a principals ruleset YAML file that maps Apache Ranger groups to service accounts in Trusted Cloud by S3NS:
ranger: user_rules: - skip: true group_rules: # Skip internal Ranger groups. - skip: true when: "group.groupSource == 0" # Map all roles to Google Cloud Platform service accounts. - map: type: value: serviceAccount email_address: expression: "group.name + 'my-service-s3ns-system.iam.gserviceaccount.comaccount@my-project.'" role_rules: - skip: true hdfs: user_rules: - skip: true group_rules: - skip: true other_rules: - skip: true
The following example shows a principals ruleset YAML file that maps HDFS users to specific Trusted Cloud users:
ranger: user_rules: - skip: true group_rules: - skip: true role_rules: - skip: true hdfs: user_rules: # Skip user named 'example' - when: "user.name == 'example'" skip: true # Map all other users to their name at google.com - when: "true" map: type: value: user email_address: expression: "user.name + '@google.com'" group_rules: - skip: true other_rules: - skip: true
For more information about the syntax for creating a principals ruleset YAML file, see Ruleset YAML files.
Once you have created a principals ruleset YAML file, upload it to a
Cloud Storage bucket. You must also include either the HDFS file, the
Apache Ranger file generated by the dwh-migration-dumper
tool, or both,
depending on which source you are migrating permissions from. You can then run
the permissions migration tool to generate the principals mapping file.
The following example shows how you can run the permissions migration tool to
migrate from both HDFS and Apache Ranger, resulting in a
principals mapping file named principals.yaml
.
./dwh-permissions-migration expand \ --principal-ruleset gs://MIGRATION_BUCKET/principals-ruleset.yaml \ --hdfs-dumper-output gs://MIGRATION_BUCKET/hdfs-dumper-output.zip \ --ranger-dumper-output gs://MIGRATION_BUCKET/ranger-dumper-output.zip \ --output-principals gs://MIGRATION_BUCKET/principals.yaml
Replace MIGRATION_BUCKET
with the name of the Cloud Storage
bucket that contains your migration files.
Once you've run the tool, inspect the generated principals.yaml
file to verify
that it contains principals from your source mapped to Trusted Cloud
IAM principals. You can edit the file manually before the next
steps.
Generate target permissions file
The target permissions file contains information about the mapping of source permissions set in the Hadoop cluster to IAM roles for BigQuery or Cloud Storage managed folders. To generate a target permissions file, you must first manually create a permissions ruleset YAML file that specifies how permissions from Ranger or HDFS map to Cloud Storage or BigQuery.
The following example accepts all Ranger permissions to Cloud Storage:
gcs: ranger_hive_rules: - map: {} log: true
The following example accepts all HDFS permissions except the hadoop
principal:
gcs: hdfs_rules: - when: source_principal.name == 'hadoop' skip: true - map: {}
The following example overrides the default role mapping for the table tab0
, and uses defaults for all other permissions
gcs: ranger_hive_rules: ranger_hive_rules: - when: table.name == 'tab0' map: role: value: "roles/customRole" - map: {}
For more information about the syntax for creating a permissions ruleset YAML file, see Ruleset YAML files.
Once you have created a permissions ruleset YAML file, upload it to a
Cloud Storage bucket. You must also include either the HDFS file, the
Apache Ranger file generated by the dwh-migration-dumper
tool, or both,
depending on which source you are migrating permissions from. You must also
include the tables mapping configuration files and the principals mapping file.
You can then run the permissions migration tool to generate the target permissions file.
The following example shows how you can run the permissions migration tool to
migrate from both HDFS and Apache Ranger, with the tables mapping
configuration files and the principals mapping file named principals.yaml
,
resulting in a principals mapping file named permissions.yaml
.
./dwh-permissions-migration build \ --permissions-ruleset gs://MIGRATION_BUCKET/permissions-config.yaml \ --tables gs://MIGRATION_BUCKET/tables/ \ --principals gs://MIGRATION_BUCKET/principals.yaml \ --ranger-dumper-output gs://MIGRATION_BUCKET/ranger-dumper-output.zip \ --hdfs-dumper-output gs://MIGRATION_BUCKET/hdfs-dumper-output.zip \ --output-permissions gs://MIGRATION_BUCKET/permissions.yaml
Replace MIGRATION_BUCKET
with the name of the
Cloud Storage bucket that contains your migration files.
Once you've run the tool, inspect the generated permissions.yaml
file to verify
that it contains permissions from your source mapped to Cloud Storage or
BigQuery IAM bindings. You can edit the manually
before the next steps.
Apply permissions
Once you have generated a target permissions file, you can then run the permissions migration tool to apply the IAM permissions to Cloud Storage or BigQuery.
Before you run the permissions migration tool, verify that you have met the following prerequisites:
- You have created the required principals (users, groups, service accounts) in Trusted Cloud.
- You have created the Cloud Storage managed folders or tables that will host the migrated data.
- The user running the tool has permissions to manage roles for the Cloud Storage managed folders or tables.
You can apply permissions by running the following command:
./dwh-permissions-migration apply \ --permissions gs://MIGRATION_BUCKET/permissions.yaml
Where MIGRATION_BUCKET
is the name of the
Cloud Storage bucket that contains your migration files.
Apply permissions as a Terraform configuration
To apply the migrated permissions, you can also convert the target permissions file into a Terraform Infrastructure-as-Code (IaC) configuration and apply it to Cloud Storage.
- Verify that you have Python 3.7 or higher.
- Create a new virtual environment and activate it.
From the
permissions-migration/terraform
directory, install the dependencies from therequirements.txt
file using the following command:python -m pip install -r requirements.txt
Run the generator command:
python tf_generator PATH LOCATION OUTPUT
Replace the following:
PATH
: the path to the generatedpermissions.yaml
file.LOCATION
: the location of your Cloud Storage bucket where the script checks and creates folders based on the permission configuration.OUTPUT
: the path to the output file,main.tf
.
Ruleset YAML files
Ruleset YAML files are used to map principals and roles when migrating permissions from HDFS or Apache Ranger to Trusted Cloud. Ruleset YAML files use Common Expression Language (CEL) for specifying predicates (where the result is boolean) and expressions (where the result is string).
Ruleset YAML files have the following characteristics:
- Mapping rules of each type are executed sequentially from top to bottom for each input object.
- CEL expressions have access to context variables, and context variables depend
on the section of the ruleset. For example, you can use the
user
variable to map to source user objects, and you can use thegroup
variable to map to groups. - You can use CEL expressions or use static values to change default values. For
example, when mapping a group, you can override the output value
type
from the default valuegroup
to another value likeserviceAccount
. - There must be at least one rule which matches every input object.
In a HDFS or Apache Ranger permissions migration, a ruleset YAML file can be used to define either a principal mapping file or a role mapping file.
Mapping rules in ruleset YAML files
The ruleset YAML file consists of mapping rules that specify how objects match from your source to your target during a permissions migration. A mapping rule can contain the following sections or clauses:
when
: A predicate clause that limits the applicability of the rule- A string represents a boolean CEL expression. Values can be
true
orfalse
- The rule applies only if the
when
clause evaluates totrue
- Default value is
true
- A string represents a boolean CEL expression. Values can be
map
: A clause that specifies the contents of a result field. The value for this clause depends on the type of object processed and can be defined as:expression
to evaluate as a stringvalue
for a constant string
skip
: Specifies that the input object shouldn't be mapped- Can be either
true
orfalse
- Can be either
log
: A predicate clause that helps debug or develop rules- A string represents a boolean CEL expression. Values can be
true
orfalse
- Default value is
false
- If set to
true
, the output contains an execution log that can be used to monitor or diagnose issues in the execution
- A string represents a boolean CEL expression. Values can be
Creating a principal ruleset YAML file
A principal mapping file is used to generate principal
identifiers by providing a value for
email_address
and type
.
- Use
email_address
to specify the email for the Trusted Cloud principal. - Use
type
to specify the nature of the principal in Trusted Cloud. The value fortype
can either beuser
,group
, orserviceAccount
.
Any CEL expression used in the rules has access to variables which represent the processed object. The fields in the variables are based on the contents of the HDFS or Apache Ranger metadata files. The available variables depend on the section of the ruleset:
- For
user_rules
, use the variableuser
- For
group_rules
, use the variablegroup
- For
other_rules
, use the variableother
- For
role_rules
, use the variablerole
The following example maps users from HDFS to users in the Trusted Cloud
with their username, followed by @google.com
as their email address:
hdfs: user_rules: # Skip user named 'example' - when: "user.name == 'example'" skip: true # Map all other users to their name at google.com - when: "true" map: type: value: user email_address: expression: "user.name + '@google.com'"
Override default role mapping
To use non-default principals, you can either skip or modify the default role mappings using the ruleset files.
The following example shows how you can skip a section of rules:
hdfs: user_rules: - skip: true group_rules: - skip: true other_rules: - skip: true
Creating a permissions ruleset YAML file
A permissions ruleset YAML file is used to generate a target permissions mapping file. To create a permissions ruleset YAML file, use CEL expressions in your permissions ruleset YAML to map HDFS or Apache Ranger permissions to Cloud Storage or BigQuery roles.
Default role mapping
HDFS file roles are determined by source file permissions:
- If the
w
bit is set, then the default role iswriter
- If the
r
bit is set, then the default role isreader
- If neither bits are set, then the role is empty
Ranger HDFS:
- If the access set contains
write
, then the default role iswriter
- If the access set contains
read
, then the default role isreader
- If the access set contains neither, then the role is empty
Ranger:
- If the access set contains
update
,create
,drop
,alter
,index
,lock
,all
,write
, orrefresh
, then the default role iswriter
- If the access set contains
select
orread
, then the default role isreader
- If the access set contains none of the preceding permissions, then the role is empty
Cloud Storage:
roles/storage.objectUser
- Writerroles/storage.objectViewer
- Reader
BigQuery:
roles/bigquery.dataOwner
- Writerroles/bigquery.dataViewer
- Reader
The following example shows how you can accept default mappings without any changes in the ruleset YAML file:
ranger_hdfs_rules: - map: {}
Override default role mapping
To use non-default roles, you can either skip or modify the default role mappings using the ruleset files.
The following example shows how you can override a default role mapping using a map clause with the role field using a value cause:
ranger_hdfs_rules: - map: role: value: "roles/customRole"
Merging permission mappings
If multiple permission mappings are generated for the same targeted resource,
the mapping with the widest access is used. For example, if a HDFS rule gives a
reader role to principal pa1
on an HDFS location, and a Ranger rule gives a
writer role to the same principal on the same location, then the writer role is
assigned.
String quoting in CEL expressions
Use quotation marks ""
to wrap the entire CEL expression in YAML. Within the
CEL expression, use single quotes ''
for quoting strings. For example:
"'permissions-migration-' + group.name + '@google.com'"