本頁說明如何調查及解決 GKE 記錄相關問題。
Cloud Logging 中缺少叢集記錄
確認專案已啟用記錄功能
列出已啟用的服務:
gcloud services list --enabled --filter="NAME=logging.googleapis.com"
以下輸出內容表示專案已啟用記錄功能:
NAME TITLE logging.googleapis.com Cloud Logging API
選用:在記錄檢視器中查看記錄,判斷是誰在何時停用 API:
protoPayload.methodName="google.api.serviceusage.v1.ServiceUsage.DisableService" protoPayload.response.services="logging.googleapis.com"
如果記錄功能已停用,請啟用記錄功能:
gcloud services enable logging.googleapis.com
確認叢集已啟用記錄功能
列出叢集:
gcloud container clusters list \ --project=PROJECT_ID \ '--format=value(name,loggingConfig.componentConfig.enableComponents)' \ --sort-by=name | column -t
更改下列內容:
PROJECT_ID
:您的 Trusted Cloud by S3NS 專案 ID。
輸出結果會與下列內容相似:
cluster-1 SYSTEM_COMPONENTS cluster-2 SYSTEM_COMPONENTS;WORKLOADS cluster-3
如果叢集的值為空白,系統會停用記錄功能。舉例來說,這個輸出內容中的
cluster-3
已停用記錄功能。如果設為
NONE
,請啟用叢集記錄功能:gcloud container clusters update CLUSTER_NAME \ --logging=SYSTEM,WORKLOAD \ --location=COMPUTE_LOCATION
更改下列內容:
CLUSTER_NAME
:叢集名稱。COMPUTE_LOCATION
:叢集的 Compute Engine 位置。
確認節點集區中的節點具有 Cloud Logging 存取範圍
節點必須具備下列其中一個範圍,才能將記錄寫入 Cloud Logging:
https://www.googleapis.com/auth/logging.write
https://www.googleapis.com/auth/cloud-platform
https://www.googleapis.com/auth/logging.admin
檢查叢集中每個節點集區設定的範圍:
gcloud container node-pools list --cluster=CLUSTER_NAME \ --format="table(name,config.oauthScopes)" \ --location COMPUTE_LOCATION
更改下列內容:
CLUSTER_NAME
:叢集名稱。COMPUTE_LOCATION
:叢集的 Compute Engine 位置。
將工作負載從舊節點集區遷移至新建立的節點集區,並監控進度。
使用正確的記錄範圍建立新的節點集區:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --scopes="gke-default"
更改下列內容:
CLUSTER_NAME
:叢集名稱。COMPUTE_LOCATION
:叢集的 Compute Engine 位置。
找出節點服務帳戶缺少重要權限的叢集
如要找出缺少重要權限的節點服務帳戶叢集,請使用 NODE_SA_MISSING_PERMISSIONS
recommender 子類型的 GKE 建議:
- 使用 Trusted Cloud 控制台。前往「Kubernetes clusters」(Kubernetes 叢集) 頁面。在特定叢集的「通知」欄中,查看「授予重要權限」建議。
使用 gcloud CLI 或 Recommender API 時,請指定
NODE_SA_MISSING_PERMISSIONS
recommender 子類型。如要查詢這項建議,請執行下列指令:
gcloud recommender recommendations list \ --recommender=google.container.DiagnosisRecommender \ --location LOCATION \ --project PROJECT_ID \ --format yaml \ --filter="recommenderSubtype:NODE_SA_MISSING_PERMISSIONS"
如要實作這項建議,請將 roles/container.defaultNodeServiceAccount
角色授予節點的服務帳戶。
您可以執行指令碼,在專案的 Standard 和 Autopilot 叢集中搜尋任何沒有 GKE 必要權限的節點服務帳戶。這個指令碼會使用 gcloud CLI 和 jq
公用程式。如要查看指令碼,請展開下列章節:
查看指令碼
#!/bin/bash
# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions
# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
local project_id="$1"
local service_account="$2"
local permission="$3"
local roles=$(gcloud projects get-iam-policy "$project_id" \
--flatten="bindings[].members" \
--format="table[no-heading](bindings.role)" \
--filter="bindings.members:\"$service_account\"")
for role in $roles; do
if role_has_permission "$role" "$permission"; then
echo "Yes" # Has permission
return
fi
done
echo "No" # Does not have permission
}
# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
local role="$1"
local permission="$2"
gcloud iam roles describe "$role" --format="json" | \
jq -r ".includedPermissions" | \
grep -q "$permission"
}
# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
local service_account="$1"
all_service_accounts+=( ${service_account} )
}
# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read nodepool; do
nodepool_name=$(echo "$nodepool" | awk '{print $1}')
if [[ "$nodepool_name" == "" ]]; then
# skip the empty line which is from running `gcloud container node-pools list` in GCP console
continue
fi
while read nodepool_details; do
service_account=$(echo "$nodepool_details" | awk '{print $1}')
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.s3ns-system.iam.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
fi
done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"
}
# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read service_account; do
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.s3ns-system.iam.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account" for cluster "$project_id\t$cluster_name\t$cluster_location\t"
fi
done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}
# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
echo "$autopilot"
}
echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
cluster_name=$(echo "$cluster" | awk '{print $1}')
cluster_location=$(echo "$cluster" | awk '{print $2}')
# how to find a cluster is a Standard cluster or an Autopilot cluster
autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
if [[ "$autopilot" == "True" ]]; then
add_service_account_for_autopilot "$project_id" "$cluster_location" "$cluster_name"
else
add_service_accounts_for_standard "$project_id" "$cluster_location" "$cluster_name"
fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"
echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))
echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
monitoring_permission="No"
else
monitoring_permission="Yes"
fi
performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission
if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
sa_missing_permissions+=( ${sa} )
fi
done
echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
echo "All service accounts have the above permissions"
fi
找出叢集中缺少重要權限的節點服務帳戶
GKE 會使用附加至節點的 IAM 服務帳戶,執行記錄和監控等系統工作。這些節點服務帳戶至少必須具備專案的「Kubernetes Engine 預設節點服務帳戶」(roles/container.defaultNodeServiceAccount
) 角色。根據預設,GKE 會使用專案中自動建立的 Compute Engine 預設服務帳戶做為節點服務帳戶。
如果貴機構強制執行 iam.automaticIamGrantsForDefaultServiceAccounts
機構政策限制,專案中的預設 Compute Engine 服務帳戶可能不會自動取得 GKE 的必要權限。
如要確認是否缺少記錄權限,請檢查叢集記錄中是否有
401
錯誤:[[ $(kubectl logs -l k8s-app=fluentbit-gke -n kube-system -c fluentbit-gke | grep -cw "Received 401") -gt 0 ]] && echo "true" || echo "false"
如果輸出內容為
true
,表示系統工作負載發生401
錯誤,這表示缺少權限。如果輸出內容為false
,請略過其餘步驟,嘗試其他疑難排解程序。如要找出所有缺少的重大權限,請檢查指令碼。
-
找出節點使用的服務帳戶名稱:
主控台
- 前往「Kubernetes clusters」(Kubernetes 叢集) 頁面:
- 在叢集清單中,按一下要檢查的叢集名稱。
- 視叢集運作模式而定,請執行下列其中一項操作:
- 如為 Autopilot 模式叢集,請在「安全性」部分中,找出「服務帳戶」欄位。
- 如果是 Standard 模式叢集,請執行下列操作:
- 按一下「Nodes」(節點) 分頁標籤。
- 在「節點集區」表格中,按一下節點集區名稱。「節點集區詳細資料」頁面隨即開啟。
- 在「安全性」部分,找到「服務帳戶」欄位。
如果「服務帳戶」欄位中的值為
default
,節點就會使用 Compute Engine 預設服務帳戶。如果這個欄位的值不是default
,節點就會使用自訂服務帳戶。如要將必要角色授予自訂服務帳戶,請參閱「使用最低權限的 IAM 服務帳戶」。gcloud
如果是 Autopilot 模式叢集,請執行下列指令:
gcloud container clusters describe
CLUSTER_NAME
\ --location=LOCATION
\ --flatten=autoscaling.autoprovisioningNodePoolDefaults.serviceAccount如果是標準模式叢集,請執行下列指令:
gcloud container clusters describe
CLUSTER_NAME
\ --location=LOCATION
\ --format="table(nodePools.name,nodePools.config.serviceAccount)"如果輸出為
default
,表示節點使用 Compute Engine 預設服務帳戶。如果輸出不是default
,表示節點使用自訂服務帳戶。如要將必要角色授予自訂服務帳戶,請參閱「使用最低權限的 IAM 服務帳戶」。 -
如要將
roles/container.defaultNodeServiceAccount
角色授予 Compute Engine 預設服務帳戶,請完成下列步驟:主控台
- 前往「歡迎」頁面:
- 在「專案編號」欄位中,按一下 「複製到剪貼簿」。
- 前往「IAM」(身分與存取權管理)IAM 頁面:
- 按一下「授予存取權」 。
- 在「New principals」(新增主體) 欄位中,指定下列值:
將PROJECT_NUMBER-compute@developer.s3ns-system.iam.gserviceaccount.com
PROJECT_NUMBER
替換為您複製的專案編號。 - 在「Select a role」(選取角色) 選單中,選取「Kubernetes Engine Default Node Service Account」(Kubernetes Engine 預設節點服務帳戶) 角色。
- 按一下 [儲存]。
gcloud
- 找出 Trusted Cloud 專案編號:
gcloud projects describe PROJECT_ID \ --format="value(projectNumber)"
將
PROJECT_ID
替換為您的專案 ID。輸出結果會與下列內容相似:
12345678901
- 將
roles/container.defaultNodeServiceAccount
角色指派給 Compute Engine 預設服務帳戶:gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:PROJECT_NUMBER-compute@developer.s3ns-system.iam.gserviceaccount.com" \ --role="roles/container.defaultNodeServiceAccount"
將
PROJECT_NUMBER
替換為上一步的專案編號。
- 確認節點服務帳戶具備必要權限。請檢查指令碼 以進行驗證。
用於找出 GKE 節點服務帳戶缺少權限的指令碼
您可以執行指令碼,在專案的 Standard 和 Autopilot 叢集中搜尋任何沒有 GKE 必要權限的節點服務帳戶。這個指令碼會使用 gcloud CLI 和 jq
公用程式。如要查看指令碼,請展開下列章節:
查看指令碼
#!/bin/bash
# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions
# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
local project_id="$1"
local service_account="$2"
local permission="$3"
local roles=$(gcloud projects get-iam-policy "$project_id" \
--flatten="bindings[].members" \
--format="table[no-heading](bindings.role)" \
--filter="bindings.members:\"$service_account\"")
for role in $roles; do
if role_has_permission "$role" "$permission"; then
echo "Yes" # Has permission
return
fi
done
echo "No" # Does not have permission
}
# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
local role="$1"
local permission="$2"
gcloud iam roles describe "$role" --format="json" | \
jq -r ".includedPermissions" | \
grep -q "$permission"
}
# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
local service_account="$1"
all_service_accounts+=( ${service_account} )
}
# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read nodepool; do
nodepool_name=$(echo "$nodepool" | awk '{print $1}')
if [[ "$nodepool_name" == "" ]]; then
# skip the empty line which is from running `gcloud container node-pools list` in GCP console
continue
fi
while read nodepool_details; do
service_account=$(echo "$nodepool_details" | awk '{print $1}')
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.s3ns-system.iam.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
fi
done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"
}
# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read service_account; do
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.s3ns-system.iam.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account" for cluster "$project_id\t$cluster_name\t$cluster_location\t"
fi
done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}
# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
echo "$autopilot"
}
echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
cluster_name=$(echo "$cluster" | awk '{print $1}')
cluster_location=$(echo "$cluster" | awk '{print $2}')
# how to find a cluster is a Standard cluster or an Autopilot cluster
autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
if [[ "$autopilot" == "True" ]]; then
add_service_account_for_autopilot "$project_id" "$cluster_location" "$cluster_name"
else
add_service_accounts_for_standard "$project_id" "$cluster_location" "$cluster_name"
fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"
echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))
echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
monitoring_permission="No"
else
monitoring_permission="Yes"
fi
performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission
if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
sa_missing_permissions+=( ${sa} )
fi
done
echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
echo "All service accounts have the above permissions"
fi
確認未達到 Cloud Logging 寫入 API 配額
確認您尚未達到 Cloud Logging 的 API 寫入配額。
前往 Trusted Cloud 控制台的「配額」頁面。
依「Cloud Logging API」篩選表格。
確認您未達到任何配額。