Version: main 🚧

Troubleshooting External Database

info

This feature is available from the Platform version v4.8.0

Modify the following with your specific values to replace on the whole page and generate copyable commands:

PLATFORM_DOMAIN

ACCOUNT_ID

AWS_REGION

DB_VPC_ID

DB_SG_ID

CLUSTER_CONTEXT

Platform pods are crashlooping

Check pod logs for errors:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-ha \
logs -n vcluster-platform -l app=loft --tail=50

Common causes:

Database connectivity errors: Verify that VPC peering is active and routes exist on the correct route tables. Include public route tables if nodes are in public subnets. Confirm the database security group allows port 3306 from the cluster VPC CIDR.
IAM authentication failures: Confirm the Pod Identity association exists, the Pod Identity Agent add-on is ACTIVE, and the RDSIAMAuthKine IAM policy contains the correct DbiResourceId.
Resource exhaustion: Check node resource usage with kubectl top nodes and pod resource usage with kubectl top pods -n vcluster-platform.

Leader election is stuck

If the platform is unresponsive but pods are running, the leader lease may be stuck. Check the lease status:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-ha \
get lease -n vcluster-platform -o wide

If the lease holder is a pod that no longer exists, restart the deployment to force a new leader election:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-ha \
rollout restart deployment/loft -n vcluster-platform

ALB health checks failing

Check the ingress status:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-ha \
get ingress -n vcluster-platform loft

Verify the ALB target group health in the AWS console or with:

Modify the following with your specific values to generate a copyable command:

TARGET_GROUP_ARN

aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/k8s-vcluster-loft-xxxxxxxxxx/xxxxxxxxxx

Common causes:

The platform pods are not ready (check kubectl get pods).
The /healthz endpoint is not reachable from the ALB (check security groups and the ALB target type matches the service configuration).

Database connection timeouts

If pods start but log database connection errors:

Verify VPC peering is active:

aws ec2 describe-vpc-peering-connections \
--filters "Name=status-code,Values=active" \
--region us-east-1 \
--query 'VpcPeeringConnections[].{ID:VpcPeeringConnectionId,Requester:RequesterVpcInfo.VpcId,Accepter:AccepterVpcInfo.VpcId}'

Verify routes from the EKS VPC to the database VPC exist:

Modify the following with your specific values to generate a copyable command:

EKS_VPC_ID

aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-yyyyyyyyy" \
--region us-east-1 \
--query 'RouteTables[].Routes[?DestinationCidrBlock==`10.1.0.0/16`]'

Test connectivity from within the cluster:

Modify the following with your specific values to generate a copyable command:

RDS_ENDPOINT

kubectl run dbtest --image=busybox --restart=Never --rm -it -- \
nc -zv mariadb-ha-platform.xxxxxxxxxxxx.us-east-1.rds.amazonaws.com 3306

Kerberos authentication failures

When the platform authenticates to a PostgreSQL backend over Kerberos, the GSSAPI handshake output appears in the platform pod logs. Check the current logs with:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-ha \
logs -n vcluster-platform -l app=loft --tail=100

If the pod restarted, check the previous container logs as well:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-ha \
logs -n vcluster-platform -l app=loft --tail=100 --previous

Common causes:

KDC_ERR_C_PRINCIPAL_UNKNOWN: The KDC has no such principal. The principal in the datasource does not match the one exported into the mounted keytab, or it was never created in the KDC. Confirm both name and realm.
no GSSAPI provider registered: The platform build predates GSSAPI support. Upgrade to v4.10.0 or later.
Handshake fails despite a valid principal: The datasource host does not match the host in the PostgreSQL service principal (postgres/<host>@REALM) baked into the server keytab. Make the datasource host and the service principal host identical, or disable DNS canonicalization in krb5.conf (dns_canonicalize_hostname = false, rdns = false).
gss authentication failed from PostgreSQL: The pg_hba.conf rule did not map the principal to a login role, or PostgreSQL matched a broader host rule before the Kerberos rule. Verify the gss rule appears before broader rules, the role name matches the principal short name, and include_realm=0 and krb_realm are set correctly.
TLS or certificate errors: Kerberos authenticates the database connection, but TLS is configured separately. Verify the datasource sslmode and any database CA or client certificate values match your PostgreSQL server requirements.

Platform pods are crashlooping​

Leader election is stuck​

ALB health checks failing​

Database connection timeouts​

Kerberos authentication failures​

Platform pods are crashlooping

Leader election is stuck

ALB health checks failing

Database connection timeouts

Kerberos authentication failures