Stop the ‘Out of Memory’ Crashes: Scaling Your BI
When I first started with Metabase, a single small EC2 instance was plenty. But as more team members started building dashboards and the query volume grew, I hit a wall. If you’ve seen the dreaded ‘504 Gateway Timeout’ or experienced sudden crashes during peak morning reporting hours, you’re probably wondering how to scale Metabase on AWS effectively.
The core problem is that the default Metabase setup uses an H2 embedded database. To scale, we must decouple the application logic from the data storage. This allows us to run multiple instances of Metabase behind a load balancer, creating a high-availability (HA) cluster. If you’re still deciding between Metabase vs Superset, keep in mind that both require similar infrastructure shifts as you scale, but Metabase’s scaling path on AWS is particularly streamlined if you use containers.
Prerequisites
- An AWS Account with IAM permissions for ECS, RDS, and VPC.
- A Dockerized Metabase image (the official
metabase/metabaseimage is recommended). - A basic understanding of VPCs, Security Groups, and Target Groups.
- Experience with self-hosted business intelligence tools is helpful but not required.
Step 1: Move to an External Application Database
You cannot scale Metabase horizontally if you are using the embedded H2 database, as each instance would have its own separate state. The first step in scaling is migrating your application database to Amazon RDS (PostgreSQL is the gold standard here).
# Example environment variables for your Metabase container
MB_DB_TYPE=postgres
MB_DB_DBNAME=metabase
MB_DB_PORT=5432
MB_DB_USER=metabase_user
MB_DB_PASS=your_secure_password
MB_DB_HOST=metabase-db.xxxx.us-east-1.rds.amazonaws.com
I recommend using a Multi-AZ deployment for RDS to ensure that your BI tool doesn’t go down just because one AWS availability zone has a hiccup.
Step 2: Containerize with Amazon ECS (Fargate)
Instead of managing individual EC2 VMs, I use Amazon ECS with Fargate. This removes the need to manage the underlying servers and allows for seamless auto-scaling based on CPU or Memory usage.
- Create a Task Definition: Use the
metabase/metabaseimage. Set your memory limits—I’ve found that 4GB of RAM is the sweet spot for medium-sized teams. - Configure Environment Variables: Inject the RDS credentials mentioned in Step 1.
- Set Up a Service: Create an ECS Service that maintains a desired count of 2 or more tasks across different Availability Zones.
As shown in the architecture diagram above, these tasks are stateless, meaning they don’t store any data locally, making them perfect for horizontal scaling.
Step 3: Implementing the Application Load Balancer (ALB)
To distribute traffic across your Metabase containers, you need an ALB. This provides a single DNS entry for your users and handles health checks.
In the AWS Console, create a Target Group for your ECS service. Ensure the health check path is set to /api/health. If a container becomes unresponsive due to a heavy query, the ALB will automatically stop sending traffic to it and ECS will replace it.
Step 4: Configuring Auto-Scaling Policies
Scaling manually is a chore. I set up a Target Tracking Scaling policy in ECS. I typically target 70% average CPU utilization. When the Monday morning dashboard rush hits, AWS will automatically spin up more Metabase containers to handle the load.
# Conceptual CLI command to add a scaling policy
aws application-autoscaling put-scaling-policy \
--policy-name metabase-cpu-scaling \
--service-namespace aws:ecs \
--resource-id service/metabase-cluster/metabase-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration "TargetValue=70.0,PredefinedMetricSpecification={PredefinedMetricType=ECSServiceAverageCPUUtilization}"
Pro Tips for Metabase Performance
- Enable Caching: Don’t let Metabase hit your data warehouse for every single page load. Set up caching in the Admin settings to reduce database load.
- Optimize your JVM: Set
JAVA_OPTSto manage heap size. For a 4GB container, I use-Xmx3gto leave room for the OS. - Use Read Replicas: If your Metabase queries are slowing down your production DB, connect Metabase to an RDS Read Replica instead.
Troubleshooting Common Scaling Issues
Issue: Containers are restarting frequently (CrashLoopBackOff).
Fix: This is almost always an Out-of-Memory (OOM) error. Check your ECS logs. Increase the task memory or adjust the JAVA_OPTS.
Issue: Users are being logged out randomly.
Fix: This happens if you don’t have a shared session store or if your load balancer isn’t configured for sticky sessions. While Metabase handles sessions in the app DB, ensure your ALB timeout is longer than your session timeout.
What’s Next?
Now that you’ve mastered how to scale Metabase on AWS, you should look into implementing a more robust monitoring stack. I suggest using Amazon CloudWatch with custom dashboards to track your Metabase query latency. If you’re looking to automate your entire data pipeline, check out my other guides on automation and productivity tools here at ajmani.dev.