Over the past 18 months I have been working with organizations and the Zscaler team to help deliver security for public cloud workloads, such as AWS, Azure, and GCP. As many of you know and have been doing for many years, it's possible to send your internet-bound traffic to the Zscaler Internet Access (ZIA) platform using a variety of methods. Getting traffic to ZIA Service Edges includes GRE Tunnels, IPSEC Tunnels (including SD-WAN integrations), Client Connector, and even PAC files. This flexibility is one of the many things that make the Zscaler platform so amazing.
Enter public cloud. AWS. Azure. GCP. You can just configure some GRE or IPSEC tunnels and forward internet-bound workload traffic to ZIA easily, right? Well, not really. Some cloud providers don't support GRE tunnels and some of the native VPN/IPSEC tunnel capabilities do not support the resiliency/HA many organizations require. There are third party solutions, such as deploying virtual cloud routers, and then setting up IPSEC tunnels to ZIA. This can work but in most cases we are seeing this is not scalable. Not just from a throughput perspective but operationally.
Zscaler for Workloads offers a component called Cloud Connectors. They are Zscaler purpose-built gateways that can be deployed into public cloud platforms and forward traffic to both Zscaler Internet Access (ZIA) and Zscaler Private Access (ZPA) platforms. Cloud Connectors are EC2/VMs, integrate with cloud provider's native load balancers, scale horizontally, and are deployed with IaC Tools such as Terraform and CloudFormation. Cloud Connectors securely forward traffic to Zscaler using DTLS/TLS tunnels, something many customers will be familiar with because it is the same underlying tunneling technology Zscaler offers with Client Connectors . If you want to learn more please visit our page here.
What is a Workload, anyways? Any service or machine that communicates on the network and typically does not have a user logged into it. This can be EC2 Instances, RDS Instances, EKS (containers) Nodes, Lambda Functions, etc.
Cloud Connectors on AWS
Let's quickly cover the Cloud Connector component to provide more familiarity and context for the rest of the article. In this example, we have decided to create a Zscaler VPC in the AWS Account that has the regional Transit Gateway. Zscaler can automate the creation of the VPC using Terraform, but many organizations utilize existing code or processes for the underlying network. Zscaler generally recommends a minimum of 2 AZs and to deploy the components into private subnets because no inbound connectivity from the internet is needed. In this case, we have:
- Deployed public subnets with 1 NAT Gateway per subnet/AZ, and configured a default route from the public subnets to the Internet Gateway
- Deployed private subnets with a default route to the NAT Gateways. *Note: NAT Gateways are not required but this is recommended deployment from Zscaler. Please contact Zscaler if you prefer to deploy Cloud Connectors into a public subnet and remove the need for NAT Gateways. This is supported but not our recommendation.
- Deployed private subnets that have the Transit Gateway attachments. Prior to deploying the Cloud Connectors, the default route will most likely point to the NAT Gateways
Zscaler Cloud Connectors with AWS GWLB Example
Once the underlying network is in place, we deploy using Terraform or CloudFormation. In this example, the default Zscaler TF/CFT templates will deploy a Lambda Macro, one Cloud Connector per Subnet/AZ (m5.large), a GWLB Service, a Target Group including the Cloud Connector service ENIs, and a GWLB VPC Endpoint in the same subnets as the Cloud Connectors.
Upon successful enrollment with Zscaler, the Cloud Connectors, by default, will each discover and establish 2 outbound tunnels (can be unencrypted, DTLS/TLS) to the closest optimal Zscaler Service Edges. Each Cloud Connector will have an Active tunnel for forward workload traffic to Zscaler, will the secondary/backup tunnel is in standby modes.
Please note this is the default configuration and behavior. Organization can utilize forwarding rules to send traffic to different destinations and service edges. This means that when a Cloud Connector processes traffic it is possible for it to have multiple active tunnels established to forward the traffic to the different destinations per the forwarding rules:
- Select which Zscaler Service Edges to utilize. This includes public edges, private/virtual edges, and ZIA subclouds
- Utilize different Zscaler Service Edges for different types of traffic based on criteria. This includes but is not limited to: Cloud Connectors, Network Services, Source IP/CIDR, Destination IP/CIDR/FQDN.
For more information please read more here.
That should be a good enough introduction to what and how the Cloud Connectors work in AWS for the purposes of this article. Let's move on to the design and topology decisions!
Many organizations have massive AWS footprints. Surprisingly, the complexities I see are not tied to the quantity or types of workloads running in AWS: EC2 instances, RDS instances, Lambda functions, EKS nodes, etc. Operationalizing granular security policies does take time but is not the forefront of most conversations.
So what is? Cloud Network Topology & Design Decisions. There are so many considerations to account for and many are not Zscaler-specific. The right design will optimize costs, reduce operational overhead, and not compromise security. I hope to share some insight and my experiences having designed and deployed this with many organizations.
Let me reiterate one more thing: There is no single answer or design that works for every deployment, but I am starting this blog series with what I have seen to be the most common on AWS: Regional Security Hubs using Transit Gateway (TGW). In a nutshell this is basically a hub-and-spoke model in which the Workload Spoke VPCs are connected to a regional TGW, routing can be centralized, and security products can be deployed into a "security", "inspection" or "egress" VPC to service all the Spoke VPCs. There are many configuration options that would require me to write a 50 page document on, so I will keep to the point and not talk about every single nuance. I'll refer to these hubs being per-region, but it is possible to have multiple hubs/TGWs per region. Large organizations may separate the workloads across environments: Test/Dev, QA, Production. So just keep in mind your exact design will vary.
There are many benefits of using TGW, and you can learn more at https://docs.aws.amazon.com/prescriptive-guidance/latest/integrate-third-party-services/architecture-3.html and https://aws.amazon.com/transit-gateway/features/ or just use your favorite search engine.
Is this model right for us?
Let's make it simple. When you are wondering if this is the best or possible topology when it comes to Zscaler, ask yourself these 3 questions:
- Are we already using Transit Gateways?
- Have we already made a decision to migrate to using Transit Gateways?
- Do we have 100's or 1000's of VPC spreads across AWS Accounts and/or Regions?
If you answered yes to at least 1 of the questions then this might be the best option for you as an enterprise standard. Does that mean you will be "all or nothing"? Nope. Thanks to AWS innovation, the AWS Gateway Load Balancer (GWLB) offering enables security services and vendors like Zscaler to utilize what is called a Distributed GWLB Endpoint model. This means you can have centralized regional Security VPCs with Zscaler Cloud Connectors and secure Isolated VPCs that are not peered or connected to the Security VPC via TGW! If this concept is new to you, don't worry, I will explain this a bit more if you keep reading...
What about other design options? We'll cover that in the Part 2 but to give you a sneak peak.. it's a fully decentralized model where each VPC is Isolated without any peering or TGW connectivity. For this article we are only talking about the centralized hub model!
Regional Hub with TGW and Distributed Endpoint connectivity to Zscaler example
Let's take a look at this high level design example where the organization has decided to deploy Zscaler Cloud Connectors into regional hubs because most of the workload spoke VPCs are already connected via Transit Gateway. However, the organization also has a few Isolated VPCs that do not require access to private resources or applications, but needs internet egress protection. Instead of deploying Cloud Connectors directly into the Isolated VPC, the organization simply connected to GWLB Service fronting the Zscaler Cloud Connector by deploying a GWLB VPC Endpoint into the Isolated VPC. This hybrid model provides the benefits of:
- Lowered costs. Not needing to deploy Cloud Connector EC2 instances in each VPC, but rather just a few Cloud Connectors per regional hub. There is always give and take with vendor and cloud provider costs (networking, compute, storage, etc) but generally this benefit applies today.
- Less operational overhead. Only needing to deploy a group of Cloud Connectors and related components per region is must easier and faster than per VPC. It is important to note that organizations further along in the IaC journey will not have bigger overhead when deploying per VPC as it is very "cookie-cutter" and can be automated. Organizations that do not have all the VPCs, network configurations, etc across AWS deployed using IaC such as Terraform, will find that deploying per VPC will require more manual steps than doing it centrally.
- Less complex routing. Most of the VPCs already default route to the TGW, so it is possible to essentially route all outbound traffic to the Cloud Connectors by adjust a TGW route table. In reality you will cut over VPCs in phases but the point is simplification.
The diagram depicts a hybrid approach but there are some variables I want to call out just because a single diagram can't account for every possibility:
- Availability Zones. Zscaler recommends deploying Cloud Connectors across a minimum of 2 AZs, but usually it's best to match your enterprise standard
- GWLB. Zscaler deploys a GWLB Service and respective GWLB VPC Endpoints into the subnets of each Cloud Connector by default. Connecting additional GWLB VPC Endpoints (distributed) can be done too using the same GWLB Service.
- Routing. Instead of routing to Cloud Connector Service ENI, you will always route the the respective GWLB VPC Endpoints. A default route is most common, but some organizations only forward proxy-aware web traffic (PAC file / explicit proxy) to Zscaler.
- Cloud-Native or Virtual Firewalls. Many organizations might have existing firewalls in AWS that are used for "east-west" inspection and control. The firewalls might be the next hop from a routing perspective. If so, we generally talk through the options of where to deploy Cloud Connectors. We can deploy into a separate VPC, deploy into the same VPC, and have workloads next hop point to the Cloud Connectors or firewalls first. This topic alone will be a blog as there are many options here depending on your organization's requirements.
Brief note: Many customers bring up questions around traffic they do not want to send to Zscaler. The implementation details vary based on the use case but with the use of routing, forwarding rules, and other configurations it is possible to send all or some traffic to Cloud Connectors and/or Zscaler. We will not cover this topic in this article but it's an important consideration that Zscaler is aware of!
Spoke VPCs to Zscaler via TGW
In many cases, most of the VPCs will be connected to the Transit Gateway where the Cloud Connectors are deployed. As we zoom into this portion of the network diagram, we can see the Spoke will route all traffic destined outside of its own VPC to the TGW. If the TGW Route Table is configured to send the default route to the Zscaler VPC, the TGW route table in the Zscaler VPC will default route to the GWLB VPC Endpoints fronting the Cloud Connectors.
Spoke VPC connected via Transit Gateway to Zscaler
The Cloud Connectors will then forward the traffic appropriately, such as using the established DTLS tunnels to the Zscaler Service Edges as depicted in the diagram below. When GWLB Cross-Zone Load Balancing is Enabled (which is our recommendation), GWLB will be able to send traffic to Cloud Connectors across all AZs instead of only the workload source AZ. This is important from an HA/resilience perspective because if the Cloud Connector(s) in AZ1 are unable to tunnel the traffic to Zscaler, the healthy Cloud Connector(s) in AZ2 can forward that traffic without interruption. It is also important to note that if a primary tunnel fails to connect to Zscaler from a Cloud Connector, a secondary tunnel will be marked as active and used to forward traffic (as depicted in red below).
Now we have traffic routing from a Spoke VPC to Zscaler so the workloads are protected. Although I am simplifying this for the purpose of this article, you would start associating other workload Spoke VPCs in this region to the TGW Route Table that is pointing to the Zscaler VPC to protect them as well. It is mostly "rinse and repeat" at this point for all VPCs in the region, and then the next region, etc.
Isolated VPCs to Zscaler with Distributed Endpoints
Last but not least, what about those Isolated VPCs in the same region that have no peering or TGW connectivity to the Zscaler VPC? This is where we zoom into this portion of the diagram and show the connectivity looks almost identical to TGW from a diagram perspective. A minor, but critical detail, is that instead of TGW attachment we have simply deployed a GWLB VPC Endpoint that connects to the existing GWLB Service fronting the Cloud Connectors (from the TGW diagram). This connection is basically just using AWS PrivateLink to stay on the AWS backbone/network, but allows for the same connectivity out to the Internet via Zscaler protection!
Isolated VPC with Distributed GWLB Endpoint to Zscaler
So in the above diagram you'll notice the architecture/topology is still centralized, but the AWS GWLB Service enables connectivity to Zscaler without VPC connectivity! From a routing perspective the differences in this method are:
- Workload VPC routes to the local GWLB VPC Endpoint instead of TGW
- The GWLBe connects to the GWLB Service to the Cloud Connectors using AWS PrivateLink (not depicted in diagram) instead of TGW
Are there any advantages or disadvantages to the Distributed Endpoint model instead of just attaching them to TGW?
- Biggest Advantage: Support for overlapping VPC CIDRs. The Isolated VPCs can all have the same CIDR and utilize the same centralized Cloud Connectors. VPCs attaching to the same TGW cannot have overlapping CIDRs.
- Biggest Disadvantage: This is "one-way" communication for the Isolated Workload VPC getting to the internet through Zscaler. Yes, return traffic is inspected and sent back to the workloads, but they must initiate all communication. If another VPC or service outside of the Isolated VPC needs to communicate inbound to this VPC, it will not work since GWLB Endpoints only support the "inbound" side. However, this disadvantage is actually a good reason to explore the ZPA for Workloads offering from Zscaler :) If workload to app and workload to workload communication is needed, you can simply deploy ZPA App Connectors into the Isolated VPC and publish the required applications(s) through Zscaler Private Access! However, this bullet point is a disadvantage specifically in the context of this article's topic around internet egress security only use cases.
In Part 2 of this article series, we will cover a fully decentralized AWS model where you deploy the Cloud Connectors into each Workload VPC with direct secure internet access. Don't worry, the next articles will cover Azure, GCP, and then ZPA-specific use cases too. I plan to write a new part every few weeks!
Now, you might have some questions. Please don't hesitate to reach out to your Zscaler Customer Success Manager or Account Team and ask for a Workload Communications Discovery Workshop. Nothing beats some diagrams, digital or in-person white boarding, and talking through all the details of the design.