Networking & Content Delivery

DNS best practices for Amazon Route 53

Most web services rely on DNS to resolve names to IP addresses and sometimes other pieces of information. Amazon Route 53 provides highly available and scalable recursive DNS resolution, domain registration, and authoritative DNS-hosted zones that include health check capabilities and a broad array of routing capabilities. When using Amazon Route 53, you can scale your web services in a performant and reliable manner, taking advantage of the Amazon Web Services (AWS) global footprint.

Since launching our first service, Amazon Simple Queue Service (Amazon SQS), in 2006, we have launched hundreds of services, and they all have one thing in common: they require the use of DNS. Over the years, we have learned many best practices in managing DNS for highly scalable and reliable web services. The following best practices combine what has been published in our documentation, blogs, videos, and presentations along with our vast experience operating web services using Route 53 for DNS management.

A day in a life of DNS query

The domain name system(DNS) is the phonebook of the internet. Applications, browsers and people access information online through the domain names, such as amazon.co.uk or amazon.com. Web browsers interact through Internet Protocol (IP) addresses and DNS translate domain names to IP addresses so browsers can load internet resources and applications can communicate with one another.

The DNS resolution process is extensive and requires a lot of nameservers involved, as described in RFC 1035. In order to make domains easily accessible and manageable, DNS administrators will break down the structure of a domain name into smaller names for easy access, often called subdomains, see the following figure to understand the DNS tree structure:

Figure 1: DNS tree Structure

When a browser or an application tries to connect to a domain name, several steps happen behind the scenes to allow the website or application to connect to a server. In the following figure we have outlined the steps taken during a DNS query:

Figure 2: Image of recursive DNS resolution steps

This blogpost will cover best-practices when managing your domain name infrastructure in order to help understanding, and preventing outages by misconfigurations.

We will talk about the following best-practices:

1) Route 53 namespace account control

2) Hosted Zone management

3) Sub-domain and delegation management

4) TTL Management

1) Route 53 namespace account control

The first and most important aspect of managing DNS is managing your namespace (the hierarchical design of the zones and names you will use).

Route 53 allows you to register a domain using the console or API. The recommendation here is to register a domain in an AWS account that is tightly controlled and prevent unwanted actions that could lead in a domain lost or outage, such as deleting domain name, disabling auto-renew, etc. You can control access to make any such changes (see IAM policy controls in Route 53 for an example).

2) Hosted Zone management

Teams managing their hosted zone should have access control in place to prevent unwanted actions that can cause outages. The prevention should be against deletion of a hosted zone, deletion of records or DNS delegations. If you delete a hosted zone, you cannot restore it, you will need to recreate the hosted zone and this can cause issues especially for clients who get a “no such domain” (NXDOMAIN) error while their dns records was deleted. One important aspect is that negative answers are also cached by recursive resolver, so even if you recreate it quickly, customers might still experience inconsistent results due to negative caching caching for the time-to-live (TTL).

While it is usually easy to break down everything into smaller subzones, be cautious not to go too deep. Each delegation can require an additional query for the resolver to make and create additional dependencies to manage. You must find the right balance between delegating to reduce the scope of impact for a single zone versus having too many zones to manage and navigate. At AWS, we have chosed to delegate by region then service. You will see this in domains like ec2.eu-west-1.amazonaws.com where amazonaws.com is a zone which delegates to eu-west-1.amazonaws.com zone which then delegates to ec2.eu-west-1.amazonaws.com, but your situation may be different. For per Region delegations, you should also consider using separate AWS accounts as a further limitation on scope of potential impact, but also weigh this against the additional operational overhead of separation.

When vending software-as-a-service (SaaS) type services that are per customer endpoints, you should also consider creating a different DNS name for each service consumer or instance of your service endpoints (for example, Your-service-xyz123.myservice.eu-west-1.example.com).In this example, we appended a random character set (xyz123) to the service. This helps prevent overlaps in namespace when requesting an instance of your service.

Creating a name per service/component gives you the flexibility to migrate a single consumer to a different endpoint without requiring any action on the consumer’s part or impact on other consumers. This is important for balancing load in different sharding strategies. Note that adding the Region name into external DNS names may or may not be right for your business based on the need to migrate between Regions or load balance between Regions, and thus you should weigh the risk versus reward.

Domains have higher risk of being crawled, are susceptible to queries for records that do not exist. To reduce cost and provide a better experience for user interfaces, you can consider creating a wildcard DNS record for your zone that aliases to an Amazon CloudFront distribution with either a long-cached error page or HTTP redirect. Make sure you have a record at the apex of your domain (example.com) in addition to the wildcard (*.example.com) as the wildcard does not cover the apex. Additionally, if your endpoint supports IPv6, make sure to have AAAA records in place or your customers may cause additional NODATA query cost. You can read more about preventing unwanted NXDOMAIN and NODATA query charges in the AWS Best Practices for DDoS Resiliency Whitepaper.

3) DNS Delegation Management

You should carefully consider the level of DNS delegation you want to create. As explained above, try to avoid deep subdomain delegations, which can undesirable overhead to maintain. Consider defining a hierarchy that can give the hosted zone authority to the right owners. If you have a single team managing the entire namespace, it’s better to have a less granular hierarchy, however, if you have multiple teams and you want them to manage their own Hosted Zones, then you would want to create delegations so each team can manage their own hosted zones.

A common mistake is attempting to expand your domain name structure by creating subdomains without following a structure. An incorrect domain name structure can cause inconsisten DNS responses, leading to outages in your applications.

Suppose you create two public hosted zone that are parent and child:

example.com -> ApexDomain -> NS Delegation to sub.example.com

sub.example.com -> Public Hosted Zone

Then you create an NS record in the example.com hosted zone to delegate responsibility for sub.example.com records to sub.example.com hosted zone.

If you have any sub.example.com records in the example.com (Apex Domain Hosted Zone), you will have inconsistent DNS responses from authoritative nameservers, and you will need to recreate these records under sub.example.com hosted zone. See the Figure 3 of an incorrect setup:

Figure 3: Incorrect delegation setup

Having sub.example.com records in both hosted zones creates ambiguity. When a DNS resolver resolves a record such as api.sub.example.com, the resolver will either directly query the sub.example.com name servers or query the example.com name servers, depending on which NS (NameServer) records the resolver has cached at any time:

– If the resolver queries example.com name servers for api.sub.example.com, those name servers will return an answer from the example.com zone if one exists or the NS records for sub.example.com if one does not.

– If the resolver queries sub.example.com name servers, those name servers will return the answer from the sub.example.com zone or a negative response (NXDOMAIN or No Answers) if no record exists, or even different response than expected, as figure above.

To ensure consistency, create all sub.example.com records in the sub.example.com zone, create an NS record in the example.com hosted zone to delegate responsibility to the sub.example.com hosted zone, then remove any sub.example.com records from the example.com zone.
This ensures that resolvers always get answers for sub.example.com from the sub.example.com hosted zone and that DNS queries are always answered unambiguously.

4) TTL Management

DNS uses caching to improve performance and reduce load on each part of the recursion tree. To allow DNS administrators to be able to control that caching, there is a time-to-live (TTL) value specified in each record, as well as a TTL specified in the start of authority (SOA) record for caching of negative answers such as NXDOMAIN and the likes. Not all resolvers or clients may honor the TTL, most do and it is important to choose a correct TTL for your workloads. We recommend leaving the negative caching TTL as 900 seconds, which is the default when creating a hosted zone in Route 53.

Choosing the TTL for individual records is a trade-off decision with a number of factors. The first is how often you are likely to create a change. For NS records, these do not change very often and thus you can keep a long-lived TTL like the default of 172,800 seconds (2 days) to reduce the cost of queries from clients and improve performance. Other records which point to a production endpoint, you will probably want to keep between 60 seconds and 300 seconds (5 minutes). The lower the TTL, the faster you can redirect clients to a different endpoint. However, you will see more queries from resolvers, which can result in higher latency for clients reaching your application. It is also worth noting that clients typically only perform a DNS request when initiating a session. Thus, if you have a long-lived TCP session that is active or has an extended timeout period, you may not see that client move to a different endpoint even if you have updated the DNS record, failed over, or weighted away, and the TTL has expired.

Conclusion

With Amazon Route 53, you can register, host, and resolve your domains in a scalable, reliable way. This post reviewed the best practices you can use to further enhance your experience on Route 53. Review your existing domains and consider migrating them to Route 53 if you haven’t already, and adopting the best practices outlined here.

About the Authors

Scott Morrison

Scott is a senior specialist solutions architect for networking at AWS, where he helps customers design resilient and cost-effective networks. Scott loves to code in his spare working hours to solve unique problems. When not working, Scott is often found either in the desert outside of Las Vegas off-roading or occasionally playing in poker tournament.

Renato Gentil

Renato is a Senior Technical Account Manager based in Ireland with expertise in Route 53. Renato holds AWS Networking Specialty certification and he has been working on large scale Route 53 and Resilience scenarios with different customer around the globe. In his spare time Renato loves playing Uno with his children, watching a move or playing bass guitar.