Sunday, September 18, 2016

Network Infrastructure in Multiple Regions and Impacted Dependent Services. West Europe

i got an warning in my azure subscription with heading "Network Infrastructure in Multiple Regions and Impacted Dependent Services" and detailed message is




Azure SQL Database services were able to handle large number of requests from our customers quickly enough to seamlessly process and Azure SQL Database customers would have seen recovery except Central US region. Unfortunately, Azure SQL Databases in Central US region were overwhelmed by requests that came in higher rate than expected and resulted in availability impact to Azure SQL Database service in Central US region. Azure SQL Database service team was engaged promptly and identified a higher rate of sending requests that prevented Azure SQL Database services from recovery. The team controlled the amount of requests to Azure SQL Database service to be able to handle seamlessly, confirmed all requests were processed normally by 17:15 UTC. Affected HDInsight and Media Services in Central US region were fully recovered shortly after. CUSTOMER / SLA IMPACT: Customers may have experienced degraded service availability for multiple Azure services listed in “Impacted Services” above when connecting to resources or services that have a dependency on the recursive DNS services. We estimated that the availability of Azure SQL Database and DW, and HDInsight and Media Services that are dependent on these was reduced by approximately 60% due to the impact of the recursive DNS issue. After the recursive DNS issue was mitigated, a subset of our customers using Azure SQL Database and DW resources in Central US region, services that have a dependency on Azure SQL Database and DW in Central US region may have continued experiencing the impact. WORKAROUND: No workaround was available during the initial impact period from 11:18 UTC to 13:00 UTC. For customers who were impacted by the subsequent outage on Azure SQL Database and DW in Central US region, if customers configured active geo-replication, the downtime would have been minimized by performing a failover to a geo-secondary which would be loss of less than 5 seconds of transactions. Please visit https://azure.microsoft.com/en-us/documentation/articles/sql-database-business-continuity/ for more information on these capabilities. AFFECTED SUB REGIONS: All Regions ROOT CAUSE: The root cause of the initial impact was a software bug in a class of network device used in multiple regions which incorrectly handled a spike in network traffic. This resulted in incorrect identification of legitimate DNS requests as malformed, including requests from Azure services to resolve the DNS names of any internal endpoint or external endpoint to Azure from within Azure. The root cause of the subsequent Azure SQL Database issue in Central US region was triggered by a large amount of requests before Azure SQL Database service was fully recovered to process those requests, which resulted in availability impact to Azure SQL Database service in Central US region. Azure SQL Database and DW and its customers make extensive use of DNS. This is because the connection path to Azure SQL Database and DW requires 2 DNS lookups. All Azure SQL database and DW connection requests are initially handled by an Azure hosted service called the control ring. This is the IP address referenced by the DNS record .database.windows.net. The control ring tracks which Azure hosted service currently hosts the database/datawarehouse requested, and returns the DNS name of that service to the client in the form ...worker.database.windows.net. The client then performs a DNS lookup to connect to that location. For some customers (those connecting from outside Azure), the control plane proxies the entire connection, and thus performs the DNS lookup itself. Internal connections to databases and datawarehouses, for instance to perform service management operations and geo-replicate transactions, act as clients to other databases and datawarehouses and thus go through the same 2 lookups. We estimate that during the outage, DNS lookups failed at approximately 75% rate, for Azure SQL Database and DW this meant approximately 6% of connections succeeded on first try. NEXT STEPS: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to): 1) Azure Network Infrastructure: The network device bug fix released in all regions once testing and validation are completed [Status – in progress] 2) Azure Network Infrastructure: Improve alerting to detect an inability of DNS services quicker to minimize the time to resolve [Status – in progress] 3) Azure Network Infrastructure: Set new configurations to bypass the network devise bug [Status – Completed] 4) Azure SQL Database/DW: Reduce dependency on DNS by increasing TTL for most records maintained by Azure SQL Database and DW (Instance and server names change rarely, this occurs only on service management operations, therefore the low TTL is unnecessary) [Status – in progress] 5) Improve resiliency options for our customers to be able to minimize downtime. This includes Azure services that have a dependency on the DNS services used by Azure services [Status – in review] In addition, we continue working on the following remediation actions that were identified during Azure SQL Database and DW incident on September 12th. We are committed to complete these items as soon as possible to help avoid any further interruption. We again apologize for any impact you may have experienced due to this issue. 1) Run multiple active/active control rings to avoid single point of failure in a region. [Status – in progress] 2) Document additional control ring IPs and later provide an easy to manage IP tagging mechanism in the future using Azure Network security groups. [Status – in progress] 3) Automate health detection of full control ring health and failover to standby. After item#1 this becomes move of traffic to the healthy rings. [Status – in progress] 4) Evaluate enhancements to Quality-of-Service traffic management scenarios [Status – in progress]


 

No comments:

Post a Comment