20180917: Appendices – Cloud-ready DCs

Last updated: November 15, 2023
Audience: IT Staff / TechnicalDecision Makers

This page holds appendices for the ‘NETID DCs ready for cloud’ analysis.

Assumptions

The following are assumed:

  • NETID domain controllers are a high-value target which if compromised would represent a severity=1 incident and reputation loss for the UW, i.e. compromise is to be avoided if at all possible
  • Design changes to NETID domain controllers potentially impact a lot of systems and applications, some of whose configuration we can not know
  • Some imagined/desired use cases can be met without involving the NETID domain controllers
  • Having a reproducible central solution is highly desired because it eliminates duplication and improves the quality of the solution
  • We are not yet to the point where the capabilities provided by the NETID domain controllers have lost their value, but we can imagine a point in the future where we retire the NETID domain
  • The primary capability and value provided by the NETID domain is device management. Active Directory authentication and authorization are secondary capabilities which are leveraged because an involved device needs to be managed. For this reason, Active Directory authentication and authorization capabilities are not considered strategic to the UW, but rather of tactical importance. To illustrate, if an application needed to integrate & it supported Active Directory or SAML claims based authentication & authorization, the UW would recommend Shibboleth, not Active Directory.

Background

In 9/2006, UW-IT deployed the NETID domain as a central Active Directory for the UW. Initially, the capabilities provided were limited to authentication and authorization via trust. In 2/2009, the NETID domain was moved from the public internet to private IP address space for two reasons. The first reason was to improve the user authentication experience when the user’s computer was off-campus–the experience took 5+ minutes for a logon. The second reason was to limit remote attacks on the domain controllers from public internet addresses and to align with Microsoft recommendations about not running AD on the public internet. In 8/2010, the broader set of usual Active Directory capabilities were released via Delegated OUs. Delegated OU adoption has been steady for 8 years, with ~150 departments with OUs today, and more than 23K computers joined to the domain.

The expected lifetime of this problem and the needed solution is also worth considering. Microsoft has not indicated any expected lifetime for Active Directory, but anxiety about its lifetime is present in the marketplace. Indirect indicators such as the level of investment in new Active Directory features (none in WS2019), Microsoft’s strong push to cloud-based MDM workstation management, and shifts toward cloud-friendly architectures for Microsoft’s own Windows Server based application workloads are good examples of reasons why customers have this anxiety. As mentioned in the assumptions section, we can imagine a point in the future where the capabilities provided by the NETID domain controllers have lost their value. At the time this analysis is being written, a best estimate of when that might be is 7 years from now in 2025, with an estimated 50% confidence. However, the indicators previously mentioned suggest that the amount of use should start to decrease because Microsoft is subtly shifting its core use cases away from Active Directory. So this solution may only be necessary for 5-10 years, and there should be some recurring analysis of the use cases that need it to determine whether it should be turned off and alternate solutions implemented for those use cases. In other words, carefully managing the on-board, lifecycle, and expectations of customers which use any solution to this problem seems like a wise course of action.

Possible Solutions & Analysis

  • Table 1 summarizes which of the following possible solutions addresses the three representative use cases.
  • Table 2 contains a summary of the key traits of each of the possible solutions.

Footnotes for the tables are as follows:

1 Computers can be joined to a domain but by itself, this solution does not allow you to authenticate to the NETID domain. Whether a P2S VPN can solve all issues for the use case depends on specifics we can’t generally speculate on.

2 Microsoft considers running DCs on the public internet very poor security practice. Whether this makes it unsupported is arguable, but we’ve gone ahead and considered it unsupported.

3 There is a small increase in security risk by extending the size of the UW network, but that is considered negligible here.
4 There is a small increase in risk by placing a firewalled NETID DC on the public internet, but that is considered negligible here.
5 While Microsoft has a whitepaper covering this option, it requires configuration that is not supported.
6 This is insecure like a), but it is less insecure because folks on the internet can not directly get at the NETID domain.
7 There is some added cost: at least 2 DCs and some ongoing maintenance, but this is considered negligible here.
8 There are some significant customer support impacts. Adding IPv6 global addresses is supported by Microsoft, but presumably is considered not best or good practice. But for our purposes here, we’ve chosen to list this as Supported: Yes.

Analysis

The following are possible solutions which address one or more of the 3 representative use cases.

  1. The most direct solution would be to place NETID DCs directly on the internet. However, it is not generally believed to be best practice to put your Active Directory domain controllers on the public internet, and in some cases Microsoft product managers and documented guidance have stated this point of view. The reasons behind this point of view are:
    1. Kerberos, NTLM and LDAP protocols, especially in combination with lateral escalation, are not generally considered strong enough to withstand persistent attack on the internet
    2. The volume of security log audits increases by several factors, and this makes getting useful information from those logs much harder
    3. The volume of activity increases, meaning you may run into more operational problems, all to effectively service the same workload
    4. Some client platforms poorly handle AD authentication, leading to 5+ minute login delays that very significantly affect valid users connecting over the internet

    Some of these issues might be mitigated, but only at considerable cost and increased risk. Further discussion of the possible mitigations, associated costs, and risks happens below.

  2. Microsoft generally recommends a site to site (S2S) VPN for this problem. ExpressRoute is one such S2S VPN solution specific to Azure, but there are others. One of the problems with a VPN solution is that it doesn’t necessarily scale well–you are likely to need at least one S2S VPN for each cloud vendor, and possibly multiple per cloud vendor, because in the higher education environment, you often have a distributed IT management model with dozens to hundreds of Azure subscriptions for a single university.There are a variety of variations within this solution, including siting NETID DCs in the cloud, siting other AD domains in the cloud with a trust to NETID, and others. We won’t dive too deeply into those variations as they all assume a S2S VPN. The nice quality about this solution is that once it is in place it as if the cloud services are on-premises–this translates to a good user experience.It is worth noting that some of the problems with scaling could be mitigated with an Azure subscription strategy change (and likewise for other cloud vendors), making cloud vendor subscriptions shared instead of distributed. There is also some possibility of mitigating scaling issues by leveraging cloud vendor network routing between cloud vendor subscriptions to share centrally-provided S2S VPNs.
  3. Historically there are many large enterprises with geographically diverse locations without private connectivity between them, and a solution for this scenario are what we will call “replication bridgehead DCs in a DMV”. In other words, siting a heavily locked down domain controller on public IP space which is not generally usable, but which can replicate to other similar DCs. This solution allows discrete sets of usable domain controllers (called AD Sites) on otherwise disparate networks without incurring a significant security or usability risk. The earliest mention by Microsoft of this solution (and a partial endorsement of it) is at https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/bb727063(v%3dtechnet.10).If the UW were to use this solution, we’d need one partially firewalled NETID DC on a public IP in Seattle, at least one partially firewalled NETID DC in each cloud vendor plus an additional 1 or 2 NETID DCs in each cloud vendor. This solution avoids the need for S2S VPNs, replacing them with additional NETID DCs. This solution may be better than multiple S2S VPNs each run by multiple university IT departments, but it is not necessarily any better than a centrally-provided & shared S2S VPN solution. It has the same scaling problems; trading VPN servers for DC servers, and you complicate your AD design significantly.Another significant issue to overcome which is not addressed by Microsoft’s paper are the risks associated with having a DC with multiple IP addresses–this is discouraged for a variety of reasons. Some of the contortions needed to support dual-homed domain controllers are covered at https://blogs.msmvps.com/acefekay/2009/08/17/multihomed-dcs-with-dns-rras-and-or-pppoe-adapters/. Further analysis of possible mitigations will not be attempted here as that resource goes into great depth why this potential solution is not really viable.
  4. Microsoft provides Entra ID Domain Services as a general hybrid bridge solution for this problem. This solution it is not always suitable for problems of class #3, but is otherwise very promising. Some use cases of class #3, really do require joining to the NETID domain and the AAD-DS domain is not the NETID domain. But some use cases of class #3 would be fine with the AAD-DS domain because the users & groups in that domain would include sidHistory of the respective NETID users and groups.One issue with this solution is that it is not generally possible for a tenant of our size, but Microsoft has offered a solution for this problem which can not be publicly discussed.With this solution we could choose to recover costs from customers who need it.
  5. In some cases, use of alternative technologies such as federated authentication or hybrid authentication can avoid the need for direct NETID integration from the cloud. Whether the application is designed ready for SAML or more modern authentication protocols depends on the application. This solution is highly dependent on the flexibility of the scenario & applications, but it is very cost-effective, and from a strategy perspective is very desirable because architecturally it is very cloud-ready.Put simply, this should always be the first solution proposed, with the recognition that it will have a low likelihood of being viable.Depending on the specifics, there are other flavors of federated authentication or hybrid authentication that are not SAML/Shibboleth integration.SQL is a core Microsoft technology which frequently underlies a perceived need to join to the NETID AD. Azure SQL has almost 100% feature compatibility as SQL server. Azure SQL supports Entra ID authentication via modern or password (legacy) authentication, and when ADFS is part of your Entra ID tenant authentication architecture it also supports Integrated Windows Authentication. SQL Linked Servers from on-premises SQL servers to Azure SQL are also supported, although it appears that you must use a service account to query Azure SQL (i.e. trusted sub-system model). Where possible, when a cloud-based use case involves SQL and believes it needs to join the NETID AD, Azure SQL should be considered instead.Where the use case requires access to an on-premises website using Integrated Windows Authentication, use of the Entra ID Application Proxy can provide a cloud-based endpoint (with optional additional AAD security protections) to cloud-based components that need to interact with it. With this solution, an Entra ID token is obtained and used to get a Windows-based token and access to the on-premises web application via the proxy.
  6. One might use what Microsoft calls the forest trust model to join computers which need to be joined to an Active Directory, but not necessarily to the NETID AD. Note that this solution can not directly solve use case #2 because you can not authenticate to the NETID domain without direct connectivity. So this solution, by itself is incapable of solving any of the use cases. But in combination with other solutions, it can. For example, with a point to site VPN such as Husky OnNet, the computer joined to this publicly available domain can authenticate to the NETID domain. Whether this constitutes a working solution for use case #2 depends heavily on the specifics of the use case.If the UW were to use this solution, we’d need one or more ADs (each with 2+ DCs) on the internet. Those ADs would need to trust the NETID domain, meaning they’d need at least one DC on the campus network in public IP space. Whether there was only one “public” AD or many would depend on the need to segregate risks from various use cases. Whether the public AD(s) were run by departments or centrally is dependent on university strategy. With this solution there would be fixed costs and risks, meaning it may be more attractive than many of the other solutions, but it is not suitable for problems of class #1 or 3, and whether it is suitable for use case #2 depends on specifics.This solution is similar to d). It differs from d) in the following ways:
    1. It can not directly solve use case #1 because it will not have all UW NetID users and Microsoft AD-DS LDAP does not support chained requests or binding
    2. Microsoft runs and secures d). The related operational costs/risks may be significant and Microsoft has better insight and staffing to support those operational costs/risks.
    3. We have a lot more control of the AD-DS in this solution. The solution in d) has a number of limitations which may make it non-workable for specific scenarios.
  7. We might deploy AD-LDS on the public internet in a “user proxy” configuration. This solution would amount to having an LDAP directory which uses the NETID domain for authentication, but which can hold unique directory data different than what is in the NETID domain. More about this configuration is documented here: https://technet.microsoft.com/en-us/library/2008.12.proxy.aspx.This solution could meet use case 1. This solution would require minor additional operational costs. Since authentication is being proxied, there is a minor increase in security risk, but this would be no different than any other computer joined to the NETID domain which is located on the public internet. Whether this solution would meet the specific instances of use case 1 depends on whether only authentication integration is required. If authorization or additional directory information is required, then this solution is unlikely to help. In most cases solution d) is likely to be superior because we don’t have to expend any labor to support it, the overall costs are comparable, and a greater amount of functionality is provided by solution d).
  8. We might assign global IPv6 addresses to the NETID DCs, but leave the IPv4 addresses as is. Global IPv6 addresses are reachable/located on the public internet.However, Windows Firewall enforces the same restrictions on IPv6 as it does on IPv4, so those IPv6 addresses would not be useful, unless we also removed and superseded the built-in/vendor firewall rules for Active Directory. Making those changes is highly inadvisable and not supported by the vendor.There is another significant problem with this idea–Windows computers will prefer IPv6 if present. As soon as we add an IPv6 address to the NETID DCs, all Windows computers that interact with the NETID DCs will prefer those addresses. This makes the IPv4 addresses–and specifically that they remain private–a mute point.So in the end, this idea is effectively the same as possible solution a), just a more convoluted path with a higher likelihood of customer confusion and support problems. For this reason, further analysis of possible mitigations to improve the characteristics of this potential solution will not be attempted.
  9. Virtual directories are a capability highly lauded by Gartner about 5 years ago. RadiantLogic, OptimalIdM, Oracle, and other vendors provide these solutions. They provide a traditional LDAP protocol endpoint which proxies connection to one or more directories (and in some cases SQL and other technologies), making the other directories appear to be a single endpoint. Virtual directories provide additional capabilities, like filtering or searching across multiple directories via what appears to be a single query from the client’s perspective.We might deploy a virtual directory on the internet in front of the NETID domain as a possible solution. Unfortunately, this would only meet use case #1, because virtual directories do not support AD domain join or the usual AD member computer operations. So this solution is very similar to g) (AD-LDS), but with additional costs to the UW to purchase the solution.If one of these vendors added support for domain join, this solution would become very interesting.

Analysis Round-up

Only 4 possible solutions are capable of meeting all of the representative use cases–solutions a, b, c, and h. Each of those options has downsides, with cost as a consistent trait. Differentiators among those solutions are whether they are secure and supported, with only solution b) having both those favorable qualities. So from a simple analysis, the existing policy of recommending the use of a S2S VPN, appears to be valid, although centrally provided S2S VPN(s) would improve the situation.

Given the goal desired by UW’s CTO, we need to dig deeper, to ensure there is not a more complex combination of solutions which is superior or whether there are acceptable mitigations which turn one of these solutions into a more favorable solution.

If we analyze the desirability of solutions by use case, using the traits captured and noted characteristics of each solution, the following ordering emerges:

For use case #1, the solutions in order of preference are: Federated integration (e), AAD-DS (d), AD-LDS (g), Virtual Directory (i), VPN (b), DCs on internet (a), IPv6 global (h)

For use case #2, the solutions in order of preference are: AAD-DS (d), VPN (b), forest trust model (f), bridgehead (c), DCs on internet (a), IPv6 global (h)

For use case #3, the solutions in order of preference are: VPN (b), bridgehead (c), DCs on internet (a), IPv6 global (h)

If we amalgamate the solutions based on this threading, the following generalized order of preference emerges: Federated integration (e), AAD-DS (d), AD-LDS (g), Virtual directory (i), VPN (b), forest trust model (f), bridgehead (c), DCs on internet (a), IPv6 global (h). In other words, when presented with any use case, we might try this list of solutions in this order. If the solution isn’t valid, then we proceed to the next. A takeaway from this seems to be that given demonstrated customer interest we should explore centrally enabling Entra ID Domain Services (AAD-DS).

Next we turn to whether mitigations might improve the characteristics of the two most favorable solutions that cover all use cases. Since this does require a greater amount of detail, we have chosen only those 2 solutions which have the greatest perceived potential, namely a) (DCs on the internet) and b) (S2S VPNs).

Possible mitigations for DCs on the Internet – more in-depth analysis

The most undesirable trait of this solution is security impact. Almost no one is willing to speculate on what protections would be adequate enough to place domain controllers on the internet, but that is the chief task of this section. There are several other undesirable traits of this potential solution like cost, and unfortunately, to attempt to mitigate the security issues the cost must greatly increase.

Security

A common security-oriented suggestion is to only use Read Only Domain Controllers (RODCs) in your DMZ network. The thinking is that this limits the exposure when an attacker finds a way to compromise the read-only domain controller. Unfortunately, this kind of thinking isn’t really about maintaining security but rather about reducing the losses of an inevitable compromise. In reality, it is usually possible to compromise the writable domain controllers from a compromised RODC, so using an RODC is not a significant mitigation, especially if we must allow all users to have access to the RODC. But using RODCs does provide some protections, so we’ll use it along with a long list of others. Here is a list of mitigations that would be needed:

  • Deploy RODC on public IP
    • Limit DNS publishing of RODC to internal only DNS views. Require that all clients of the RODC use a hosts file for routing.
    • Require that every user that needs access to the RODC need to go through some kind of process to get provisioned and have a lifecycle attestation process for each user to remain provisioned, e.g. you automatically lose RODC access after 1 year if you don’t do something to stay provisioned
      • This process will require some custom solution to make it self-service
    • Administration of the RODC must be handled via separate admin accounts, so we’ll need a new admin uw netid type: zadm. To limit lateral compromise risks, the zadm credential must only ever be present on the RODC
  • Implement full PAWS infrastructure for tier/ring zero of our NETID AD. This is required to prevent accidental exposure of credentials that would lead to full compromise, because our risk profile is heightened via exposing a RODC on the internet
  • Implement full PAM for AD privileged groups, with extremely limited or no exceptions. Requiring just-in-time activation of privileged AD group roles will help to prevent full compromise with our heightened risk profile
  • Implement PAWS and PAM infrastructure for tier/ring one to mitigate risks to Managed Servers. This work is not yet analyzed so estimates have low accuracy. The scope of individuals impacted is also larger than for tier zero, but much of the infrastructure needed for tier zero can be re-used.
  • Locate RODC on network with firewall independent of the host. Allow access from public IPs only via a gating mechanism with a lifecycle attestation process for each IP to remain permitted. This requires customers who want to use the RODC to register their IP and attest annually to retain access.
    • This process will require some custom solution to make it self-service
    • Experiment with blocking port 389/3268 completely–if possible, don’t allow connections to these ports b/c traffic is not encrypted
  • Monitor and analyze failed authentications to the RODC to identify possible attacks. When an attack is detected, automatically remove access from that client.
    • This process will require some custom solution to analyze and automatically remove access

Cost

The cost to implement the security mitigations noted above are extremely large. The PAWS + PAM for tier/ring 0 are estimated separately (internal link) and are large enough that resourcing has not yet been identified despite their strategic value. In addition, we’d need to implement PAWS + PAM for tier/ring 1 and the 3 custom solutions noted above. Security mitigation costs are estimated to be very considerable.

Non-labor costs
Cost
Item
Recurring
$10,000 RODCs (2) One-time
$400 SMS management of RODCs Monthly
$1,314.24 Managed Firewall (2) One-time
$43,000 PAWS and PAM (tier 0) One-time
$14,628.94 PAWS and PAM (tier 0) Annual
$58,000 PAWS and PAM (tier 1) One-time
$11,000 PAWS and PAM (tier 1) Annual

Costs above presume Managed Firewall can adjust to allow on-demand programmatic blocking and that we need two separate Managed Firewall instances to cover the two RODCs. If Managed Firewall can’t provide programmatic blocking, then an independent firewall solution and personnel who can support that solution will be needed which will undoubtedly be more costly.

PAWS and PAM costs are further broken out in the write-up for that work.

Labor costs
Cost
Item
Recurring
$17,290 RODC operational support – 130h (.1 FTE) Annual
$10,640 Firewall gating mechanism support – 80h (.05 FTE) Annual
$13,300 Failed authentication detection support – 100h (.1 FTE) Annual
$23,940 User provisioning support – 180h (.15 FTE) Annual
$79,800 PAWS and PAM (tier 0) – 600h One-time
$93,100 PAWS and PAM (tier 0) – 700h (.5 FTE) Annual
$40,000 PAWS and PAM (tier 1) – 300h One-time
$38,000 PAWS and PAM (tier 1) – 3500h (.2 FTE) Annual
Total cost

One-time: $232,114.24

Annual: $226,698.94

Summary

At greatly increased cost, these mitigations would partially neutralize the additional risks introduced by running a domain controller on the internet, however, there is definitely still an increase in the overall risk profile. This increased risk is due to only partially mitigating security risks (e.g. these mitigations do nothing to mitigate increased risk to student data) but also because this solution introduces greater complexity which may allow new attack vectors in unforeseen failure modes.

Even with these considerable mitigations, this solution can not be recommended.

Possible mitigations for S2S VPNs – more in-depth analysis

Cost is the chief trait which makes S2S VPNs undesirable, so mitigating overall cost is the focus here.

There are several ways to mitigate S2S VPN costs. These are:

  • Most IaaS cloud providers allow the virtual networks within their ecosystem to route between them. This would permit a centrally-provided S2S VPN with an endpoint in that vendor’s virtual networks to be shared across UW customers. There would still need to be one S2S VPN per IaaS cloud vendor.
  • Limiting the IaaS cloud providers from which you can connect to the NETID DCs is an obvious way to limit the costs associated with this solution. The best cloud provider to support would be Azure, since it provides the best Microsoft-based functionality for use cases dependent on the NETID DCs.
  • There may be ways to share the costs of a S2S VPN with those which use it. Whether this would be a flat cost per computer which needs the solution or a cost related to the bandwidth used would need more analysis. This doesn’t eliminate the cost to the university, but it would be a way to share the pain of those costs and encourage alternate solutions where they are possible.

All of these possible mitigations appear to fit with scenarios 2 or 3, although there may be some customers who are not happy with support for a single IaaS provider. Choosing a 2nd provider or reminding customers that they can setup their own S2S VPN could address that issue.

Cost

Example costs for an ExpressRoute S2S VPN with Azure are as follows (info provided by Scott Hansen):

Cost
Item
Recurring
$2,200 ExpressRoute provider startup One-time
$13,000 Layer 2 switches (2) One-time
$10,640 Labor to design, install & deploy switches (80h) One-time
$11,400 2GB unlimited (no metering) Monthly
$2,660 Labor operations for switches (20h) Monthly

One-time Total: $25,840

Annual Costs: $168,720

If cost-recovery was implemented as a mitigation, with a modest number of 6 customers equally sharing the annual costs, the monthly cost to each customer would be $2,343.33. This seems like a reasonable amount for a customer with a strong business need to have a hybrid cloud architecture. Speaking as a potential customer of such a hypothetical offering, both Managed Workstation and Microsoft Infrastructure would gladly pay that amount for this solution.

Summary

A centrally-provided S2S VPN for Azure seems to be a plausible solution that covers all the use cases. Whether there is interest in providing a central cost-sharing infrastructure which does not align with other desired architecture directions is a decision for stakeholders.