I visited a customer recently that had an issue with RID depletion. While there is fairly good documentation on this already (see below for all the links you need), I decided to write this post to document some of the more practical aspects as well as serve as a starting place to manage RID depletion, particularly in environments where there are down-level domain controllers (E.g., 2008, 2008R2 etc). This post ties it all together and fills in the gaps.
If you’ve found your way here because you suspect you have run out of RIDs in a domain, then you’ve come to the right place (to start). However, the existing documentation on this topic is actually really good. I’ve compiled a list of the top reads on the topic below.
Managing RID Pool Depletion – Ask DS Blog– Covers an overview of the problem, and a good snapshot of the issues pre-Windows Server 2012.
Managing RID Issuance– High level information about the changes to RID issuance in Windows Server 2012 and the (largely) capability that was back-ported to Windows Server 2008R2 with KB
KB2618669– KB and update to prevent rIDSetReferences issue causing unexpected depletion
KB2642658– KB and update for RID issuance changes on Windows Server 2008 R2 (particularly, allowing global unlock)
These are the major ones you need to know.
The other thing to mention before I get going is to point out that if you are experiencing this, and especially if you have actually run out of RIDs, you should get in touch with us immediately as this is an extremely serious condition. Prior to the unlock capabilities in Windows Server 2012, this required you to either perform a full forest recovery back to a time before the problem occurred (if you can – often the issue has been going on unchecked for a while) or abandon your domainand start over. That’s bad and I bet those are not things you want to suffer through alone.
So let’s get started – there are three sections below, with the key points summarised at the end of each.
Discovering the problem
Bob arrives to work one day, and discovers something terrible. It all started when the helpdesk reported that they can’t create users in the domain anymore. When they do, they get something like this:
Confused, and suspecting something may be wrong with Active Directory (after trying this for himself), Bob goes ahead and runs dcdiag and notices something unusual: (emphasis mine)
As per the AskDS post linked above, rIDAvailablePool is the attribute on the RIDManager$ object that stores the current domain-wide pool state. Curious, Bob decides to check on the attribute to find out what’s going on.
He fires up LDP.exe, Connects to a DC (any DC) and then Binds (both options on the Connection menu). He then chooses View > Tree and opts to view the domain NC (where the RIDManager$ object lives):
Then, he navigates to the right spot, and double clicks RIDManager$:
Oh, wow – 4611686015206162431– that’s not good and here’s why: The attribute is actually a large integer comprising a high part and a low part. The high part is the maximum number of RIDs in the domain (1073741823), and the low part is the actual place the pool is up to – think of it like a bookmark recording where the RID pool allocations are at. So the low part should definitely be way less than 1073741823 as that would mean the RID master has issued all available RIDs for the domain.
You can find this out for yourself using the Large Integer Converter, handily built right into LDP: (Utilities > Large Integer Converter)
Yikes! Again, I want to reiterate that if this is you, contact support for advice.
So now we know what’s wrong with AD, the domain has experienced RID pool depletion for one reason or another.
Key Points:
- DCDiag will report on RID master information. You can run dcdiag /test:ridmanager to just run that test.
- However, it will report that the value is invalid as it doesn’t expect the pool to be exhausted (Even in Windows Server 2012)
- The RIDManager$ object and it’s rIDAvailablePool attribute is the authoritive place to check
- Using LDP allows you to view the value, and use the built-in Large Integer Converter
Investigating the Cause
Bob sits down and realises he has run out or RIDs in one of his domains. He goes ahead and checks the rIDAvailablePool attribute in all domains and finds this is the only domain with the problem.
The domains in the environment are as follows:
There is a forest root domain, with two child domains – luckily the child domains aren’t affected; the bad news is that the forest root domain is.
Bob (and you if you’re in this position) needs to understand what caused the depletion before taking any corrective action around unlocking additional RIDs for issuance. There’s a few things to consider and time may be of the essence. Since doing nothing would be CLM (Career Limiting Move), Bob decides to leap into action:
The first thing Bob does it notify those that are likely to need to create accounts in this domain not to do so, and suspends any automated scripts that might create users as well. If there are any unused portions of RID blocks available on DCs (even though the global pool is depleted) he’ll want to save these as they are now a precious non-renewable resource we may need later. This is the reason why you may still indeed be able to create accounts and yet the pool is exhausted; individual DCs might have parts of unused RID blocks that were previously issued that they can use.
So what could be the cause of something like this? Typically (again, as per the AskDS post) there are a bunch of normal and abnormal reasons for this. If you haven’t been creating seriously large numbers of security principals, or been doing massive domain controller promotions/demotions, you need to look towards more exotic causes. I’ve compiled a list which includes some from the AskDS post, and others from my experience below:
- Crazy provisioning scripts on spring break (lots of users, attempting to create users that don’t meet password policy, DCs being provisioned, torn down and provisioned again millions of times, etc)
- Performing a forest recovery (where an important part is artificially increasing the RID pool (we recommend 100,000) so as not to recycle any RIDs after the restore
- Increasing the RID Block Size registry value – ahhh, my favourite. This setting controls how big the RID block is that a DC requests from the RID master. Prior to Windows Server 2012, you could set this to whatever you’d like (say, 536870911) and deplete the entire domain’s RIDs just by having two DCs with this setting. We now limit the value to 15,000 regardless of what you jam in the registry key.
- If you want to check this, the RID Block Size key lives under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\RID Values\
- Some issue in the way that RIDs are issued or requested by DCs. Typically this would be the dreaded ‘RidSetReferences’ issue.
So, back to Bob. Let’s say he knows about the 4 points above, what he really needs to do is rule out some of these (divide and conquer) to be more focussed in his attempts to resolve the problem.
As part of my efforts to sit down with Bob and assist with this, I’d be interested in the following:
- What does ‘normal operation’ for security principal creation look like?
(Probing for crazy scripts as per above) - Major changes lately (eg, last 6 months) that could be related?
(Looking for major DC refresh projects, security compromise, forest recovery operations etc) - Have you modified the RID Block Size on any DC, ever?
If so, what did you/have you set it to? (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\RID Values\) If unsure it may pay to go ahead and check each DC. - What machine hosts the RID master FSMO role for the domain?
What OS, how long has it held the role?
In this specific instance, this forest root domain is largely empty and doesn’t have many security principals created, and there are no provisioning scripts that operate against this domain. Bob doesn’t know if anyone has modified the RID Block Size so goes ahead and checks each DCs registry key – everything looks normal.
However, if the environment (like Bob’s does) has a large number of administrators, it can be hard to know for sure that none of these weird scenarios or things have happened. To rule out the possibility that the RIDs were being used in the creation of users (as apposed to some RID-wasting problem), I ask Bob to get a list of all users in the domain with their SIDs for some simple analysis: (unceremoniously stolen from Ned’s post)
Get-ADObject -Filter 'objectclass -eq "user" -or objectclass -eq "computer" -or objectclass -eq "group"' -properties objectclass,samaccountname,whencreated,objectsid,uSNCreated -includeDeletedObjects | select-object objectclass,samaccountname,whencreated,objectsid,uSNCreated | Export-CSV riduse.csv -NoTypeInformation -Encoding UTF8
This pushes out a list of principals to riduse.csv (or whatever you change it to) to give you a feel for whether users have been created. This CSV file will help you in two ways:
- Understanding if there are massive creations of principals (either planned or otherwise) occurring in the environment
- Finding out (based on the whencreated field) if there is a rough date where the RIDs that got assigned begin to accelerate suspiciously
Looking over the output, we notice the following: (note dates are DD/MM/YYYY)
One of the things that is obvious at this point is how rare new principal creation is in this domain – there are typically weeks between object creations. This makes things a little complex as we only get data points in this way when principals are created. So what does this all mean? Well, we can draw some conclusions from the above:
- In May 2012, no suspiciously high numbered RID had been issued. This doesn’t mean the problem didn't occur before this time, just that no DC had yet managed to obtain an unusual RID block and issue from it
- However, by June, at least one domain controller had started to issue RIDs with a huge jump. This, combined with our knowledge that no mass object creation has occurred leads to the conclusion that massive depletion of the RID pool had already been occurring before this date.
- As time goes on, (off the end of the screenshot) more and more high-numbered RIDs get issued as more DCs get through their older RID blocks and request and use new ones from the RID master.
As we look further down in the CSV, you can see the number jumping wildly before finally issuing really high RIDs such as 917870602, signalling the end is nigh.
So what CAUSED this? Well – at this point the one thing from the list above that we haven’t ruled out is the RidSetReferences issue. This issue occurs in situations where the RidSetReferences attribute on domain controller computer objects is invalid or blank. When this attribute is invalid, the DC doesn’t have a pointer to it’s DC-specific RID Set object in AD, causing the DC to request a new RID block every 30 seconds or so. This nasty problem is documented here:
http://support.microsoft.com/kb/2618669
If you have a read of that KB, you’ll find that the behaviour explained is exactly as Bob has noticed in our example, and fits the profile based on what else is going on. How do we know if we’re affected? Well, you can manually (or programmatically) check the attribute on DCs:
The above screenshot is what a ‘good one’ should look like for my DC called BX-DC. In this instance, the rIDSetReferences attribute has a value of “CN=RID Set, CN=BX-DC,OU=Domain Controllers,=DC=contoso,DC=com” which is reference to the RID Set object at the top of the screenshot.
However, if a DC is experiencing the problem, then it might look like this:
So here we have our broken DC – a DC missing this attribute (it’s blank). Bob has not applied this update to his 2008R2 DCs, and is experiencing this problem. This may have been caused by someone clearing the attribute maliciously (unlikely, but possible) or by some scenario during a recovery or DC computer object deletion gone wrong. There is situation where if you delete a DC’s computer object from another DC, the deleted DC will reverse the deletion when replicating the change (kind of an anti-destruction mechanism), but may not restore it’s RidSetReferences attribute, leaving it blank.
Regardless of that, Bob needs to urgently apply this hotfix to his fleet (he, and you, should apply it to all Windows Server 2008 R2 DCs in the environment, don’t wait for the problem – just proactively apply it).
Key Points:
- Start asking questions based on the likely causes
- Dump out security principals (even deleted ones) to look for trends and try to get a feel for when the problem started
- Proactively apply KB2618669 to all Windows Server 2008 R2 DCs NOW!
How to fix this
So, we now know what the problem is, and a fair idea of what happened – but what’s next? Thefirst thing you want to do is ensure that the problem is understood and won’t reoccur. There’s no point in fixing this problem if it is only going to happen again!
Next, we need to consider our options for making the domain, you know, work again. In older versions of Windows, the only option was to either:
- Perform a domain / forest recovery back to a time when you had RIDs left (in this case, 2012 – so not a desirable option since 2012 is definitely longer than the tombstone lifetime ago and thus the maximum supportable restore period)
- Abandon your domain / migrate to a new domain – only really viable if you have a RID left to create a trust with your new, non destroyed domain.
However, in the Windows Server 2012 timeframe, we introduced some new features that were, thankfully backported to Windows Server 2008 R2 (but not Windows Server 2008) Many of these are quite significant; check out http://technet.microsoft.com/en-us/library/jj574229.aspx for a good overview of these features.
The features we’ll need to get this domain working again are the global RID unlock and the periodic consumption warnings to ensure this isn’t happening again when additional RIDs are unlocked.
The goal for this domain is for us to unlock an additional bit of addressing for RIDs so we can continue allocating RIDs. However, as per the diagram in the previous section, this particular environment still has Windows Server 2003 and Windows Server 2008 domain controllers present, which don’t (at all) support RIDs above 1073741823 making the unlock worthless without them gone.
Step One: Remove all older DCs from the domain
This means getting rid (pun intended) of any DCs that are of an earlier version than 2008 R2. A couple of caveats here though:
- If you’re out of RIDs completely and there isn’t even any left in unused RID blocks on DCs you aren’t going to demote in this step, you will be unable to add additional domain controllers to the domain to support the load no longer serviced by the DCs you want to take out. Simply put – creating a DC requires a RID.
- You may still be able to promote DCs if you still have RIDs left on some other DCs that you haven’t uninstalled ADDS from yet
- If push comes to shove, you can always upgrade your existing DCs to Windows Server 2012, provided your existing DCs are 64bit. For details on supported upgrade paths, check out this link.
Step Two: Ready the domain for global RID unlock
After ensuring you only have Windows Server 2008R2 DCs and later in the domain (or, perhaps you already have and can skip straight to this step) the next task to ensure the environment is ready.
- Ensure the RID master FSMO is now hosted a Windows Server 2008 R2 DC or later (ideally, just put it on the latest OS you can)
- Apply KB2618669 and KB2642658 to ALL Windows Server 2008 R2 DCs left in the domain – this will fix any chance of future RidSetReferences issues as well as give you the Windows Server 2012 RID issuance awesomeness described above.
- If the problem has been the RidSetReferences problem (and, perhaps even if it’s not – for good measure) check and resolve any issues with the RidSetReferences attribute in the domain.
- This means checking the attribute by connecting to each DC, as the value is not replicated
- If it’s blank, set the value to the DN of RID Set object as a child object underneath the DC. You can simply obtain the DN of the RID Set object but viewing it’s properties in adsiedit.msc. Then, just set the RidSetReferences attribute on the domain controller computer object to that. For example:
Blah
Step Three: Unlock and monitor
Now, you're ready. It’s time for those managers to stop breathing down your neck.
- Take a backup of Active Directory (I’d recommend using the in-box Windows Server Backup for this, but you can use any VSS-aware backup software that can take a viable system state backup that you’re comfortable restoring from)
- Identify the machine hosting the RID master role (you’ll need this for the last step)
- Follow the steps in KB2642658, however the steps are not very clear, so I’ve documented them below:
- On the domain controller, click Start, click Run, type ldp.exe, and then click OK.
- On the Connection menu, click Connect, and then connect locally by using an enterprise administrator account
- Click Modify on the Browse menu.
- Change the sidCompatibilityVersion attribute to 1 by adding the following entry to the Edit Entry Attribute box:
- [Add] sidCompatibilityVersion: 1
- Press Enter, and then click Run
Just a note: yes, you’ll be leaving the DN field blank. Here’s a visual reference of what to do on that last window (the Modify window) as it’s confusing:
- Check for Event ID 16655
Now, and forever, closely monitor the System event log on your RID master for RID consumption events that may indicate a faster than usual consumption of RIDs – ideally you want to have these plugged into your monitoring system so that the cavalry is called in should one of those thresholds be reached – it’s definitely worth investigating for each and every occurrence.
Key Points:
- Remove all DCs running version of Windows older than Windows Server 2008 R2
- As with any change, you should backup Active Directory first
- Ready the domain by ensuring all Windows Server 2008 R2 DCs have both updates applied
- Unlock the global RID pool, and monitor carefully.
Has this been helpful to you? Or have you found this post useful? Let me know below!
Until next time.
Ash.