This error message: I do not think it means what you think it means...or why AD replication is kind of important

Since it seems like we are on a roll with Kerberos related posts, I figured that I would add one into the mix as well (and use the opportunity to title the post with a line from one of the greatest movies of all time). It's also a good insight into the sometimes uselessness of error messages.

In either case, I'm sure that at some point we've all seen the dreaded "The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server SERVER01$" with some stuff about SPNs (full error at bottom) message in our logs and have spent the time tearing our hair out trying to figure it out. The problem with that message is that it's pretty generic, and doesn't really offer any insight into what it means. We know it's something Kerberos related, and the text itself makes it seem like something is being modified, but we know better than to just blindly trust a Microsoft error message.

As with many things, it's not really the resolution, but the journey that is the most interesting aspect of getting to the root cause of the issue in an environment with which you are unfamiliar. Read on past the jump.

This particular message had to do with an Exchange server at a DR site and a few CA Servers at the main datacenter. The logs on each of the the CASs was showing this error, and it was occurring on a regular basis...every hour exactly. Before we get into the usual suspects and how this error came about, let's get a little bit of insight into Kerberos and what this message means.

So how does Kerberos work, exactly, you ask? First, we have to know that Kerberos relies on three parts: The KDC (Key Distribution Center [which is actually two components in itself, but if you want the really nitty gritty details, you can look at this TechNet article]  - usually your Domain Controller), the Client, and the Server that the client is trying to access. The cliffnotes are as follows:
1. Client tells the KDC that it wants to access Server. KDC creates a TGT (ticket to get tickets) for Client and sends it over.
2. Client then sends over its TGT back to the KDC and gets a brand spanking new service ticket - which contains information that both the Client and Server will be able to read. This service ticket also contains timestamp information so that it can expire at some point and not be re-used.
3. Client sends the Service Ticket over to the Server to get authenticated to its resources.

It seems like a step is being missed here, doesn't it? How does the server know that the Service Ticket that it was sent is valid. There is no step 2A that says "Server talks to the KDC to verify ticket" is there? Turns out, there's another step that occurs on a somewhat regular basis between all servers and workstations joined to a domain. Each machine that is joined to the domain has a long term key that is used in Step 2. The Service Ticket that the KDC grants is encrypted in two parts: the Client part is encrypted with the client's password hash, and the part that the Server will read is encrypted with the Server's long term key. This long term key (in a roundabout way) is the Server's Domain Trust Account. It's the very reason that there are Computer objects in Active Directory, and why you see the "SERVER01$" in the log. The "$" at the end signifies that it is trying to access the trust account of the Server. This will be important later.

So, going back to our cryptic Kerberos Error message, we can search around our brains and the internet and gather a list of the usual suspects:

* DNS is incorrect: we are trying to authenticate with SERVER01$ but the DNS for server01.domain.local points to some other machine.
* The SPN is a duplicate SPN on the domain, so whatever service or application is running is getting multiple responses, so Kerberos is not even to blame. (See bottom for finding out how to query for duplicate SPNs).
* The server needs to be rejoined to the domain.

In our case, it was none of those. DNS was set correctly, there was a single SPN, and I wasn't about to rebuild an Exchange server, seeing as everything else seemed to be working, since I was able to RDP into the server and authenticate. Or was it?

Another post I found had me try something so seemingly simple that I overlooked it: try to connect to it from my machine directly. I fired up a command prompt window and did "net use \\server01", and it came back with a very different error message. "The Target account name is incorrect". I tried the FQDN: "net use \\server01.domain.local" and got the same error message. This at least tells us that it IS in fact authentication related, so back to blaming our favorite hound of Hades.

Next up is testing to make sure all the domain controllers are replicating, to make sure that the DC that I hit for RDP has the same info as the one that rejected my NET USE request (if they were different even, that is). REPADMIN and DCDIAG come back clean, with successful replications all over the place. So I logged on to a DC and tried NET USE from the domain controller directly, and still no go. I then fired up Sites and Services, and saw that there are in fact two different domain controllers at the site where this SERVER01 is, and they have replication partners over in the main Datacenter (where I was). I RDP to a DC at the same location, and NET USE succeeds from there. Well, now that's VERY strange. I put on my monacle and get my magnifying glass and look into their AD architecture a bit more closely. With as many DCs as this organization had, it would have been easy to miss the fact that while there is replication FROM the main datacenter to the DR site (where Server01 is), there is no replication FROM the DR site to the main datacenter.

Remember a while back I said that the long term key is generated on a regular basis and is associated with the trust account of the server? Well, that key is generated and stored on the Domain Controllers. SERVER01 had generated a new key, and the DC at its site knew about it, but it never replicated that information back to the main datacenter. This discrepancy between the key that the DC I was using and the key that the DR site's DC was using was causing Kerberos authentication to fail. Since it had not replicated...well...ever, the datacenter DCs had considered the DR DCs info as tombstoned and didn't want to replicate it back, there was some magic to be done with changing the AD tombstone lifetime and allowing replications with divergent partners but in the end all was well.

So remember kids: make sure you are properly replicating your AD infrastructure, or you might just have random problems that don't even show up right away.

Full Messages:
1. The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server server01$. The target name used was host/server01.local.domain This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (local.domain) is different from the client domain (local.domain), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.."

How to find duplicate SPNs:
1. Open up "ldp.exe" (comes by default on Win 7, Server 2008+)
2. Connection -> Connect. Select any domain controller.
3. Connection -> Bind. Use either your own credentials or any service account.  It only needs read permissions.
4. View -> Tree. Select the BaseDN to be your main domain.
5. Browse -> Search. Under filter, put in "serviceprincipalname=[what the error message said]", in this case "serviceprincipalname=host/SERVER01.domain.local".
6. Select "subtree", then hit run. Then close the window.
7. In the main window, you should see something like "Getting 1 entries:" and then it would list out. If that number is more than 1, then you have a duplicate SPN, and you'll need to either setspn.exe (Part of the Resource Kit tools, or natively in the latest OSs) to play with SPNs.

Labels: , ,