Over the last few weeks, I’ve been working on a Microsoft Lync pilot at work. One of the requirements was external federation. This feature basically allows us to use instant messenger (IM) between users in both locations. So for example, you are CompanyA and you do business on a regular basis with CompanyB and are both using Lync. Federation allows you to add each other to your Lync clients, and talk to each other.
The configuration and implementation went pretty smoothly, but I was having intermittent issues with federation. The problem came up when adding an external company that hadn’t explicitly added us to their federated domains list. Initially we had dismissed the issue as a firewall issue because we got federation working with some consultants, however I was later asked to add a vendor and started seeing the same issues.
After some testing, I wasn’t getting any closer, so I enabled the client logging options in the Lync client. Those are found under Options, General, and “Turn on Logging in Lync”. This writes a log file to a Tracing folder under your user profile directory (C:\Users\username\Tracing). When I started digging into the logs, some errors popped out at me.
The other error that popped out at me was:
Without doing much digging, the first suggests the request I’d sent to the vendor couldn’t be routed, and the second reports that no error explaining why it couldn’t be routed was returned from the remote side. This made me think it was a potential firewall issue again. Doing the basic testing of validating that our Edge server was accepting incoming connections, and I could connect to the vendors edge server, I eliminted the firewall being an issue. This got me really scratching my brain.
I ran a SIP stack trace from the Lync Edge server and saw more unusual errors, such as “504 Server time-out”. This was beginning to frustrate me, I had confirmed that both servers could talk to each other, why were they getting timeouts?
I decided to go back to basics, start at the very bottom. First thing was connectivity, that we already established by using telnet to the servers. The next was DNS. Lync, like a lot of Microsoft products, takes advantage of Service Records (SRV) in DNS. This record tells the requesting client the protocol, port, and host to connect to. In this case, the Edge server is looking for the SRV record for the entry _sipfederatiltls._tcp.sipdomain.com. The response should look something like this:
So the protocol is TCP, the port is 5061, and the server I need to connect to is sipexternal.sipdomain.com. I ran a check against our domain, and the vendors domain, and both came back with records. Except, with them being so close together on the screen, I immediately spotted an issue.
Ignoring the different in 300 and 3600, the Time-To-Live of the record, the next difference was the port numbers. Looks like I made a simple transposition of numbers. I did a quick test from outside the firewall, and confirmed that 5601 was not open. I went back through the firewall change tickets, and confirmed I had requested 5061, and the Lync configuration was also set to 5061.
A quick DNS change for the SRV record, fixing the port, and within 10 minutes I received 3 notifications from the vendor that they had staff adding me to their contact lists.
One of the things I’ve come to learn over the years, whenever there is something awfully quirky going on, and you cannot quite figure out what’s causing it, take a look at DNS. I’ve had a number of issues that have resulted in being a simple DNS issue. In this case, it was simple human error, but boiled down to be DNS saying one thing, and it should have been something else.