Wednesday, December 03, 2008

Facebook vs DNS

Sometime back, about a couple of weeks ago, my Facebook page loads all of a sudden started getting very slow (like 20 seconds or so before the data started loading, but once it did start loading it was fast). It was only happening at Facebook (Google, WheresGeorge, Blogger, pretty much any other site) was working fine, so I thought the problem had to be at Facebook rather than on my side.

However, after it kept up for a week, I started to get irritated enough to dig into it. First I turned off my web proxy and went directly to the sites from my browser. Things worked fine then, so clearly it was an issue in my proxy. I run a Fedora Linux server at home that serves as my web proxy using the Apache HTTP daemon.

This past weekend, I started digging into the problem and spent several hours debugging, testing, searching the web and while I still don't have a clear reason as to the why, I do understand the what and have put together a somewhat nasty hack around the problem. Hopefully I will dig around and find or figure out what the problem is so that I can put in a good fix.

My first look at the server didn't show anything amiss. The httpd logs showed the accesses to Facebook with no errors. That led me to consider DNS as this felt like what you get when your DNS is timing out.

My /etc/resolv.conf file was clean and correct. Using the nslookup or dig tools, I was able to look up the names without problems and quite quickly on both my own name server as well as the name servers provided by my ISP. The system logs didn't show any problems in named or anything that looked like the firewall could be getting in the way.

However, using any other tool (telnet, wget, httpd) the name look ups would go through several failures before succeeding -- causing a substantial delay in accessing the site. This only happened with Facebook related sites (www.facebook.com and apps.facebook.com to mention two of them). The same tools, accessing any other site that I tried, had no problems and no delays.

Using strace, I could see that the first pass at the name service look ups were failing and each timing out after so many seconds before trying the next. Eventually, the tools go back and try again and the second time, the response comes back almost immediately and the tool continues. For example, "wget http://www.facebook.com" returned the following:

01     0.000106 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
02     0.000068 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 28) = 0
03     0.000076 fcntl64(3, F_GETFL)       = 0x2 (flags O_RDWR)
04     0.000054 fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
05     0.000042 gettimeofday({1227974358, 62163}, NULL) = 0
06     0.000048 poll([{fd=3, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
07     0.000059 send(3, "\0079\1\0\0\1\0\0\0\0\0\0\3www\10facebook\3com\0\0\34"..., 34, MSG_NOSIGNAL) = 34
08     0.000861 poll([{fd=3, events=POLLIN}], 1, 5000) = 0
09     4.998266 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
10     0.000065 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("66.36.226.50")}, 28) = 0
11     0.000071 fcntl64(4, F_GETFL)       = 0x2 (flags O_RDWR)
12     0.000046 fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
13     0.000041 gettimeofday({1227974363, 61621}, NULL) = 0
14     0.000046 poll([{fd=4, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
15     0.000053 send(4, "\0079\1\0\0\1\0\0\0\0\0\0\3www\10facebook\3com\0\0\34"..., 34, MSG_NOSIGNAL) = 34
16     0.000098 poll([{fd=4, events=POLLIN}], 1, 3000) = 0
17     2.998500 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
18     0.000070 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("207.228.225.50")}, 28) = 0
19     0.000073 fcntl64(5, F_GETFL)       = 0x2 (flags O_RDWR)
20     0.000045 fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
21     0.000043 gettimeofday({1227974366, 60548}, NULL) = 0
22     0.000045 poll([{fd=5, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
23     0.000052 send(5, "\0079\1\0\0\1\0\0\0\0\0\0\3www\10facebook\3com\0\0\34"..., 34, MSG_NOSIGNAL) = 34
24     0.000118 poll([{fd=5, events=POLLIN}], 1, 6000) = 0
25     5.997342 gettimeofday({1227974372, 58108}, NULL) = 0
26     0.000050 poll([{fd=3, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
27     0.000054 send(3, "\0079\1\0\0\1\0\0\0\0\0\0\3www\10facebook\3com\0\0\34"..., 34, MSG_NOSIGNAL) = 34
28     0.000416 poll([{fd=3, events=POLLIN}], 1, 5000) = 0
29     4.997778 gettimeofday({1227974377, 56418}, NULL) = 0
30     0.000063 poll([{fd=4, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
31     0.000055 send(4, "\0079\1\0\0\1\0\0\0\0\0\0\3www\10facebook\3com\0\0\34"..., 34, MSG_NOSIGNAL) = 34
32     0.000106 poll([{fd=4, events=POLLIN, revents=POLLIN}], 1, 3000) = 1
33     0.001235 ioctl(4, FIONREAD, [34])  = 0
34     0.000065 recvfrom(4, "\0079\201\202\0\1\0\0\0\0\0\0\3www\10facebook\3com\0\0\34"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("66.36.226.50")}, [16]) = 34

As you can see, the delays come waiting for a response from the nameserver and it's not until the second try on the second name server (lines 31-34 before we get a response. You might think that this has something to do with my name server on 127.0.0.1, but that wasn't originally in my /etc/resolv.conf file until I started the debugging and the problem still occurs when I remove it.

A similar trace of the dig command shows that the first name server (whether it be 127.0.0.1 or my ISPs) resolves the name almost immediately (though dig uses a different communications method (sendmsg vs send) and different networking libraries.

Traces for wget with other host names return successfully on the first lookup.

I haven't (yet) figured out what exactly is causing this. But I have figured out two workarounds (neither of which are all that nice):

  • Set one of Facebook's name servers as the first name server in my resolv.conf file (so my applications use that name server to resolve all host names.

    This does work (name resolutions worked first try and in very reasonable times). However, name servers are core trusted parties in your network access and I really don't like setting things up so that I totally trust Facebook's server for all of my outgoing name service look ups. Call me paranoid, but this one just isn't right for me.

  • Add www.facebook.com and apps.facebook.com host entries to my /etc/hosts file (which is checked before name service look ups.

    This definitely works, though it does remove the usefulness of DNS from my access to Facebook (like if they change their IP address I won't know). However, it is the lesser evil of the two solutions I have found so far and so this is what I've done for now.

I'll post an update if I figure out exactly what's wrong (which I'm very unhappy about not being able to figure out so far -- I like being able to understand things and spent several hours after I had workarounds trying to figure it out to no avail).

Tags : / /

2 comments:

rdeacon said...

I run an apache proxy too and ran into the same issue. The issue can be resolved by disabling the IPv6 on your box. Seems that the facebook DNS servers have an issue with IPv6.

Conor P. Cahill said...

Yup. That fixes the problem, though I don't like that I still don't know exactly what was going on, nor why it worked fine for nslookup and dig but not for other tools (presumably it was something in the dns library that they were using).

If you're wondering how to do this, here is a clear set of directions on how to disable ipV6 in Fedora.

Thanks!