While configuring some of the internal services that we host for external access through our NGINX proxy VM, I started noticing some strange behaviour. Every once in a little while, when requesting a page that was being passed through the proxy, the proxy server would respond with a
502 Bad Gateway. It turns out that there were some issues with the resolver module for NGINX. I’ll detail how I fixed it below.
We have 2 nameservers at SANBI that handle out internal and external records. The external stuff is synced off to Cloudfloor. The IP addresses of both are used in each of the
server blocks under the
location block for our services:
This seemed to be a fairly normal configuration from what I’d been reading up online. However, I was seeing these
502 errors intermittently. For example, after restarting the
nginx service the first poll to a hostname would result in a
502, but the subsequent few would not, and then again a
502 after some time passes. I dug into the error logs of the proxy server and saw a suspicious looking line:
You know that old haiku about DNS, right?
I tried a couple of things, mainly with the proxy settings and specifying of upstream servers. In a nutshell, what didn’t work for me was:
- Setting the resolver globally.
- The exact same issue existed when this was done.
- Setting the IP addresses of the upstream (internal) server to connect to manually.
- This alleviated the issue of connecting, but we don’t have any assumptions about these IP addresses staying the same since they can be short lived VMs.
- Various bloody proxy module settings for NGINX.
This was clearly an issue with the way that NGINX was resolving hostnames from the nameservers that we have. It seemed that the server was failing to resolve the name on the first go and then does correctly afterwards, caching the entry (from what I understand) for the TTL by default.
The stupid solution would be to just set
resolver <ns address> valid=10000000s; or something dumb, but I’m looking for a proper answer to this problem. What I ended up doing was installing
dnsmasq on the proxy server. I explicitly disabled the dhcp stuff to be safe.
and specified my internal nameservers
I then changed the global resolver in the
/etc/nginx/nginx.conf file to point to
127.0.0.1 and made sure that none of my
server blocks had any residual
nginx services were restarted, everything worked without a hitch 😃
I suspect that it has something to do with latency of reaching the name server, but I haven’t investigated thoroughly so take that with a pinch of salt!