Diagnosing and Fixing a Failing HTTPS API Request

Introduction

This is a post to make a note of my recent journey of diagnosing and fixing a failing HTTPS API request.

Background

I have a user having issues with the Android app. It appears that the app has difficulty communicating with the server (API gateway). A generic connection error is shown to the user upon API call.

Observation 0

  1. It cannot be reproduced locally. None of my devices (phones and browsers) has errors communicating with the server.
  2. The log server shows no logs generated by both nginx and backend API servers from user’s IP.

Hypothesis 1

Hypothesis: There is a problem with the user’s environment. Let’s say it is the network layer.

Reasoning: It is possible that API gateway’s AWS elastic IP (public IP) falls into the user’s ISP’s blacklist, or the two networks cannot reach each other for some other reasons.

Observation 1

The user uses WiFi and I have his outgoing public IP. ping and traceroute from the API server’s network to user’s network shows that there is no problem reaching the IP and the latency is very consistent. The network connection is stable. Hypothesis 1 is wrong.

Hypothesis 2

Hypothesis: There’s a problem at the transport layer that is causing the TCP connection from user’s device to API server’s port 443 to fail.

Reasoning: I once got into a problem that my ISP blocked a TCP packet solely because of the target port and I found that out using tcptraceroute. It is possible for an ISP or a firewall to block TCP packets for certain ports, but it is really unreasonable to block legitimate HTTPS packets.

Observation 2

I ask the user to connect to the API server using his mobile browser. The browser has no issue connecting to the API server, and the corresponding logs show up in the log server.

Hypothesis 3

Hypothesis: There’s a problem with DNS.

Assumption: Mobile browser uses DoH (DNS over HTTPS), and the app uses OS’s DNS configuration, which probably uses the router’s DNS server if connected to a WiFi network.

Reasoning: ISP or other institutions may perform internet censorship using DNS. This is a possible explanation for observation 2. Although it is unlikely that an average user enables DoH, it is always good to rule out DNS issues.

Observation 3

I use dig to send a DNS query to the user’s outgoing public IP’s server. It returns the correct IP of the API gateway.

(This doesn’t absolutely prove that DNS isn’t the user’s problem but it is good enough for now.)

Hypothesis 4

Hypothesis: There’s a problem with SSL.

Reasoning: Modern browsers ship their own list of CAs instead of using OS’s preloaded CAs.

This is not considered earlier because there is no SSL error with my app, my mobile browser or desktop browser, and I have a nagios monitor for my SSL certificate expiry.

However, I once faced a problem where some clients had SSL errors and some others didn’t, because I installed a bad CA bundle in nginx. Therefore, it is possible that only some users experience SSL errors.

Observation 4

I test the site with SSL Lab’s SSL server test and realize that there is a problem for certain certification paths which contain an expired root CA certificate. At this point the correlation between this and the reported issue is unclear but at least there’s a clue.

After an updated CA bundle is installed, the SSL server test returns no error.

Hypothesis 5

Hypothesis: The user’s device is so old that it does not come with up-to-date CA list pre-installed inside the OS.

Observation 5

The user agent in the http logs shows that the user’s device is a 2014 model. I also discover this Namecheap blog post about Sectigo SSL certificate root expiring on May 30, 2020, and it affects clients older than 2015, e.g. the user’s Android 5 device.

Confirming the fix

Now that the cause of the problem is clear and the fix is deployed, the user confirms that there is no longer an error in the app.

Conclusion

Despite a lot of suspicion about user’s device and environment, the root cause is tracked down cleanly with a step-by-step approach and careful reasoning. This is a good reminder to myself that before blaming the user, I should make sure that it is not my fault. And the usual case is that it is my fault.