WordPress on Azure: troubleshooting “Error establishing a database connection”
This is a bit of an aide-memoir for my future self, but might also be useful for others.
Recently, my blog had some unexpected downtime. And not just a few seconds either, people started noticing. Not good. I’d recently published a blog post about a new Teams App I’d written, and so plenty of people were landing on my site to see what it was all about. And all I was giving them was this nasty error:
Error establishing a database connection
Panic, Troubleshoot
It’s a classic reaction to seeing a bug in the wild: unconsidered panic. Something must be wrong with the database, or with the database connection string.
I host this blog in Microsoft Azure, using hosted WordPress. WordPress uses a mySql database to store content, which Azure doesn’t (didn’t?) support, but they use a trusted third-party (ClearDB) to host for them.
So, I log into my ClearDB instance. There’s definitely still a database, and it’s online, and I can see from the gauges that it’s doing something. It’s not pegged for performance or memory or anything but it’s definitely working.
I go and check the database connection strings in the WordPress configuration. I don’t know why I did this: they haven’t changed in years. But I do it anyway, because, panic.
All this time I’m madly hitting F5, clearing my cache, using different browsers to try and make the site come back.
After a while it does. I don’t know why. Then it goes again. Then it comes back. And, it’s inconsistent. It might work on one browser and not another. It felt like some limit was being placed on connections, but why, where?
And, breath…
I go and get a drink and start to more rationally think through what’s happening. The database is there, but there’s an error establishing a connection to it. What would cause that? Maybe a network connectivity issue between Azure and ClearDB. I check Azure Status – there’s nothing there.
Second thought, maybe something to do with contention. I go back to the ClearDB dashboard and find the list of connections. There are 15 simultaneous connections. 15 is a suspiciously nice, round number.
Something clicked in my brain (note to ClearDB: it would be cool if you could make this bit more obvious!). I check my account plan. For my class of database, I’m limited to…15 connections!
So, that’s the problem.
The easy fix would be to increase the plan and the number of connections, but there was a fairly significant price hike between the two plans. Besides, I’m not fixing the problem, I’m just pushing it down the road a bit.
Now, I don’t actually know very much about WordPress. I see 15 concurrent database connections. I don’t know if that means there are 15 people concurrent looking at my site, and WordPress is using one connection per visitor, or whether WordPress uses more than one but uses them across visitors. Based purely on what I saw happening, I’m thinking the second – it seems that just running my site uses around 6 database connections.
Going to back Azure, I was making use of the Scale-Out feature. This clever, money-saving idea means that you can set CPU thresholds on your App Service. When the limit is reached, another instance of your Service is created to handle the increased throughput. Once the load drops, the instance count drops back to 1. This scale up and down happens automatically and means that you only pay for the increased power when you need it.
The history of scale actions is all recorded as well, making it really handy for troubleshooting problems like this, where you want to look at lots of different sets of data to try and figure out what happened.
What was happening (I think) is that as load increased, a new instance would be created. A new App Service instance meant another 6 database connections. Now there are 12 concurrent connections. I scale to 2 instances fairly regularly and don’t see a problem. However, scaling to 3 (which I don’t think has happened to me before) meant 18 connections, breaching my 15 connection limit and likely meaning that all requests going through that App Service instance are refused.
How to fix?
Fixing this problem will be different for each person. I could have researched into why WordPress was using so many connections and whether it’s possible to reduce that number. I could have looked to move the database somewhere that allowed more connections or reconsidered changing my database plan.
What I did instead, though, was this. Firstly, I changed my auto-scale settings to not scale and to always use 1 instance. By looking at the last 3 months of Azure data I worked out how much money that would save me, and plowed that money into Scale-Up – changing the configuration of the App Service to be more powerful (more CPU, memory etc). I also made sure that I’d get alerts if the memory or CPU were getting critical because without auto-scale I would need to do something manually if they were.
Secondly, I enabled caching in WordPress. I used a plugin which makes it easy to setup file-based caching. For a bit more load on the server, I’m trading a big reduction in database load (or, specifically for me, database connections), which should enable me to scale more effectively and respond better to surges in demand.
In a way, I’m resetting my options. In the future, if I need to, I think I could auto-scale back up to 2 instances without causing any database connection issues. I’m writing this blog post to remind me of the perils of going to 3, or more generally, of enabling options without fully understanding all the knock-on effects they might have!
So, I think that’s fixed things. Things certainly seem quieter and I’ve not seen the error happening again. The people who were good enough to let me know there were problems have been told and they all report it’s working again. And, it’s prompted me to update my Pingdom settings so I can get early warning on any future problems.