The Challenge
Microsoft SQL Server performance tuning is not a core capability of the Tarbh Tech team, as we focus on Microsoft cloud-based infrastructure. That said, TarbhTech was recently asked to assist with a significant manufacturing production impact on an electronics company, which was traced back to an on-premises SQL Server performance issue. This SQL server was configured as a cluster and ran many databases which supported the production lines. It had been running without issue for several years, but it had begun to experience random, unexplained slowdowns in response times.
The clients’ IT team could restore service by failing over the SQL server instance, but the respite was only temporary, with the performance issue returning after several days. The problems did not correlate with patching cycles, though the development team recently introduced some smaller databases and their corresponding apps. Additionally, with global supply chain issues in the news, the client needed to run production 24×7 to keep up with demand; thus, the slowdowns resulted in late-night callouts, adversely affecting the support teams.
The Specifics
Tarbh Tech performed the role of SQL Database Administration, augmenting the clients’ internal development and infrastructure skills. The SQL Server cluster ran an older release outside vendor support on Windows Server 2012R2, all virtualised in VMware on SAN storage. The cluster event log and those of the windows servers exhibited the usual volume of entries, none of which proved conclusive. Server monitoring did not show excessive CPU usage, RAM, or storage latency issues, and the underlying hardware was practically idle, as it was dedicated to the SQL Server cluster.
The Solution(s)
Initial consultations covered various probable causes, from backups to operating system corruption, but ultimately the root cause proved to be several linked issues.
Issue One
Server monitoring showed that CPU usage never exceeded 50% consumption, no matter how busy the SQL server was. However, a closer examination showed that only four were actively used, while the Windows VM had eight vCPUs.
A review of the SQL logs following a failover provided the first hint as to the cause; SQL Server saw all vCPUs, but due to the Standard edition licensing constraints, it was only capable of using four sockets or 24 cores, whichever is encountered first. This limitation had impacted performance since the VM was first created years ago but hadn’t been noticed. As VMware SMP scheduling has significantly improved over the years, new VMs are created with “wide” vCPU configurations by default; i.e. eight vCPUs are configured logically as eight sockets, each with a single core. Unfortunately, this ran foul of SQL Server Standard licensing and thus restricted it to only four sockets.
This issue was quickly resolved by shutting down the passive cluster node and updating its vCPU configuration to two sockets, each with four cores. After performing a failover, the increased vCPU resources were immediately visible and active.
Issue Two
With the ceiling on vCPU resources lifted, the random slowdowns’ frequency was reduced but not eradicated. However, for occasional SQL DBAs, the internet has a wealth of helpful guidance and diagnostic assistance. In particular, the First Responders Toolkit from Brent Ozar and the SQL Server Maintenance Solution from Ola Hallengren were especially useful. For example, reviewing the configuration of the SQL Server cluster with sp_blitz from the First Responders Toolkit produced a list of potential performance-impacting configuration issues to address, including:
- Out-of-support SQL Server version
- Configuration of minimum and maximum ram usage for SQL Server
- Configuration of Max Degree of Parallelism and corresponding Cost Threshold for Parallelism
- Outdated defaults for Auto-close, Auto-shrink, and Recovery Interval on most databases
Ola Hallengrens maintenance solution was implemented for:
- Consistency Checks, Index Maintenance, and Statistics Updates on most databases
These changes ensured the client worked from a known-good configuration should any further performance issues arise.
Issue Three
Not all databases running in the cluster were equally important, with one exceeding all others as it was responsible for all production planning. Most of the investigations had focused on optimising the performance of that critical database, but random slowdowns occurred again. With the various third-party toolkits now in place, we could run some diagnostics, such as sp_blitzfirst from Brent Ozar, which indicated the presence of some costly queries causing high CPU usage. So we dug deeper with sp_blitzcache, “a fast tool to find the worst queries in the SQL Server plan cache, tell you why they’re bad, and even tell you what you can do about them”. The clients’ Development Team was surprised when we discovered that a simple database and app to update timing data on all manufacturing equipment consumed over 40% of all server resources, executing a query over 5000 times per minute.
This simple app proved to be the root cause of all performance issues, pushing overall CPU usage to the limit as more and more manufacturing systems came online. The impact was confirmed by temporarily disabling the app and then re-architecting it to reduce the volume of queries on the database.
The Result
Following all the troubleshooting, the clients critical SQL Server cluster:
- had increased physical CPU resources by 100%,
- reduced CPU consumption by SQL queries by nearly 40%,
- was updated to vendor-supported versions,
- reduced or removed dozens of minor misconfigurations,
- had improved tools and instrumentation to allow faster recovery from issues.
Overall, the client regained confidence in their critical production environment, and the development and IT teams gained a greater knowledge of SQL Server configuration best practices and performance tuning techniques. In addition, overall system availability was improved, reducing or eliminating overnight callouts.