Performance Testing Series
This post is part of a series of blog posts about my performance tests in Azure SQL Database. For the first post in this series (and links to all of the other posts) please see here. For a summary of all the results please see here.
For a general overview of the test architecture, test components and test types, please see here.
Scale Test Overview
This test combines the activities of the earlier Inserts and Deletes, Direct Selects, Indirect Selects and Updates tests. Please refer to those posts for a more detailed description of these activities.
The type of request generated by each worker thread at any point in time is randomly selected, based on the following probabilities:
- Inserts – 20%
- Selects – 30% – which breaks down as 75% Direct Selects and 25% Indirect Selects
- Updates – 30%
- Deletes – 20%
As in those earlier tests, the worker threads in these tests are not limited to a specified request rate.
The earlier tests only ran on a relatively small set of data in memory. These tests run against a newly restored and much larger 36 GB database (except for Basic edition, where a 2 GB database was used). The size of the database and the fact it is newly restored means that none of the data exists in the buffer pool at the start of the test and the test therefore includes the effect of read rates from physical disk for each tier.
Using the same workload against each tier is, on one hand, not a smart thing to do – i.e. clearly we expect the lower tiers/performance levels to not perform as well as the top end. On the other hand, running the same workload is a handy way to illustrate exactly what those differences are like.
The previous Insert, Select, Update and Delete tests randomly picked a row from any point in the database with equal probability. Since those tests were working with data entirely in the buffer pool, that was fine. In most databases, certain data is more likely to be hit than other data, e.g. typically more recent data is more “active”.
To at least loosely replicate that, two distribution functions were used in these tests that meant the data towards the end (highest row ID values) of each test table was more likely to be accessed than data at the start. The first distribution function, used for the single row operations (i.e. Select Direct, Update and Delete) had a high probability of choosing data in the last 5% of each table. The second distribution function used for row ranges (i.e. Select Indirect) had a slightly higher probability of selecting a slightly older data – i.e. the function was less peaked at the end of the table.
Probability distribution charts for these functions can be found in an appendix at the end of this post.
I am not at all claiming that this workload and these distributions are representative of the majority of databases in use. They are merely one example workload and data access pattern. The single row distribution function is in particular quite “peaked”, however the choice was deliberate – having such a peak should allow some data to be relatively quickly cached in the buffer pool.
Each test lasted 120 minutes and was executed three times against each tier. Tests were performed against Business Edition with 1 thread generating requests, 2 threads, 4 threads and 8 threads. This is due to the opportunistic nature of that edition (which means more threads = more throughput, to some extent). This is less true of the new service tiers, where the different quotas come into play and additional threads don’t improve performance once a quota limit is reached (which generally happens in the tests here).
More details of the test parameters and configurations can be found in the appendices at the end of this post.
Results from the 30 Scale Tests are shown in the charts below. In these charts, the “T” values are the number of worker threads generating requests against the database, e.g. “Business 4T” = a test against Business edition, running with 4 threads continuously generating requests.
Clearly the Standard Edition has not performed well under this workload. Business 4T outperforms P1 4T. Prem P2 8T outperforms Business 8T. Both of these comparisons are like-for-like in terms of Cloud Service configuration (i.e. both 4T tests were executed from an A2 instance and both 8T tests from an A3 instance).
Plotting the same data on a logarithmic scale:
P1 has outperformed S2 by a factor of around 100. P1 has outperformed S1 by a factor of around 1000. Basic has performed well, though it should be remembered that the Basic Edition test was conducted against a 2 GB database, whilst all the other tests were performed against a 36 GB database. Still, this difference in performance is interesting.
The charts below show the performance profiles over the lifetime of the tests in the charts above. Each test has been plotted on a separate chart to allow throughput and load % to be easily compared. The first test of the three on each tier (i.e. ST1) is plotted and discussed below. Charts for the ST2 and ST3 tests are generally similar and presented in an appendix at the bottom of this post.
It is also worth noting that due to the way the Azure SQL Database logging into sys.resource_stats works, data is only available at five minute intervals and the last data point in each test is not useful and so is omitted. There also appeared to be a bug or infrastructure fault occurring during these tests that was prevent SQL Database capturing data for every five minute period – the gaps appear to follow quite a regular pattern but fortunately don’t prevent the general trends from being observed.
After a five to ten minute warm-up period performance quickly reaches a good level. We can see the read rate decreasing over the course of the test as more of the database appears to be cached. This result is rather in contrast to all of the other results below and so should be considered as something of a special case.
For the majority of this test, performance is limited by the log write quota.
Response times are good after the first couple of minutes. Since this database is write-limited, unsurprisingly the Insert, Update and Delete operations take the most time. Again, this is something of a special case compared to the other tests.
This is not a pretty profile, at all. Performance is terrible at the start and only improves, perhaps very slightly, at the very end of the test. The test is read constrained throughout.
There were never more than two requests pending at any one time (the test was run with two threads). The chart below shows the average response times. Actually, since so few requests were processed in this test, each data point generally represents an individual request:
There are so few requests processed in this test, we can easily list them, by start time. Here are all 37 requests from the first hour:
|Minute||Action||Requests||AvgRespTime (sec)||AvgRespTime (m:ss)||Rows||Bytes|
“Byte” values above represent the length of the rows sent to SQL Database (for inserts, updates) or the lengths of the rows read from SQL Database (for selects).
It would be interesting to see just how many pages SQL Server is processing for each of these requests. In theory, it should be relatively low, since all requests search by the clustered index key – i.e. WHERE RowId = … for the SelectDirect, Update and Delete requests, or WHERE RowId between… for the SelectIndirect requests (which cover an average range of 660 rows).
The database was not defragmented after importing which certainly won’t help performance – the bacpac import process tends to leave clustered indexes relatively fragmented, e.g. these tables typically had over 75% fragmentation (yet another reason to dislike bacpacs).
It is interesting to note that later in the test, after minute 90, generally all of the request types except SelectIndirect respond sub-second. But those SelectIndirects are still performing very badly and contributing most to the poor performance.
Clearly, being read limited is not a happy place to be.
The second test shows a similar pattern:
The improvement in performance after 90 minutes in all but the SelectIndirect requests is easier to see in this test. Perhaps given a test over 24 hours performance would have continued to improve further, though the S2 result below suggests that is probably unlikely.
The chart clearly shows a higher throughput compared to S1. Again the test is read constrained throughout.
A stable state of performance is reached after around 35 minutes, vs. 90 minutes for S1. Again, the SelectIndirect requests have the slowest response. Response times for all request types are roughly an order of magnitude better than S1.
This clearly shows a database performing much more efficiently than S2. The database reaches a reasonably stable level of performance after only ten minutes. After another 30 minutes a significant proportion of the database workload has been cached (judging by the reduction in physical disk utilisation) and performance has increased by another order of magnitude.
Throughput in the stable state is roughly 100 times that of S2 (for what is in effect only a doubling of DTUs between S2 and P1).
This is the first of the new tiers able to operate this database workload without being either read or write limited.
The response time comparison to S2 is interesting. The average response time for the single row operations is the same order of magnitude (P1 perhaps half that of S2). However, the multiple row SelectIndirect operation response time is roughly two to three orders of magnitude (i.e. 100 to 1000 times) better. Clearly, a higher read quota has brought about a huge benefit.
There also appears to be a regular peaks in performance (e.g. the response time dips at roughly 58, 68, 77, 87, 97 minutes). In the Select Sequential post I speculate that read quota may be allocated in blocks per unit of time. Perhaps these dips represent additional “premium performance” quota being allocated, then quickly consumed (giving a read rate boost) before read performance defaults back to a lower level for the tier.
For this workload, P1 would clearly be the optimum service tier and performance level choice based on a balance of cost and performance.
On P2 the workload is working efficiently in around only five minutes. The database is read limited for less than five minutes. Throughput is around three to four times that seen in P1, for again what is only a doubling of the DTUs.
Response times are probably half that seen in S1. Again the same pattern of periodic peaks in performance (i.e. dips in response time) are apparent, again at intervals of around ten minutes.
It is interesting to note the obvious dip here around minute 20 – an example of the variability in performance that can be encountered with this edition. We are reported to be write limited, though we know from the earlier tests this limit means little in practice for this edition.
Response times are very good. It is interesting to note that the SelectIndirect requests in this edition do not have a response time that is noticeably longer than the other request types. This is in comparison to the newer service tiers where this part of the workload performed significantly worse throughout. This suggests read performance is noticeably better in Web/Business than the newer service tiers.
Very similar to the 1T test.
Again, very similar. With a noticeable blip around minute 70.
With eight threads, we reach a performance that is comparable to P2 (that test also ran with eight threads).
Scale Test Conclusions
The Scale Tests have illustrated how one example workload performs across the different editions, tiers and performance levels of SQL Database. In particular, how the nature of the workload can generate differences in performance vastly different to simple expectation based on the DTU figures for the different tiers and performance levels.
From the figures gathered in these tests, we can create yet another benchmark figure (CTU, Chris’s Throughput Units, for this rough test database workload here) as follows:
|Service Tier||Benchmark Rate||Chris’ Throughput Units (CTU)|
|Web / Business (4T) =||40 MB per minute||100 CTU (36GB DB)|
|Basic =||15 MB per minute||38 CTU (2GB DB)|
|Standard S1 =||0.03 MB per minute||0.1 CTU (36GB DB)|
|Standard S2 =||0.08 MB per minute||0.2 CTU (36GB DB)|
|Premium P1 =||20 MB per minute||50 CTU (36GB DB)|
|Premium P2 =||76 MB per minute||134 CTU (36GB DB)|
Now, I have put the CTUs here half jokingly (actually, 90% jokingly, even though they do accurately reflect my test results). I don’t claim for a second they are any more useful than those figures provided by Microsoft. Microsoft have undoubtedly put substantially more effort into creating a rate that much better reflects more workloads and scales.
My point really is that DTUs, CTUs or whatever else aren’t a reliable guide to planning / design. They are perhaps a very rough starting point but SQL Server performance doesn’t scale nicely and simply according to one simple blended benchmark number.
Some basic read and write rates can be helpful though. By the way – the CTU benchmark comes supplied with these figures in contrast to other xTU style benchmarks that may be available! Though there is a disclaimer the CTU rates are estimates of course, not official, subject to change, etc! These figures can be found summarised in the next post and individually in the earlier posts about each test type.
The CTU benchmark also compares the new service tiers to Web/Business, though again it is noted that Web/Business performance is more variable. That still doesn’t stop averages being captured nor does it mean readers aren’t intelligent enough to understand that any Web/Business average has a wider standard deviation than for the new Service Tiers.
These tests have also illustrated how read rate appears to quite heavily dominate SQL Database performance under the new service tiers. And how life is very unpleasant when that quota is running short. Of course, I could have reduced the fragmentation in the test database, used a more optimum design, etc. But, just in case you missed it everywhere else I said it, I don’t claim this is the best workload to benchmark with. Though I am sure that there are plenty of other databases in Web/Business today with similar or bigger problems – that as of today are running perfectly/moderately happily (just like this workload did in Web/Business) – whose owners may be in for a surprise when they try running in an environment that is constrained on read rate. For these databases, Standard Edition as of today is not going to be a happy place to be.
Which neatly brings me to one aspect I haven’t mentioned thus far. Cost. For this workload, clearly Premier P1 would be needed. That’s would represent a cost change from 65 UKP (100 USD) to 295 UKP (465 USD) per month, i.e. a multiple of 4.6! (And that’s just the preview pricing!) So, that database that is running happily today in Web/Business (despite its “issues”) is going to potentially cost quite a lot more this time next year to keep running…
Which leaves me all the way back with my original thought I posted way back in April/May time – that these new Service Tiers actually represent quite an increase in the cost of SQL Database. Even a 150 GB database (i.e. at the top end of SQL Database per GB pricing in Business Edition today) will cost twice as much to run in P1 in future (and again, that’s just the preview pricing, GA pricing is four times today’s price!).
Finally, a couple of other footnotes comparing Web/Business to the new service tiers:
- The sys.resource_stats DMV recently appeared in the Web/Business edition, presumably to allow comparison between the existing tiers and the new tiers.
- Firstly, from these tests, we have seen that 100% values in the log write and physical disk disk columns actually cover a VERY wide range of actual performance levels, often multiples of 3 or 4 or even 5. So, if you are seeing 100%, that actually doesn’t tell you the true maximum.
- I believe I read in a blog post that the read and write percentages in this DMV in Web/Business have been set at half S2. I.e. 100% Business = half S2. Or equivalently, S2 has twice the performance of Business. From these tests, the opposite actually appears to be true. E.g. the average physical read read rate measured for S2 = 210 MB per minute, for Business = 550 MB per minute.
This conclusion has turned into something of a stream of conciousness. It may sound overly critical. Actually, I do really like the concepts behind the changes Microsoft is bringing to SQL Database. They bring an increased level of robustness and probably should have been the way SQL Database was designed originally. There are many good solid technical enhancements. In practice, P2 significantly outperforms Web/Business in almost all tests (P1 is a more mixed picture). So my real concerns are about the transition from Web/Business and what appears from here to be quite significant associated cost increases in order to maintain the same performance. Standard Edition in particular seems a “somewhat” poor relation to Web/Business due to the constrained read quota. And, great though Premium is, it costs significantly more.
Is this the equivalent in aviation terms of moving from a rough around the edges, sometimes unreliable but generally quite high performing economy class only Budget Airline to a National/Flag Carrier with it’s multiple cabins and seating classes, all generally more expensive than the budget carrier?
As a PS, I should remind that these tests were conducted in the Western Europe data centre. It is conceivable that there is plenty of spare Web/Business capacity here (that may not exist in other Azure data centres worldwide) so Web/Business may perform better here than elsewhere. If that is the case, the general conclusions here would be less widely applicable. Time and the experiences of others will certainly tell if that is the case or if these conclusions are generally applicable…
Appendix – Test Configuration
|Cloud Svc Inst. Size||A2||A2||A2||A3||A1||A1||A2||A2||A3|
|Req. Gen. Thread Count||1||2||4||8||2||2||2||4||8|
|Initial Test Data (GB)||36||36||36||36||2||36||36||36||36|
Appendix – Test Results
|Configuration||Avg Ops Per Second||Avg Rows Per Second||Avg MB Per Minute|
|Std S1 2T||0||0||0||0.6||1.3||1.4||0.01||0.03||0.03|
|Std S2 2T||0||0||0||3.6||3.8||3.9||0.08||0.08||0.08|
|Prem P1 4T||95||94||103||902.2||895.4||985.4||18.98||18.83||20.73|
|Prem P2 8T||387||381||381||3710.0||3602.5||3595.6||78.04||75.76||75.61|
|Configuration||SQL Avg Disk Read %||SQL Avg Log Write %||SQL Avg CPU %|
|Std S1 2T||100.0||98.1||100.0||0.0||0.0||0.0||2.8||3.6||2.6|
|Std S2 2T||100.0||100.0||99.1||0.1||0.1||0.1||2.9||3.2||3.3|
|Prem P1 4T||72.3||71.6||73.6||22.7||21.5||23.7||8.6||7.3||8.0|
|Prem P2 8T||47.9||50.2||49.3||44.5||43.2||42.4||10.4||11.7||10.2|
|Configuration||Cloud Svc Avg CPU||Error Count|
|Std S1 2T||0.3||0.3||0.3||0||0||2|
|Std S2 2T||0.3||0.3||0.3||0||0||0|
|Prem P1 4T||3.5||3.4||3.7||0||0||0|
|Prem P2 8T||7.0||6.9||6.8||0||0||0|
Appendix – Row Selection Probability Distribution Functions
The chart below illustrates the probability distribution functions used for selecting single rows (for the Direct Select, Update and Delete operations) and for selecting multiple rows (for the Select Indirect operation). The x-axis represents a theoretical table with 1000 rows. The y axis represents the probability of each row being accessed for a single operation.
Zooming into the lower portion of the chart:
In practice, this meant a data access operation had the following probabilities of hitting different regions within a table (again, using a table of 1000 rows as an example):
|From Row||To Row||Single Row %||Row Range %|
So, for example, a single row select (Select Direct) request had approximately 50% chance of hitting the last 5% of the table (i.e. between rows 950 and 1000). The probability that a multiple row select (Select Indirect) would hit that part of the table is approximately 30%.
Appendix – ST2 and ST3 Charts
The main text above includes only the performance profile for the first of the three scale tests conducted against each tier / edition (to avoid overloading the main text with even more charts than are already present!). The performance profiles for the second and third tests against each tier are included below for those that really love this kind of thing.
Generally speaking, the main patterns of behaviour are consistent within each performance level / tier / edition.