This week I already had two serious performance regression cases when upgrading from MySQL 4.0 and 4.1 to MySQL 5.0. By serious I mean several times performance difference not just 5-10% you often see for simple queries due to generally fatter code.

The problem in both cases was MySQL 5.0 broken group commit bug.

First I should note I am extremely unhappy how MySQL handled this problem. While working for MySQL we spotted this problem early in MySQL 5.0 release cycle as it was introduced and reported it to everyone we could inside the company - this was over 2 years ago. Few months later I created a bug for this issue to get more public attention to the problem and giving extra motivation to MySQL to fix it. Few months later I blogged about this problem with more performance results but as we can see the bug is still in Verified stage and there is no indication any work is going to have it fixed.

I can agree this may be fundamental issue which is not easy to fix, But why is not it mentioned in MySQL 4.1 to 5.0 upgrade notes ? ?

Furthermore if there were no good ideas how to make XA to work with group commit why would not you keep old working code path if XA is disabled ? Many customers do not flush binary log anyway and use single transactional storage engine so they do not care anyway.

Anyway. Enough complains. We have this problem and we have to live with it, most likely MySQL 5.0 and 5.1 would not get any fixes for this problem, so lets see who is affected, how to check you’re affected and how to fix it.

Who is affected ? The good thing is only cheap guys which care about their data are typically affected, meaning you have to have innodb_flush_logs_at_trx_commit=1 so transactions are truly durable. You have to have log-bin enabled to get replication or point in time recovery, but at the same time you should not have hardware RAID or have one without battery backed up cache (BBU unit). I guess this is one of the reason why this bug did not get so much traction inside MySQL - because paying customers would normally have enough money to get BBU unit which is great for performance anyway. Of course you also have to have plenty of concurrent transactions so group commit would trigger in MySQL 4.0 and large number of transactions in total so serializing them would make MySQL unable to keep up. Disks can do 80-150 single page fsyncs per second to get you an idea bout number.

How to spot you’re affected This one is interesting. If you have update prevailing load you would see very strange behavior on 5.0 of MySQL being slow but few queries being “inside innodb” and potentially even empty queue. This is because bottleneck happens in commit phase which is not counted as “inside innodb”. I wish there would be some stats for number of queries waiting to be committed but there is not one easily readable. You can see it from other symptoms though. You would see queries in “SHOW PROCESSLIST” stuck in “end” stage or have “commit” queries in the processlist for multiple statement transactions. Looking at SHOW INNODB STATUS you would notice large amount of log writes and fsyncs per second which matches your hard drive capacity. Plus you would normally see single outstanding log write all the time. There are other ways you can spot the problem as well but these are probably most obvious and easy to use.

How to fix the problem ? Assuming getting back to MySQL 4.1 is not the option you can do one of 3 things to get some of your performance back. XA support has its overhead anyway so you may not get same performance as with MySQL 4.1

  1. Disable Binary Logging This allows to get group commit back but obviously you loose point in time recovery and replication.
  2. Use innodb_flush_log_at_trx_commit=2 This is probably best solution. In many cases this would be good change to do with MySQL 4 also because 100% durable transactions are not required anyway and it would allow to get some extra speed. It is often left at default value without a good reason.
  3. Get BBU If you can’t use any of first two workarounds you better to get battery backed up cache unit and make sure you set your RAID cache policy to “write back”. One of the customers I worked with indeed had battery backed up cache on their system… it just was in “write through” cache policy, so was basically disabled. Note getting BBU is often good idea anyway so you can use this together with other workarounds. Also it is worth to mention BBU does not fix the problem but dramatically raises number of update transactions per second needed to trigger the problem. Without BBU 200 per sec may be well enough with BBU you may only see it if you have 2000 update transactions per second or so, which few people reach.

The other way to approach the problem is of course to work on the application - in large amount of cases the problem happens in case there are away to many updates outside of transactions in auto_commit mode. Wrap them in transactions and reduce number of commits if you can. It is great optimization idea anyway.

Also please do not read this post as MySQL 5.0 is junk in terms of performance and you should stay on MySQL 4.1 until MySQL has taken it away from you as already happened with MySQL 4.0. MySQL 5.0 can offer substantial performance improvements in variety of cases as well as has other benefits. This is simply important regression which you better to know about.