On 3 Dec 2019, we released a new agent version (2.6.4) to fix the cert update bug in the SSL Terminating module, but unfortunately, some dependencies caused serious issues (kernel panic, redirection problems) on some CentOS 6 and CentOS 7 servers. It affected only 2% of the BitNinja protected servers because it occurred only in the case of a special kernel version.
The agent version was reverted and released as 2.6.6. We started a bulk update on the servers and we are in contact with the affected users.
Just to make sure that the right version of BitNinja is running on your server, we recommend to check the agent version and update if it's necessary.
We are truly sorry for the caused inconveniences and we would like to sincerely apologies publicly too. We absolutely understand how serious this problem was for the affected users, that's why we'll drop away all our current projects now and the whole team will work on to set up more than 100 servers with different kernel versions and test thoughtfully BitNinja on all of them. A separate public page will be purposed to show the results, so everyone can follow-up on the investigation process.
As soon as we'll have more accurate technical details, we'll provide them immediately (within 7 workdays), but until then, Zsolt Varga, our Product Manager summarized the information we have so far:
Of course, we tested on every supported version, this is standard procedure for every release. Currently we were only able to collect enough information to narrow down the issue to the kernel versions, the BitNinja package never runs close enough to the kernel to cause an issue like this, but our dependency the HAProxy does, and this is where we made the mistake, we ran tests on CentOS 6 but our test system didn't test different kernel versions, we never needed to do this before since this is our first real HAProxy update since it's release.
Sadly we made a huge mistake where we assumed that every kernel will accept this pre-packaged version in the same way, to be honest, we had to rollback because this incompatibility only appears in a limited range of kernel versions, nor the newer or the older ones have problem.
And before we can push this update we need to extend our setup to run every combination for every supported operating system, until now we ran the agent tests on 52 common + 8 special setup, but our colleagues are now scaling this to test every possible kernel version for each release.
We are terribly sorry for the caused extra work and maintenance, our team did their best to mitigate the issue when we got the first response about the issue, and we know it was longer then expected because our error reporting system could not send the error report since the agent gone down which is something that never happened before, after we got enough information about the issue we issued a full rollback with a version update and tested the current combination on a quickly crafted test environment where the issue arises.
The newest version (2.6.6) is the version that ran without an issue for 2 weeks, and we will not release any new version before we scaled our test servers to the level where this could not happen once again.
On behalf of our whole team, I'd like to sincerely apologize for the caused problems.
P.S. We'll keep updating this article with further details during the investigation.
UPDATE (6th Dec 2019 16:05 UTC+01:00)
Open letter from our Product Manager:
Aye aye, skipper!
I hope this email finds You well; As You know we released a buggy patch and this caused issues, this is why I am writing this email; To tell You the whole story, and see what will change.
We spent the past 2 days to do a post mortem on the issue around the 2.6.4 patch. We wanted to learn from it and go through every detail where we made the mistakes. It wasn't just a simple line of code, but rather the procedure. Until now we trusted in our test servers and spent a considerable amount of money to ensure really thorough testing on the most common setups. It's no news for anyone in the software development world to say we have failing tests all the time since our agent running on thousands of unique configurations, some of our customers running custom compiled kernel… Yep, this is the point where we realized we can't test for every setup; But here is what we can do, and what we are working on right now, because even tho, the issue only affected a small percentage of our community, but we grow to a size where it is already too much, we can't stop apologizing for what happened, this is our first time, and our last as well!
What's going to be changed? You ask, righteously!
To ensure we are not causing a kernel-level issue, we are scaling our test farm to over 170 distro / kernel version matrix, this will cover the general CI improvement, before we only tested for the mainline kernel versions on every distro and this improvement already helps a lot to prevent the serious bugs as we had now.
But this is not enough, we have gone back to the drawing board and reworked the release procedure from zero to hero; We will introduce a canary release system and stop the updates immediately on any erring client, this will ensure us to never fully release a patch which causes issues. Why we didn't have this before? Because until now we always tried to patch any zero-day attack as soon as possible, and this process will slow the release down a bit, but we have to take this;
Not enough just yet! We still want to keep the ability to fight against zero-day attacks even in minutes! This feature is in development since this summer, we are moving the protection rules into our cloud database and instead of code level rules we will simply stream protection rules to the agents, by this we can still fight against those insomniac hackers!
Is there anything left? Yes! Many of You righteously told us; "It happens with everyone, but Your communication was terrible." First of all, thank You for your understanding, we appreciate the feedback and the help coming from our community, it helped us to fix the issue a lot! And, Yes, we worked on the patch but gosh we missed the communication part… This is why our COM department will expand to a 24/7 support, and we are working on a feature where we can push global messages on our cloud admin to notify everyone about what we are doing.
To be honest, we started to work on the customer satisfaction on this summer by introducing a public status page for our cloud services, and our UI team already deployed a staging version for our test farm's status page, where You can see how many servers we are using to run our tests and what are their state.
And this is my personal promise on this, I am the product manager, many of You already met me and with some of You I even had way too much drink as well… :Now, I stopped our feature sprint, and asked every developer to help into the quality assurance, I will not schedule any feature development until this release procedure meets my expectation on the quality level, this is a promise, and I will keep it.
Thank You for your understanding, and I wish You a happy Christmas.