While the network situation was out of our control, we could have done better. Here is a complete summary of our actions including a mea culpa and what we plan to do in the future.
What did we do?
When the problems started occurring on Thursday, we contacted our colocation provider. They told us the problem was with their upstream network provider. The problem was intermittent, and we thought they were going to be able to fix it. On Friday, the problem continued, and got worse.
With the worsening situation, we had to make a tough decision: allow the site to hobble along or put up a site down notice. We decided on the latter. We assessed the risk (correctly) that our primary provider did not have the situation under control.
We created a simple "site down" server at another provider, and redirected bivio.com to this server. We also communicated on this list that there was a problem and that we were working on it. Furthermore, we ensured that we still had access to all your data. We allowed mail to continue to flow, because mail operates in a "store and forward" mode, which makes it reliable over unreliable networks. It also allowed us to communicate with you.
As the afternoon progressed, we decided it was necessary to migrate the data to a different provider Linode, with which we have over a decade of experience. There were two factors with this: moving the bulk of the data and getting a fresh snapshot.
We maintain two backup servers: one at the colocation facility in Ft Collins and another in our office in Boulder. These contain nightly snapshots. We started migrating the nightly snapshot to Linode's Dallas facility from Boulder, because the connection from Boulder was stable.
During brief periods of network connectivity with Ft Collins, we were able to bring down the site completely (including mail), make a complete backup of the transactional data, and ensure the copy of your mail and files in Boulder was completely up to date. As the afternoon progressed to evening, we made a copy to Dallas and then also in Atlanta this morning. (Atlanta is newer, faster, and has more security features than Dallas. For the technically curious, visit linode.com to learn about VLANs and NVME block storage.)
The next step was bringing up our configuration. We started rebuilding our configuration server this morning in Atlanta even though our primary colocation in Ft Collins was fairly stable. We were in constant contact with our primary provider, and they seemed to think that the system would be fully stable in the morning (see Facebook).
Around 11:30 today, we decided to bring the site back up. The network had been stable for a couple of hours. We had numerous calls with our primary provider to understand their perception of the risks. The network seems to be reliable at this point.
What did we do wrong?
We failed to communicate on social media. This was entirely my fault. Laurie made sure I had access to our social media accounts on Thursday. I was so involved in fixing the problem that I failed to communicate with you or delegate the responsibility to someone else on the team. Please accept my apologies for the lack of communication.
We delayed the decision on bringing the site down. We had a suspicion that our primary provider did not have the situation under control. We should have begun migrating the site earlier on Friday.
We were not as prepared as we could be in the event of a serious outage of this kind. We assumed we could simply copy the data and bring up a server on another provider. Again, this is my responsibility, and I failed to ensure the procedure was well documented and practiced.
What will we do?
This event was a Black Swan that all website owners dread. We are too small of a company to run a reliable, replicated version of our service at another provider. We do rely on multiple cold backups (see above), which worked perfectly, but it takes time to turn a cold backup into a running service.
To remedy this, we are going to maintain a parallel configuration. This is the slowest factor in turning a cold backup into a live service. If we had a parallel configuration ready to go, it is a simple matter of copying the data to our alternate provider and starting the service. We will practice bringing up the site with production data on the alternate provider.
We are also going to re-evaluate our reliance on our primary colocation provider. This was a significant incident, which frankly, they did not handle as well as they could have. Black Swan events are real, which is why we also maintain additional offline backups of your data in two separate vaults. This would allow us to recover from an even more serious event than losing one of our online backup servers. We expect as much care from our vendors.
Finally, we will improve communications during an outage. We will keep you informed via social media.
Please accept our apologies for this major site outage and the serious inconvenience it has caused.
Thank you for using Bivio. Your loyalty is what keeps this business going. Hopefully, this note will help restore some of your trust in us.
Sincerely,
Rob
CEO
Bivio Inc.
Norman Gee on
Will account sync bring in the data from April 1st? or is that lost and we will have to transfer the transactions from that date.
On Sat, Apr 2, 2022 at 12:02 PM Rob Nagler <nagler@bivio.biz> wrote:
The network has been restored. bivio.com is fully operational. No data was lost.
You can read our colocation provider's page to learn more about the network problem:
While the network situation was out of our control, we could have done better. Here is a complete summary of our actions including a mea culpa and what we plan to do in the future.
What did we do?
When the problems started occurring on Thursday, we contacted our colocation provider. They told us the problem was with their upstream network provider. The problem was intermittent, and we thought they were going to be able to fix it. On Friday, the problem continued, and got worse.
With the worsening situation, we had to make a tough decision: allow the site to hobble along or put up a site down notice. We decided on the latter. We assessed the risk (correctly) that our primary provider did not have the situation under control.
We created a simple "site down" server at another provider, and redirected bivio.com to this server. We also communicated on this list that there was a problem and that we were working on it. Furthermore, we ensured that we still had access to all your data. We allowed mail to continue to flow, because mail operates in a "store and forward" mode, which makes it reliable over unreliable networks. It also allowed us to communicate with you.
As the afternoon progressed, we decided it was necessary to migrate the data to a different provider Linode, with which we have over a decade of experience. There were two factors with this: moving the bulk of the data and getting a fresh snapshot.
We maintain two backup servers: one at the colocation facility in Ft Collins and another in our office in Boulder. These contain nightly snapshots. We started migrating the nightly snapshot to Linode's Dallas facility from Boulder, because the connection from Boulder was stable.
During brief periods of network connectivity with Ft Collins, we were able to bring down the site completely (including mail), make a complete backup of the transactional data, and ensure the copy of your mail and files in Boulder was completely up to date. As the afternoon progressed to evening, we made a copy to Dallas and then also in Atlanta this morning. (Atlanta is newer, faster, and has more security features than Dallas. For the technically curious, visit linode.com to learn about VLANs and NVME block storage.)
The next step was bringing up our configuration. We started rebuilding our configuration server this morning in Atlanta even though our primary colocation in Ft Collins was fairly stable. We were in constant contact with our primary provider, and they seemed to think that the system would be fully stable in the morning (see Facebook).
Around 11:30 today, we decided to bring the site back up. The network had been stable for a couple of hours. We had numerous calls with our primary provider to understand their perception of the risks. The network seems to be reliable at this point.
What did we do wrong?
We failed to communicate on social media. This was entirely my fault. Laurie made sure I had access to our social media accounts on Thursday. I was so involved in fixing the problem that I failed to communicate with you or delegate the responsibility to someone else on the team. Please accept my apologies for the lack of communication.
We delayed the decision on bringing the site down. We had a suspicion that our primary provider did not have the situation under control. We should have begun migrating the site earlier on Friday.
We were not as prepared as we could be in the event of a serious outage of this kind. We assumed we could simply copy the data and bring up a server on another provider. Again, this is my responsibility, and I failed to ensure the procedure was well documented and practiced.
What will we do?
This event was a Black Swan that all website owners dread. We are too small of a company to run a reliable, replicated version of our service at another provider. We do rely on multiple cold backups (see above), which worked perfectly, but it takes time to turn a cold backup into a running service.
To remedy this, we are going to maintain a parallel configuration. This is the slowest factor in turning a cold backup into a live service. If we had a parallel configuration ready to go, it is a simple matter of copying the data to our alternate provider and starting the service. We will practice bringing up the site with production data on the alternate provider.
We are also going to re-evaluate our reliance on our primary colocation provider. This was a significant incident, which frankly, they did not handle as well as they could have. Black Swan events are real, which is why we also maintain additional offline backups of your data in two separate vaults. This would allow us to recover from an even more serious event than losing one of our online backup servers. We expect as much care from our vendors.
Finally, we will improve communications during an outage. We will keep you informed via social media.
Please accept our apologies for this major site outage and the serious inconvenience it has caused.
Thank you for using Bivio. Your loyalty is what keeps this business going. Hopefully, this note will help restore some of your trust in us.
Sincerely,
Rob
CEO
Bivio Inc.
Rob Nagler on
Hi Norman,
AccountSync always catches up.
Cheers,
Rob
John Munn on
Rob.... Thank you for bringing us up to date. From my perspective, considering how often I back up my personal data and how difficult it is to restore a system after a failure, I know my club's data is more secure at bivo than it would be if it were on my local hard drive. I think you did a great job getting things back to normal with only a day's outage.
When I worked in data processing about 35 years ago (the dark ages in computer terms), a hard drive crash would take at least a day to reload from backup and rebuild the file by entering the current transactions made since the backup was created. If a system crash happened to me I'd expect at least a few days to get back up to speed, so I'm very thankful bivio has my back.
Thanks again;
John Munn
On Sat, Apr 2, 2022 at 3:02 PM Rob Nagler <nagler@bivio.biz> wrote:
The network has been restored. bivio.com is fully operational. No data was lost.
You can read our colocation provider's page to learn more about the network problem:
While the network situation was out of our control, we could have done better. Here is a complete summary of our actions including a mea culpa and what we plan to do in the future.
What did we do?
When the problems started occurring on Thursday, we contacted our colocation provider. They told us the problem was with their upstream network provider. The problem was intermittent, and we thought they were going to be able to fix it. On Friday, the problem continued, and got worse.
With the worsening situation, we had to make a tough decision: allow the site to hobble along or put up a site down notice. We decided on the latter. We assessed the risk (correctly) that our primary provider did not have the situation under control.
We created a simple "site down" server at another provider, and redirected bivio.com to this server. We also communicated on this list that there was a problem and that we were working on it. Furthermore, we ensured that we still had access to all your data. We allowed mail to continue to flow, because mail operates in a "store and forward" mode, which makes it reliable over unreliable networks. It also allowed us to communicate with you.
As the afternoon progressed, we decided it was necessary to migrate the data to a different provider Linode, with which we have over a decade of experience. There were two factors with this: moving the bulk of the data and getting a fresh snapshot.
We maintain two backup servers: one at the colocation facility in Ft Collins and another in our office in Boulder. These contain nightly snapshots. We started migrating the nightly snapshot to Linode's Dallas facility from Boulder, because the connection from Boulder was stable.
During brief periods of network connectivity with Ft Collins, we were able to bring down the site completely (including mail), make a complete backup of the transactional data, and ensure the copy of your mail and files in Boulder was completely up to date. As the afternoon progressed to evening, we made a copy to Dallas and then also in Atlanta this morning. (Atlanta is newer, faster, and has more security features than Dallas. For the technically curious, visit linode.com to learn about VLANs and NVME block storage.)
The next step was bringing up our configuration. We started rebuilding our configuration server this morning in Atlanta even though our primary colocation in Ft Collins was fairly stable. We were in constant contact with our primary provider, and they seemed to think that the system would be fully stable in the morning (see Facebook).
Around 11:30 today, we decided to bring the site back up. The network had been stable for a couple of hours. We had numerous calls with our primary provider to understand their perception of the risks. The network seems to be reliable at this point.
What did we do wrong?
We failed to communicate on social media. This was entirely my fault. Laurie made sure I had access to our social media accounts on Thursday. I was so involved in fixing the problem that I failed to communicate with you or delegate the responsibility to someone else on the team. Please accept my apologies for the lack of communication.
We delayed the decision on bringing the site down. We had a suspicion that our primary provider did not have the situation under control. We should have begun migrating the site earlier on Friday.
We were not as prepared as we could be in the event of a serious outage of this kind. We assumed we could simply copy the data and bring up a server on another provider. Again, this is my responsibility, and I failed to ensure the procedure was well documented and practiced.
What will we do?
This event was a Black Swan that all website owners dread. We are too small of a company to run a reliable, replicated version of our service at another provider. We do rely on multiple cold backups (see above), which worked perfectly, but it takes time to turn a cold backup into a running service.
To remedy this, we are going to maintain a parallel configuration. This is the slowest factor in turning a cold backup into a live service. If we had a parallel configuration ready to go, it is a simple matter of copying the data to our alternate provider and starting the service. We will practice bringing up the site with production data on the alternate provider.
We are also going to re-evaluate our reliance on our primary colocation provider. This was a significant incident, which frankly, they did not handle as well as they could have. Black Swan events are real, which is why we also maintain additional offline backups of your data in two separate vaults. This would allow us to recover from an even more serious event than losing one of our online backup servers. We expect as much care from our vendors.
Finally, we will improve communications during an outage. We will keep you informed via social media.
Please accept our apologies for this major site outage and the serious inconvenience it has caused.
Thank you for using Bivio. Your loyalty is what keeps this business going. Hopefully, this note will help restore some of your trust in us.
Sincerely,
Rob
CEO
Bivio Inc.
Daniel Williams on
Rob, Your candor is a real breath of fresh air! Cost? Priceless!! Thank you!
On Saturday, April 2, 2022, 06:31:36 PM CDT, John Munn via bivio.com <user*223700001@bivio.com> wrote:
Rob.... Thank you for bringing us up to date. From my perspective, considering how often I back up my personal data and how difficult it is to restore a system after a failure, I know my club's data is more secure at bivo than it would be if it were on my local hard drive. I think you did a great job getting things back to normal with only a day's outage.
When I worked in data processing about 35 years ago (the dark ages in computer terms), a hard drive crash would take at least a day to reload from backup and rebuild the file by entering the current transactions made since the backup was created. If a system crash happened to me I'd expect at least a few days to get back up to speed, so I'm very thankful bivio has my back.
Thanks again;
John Munn
On Sat, Apr 2, 2022 at 3:02 PM Rob Nagler <nagler@bivio.biz> wrote:
The network has been restored. bivio.com is fully operational. No data was lost.
You can read our colocation provider's page to learn more about the network problem:
While the network situation was out of our control, we could have done better. Here is a complete summary of our actions including a mea culpa and what we plan to do in the future.
What did we do?
When the problems started occurring on Thursday, we contacted our colocation provider. They told us the problem was with their upstream network provider. The problem was intermittent, and we thought they were going to be able to fix it. On Friday, the problem continued, and got worse.
With the worsening situation, we had to make a tough decision: allow the site to hobble along or put up a site down notice. We decided on the latter. We assessed the risk (correctly) that our primary provider did not have the situation under control.
We created a simple "site down" server at another provider, and redirected bivio.com to this server. We also communicated on this list that there was a problem and that we were working on it. Furthermore, we ensured that we still had access to all your data. We allowed mail to continue to flow, because mail operates in a "store and forward" mode, which makes it reliable over unreliable networks. It also allowed us to communicate with you.
As the afternoon progressed, we decided it was necessary to migrate the data to a different provider Linode, with which we have over a decade of experience. There were two factors with this: moving the bulk of the data and getting a fresh snapshot.
We maintain two backup servers: one at the colocation facility in Ft Collins and another in our office in Boulder. These contain nightly snapshots. We started migrating the nightly snapshot to Linode's Dallas facility from Boulder, because the connection from Boulder was stable.
During brief periods of network connectivity with Ft Collins, we were able to bring down the site completely (including mail), make a complete backup of the transactional data, and ensure the copy of your mail and files in Boulder was completely up to date. As the afternoon progressed to evening, we made a copy to Dallas and then also in Atlanta this morning. (Atlanta is newer, faster, and has more security features than Dallas. For the technically curious, visit linode.com to learn about VLANs and NVME block storage.)
The next step was bringing up our configuration. We started rebuilding our configuration server this morning in Atlanta even though our primary colocation in Ft Collins was fairly stable. We were in constant contact with our primary provider, and they seemed to think that the system would be fully stable in the morning (see Facebook).
Around 11:30 today, we decided to bring the site back up. The network had been stable for a couple of hours. We had numerous calls with our primary provider to understand their perception of the risks. The network seems to be reliable at this point.
What did we do wrong?
We failed to communicate on social media. This was entirely my fault. Laurie made sure I had access to our social media accounts on Thursday. I was so involved in fixing the problem that I failed to communicate with you or delegate the responsibility to someone else on the team. Please accept my apologies for the lack of communication.
We delayed the decision on bringing the site down. We had a suspicion that our primary provider did not have the situation under control. We should have begun migrating the site earlier on Friday.
We were not as prepared as we could be in the event of a serious outage of this kind. We assumed we could simply copy the data and bring up a server on another provider. Again, this is my responsibility, and I failed to ensure the procedure was well documented and practiced.
What will we do?
This event was a Black Swan that all website owners dread. We are too small of a company to run a reliable, replicated version of our service at another provider. We do rely on multiple cold backups (see above), which worked perfectly, but it takes time to turn a cold backup into a running service.
To remedy this, we are going to maintain a parallel configuration. This is the slowest factor in turning a cold backup into a live service. If we had a parallel configuration ready to go, it is a simple matter of copying the data to our alternate provider and starting the service. We will practice bringing up the site with production data on the alternate provider.
We are also going to re-evaluate our reliance on our primary colocation provider. This was a significant incident, which frankly, they did not handle as well as they could have. Black Swan events are real, which is why we also maintain additional offline backups of your data in two separate vaults. This would allow us to recover from an even more serious event than losing one of our online backup servers. We expect as much care from our vendors.
Finally, we will improve communications during an outage. We will keep you informed via social media.
Please accept our apologies for this major site outage and the serious inconvenience it has caused.
Thank you for using Bivio. Your loyalty is what keeps this business going. Hopefully, this note will help restore some of your trust in us.