H5P.org Easter outage
April 1st at 9:10 AM CET four core team members got a phone call from a robot telling us that H5P.org was down. More details were sent to us via Slack and e-mail. We quickly found that the third party storage service H5P.org is using had gone down. Technical details about what happened and how the supplier handled it are posted here.
Service outages on the third party storage service has happened previously and have normally lasted for a few minutes, and never longer than an hour or two. We had the option of restoring our daily backup, but that would mean losing up to 24 hours of data. However since previous storage service outages have been minor and that it was a Sunday in the middle of Easter, we decided to wait and monitor the situation.
Assuming that this block storage outage would be minor like the previous incidents was a mistake. As the day went on we received further updates from the supplier that didn’t indicate that there would be a long period of downtime and we thought it would be up and running soon, further compounding our original mistake.
Luckily the vast majority of H5P users and content in the world were pretty much unaffected. H5P is currently designed exclusively for self hosting. H5P.org was never meant to be a free service for creating and embedding H5P content on external web sites. We are however aware that many uses H5P.org this way, and for those that rely on H5P.org, especially for embedded content, we are incredibly sorry for this service outage.
H5P.org was primarily created as a community web site with a forum, examples, documentation and features for letting users try out H5P. There is currently no external funding for it and thus we’re not using a high end setup with automatic fail overs etc. The H5P.org infrastructure and routines are designed to be cost efficient, it is not designed for high availability since that would require a lot more funding both for the infrastructure and for making sure that there is always human resources available if something goes wrong. A long service outage as seen in Easter is however completely unacceptable for H5P.org.
We’re now going to make some changes to reduce the risk of long outages including:
We have already made it more clear on H5P.org that the content creation features offered there are mainly meant for test driving H5P. H5P.com will be available very soon for those who need a high availability SaaS solution. It has replicas for almost every component with automatic fail-overs, auto-scaling, features for doing updates with no downtime and uses high end software and hardware with personnel available around the clock to resolve any issues instantly.
The most critical parts currently being hosted together with H5P.org are the H5P Content Type Hub APIs. These will be moved to the same kind of infrastructure and service level that H5P.com is running on.
We will reconsider the infrastructure and suppliers for H5P.org. We will at least make sure that the block storage can’t cause the entire site to go down.
We will replace the front page of H5P.org quickly with information about downtime if and when anything like this happens again and also be faster to inform on social media. We won’t assume that a third party service will be fixed quickly.
We will consider restoring backups faster if something like this happens again.