Service status updates
MySQL downtime, 2025-06-01
Our MySQL server squirrel was showing some poor performance, impacting user applications as well as the control panel.
13:00: MySQL service restart initiated
15:00: MySQL service restart completes (yes, two hours to restart)
16:25: Reboot initiated after performance not improving
16:35: Reboot completes, MySQL begins startup
17:25 MySQL service startup completes
18:15: Performance seemingly back to normal levels
Services downtime, 2025-02-22
Overnight system updates seemed to have caused some disk access and processes on sinkhole aka. webserver.srcf.net to lock up from around 6:30am. Investigation ongoing…
Update 10:30am: After some failed attempts to unwedge or diagnose it, sinkhole has been rebooted.
Update 10:45am: Looks like this is an issue with shared disk space (i.e. user/group home directories) available on all user-facing machines – shell.srcf.net and doom.srcf.net are experiencing similar downtime.
Update 3:20pm: Services were fully restored at 2:00pm. It seems overnight updates caused pip aka. shell.srcf.net to lock up the shared disk space (exactly how remains unclear), in turn locking up several other machines. Rebooting pip fixed the problem.
Webserver downtime, 2024-12-18
The SRCF webserver sinkhole (aka. webserver.srcf.net) experienced a freeze from around 4:30am this morning, taking out all web hosting and remote access. This was noticed just after 8am, and rebooted a little later after some failed attempts to thaw it.
The server itself was back online by 9am, though some user services timed out and failed to start – these were caught and retried by 10am, where service should have resumed as normal.
Stay warm out there this holiday season! ☃
Brief network disconnection, 2024-09-25 03:00-03:30
Our services will experience a brief disconnection from the outside world from around 3am (BST) on Wednesday 25 September, due to essential maintenance on the University network point-of-presence (PoP) switch connecting us and Cambridge SU to the outside world.
srcf.ucam.org domains were temporarily nonexistent, 2023-11-10
The ucam.org domain (link may not work) under which ‘srcf.ucam.org’ exists temporarily disappeared from the Domain Name System today. During that time, we regret that emails to @srcf.ucam.org addresses will have bounced (reported delivery failures to the sender), and websites weren’t accessible via the www.srcf.ucam.org redirect service, for older accounts with that feature.
webserver.srcf.net systemd services not launched, 2023-10-14
The SRCF web server sinkhole / webserver.srcf.net was rebooted during our scheduled vulnerable period on Saturday 14th October, but a handful of users’ systemd services were not launched when the server booted back up. This is likely due to the server startup timing out and systemd giving up launching the remaining user tasks.
If you were affected, attempts to control existing services with systemctl would have resulted in “Failed to connect to bus” errors.
Unfortunately, due to the small number of accounts affected, this wasn’t noticed until 9 days later, with the remaining tasks launched around 11:45pm on Monday 23rd October. All user services and service management should be back to normal now.
Ancillary services offline - 2023-10-18 16:05-
Due to a loss of power at the West Cambridge Data Centre (WCDC), some non-user-facing services, including backup storage and one of our monitoring systems, have gone down.
Mailman delivery delays, 2023-09-18 and 2023-09-19
The queue processor for Mailman, which runs user and group account mailing lists, quietly became stuck and stopped handling incoming emails. This meant emails were being accepted by our mail server but not being processed.
The logs suggest the problems started around 8am on Monday 18th, with messages backing up until 7pm on Tuesday 19th when the stuck runner was noticed and restarted.
Queued messages were all released together, initially reaching the sending limits of our upstream email relay ppsw, so some existing messages have been deferred and may take a few hours before they make it through.
Poor website performance, Friday 2023-09-15
The SRCF webserver sinkhole was seeing a large number of incoming requests from various IP addresses and servers of a particular cloud provider, likely being used for a denial-of-service attack, and caused performance to drop significantly as the machine became overloaded.
Alerts started at around 4am BST, initial attempts to block problematic IP ranges from making requests were made at 10:30am, but performance continued to vary until about 7pm as the blocking was adjusted.
Total service outage, 2023-02-05 01:58 to 11:20
The SRCF experienced a total outage of its main server cluster (“thunder”), which our monitoring systems noticed from 01:58 onwards tonight.
Real-time updates from the investigation follow:
- 02:25 – corrected the year in the title (it’s 2023 now!). Signs point to this being a networking failure, either in our upstream network connection to the outside world or in an intermediate network switch that we rely on for this connection. A physical visit to the datacentre would be necessary to confirm this, which we can conduct in the morning.
- 11:57 – we sent someone on site and discovered that a single electrical circuit breaker (technically an RCBO) had tripped. Our the intermediate switch carrying our network connection, mentioned at 02:25, had a single electrical feed on that circuit, causing disruption to our network connection.
We have moved this switch over to the alternate power feed, and services have been reachable again since 11:20.
We will continue to monitor the situation remotely and are liaising with building services to resolve any electrical issues. There are opportunities to improve redundancy of power feeds and network uplinks, to eliminate them as single points of failure, which we aim to pursue in due course.