SRCF status

Service status updates

1 Jun 2025 13:00

MySQL downtime, 2025-06-01

Our MySQL server squirrel was showing some poor performance, impacting user applications as well as the control panel.

13:00: MySQL service restart initiated

15:00: MySQL service restart completes (yes, two hours to restart)

16:25: Reboot initiated after performance not improving

16:35: Reboot completes, MySQL begins startup

17:25 MySQL service startup completes

18:15: Performance seemingly back to normal levels

22 Feb 2025 09:45

Services downtime, 2025-02-22

Overnight system updates seemed to have caused some disk access and processes on sinkhole aka. webserver.srcf.net to lock up from around 6:30am. Investigation ongoing…

Update 10:30am: After some failed attempts to unwedge or diagnose it, sinkhole has been rebooted.

Update 10:45am: Looks like this is an issue with shared disk space (i.e. user/group home directories) available on all user-facing machines – shell.srcf.net and doom.srcf.net are experiencing similar downtime.

Update 3:20pm: Services were fully restored at 2:00pm. It seems overnight updates caused pip aka. shell.srcf.net to lock up the shared disk space (exactly how remains unclear), in turn locking up several other machines. Rebooting pip fixed the problem.

18 Dec 2024 10:00

Webserver downtime, 2024-12-18

The SRCF webserver sinkhole (aka. webserver.srcf.net) experienced a freeze from around 4:30am this morning, taking out all web hosting and remote access. This was noticed just after 8am, and rebooted a little later after some failed attempts to thaw it.

The server itself was back online by 9am, though some user services timed out and failed to start – these were caught and retried by 10am, where service should have resumed as normal.

Stay warm out there this holiday season! ☃

30 Aug 2024 11:17

Brief network disconnection, 2024-09-25 03:00-03:30

Our services will experience a brief disconnection from the outside world from around 3am (BST) on Wednesday 25 September, due to essential maintenance on the University network point-of-presence (PoP) switch connecting us and Cambridge SU to the outside world.

Keep reading

10 Nov 2023 12:49

srcf.ucam.org domains were temporarily nonexistent, 2023-11-10

The ucam.org domain (link may not work) under which ‘srcf.ucam.org’ exists temporarily disappeared from the Domain Name System today. During that time, we regret that emails to @srcf.ucam.org addresses will have bounced (reported delivery failures to the sender), and websites weren’t accessible via the www.srcf.ucam.org redirect service, for older accounts with that feature.

1 note

24 Oct 2023 00:00

webserver.srcf.net systemd services not launched, 2023-10-14

The SRCF web server sinkhole / webserver.srcf.net was rebooted during our scheduled vulnerable period on Saturday 14th October, but a handful of users’ systemd services were not launched when the server booted back up. This is likely due to the server startup timing out and systemd giving up launching the remaining user tasks.

If you were affected, attempts to control existing services with systemctl would have resulted in “Failed to connect to bus” errors.

Unfortunately, due to the small number of accounts affected, this wasn’t noticed until 9 days later, with the remaining tasks launched around 11:45pm on Monday 23rd October. All user services and service management should be back to normal now.

18 Oct 2023 17:14

Ancillary services offline - 2023-10-18 16:05-

Due to a loss of power at the West Cambridge Data Centre (WCDC), some non-user-facing services, including backup storage and one of our monitoring systems, have gone down.

Keep reading

19 Sep 2023 22:51

Mailman delivery delays, 2023-09-18 and 2023-09-19

The queue processor for Mailman, which runs user and group account mailing lists, quietly became stuck and stopped handling incoming emails. This meant emails were being accepted by our mail server but not being processed.

The logs suggest the problems started around 8am on Monday 18th, with messages backing up until 7pm on Tuesday 19th when the stuck runner was noticed and restarted.

Queued messages were all released together, initially reaching the sending limits of our upstream email relay ppsw, so some existing messages have been deferred and may take a few hours before they make it through.

15 Sep 2023 19:00

Poor website performance, Friday 2023-09-15

The SRCF webserver sinkhole was seeing a large number of incoming requests from various IP addresses and servers of a particular cloud provider, likely being used for a denial-of-service attack, and caused performance to drop significantly as the machine became overloaded.

Alerts started at around 4am BST, initial attempts to block problematic IP ranges from making requests were made at 10:30am, but performance continued to vary until about 7pm as the blocking was adjusted.

5 Feb 2023 02:10

Total service outage, 2023-02-05 01:58 to 11:20

The SRCF experienced a total outage of its main server cluster (“thunder”), which our monitoring systems noticed from 01:58 onwards tonight.

Real-time updates from the investigation follow:

02:25 – corrected the year in the title (it’s 2023 now!). Signs point to this being a networking failure, either in our upstream network connection to the outside world or in an intermediate network switch that we rely on for this connection. A physical visit to the datacentre would be necessary to confirm this, which we can conduct in the morning.
11:57 – we sent someone on site and discovered that a single electrical circuit breaker (technically an RCBO) had tripped. Our the intermediate switch carrying our network connection, mentioned at 02:25, had a single electrical feed on that circuit, causing disruption to our network connection.
We have moved this switch over to the alternate power feed, and services have been reachable again since 11:20.

We will continue to monitor the situation remotely and are liaising with building services to resolve any electrical issues. There are opportunities to improve redundancy of power feeds and network uplinks, to eliminate them as single points of failure, which we aim to pursue in due course.