Stories by Tyler Finethy on Medium

That Time Facebook Vanished

Tyler Finethy — Sun, 16 Jan 2022 19:12:15 GMT

What we need to take away from the incident as engineers

Image courtesy of pixels.com

As many of us remember, on October 4th, 2021, Facebook, Instagram, WhatsApp, and many other applications under the umbrella of, what is now, Meta disappeared from the internet. The sites were down for six to seven hours, and shares in the company plunged 5% due to the mistake. This outage appeared as a worst-case scenario, code red, I’m-going-to-lose-my-job engineering failure.

While it’s always entertaining to poke fun, it’s important to remember that this is a multi-billion dollar organization with thousands of engineers, more than 47 global data centers, and over twenty years of experience operating these systems. If an engineer there could press enter and watch every alert in the company go off, should we expect to fair better? Postmortems should be looked at like open-source software; by making them free and accessible, we can all build better software by learning from others’ mistakes.

The Start

After folks started noticing, Facebook wrote a post apologizing for the inconvenience. They also explained that a change to their backbone routers that coordinate traffic between data centers caused the problem. Backbone routers connect autonomous systems (ASs) in large inter-networks.

Check out CloudFlare’s learning center for an overview about ASs, DNS, and/or BGP as a proper explanation of each deserves an entire article.

This configuration change resulted in cascading failures that caused Domain Name Service (DNS) providers like CloudFlare to think they had a problem as their path to Facebook vanished. After an investigation, CloudFlare and others confirmed that for whatever reason, Facebook unregistered itself from the in. This “unregistering” happened through Border Gateway Protocol (BGP) updates.

Mystery Solved?

The next day, Facebook put out another message with a full report. Their DNS servers disable BGP routes for locations that are deemed unhealthy. This policy makes a lot of sense, especially with at least 47 fail-over locations.

Image courtesy of CloudFlare showing the BGP withdrawn announcements from Facebook

The problem here goes back to the backbone router change. Since the entire backbone removed, the data centers registered themselves as unhealthy, and thus the BGP routes were withdrawn. While they had an audit tool to prevent mistakes to the router, the change made it through due to a bug discovered after the outage.

Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command —Meta

During the outage, engineers were unable to access the data centers through normal means because the authentication was tied to the unavailable servers, and the outage prevented standard operating procedure as services like messenger were unavailable. Eventually however, onsite engineers were able to get into the facilities and restore the backbone connectivity.

There’s a lot to unpack here, like did they need a blow torch to break into their own data centers? Also at least from my perspective, this story feels too abstract. I want to simulate these networks and see if we can recreate both what was happening at Facebook and observed by the rest of the internet.

The Laboratory

To try and recreate what happened, I’m using a tool called IPMininet.

As an aside, while I’ve used Mininet before in a graduate networking course, I was surprised I couldn’t find a more modern or user-friendly solution to simulate networks.

Anyways! Let’s start with a simplified example using DNS to explore how these technologies interact. When things are working, a client queries Cloudflare for the Facebook DNS records to make the subsequent requests.

Clients first need to make DNS requests to determine how to access Facebook.

IPMininet lets users describe topologies like this using the Python programming language. For example, I’ve used this API to simulate the situation described:

https://medium.com/media/3dae82e8a458266f7c126602673dadc0/href

With this topology, we can query the DNS server and ping the API server:

mininet> client dig @cloudflare api.facebook.com +short
192.168.0.2
mininet> client ping 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=62 time=0.062 ms
...

These records have a time-to-live (TLS) of 60s, so our service should re-query the server every minute. An actual DNS query will be much more complex, traveling through the root server and traversing the DNS hierarchy until the authoritative nameserver is found. Running dig locally against facebook.com gives a real facebook IP:

> dig @1.1.1.1 facebook.com +short
157.240.241.35

This example gives us something to experiment with, but a lot of us are already familiar with DNS. Setting up a network that includes multiple ASs, communicates via eBGP, and performs the DNS traversal gets us a lot closer to reality:

A client in some AS makes the DNS query to Cloudflare, which forwards to query to Facebook’s DNS servers available via eBGP.

I tried using IPMininet again for this task and found it incredibly difficult. So, starting as simple as possible, I tried setting up two ASs that communicate via eBGP:

https://medium.com/media/5ddfebd7b2a02c55c33416c7c8bd6a51/href

Unfortunately, when I went to confirm the eBGP connection I ran into a number of errors that prevented me from doing so:

mininet> pingall
*** Ping: testing reachability over IPv4 and IPv6
has1r1 --IPv4--> X 
has1r1 --IPv6--> X 
has2r1 --IPv4--> X 
has2r1 --IPv6--> X 
*** Results: 100% dropped (0/4 received)
mininet> noecho as2r1
*** Enter a command for node: as2r1 mininet>telnet localhost bgpd
*** Unknown command: telnet localhost bgpd

It doesn’t appear to be a topology issue, as the advanced BGP example complains about the same problems. I think it‘s a missing utility or package, so I restarted from scratch to no avail. Nonetheless, I’ll continue to play with it and make a follow-up with a complete recreation if I figure it out.

Lessons Learned

The second report from Meta lists several follow-up action items, including increasing their testing, drills, and overall resilience. Having the capacity to run outage drills, find the weak points in the chain, and come up with action plans can make huge differences. Netflix has gone to the extreme, implementing a chaos monkey tool that automatically takes down part of the infrastructure to understand where failures can occur.

A lesson for the rest of us to think about is how inaccessible some of these technologies are. While few of us will work with BGP daily, experimenting with software is critical in identifying points of failure and making informed decisions. Julia Evans wrote a great post titled Tools to explore BGP, where she mentions that it’s impossible to publish routes or broadcast updates without running an ASN yourself.

But with BGP, I think that unless you own your own ASN, you can’t publish routes yourself — Julia Evans

References

The Cloud Isn’t Developer-Friendly Anymore

Tyler Finethy — Thu, 11 Mar 2021 18:25:15 GMT

And it’s a primary reason for production outages

Photo by Sigmund on Unsplash.

When I mention configuration management, I know the battle-worn developers out there shiver. I can’t count the number of failed releases I’ve seen as a result of a seemingly innocuous config change. Even Google has published seven postmortem reports related to config errors. During one incident, they deny-listed / prompting every URL to show a warning.

Now, with modern deployment strategies, our releases should fail well before they reach production. So why on earth do these issues persist? The harsh truth is lazy instantiation is lazy, and it’s time for us to address the problems for all sides.

Let’s Start Simple

So, what the heck is lazy instantiation and what does that have to do with production outages caused by configuration changes? The easiest way to explain this is through an example.

Here’s a snippet of code written in Go where an application is using an external MySQL database:

https://medium.com/media/dc85fe6fa332611150a4ded328f35783/href

If the variable connStr is incorrect, will this code panic?

The answer in this case is actually no. While it looks like the code opens a connection to the database, it actually lazily instantiates a database pool object and waits until the first request.

This can differ based on the database driver. The sql.Open documentation recommends actually confirming that the source is valid:

“To verify that the data source name is valid, call Ping.”

Let’s say you have an API server that passes this database object to the request handlers. In that case, the server will boot and appear healthy while every incoming request subsequently fails.

In simple terms, lazy instantiation is the process of waiting for the first thing to do before eagerly doing work (in this case, opening a connection to the database). This creates a big problem: There’s no way of knowing if our service is broken until it fails. The lesson here is services should verify configuration options on boot. Otherwise, we risk precluding error detection due to sheer laziness or ignorance.

What happens when it’s measurably harder than just running Ping to confirm? Let’s take a look at verifying S3 bucket policies, where this problem rears its ugly head.

Bucket Policies

If you don’t know, S3 is an object storage service by Amazon that makes the problem of cloud file storage ridiculously easy. The kicker here is that it requires a fair bit of setup and configuration in order to be utilized correctly.

In general, when interacting with S3, we have some credential that can have various permissions like ListBucket, GetObject, PutObject, or DeleteObject.

Say our application relies on a credential that needs to get, put, and delete objects. How can we ensure this some given credential is correct at boot time?

The application needs permissions to perform different actions on an S3 bucket across the internet. Diagram designed using app.diagrams.net.

If you guessed that we need to try getting, putting, and deleting an object, you’re right! Unfortunately, S3 doesn’t support any actions to test a credential. Here’s an example of what the validation has to look like in this case:

https://medium.com/media/6626cfc1d0a284f412056b27203bfee2/href

The takeaway here should really be this: When writing developer software, we should try to support the full needs of our users. Yes, a working API with “11 9s” of durability is great, but how does it fit into the whole picture when you consider things like continuous integration and delivery?

You might think that this is just an Amazon problem or a problem with SaaS or PaaS providers, but let’s look at one last example that should change your mind.

What Proxy? Where?

I just faced this problem recently. Say your application supports an outbound HTTP(S) proxy. This means you have your application that sends outgoing HTTP requests to the proxy, which then communicates to the internet. How can you ensure the proxy is configured correctly without making a request to the internet?

Example topology of an application that sends all HTTP(S) traffic through a proxy. Diagram designed using app.diagrams.net.

Maybe we can make a HEAD request to the proxy with the correct headers, and if we get a 200 OK back, everything is all good?

Without knowing more about the specific proxy or the implementation, that might not work. We might have the bad credentials and get a 200 OK anyways! We can confirm this by using hprox:

> hprox -p 1122 -a userpass.txt &
> curl -I localhost:1122
HTTP/1.1 200 OK
Date: Thu, 04 Mar 2021 21:03:10 GMT
Server: Apache
Vary: Accept-Encoding
Content-Type: text/html

So that’s not going to work. In lieu of some existing standard, we’ll have to make do on our own. RFC 7231 section 4.3.6 talks about the CONNECT method and even mentions the Proxy-Authorization header. When we do that:

> hprox -p 1122 -a userpass.txt &
> curl -X CONNECT -v localhost:1122
...
< HTTP/1.1 407 Proxy Authentication Required
< Transfer-Encoding: chunked
< Date: Thu, 04 Mar 2021 21:06:37 GMT
< Server: Apache
< Vary: Accept-Encoding
< Proxy-Authenticate: Basic realm="hprox"
...

This gives us the correct response until we provide the right credentials. Now using this, we can craft some validation code to use at boot time to ensure that any bad configurations don’t make it up to production:

https://medium.com/media/9ed0a918a529b5569e8e280149dca4c8/href

It’s time to make an effort to support configuration validation from the get-go. I think the easiest way to do this is through more thorough contract testing.

“Contract tests assert that inter-application messages conform to a shared understanding that is documented in a contract.” — Pact docs

One of the simplest contract tests is Connect or Ping, so by implementing more contract testing across our applications, we’d solve the problem implicitly.

Conclusion

Until the world changes and every service exposes an API intended to help applications validate configuration options and ensure runtime usability, we’ll have to come up with tricks to ensure everything is correct. If we want to reduce the number of times our apps crash after deploying, we’ll need to:

Have applications that verify every configuration option on boot.
Write developer software that supports the full CI/CD methodology.
Introduce contract testing as a standard practice to identify holes in our services.

The Cloud Isn’t Developer-Friendly Anymore was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Debug Go Like a Pro

Tyler Finethy — Thu, 06 Feb 2020 21:23:36 GMT

From profiling to debugging and everything in between

Photo by Zan on Unsplash

Once you understand the basics, Golang can make you more productive than ever before. But what do you do when things go wrong?

You may not know this, but Go natively includes pprof for recording and visualizing run-time profiling data. Third-party tools like delve add support for line-by-line debugging. Leak and race detectors can defend against non-deterministic behavior.

If you haven’t seen or used these tools before, they will quickly become powerful additions to your arsenal of Golang tools.

Why Don’t I Just Print Everything?

I’ve met a lot of developers who rarely open the debugger when they run into an issue with their code. I don’t think that’s wrong. If you’re writing unit tests, linting your code, and refactoring along the way, then the quick-and-dirty approach can work for a majority of cases.

Conversely, I’ve been in the throes of troubleshooting problems and realized it’s quicker to pepper in some breakpoints and open an interactive debugger than continually add assertions and print statements.

Example of a graph showing a memory leak being fixed by Kent Gruber

For example, one day I was looking at the memory graph for a web application I helped maintain. Every day the total memory usage slowly increased to the point that the server needed to be restarted to remain stable. This is a classic example of a memory leak.

The quick-and-dirty approach suggests that we read through the code ensuring spawned goroutines exit, allocated variables get garbage collected, connections properly close, etc. Instead, we profiled the application and found the memory leak in a matter of minutes. An elusive, single statement caused it — usually the case for this class of error.

This overview will introduce you to some of the tools I use almost every day to solve problems like this one.

Profiling Recording and Visualization

To get started, let's take a basic Golang web server with a graceful shutdown and send some artificial traffic. Then we’ll use the pprof tool to gather as much information as possible.

https://medium.com/media/c4c2b6e35de405559780446e3c7a0821/href

We can ensure this works by doing:

$ go run main.go &
$ curl localhost:8080
Hello World!

Now we can profile the CPU by including this snippet:

https://medium.com/media/8864b90a91dba0525b5d1d7efecc4b67/href

We’ll use a load testing tool to exercise the web server thoroughly to simulate normal to heavy traffic. I used the testing tool vegeta to accomplish this:

$ echo "GET http://localhost:8080" | vegeta attack -duration=5s
Hello world!
...

When we shut down the go web server, we’ll see a file, cpu.prof, that contains the CPU profile. This profile can then be visualized with the pprof tool:

$ go tool pprof cpu.prof
Type: cpu
Time: Jan 16, 2020 at 4:51pm (EST)
Duration: 9.43s, Total samples = 50ms ( 0.53%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 50ms, 100% of 50ms total
Showing top 10 nodes out of 24
      flat  flat%   sum%        cum   cum%
      20ms 40.00% 40.00%       20ms 40.00%  syscall.syscall
...

This is good a good start, but Go can do better. We want to profile our application as it receives traffic, that way we don’t have to rely on simulated traffic or add additional code to write profiles to file. Adding the net/http/pprof import will automatically add additional handlers to our web server:

import _ "net/http/pprof"

With that added, we can hit the /debug/pprof/ route through our web browser and see the pprof page teeming with information.

Example of what you’ll see when navigating to the /debug/pprof/ route

We can get the same information as before by running the command:

$ go tool pprof -top http://localhost:8080/debug/pprof/heap

You can also:

Generate images based off of the type of profile.
Create Flame Graphs to visualize the time spent by the application.
Track Goroutines to detect leaks before they cause degraded service.

Note that for production web servers, we rarely want to avoid exposing this information to the world and should instead bind them to a different internal port.

Delve the Interactive Debugger

Delve is advertised as:

… a simple, full featured debugging tool for Go. Delve should be easy to invoke and easy to use. Chances are if you’re using a debugger, things aren’t going your way. With that in mind, Delve should stay out of your way as much as possible.

To that end, it works really well when you have an issue that’s taking just a little too long to figure out.

Getting started with the tool is fairly easy, just follow the installation steps. Add a runtime.Breakpoint() statement and run your code using dlv:

$ dlv debug main.go
Type 'help' for list of commands.
(dlv) continue

Once you hit the breakpoint you’ll see the block of code, for example in the webserver above I put the it in the handler:

> main.handler() ./main.go:20 (PC: 0x1495476)
    15:         _ "net/http/pprof"
    16: )
    17:
    18: func handler(w http.ResponseWriter, r *http.Request) {
    19:         runtime.Breakpoint()
=>  20:         fmt.Fprintf(w, "Hello World!\n")
    21: }
    22:
    23: func main() {
    24:         srv := http.Server{
    25:                 Addr:         ":8080",
(dlv)

Now you can go line by line using the next or n command or dig deeper into a function using the step or s command.

Example of VS Code with the Golang extension showing the debug test button

If you’re a fan of a nice UI and clicking buttons instead of using your keyboard, VS Code has great delve support. When writing unit tests using the native testing library, you’ll see a button to debug test which will initialize delve and let you step through the code via VS Code in an interactive session.

For more information on debugging Go code using VS Code check out the Microsoft wiki on it.

Delve can make adding breakpoints, testing assertions, and diving deep into packages a breeze. Don’t be afraid to use it the next time you get stuck on a problem and want to know more about what’s happening.

Leak and Race Detectors

The last topic I’m going to cover is how to add Golang leak and race detectors to your tests. If you haven’t encountered a race condition or experienced a Goroutine memory leak, consider yourself lucky.

In 2017 Uber open-sourced the goleak package, which is a simple tool to check that marks the given TestingT as failed if any extra goroutines are found by Find.

It looks like this:

func TestA(t *testing.T) {
   defer goleak.VerifyNone(t)
   // test logic here.
}

While you’re doing complex asynchronous work, you can ensure that you both avoid regressions and follow the fifth tenet of The Zen of Go:

Before you launch a goroutine, know when it will stop.

Finally, after ensuring you have no Goleaks, you’ll want to protect against race conditions. Thankfully the data race detector is built-in. Consider the example from the race detector’s documentation:

https://medium.com/media/20905fac6606642d253b2a5f2f63cc81/href

This is a data race that can lead to crashes and memory corruption. Running this snippet with the -race flag leads to a panic with a helpful error message:

go run -race main.go 
==================
WARNING: DATA RACE
Write at 0x00c0000e2210 by goroutine 8:
  runtime.mapassign_faststr()
      /usr/local/Cellar/go/1.13.6/libexec/src/runtime/map_faststr.go:202 +0x0
  main.main.func1()
      /PATH/main.go:19 +0x5d

Previous write at 0x00c0000e2210 by main goroutine:
  runtime.mapassign_faststr()
      /usr/local/Cellar/go/1.13.6/libexec/src/runtime/map_faststr.go:202 +0x0
  main.main()
      /PATH/main.go:22 +0xc6

Goroutine 8 (running) created at:
  main.main()
      /PATH/main.go:18 +0x97
==================
2 b
1 a
Found 1 data race(s)

While you can use the flag during execution of your code, it’s most helpful to add to your go test command to detect races as you write tests.

Conclusion

These are just some of the great tools available in the Golang ecosystem to aid in observing, debugging, and preventing production failures in your codebases. If you’re looking to go further I recommend taking a look at:

Distributed tracing like Open-tracing Go.
Time-series monitoring using a tool like Prometheus.
Structured logging using logrus.

For more information on any of the tools listed above check out the resources section for full documentation and manuals.

Resources

Debug Go Like a Pro was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Windows Containers on Kubernetes

Tyler Finethy — Mon, 06 Jan 2020 13:20:30 GMT

Is the technology ready for prime time?

Open for preview — Windows Server container on an Azure Kubernetes Service (AKS) cluster. How does it match up to longstanding Linux-based container alternative?

While still in preview, Azure Kubernetes Service (AKS) recently announced¹ support for Windows Server containers.

You can deploy ASP.NET applications, run PowerShell or Linux subsystem scripts, and enable autoscaling to meet customer demand. I’ve been using the service for about a month now, these are my key takeaways.

Windows Containers Are Slower and Clunky

Container-based development has a number of benefits like consistent environments, isolation, and a run-anywhere mentality².

When using Linux-based containers on Kubernetes, it usually comes with the advantage of being lightweight enough to scale and quickly meet traffic needs. Even the About Windows Containers page speaks to these benefits:

“Containers provide a lightweight, isolated environment that makes apps easier to develop, deploy, and manage. Containers start and stop quickly, making them ideal for apps that need to rapidly adapt to changing demand.”³

Coming from the Linux-based container world, I was surprised at both the larger size of the images and the amount of time it takes to scale new Kubernetes nodes and deploy more pods.

The default ltsc2019 image is over 850 megabytes, so I don’t know if we’re ready to use the lightweight phrase just yet but everything else still rings true.

Windows containers also require more resources than their counterpart making the cost higher, at least two times as much if you’re comparing Standard_DS2_v2 to the Standard_DS3_v2 on Azure⁴ (the required minimum size for Windows node pools).

I’ve only used the Standard_DS3_v2 nodes when using Windows containers, so take this with a grain of salt, but I’ve noticed a high amount of latency and slowdown when interacting with running pods.

For example, when debugging systems, I often run a command like:

$ kubectl exec -it [pod-name] /bin/bash

On Windows this becomes:

$ kubectl exec -it [pod-name] powershell.exe
Windows PowerShell                                                                                                                                                                                      
Copyright (C) Microsoft Corporation. All rights reserved.                                                                                                                                               
                                                                                                                                                                                                        
PS C:\>

Running commands from here like downloading files or installing executables can take some time, although it’s pretty awesome to hop into a Windows container on my Mac. It’s definitely a huge step forward for developers like me who usually run from any Windows development work.

Combine this functionality with the recent Windows Subsystem for Linux (WSL) and you have a shared tool-chain across operating systems.

I’m now using a previously tested bash script on Windows through Kubernetes with minimal effort. This support further lowers the barrier of entry for more developers interested in cross-platform support.

Getting Everything Right Is an Art

While there are both Windows Server Core and Nanoserver base images available for Windows containers, as far as I can tell, AKS only supports the Windows Server Core images. This is because:

“Windows Server containers and the underlying host share a single kernel, the container’s base image OS version must match that of the host.”⁵

If you don’t get this right, expect to see The operating system of the container does not match the operating system of the host.

I saw this a lot. I used Azure Pipelines⁶ to build the container image from a custom Dockerfile. I then needed to make my base image Windows version to match the Azure Pipeline VM version.

Next, I matched the container image to the Kubernetes node. I wasn’t able to find the exact Windows version through the Azure docs, and running kubectl get nodes -o wide you get something like:

NAME            OS-IMAGE                         KERNEL-VERSION
linux-node      Ubuntu 16.04.6 LTS               4.15.0-1061-azure
windows-node    Windows Server 2019 Datacenter   10.0.17763.737

Which translated to a Docker base image of mcr.microsoft.com/windows/servercore:ltsc2019 and Azure VM image vmImage: ‘windows-2019’.

Hopefully, it’ll eventually be easier to match and configure these options, and eventually run Nanoserver base images on AKS.

AKS Is Not Ready for Production Traffic

I’m currently using AKS to run production compute jobs that are asynchronous and don’t directly impact customers. I think this is a good, safe use-case right now.

The reason I don’t think it’s ready for production web-traffic or anything synchronous is that I’ve run into more than a few issues with the actual cluster. Here are a few:

Nodes have become unexpectedly unavailable with containers stuck in the ContainerCreating state.
Deleted jobs did not clean up the associated pods (Kubernetes API breakage).
Unsetting the node autoscaling option is impossible through the UI which leads to some manual scaling complications.

I expected that issues like the Kubernetes API breakage could be quickly resolved through the Azure support, but without paying extra I couldn’t escalate the issue directly.

I believe there is a path through GitHub to submit issues like this but it seems insane that during a preview they don’t allow for feedback and bugs.

References

Windows Containers on Kubernetes was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Working fully remote: the dream or a nightmare

Tyler Finethy — Mon, 11 Nov 2019 11:55:55 GMT

Will working from home make you a happier, more engaged employee? Maybe.

Working from home can be a double-edged sword, but I prefer it to the routine commute and sometimes distracting office life

When I tell people I’m fully remote they have one of two reactions: they imagine some nightmarish trapped-at-home scenario where human interaction is nonexistent, or they dream of a life unchained from “the morning routine”. After about six months as a fully-remote employee, the result is somewhere in the middle.

How did I get here?

Similar to most newly graduated folks, I had no idea what I wanted to do next. I thrashed between pathways forward, from graduate programs to interviews with Google and the like. I almost ended up in the United Kingdom at business school. I even considered staying at school working for my data science professor, after a scarring interview with a hotel reservation company sent me reeling.

After applying to some-50-odd companies, which I now realize is typical, I landed at a startup in Boston and had that “I have no idea what I’m doing” moment. It took me about six months to fully understand the job and become a functioning member of a team. The soft skills of working in an industry setting, like properly wording an email or hitting the reply all button, aren’t taught in academia and certainly took time to master.

After my first job, searching for my next engineering job was a completely different experience. This time, interviews felt more like a two-way street, and I had now learned what to look for and what to avoid. I ended up picking a company based on location, salary, and a better work-life balance. However, I was still lukewarm with office-culture: the dreaded productivity metrics, staying long hours to be a “real developer”, and wasting valuable time commuting to-and-from work.

Through the years I worked with a few remote employees and knew that being a remote employee on a colocated team is an uphill battle. The freedom to work wherever you want and lack of distractions sounded amazing. On the other hand, I worried about missing out on water-cooler chats, company culture, and putting a lot of my fate in the hands of others. When I was offered the opportunity to work at a fully-remote organization with a great mission, it sounded like the best of both worlds, so I took the plunge and have been remote ever since.

A day in the life

Before I talk about expectations and realities, here’s what a rough overview of what my daily work schedule looks like:

7:30–9:00 am

Start my day with coffee, breakfast, daily reading, etc
Respond to any urgent emails or slack messages
Prepare for any morning meetings

9:00–11:00 am

Do not disturb set for at least an hour of focus time
Respond to any code reviews, feedback, or design/planning documents
Merge any accepted code changes

11:00–2:00 pm

Respond to any email or communicate in slack
Work on issues that require collaboration
Eat lunch and do some sort of physical activity or errand to get out of the house

2:00–4:30 pm

Do not disturb set for at least another hour
Attend any afternoon meetings, pair coding sessions, or collaborate with other members of my team

4:30–6:00 pm:

Finish or polish any outstanding work
Perform less thought-intensive (busy) work
Wind down responding to slack, email, etc.

There’s a lot that I didn’t include in my daily work schedule just because they’re less frequent. This includes customer issues, releases, or related toil work.

The first thing you’ll probably notice as strange is the Do not disturb blocks. I’m a big believer in focus time and deep work. I had a colleague that said,

“Coding and [software engineering work] is what happens in the space between the meetings on our calendars”

While this is the case for a lot of software engineers, it’s insane to think that we prioritize our job only when we can fit it in. This can also lead to bad behaviors like working late, long after your peak productivity hours have passed.

The other thing you might notice is that communication is clustered. This is for a few reasons including reducing the number of distractions, group chat can negatively impact your workday, and I’m on a distributed team across many time zones so communicating later in the day makes sense. This has also had the effect of making the entire team value the meetings we do have a lot more.

The final crucial ingredient to the day is the physical activity or errand to break up the day and get out of the house. This is vital when working from home. Unlike working in a colocated environment you won’t get your daily dose of human interaction from your co-workers if you need that sort of thing (hint: you probably do even if you don’t know it). Try to avoid cabin fever at all costs.

Myths and Half-Truths

I’ve read several blog posts about the liberation and freedom you get from being a remote employee. People have also written about the loneliness and struggles of creating a strong work-life balance. Loneliness rates as the number one struggle with working from home. Before you take the plunge, you should understand what it’s actually like beyond the myths and hype.

Fully autonomous

If you’ve ever read So Good They Can’t Ignore You then you know that autonomy over your workday can be a major factor in career satisfaction. This especially rings true in creative work because banging your head against a problem or design decision causes more strife than speed.

“[control] turns out to be one of the most important traits you can acquire with [experience]… something so powerful and essential to the quest for work you love” — Cal Newport

One of the best outcomes from remote work is the freedom to take a walk, grab a snack, and generally plan when and how you’ll get work done. You reduce the anxiety associated with wanting your manager to see you toiling away. Like I mentioned before, it increases engagement in meetings, written documents, and anything else that involves collaborating with your peers — you have limited opportunities to build relationships when you don’t see everyone every day.

Traveling becomes the norm

It’s true that as a remote employee you can work from anywhere, depending on your companies timezone and country restrictions. This doesn’t mean your life is going to turn into a Jack Kerouac novel, living out of a van and following the road wherever it takes you. You still want to remain grounded and delineate a space for work.

The Amazon Loft in NYC, a co-working space free if you’re an existing AWS customer

I’ve found that finding a good place to work is difficult. Coffeehouses can be hit-or-miss depending on the WiFi signal, noise levels, and the amount of time the staff will let you sit in the space. Shared co-working spaces can be expensive, dark and dreary, or be further than you’re willing to travel. I went to one in New York City that was charging $550/mo for a shared desk with no natural lighting and lots of noise.

On the other hand, I spent the summer working from a lake with my family and I travel to work from warmer climates whenever I can. The dream of moving to the woods and escaping the rat-race might sound enticing but you’re also setting yourself up to further reduce the number of day-to-day interactions with people. When work friends no longer fill your social circle, you need to find places where you can build a strong group of friends outside of work. The point here is that you need a balance between both worlds.

No peer pressure

Lastly, let’s tackle the “no peer pressure” myth. For the most part, you can decide your schedule, you don’t have a manager looking over your shoulder, and you can work from anywhere. You should’ve escaped the pressure of working crazy hours and produce quality work right? Not quite.

Remote work comes with unique hurdles and challenges. For one, if you are used to getting daily recognition for a job-well-done you’ll have to find that satisfaction elsewhere. When you run into issues you can no longer tap a colleague on the shoulder and ask for help. You’re an island and if you struggle with knowing when working through an issue has turned into spinning your wheels this will be tough.

Communication becomes a delicate balance between over-communicating so everyone is on the same page or under-communicating to avoid bothering and respect someone in a focus time block. The best solution I’ve seen is to communicate regularly, but expect asynchronous delayed responses. You don’t want radio silence at the risk of having the police show up to check if you’re still alive.

Conclusion

While I’m still relatively new to remote-only work, I’ve been at it for long enough to understand the reality versus the hype. It’s not going to be a fix-all if you’re unhappy doing the work you already do. At the same time, it’s going to provide you the flexibility to decide how and when to do the work that matters. To recap:

Fully autonomous work can supercharge your productivity if you stick to a strict schedule and respect your focus time hours
Working-from-anywhere is great but finding the right places where you can be free of distractions and do your best work can be challenging
While you no longer have a manager looking over your shoulder, communication and reassurance that you’re doing the right thing can be hard for those of us that need it
Avoid radio silence and cabin fever for both your own and your coworker’s sanity

With all that said, unless you’re already far along in your career I would stick to the classic office work life. Early on you might need intense mentoring and autonomy requires a high level of job proficiency.

Resources

Working fully remote: the dream or a nightmare was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

On-premise Kubernetes Clusters

Tyler Finethy — Tue, 29 Oct 2019 13:04:06 GMT

What you need to know when deploying Kubernetes yourself

Running Kubernetes on-premise give developers a cloud-native experience or set your organization up to be cloud-agnostic.

Whether you have your own on-premise data center, have decided to forego the various managed cloud solutions, or are developing software for a company that has — there’s a few things you should know when getting started with on-premise K8s.

If you’re already familiar with Kubernetes you know that the control plane consists of the kube-apiserver, kube-scheduler, kube-controller-manager and an etcd datastore. For managed cloud solutions like Google’s Kubernetes Engine (GKE) or Azure’s Kubernetes Service (AKS) it also includes the cloud-controller-manager. This is the component that connects the cluster to the external cloud services to provide networking, storage, authentication, and other feature support.

To successfully deploy a bespoke Kubernetes cluster and achieve a cloud-like experience you’ll need to replicate all the same features you get with a managed solution. At a high-level this means you’ll probably want to:

Automate the deployment process
Choose a networking solution
Choose a storage solution
Handle security and authentication

Lets look at each of these challenges individually, and I’ll try to provide enough of an overview to aid you in get started.

Automating the deployment process

Using a tool like ansible can make deploying Kubernetes clusters on-premise trivial.

When deciding to manage your own Kubernetes clusters you’ll want to setup a few proof-of-concept (PoC) clusters to learn how everything works, perform performance and conformance tests, and try out different configuration options.

After this phase, automating the deployment process is an important if not necessary step to ensure consistency across any clusters you build. For this you have a few options, but the most popular are:

kubeadm: a low-level tool that helps you bootstrap a minimum viable Kubernetes cluster that conforms to best practices
kubespray: an ansible playbook that helps deploy production ready clusters

If you already use ansible, kubespray is a great option otherwise I recommend writing automation around kubeadm using your preferred playbook tool after using it a few times. This will also increase your confidence and knowledge in the tooling surrounding Kubernetes.

Choosing a network solution

When designing clusters, choosing the right container networking interface (CNI) plugin can be the hardest part. This is because choosing a CNI that will work well with an existing network topology can be tough. Do you need BGP peering capabilities? Do you want an overlay network using vxlan? How close to bare-metal performance are you trying to get?

There are a lot of articles that compare the various CNI provider solutions (calico, weave, flannel, kube-router, etc.) that are must-reads like the benchmark results of Kubernetes network plugins article. I usually recommend Project Calico for its maturity, continued support, and large feature set or flannel for it’s simplicity.

For ingress traffic you’ll need to pick a load-balancer solution. For a simple configuration you can use MetalLB, but if you’re lucky enough to have F5 hardware load-balancers available I recommend checking out the K8s F5 BIG-IP Controller. The controller supports connecting your network plugin to the F5 either through either vxlan or BGP peering. This gives the controller full visibility into pod health and provides the best performance.

Choosing a storage solution

Kubernetes provides a number of included storage volume plugins. If you’re going on-premise you’ll probably want to use a network-attached storage (NAS) option to avoid forcing pods to be pinned to specific nodes.

For a cloud-like experience, you’ll need to add a plugin to dynamically create persistent volume objects that match the user’s persistent volume claims. You can use dynamic provisioning to reclaim these volume objects after a resource has been deleted.

Pure Storage has a great example helm chart, the Pure Service Orchestrator (PSO), that provides smart provisioning although it only works for Pure Storage products.

Handle security and authentication

As anyone familiar with security knows, this is a rabbit-hole. You can always make your infrastructure more secure, and should be investing in continual improvements.

Including different Kubernetes plugins can help build a secure, cloud-like experience for your users

When designing on-premise clusters you’ll have to decide where to draw the line. To really harden your cluster’s security you can add plugins like:

istio: provides the underlying secure communication channel, and manages authentication, authorization, and encryption of service communication at scale
gVisor: is a user-space kernel, written in Go, that implements a substantial portion of the Linux system surface
vault: secure, store and tightly control access to tokens, passwords, certificates, encryption keys for protecting secrets and other sensitive data

For user authentication, I recommend checking out guard which will integrate with an existing authentication provider. If you’re already using Github teams to then this could be a no-brainer.

Other Considerations

I hope this has given you a good idea of deploying, networking, storage, and security for you to take the leap into deploying your own on-premise Kubernetes clusters. Like I mentioned above, your team will want to build proof-of-concept clusters, run conformance and performance tests, and really become experts on Kubernetes if you’re going to be using it to run production software.

I’ll leave you with a few other things your team should be thinking of:

Externally backing up Kubernetes YAML, namespaces, and configuration files
Running applications across clusters in an active-active configuration to allow for zero-downtime updates
Running game days like deleting the CNI to measure and improve time-to-recovery

This article is an adaptation of a presentation I gave for BisonTrails in New York City. Feel free to reach out for the original.

Resources

On-premise Kubernetes Clusters was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Common Go Pitfalls

Tyler Finethy — Thu, 24 Oct 2019 13:01:01 GMT

A few common mistakes and how to diagnose and fix them

Avoid pitfalls while writing simple, reliable Go code

There’s a few reasons I love Golang:

It’s a super small language (it has only 25 reserved keywords)
Cross-compilation is a breeze
Creating a reliable HTTP(s) server is natively supported

At its core, it’s a boring language, which is probably why awesome projects like Docker and Kubernetes are written in it and companies with high performance and resiliency requirements, like Cloudflare, are using it.

Despite its ease of use, Go really requires attention to detail. If you don’t use the language as it’s intended it can break. It can be hard to diagnose and challenging to fix the mistake.

Here are a few common mistakes I’ve witnessed in production codebases, during code reviews, and made myself. Hopefully, this will make it easier for you to diagnose the same issues as you encounter them.

HTTP Timeouts

This first issue has an entire article written about it but it’s still worth mentioning because the optimal solution can require some thought. It has to do with making outgoing HTTP requests using the default HTTP client.

To illustrate the problem, here’s a basic example of making a GET request to google.com:

package main

import (
    "io/ioutil"
    "log"
    "net/http"
)

var (
    c = &http.Client{}
)

func main() {
    req, err := http.NewRequest("GET", "google.com", nil)
    if err != nil {
        log.Fatal(err)
    }

    res, err := c.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()
    b, _ := ioutil.ReadAll(res.Body)
    ...
}

As pointed out in the Don’t use Go’s default HTTP Client article, the default client doesn’t actually have a timeout. This means that code could hang indefinitely depending on the server or until the application is restarted.

So what’s the best way to resolve this issue?

While always defining your HTTP client with a sensible timeout is a good idea, &http.Client{Timeout: time.Minute} you might also consider attaching a context to your request for a few added benefits:

The ability to cancel ongoing requests
You can tune the timeout to specific requests

The second benefit is especially important because if you have a few requests that you know are going to take a long time, say over an hour, you don’t want every request to wait an hour before timing out.

In the example above, adding a context would look something like this:

ctx, cancel := context.WithTimeout(context.Background(), time.Minute)
defer cancel()

req = req.WithContext(ctx)

res, err := c.Do(req)
...

If the allotted time is exceeded, the call to c.Do will result in a DeadlineExceeded error, making it easy to handle or retry. For more information on the context package, check out the documentation.

Database Connections

I’ve had database connection issues crop up in almost every Go project I’ve been on. I think the hard thing for new gophers to wrap their heads around is that the sql.DB object is a concurrency-safe pool of connections instead of a single database connection. This means that if you forget to return your connections to the pool you can easily exhaust the number of connections and your application can grind to a halt.

For instance, the connection pool contains both Open and Idle connections which are configured through:

SetConnMaxLifetime: the maximum amount of time a connection may be reused
SetMaxIdleConns: maximum number of connections in the idle connection pool
SetMaxOpenConns: maximum number of open connections to the database

Note that even if you configure the max open connections to 200, the application can still exhaust the number of open connections the database will accept, making a shutdown or restart necessary. You need to check the database settings or coordinate with whoever has the permissions to ensure you’re correctly setting these limits.

If you don’t configure a limit, your application can easily use all the connections the database will accept.

Back to exhausting the connection pool. When querying the database a lot of developers forget to close the *sql.Rows object. This leads to hitting the max connections limit and causes deadlock or high latency. Here’s a snippet of code showing this:

package main

import (
    "context"
    "database/sql"
    "fmt"
    "log"
)

var (
    ctx context.Context
    db  *sql.DB
)

func main() {
    age := 27
    ctx, cancel := context.WithTimeout(context.Background(), time.Minute)
    defer cancel()

    rows, err := db.QueryContext(ctx, "SELECT name FROM users WHERE age=?", age)
    if err != nil {
        log.Fatal(err)
    }

    for rows.Next() {
        var name string
        if err := rows.Scan(&name); err != nil {
            log.Fatal(err)
        }
        fmt.Println(name)
    }
    ...

You’ll notice, just as you can add context to an HTTP request, you can also add a context with a timeout to a database query (or an execution of a prepared statement, ping, etc.) But that’s not the problem.

As mentioned above we need to close the rows object to prevent further enumeration and release the connection back to the connection pool:

rows, err := db.QueryContext(ctx, "SELECT name FROM users WHERE age=?", age)
if err != nil {
    log.Fatal(err)
}
defer rows.Close()

This becomes particularly difficult to spot if you’re passing open connections across functions and packages.

Goroutine or Memory Leaks

The last common mistake I’m going to cover here is Goroutine leaks. These can be tricky to detect but are usually caused by user error.

This happens often when using channels. For example:

package main

func main() {
    c := make(chan error)
    go func() {
        for err := range c {
            if err != nil {
                panic(err)
            }
        }
    }()

    c <- someFunc()
    ...

If we don’t close the channel c or if someFunc() doesn’t return an error, the Goroutine we have initialized will hang until the program terminates.

Instead of enumerating the number of cases that can cause Goleaks, there are two methods I commonly deploy to detect and eliminate them.

The first method is to use a leak detector in your tests, like Uber’s goleak library. In practice this looks like this:

func TestA(t *testing.T) {
    defer goleak.VerifyNone(t)
    // test logic here.
}

This will verify, after a grace period of 30 seconds to allow for graceful shutdown, that there are no unexpected Goroutines running at the end of a test.

The other method is to use the Go profiler on a running instance of your application and look at the number of active Goroutines. One way to do this is to add the net/http/pprof library and click the Goroutine profile.

You can enable it by adding this:

import _ "net/http/pprof"

func someFunc() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))

}
}

This will enable pprof on port 6060. For especially bad leaks, you can refresh and see the number of goroutines increase. For more subtle leaks, read through the profile and look for instances of functions sticking around when they shouldn’t. The profile page will look something like this:

goroutine profile: total 39
2 @ 0x43cf10 0x44ca6b 0x980600 0x46b301
#	0x9805ff	database/sql.(*DB).connectionCleaner+0x36f	/usr/local/go/src/database/sql/sql.go:950

2 @ 0x43cf10 0x44ca6b 0x980b18 0x46b301
#	0x980b17	database/sql.(*DB).connectionOpener+0xe7	/usr/local/go/src/database/sql/sql.go:1052

2 @ 0x43cf10 0x44ca6b 0x980c4b 0x46b301
#	0x980c4a	database/sql.(*DB).connectionResetter+0xfa	/usr/local/go/src/database/sql/sql.go:1065

...

If your application is idle and you’re seeing a lot of total Goroutine’s that’s a good indication that something is going wrong. After identifying where the leak is, I still recommend using a leak detector in the tests to ensure the issue is resolved.

Conclusion

Hopefully knowing about and seeing some examples of these common mistakes will help you identify and fix them more quickly. Obviously there are a number of other common mistakes, such as:

Race conditions
Deadlocks
Error swallowing

These can be found and fixed through similar techniques, like using the go race detector, writing tests, or using the go profiler.

Common Go Pitfalls was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

The GOPATH is for everyone

Tyler Finethy — Thu, 17 Oct 2019 13:21:01 GMT

The Go tool-chain can be useful even outside of Go development

TL;DR

The command go get will organize your code into an easy to follow directory structure that you can use across all your work-spaces. Give it a try even if you’re not a gopher.

Example: go get github.com/golang/go will clone the repository and put it at $GOPATH/go/github.com/golang/go where GOPATH usually defaults to your home directory.

Background

When I started using Golang 3 years ago, I came from a world of Python, PHP, and NodeJS where every time I created a new repository I would have to think about where it was going to go. This would usually result in directory structures like:

.
├── repo1
├── myCode
│   ├── repo2
│   ├── repo3
│   └── repo4
├── myOrg
│   ├── repo5
│   ├── repo6
│   └── repo7
├── opensourceOrg
│   ├── repo10
│   ├── repo8
│   └── repo9
└── scratch
    ├── repo11
    ├── repo12
    └── repo13

Maybe I’m not disciplined enough, but this tree could look wildly different depending on my mood, the amount of coffee I’ve had, or what task I’m thinking about and working on.

There’s actually information lost here:

What organization/user does repo1 belong to?
Are we sure that every repository in myCode is ours?
What version control system do these repositories use?

As I started learning go, I absolutely hated the idea of the GOPATH, some language is going to tell me how to organize my code? Ridiculous, but I set it up to get up and running with Golang. Now even across my non-Go projects I can’t imagine doing it any other way.

How does the GOPATH help?

The GOPATH is an environment variable that specifies where the Golang tool chain should put and look for Go code (before the adoption of vendoring and gomod). There’s a lot of reasons why the community didn’t like the GOPATH as a global package management system, but I won’t get into that here.

If you have go installed, the get command comes out of the box. For a non-Go project this usually looks like:

> go get github.com/tylfin/geospy
package github.com/tylfin/geospy: no Go files in /Users/tylerfinethy/go/src/github.com/tylfin/geospy

Which will create the tree-structure:

go/
├── bin
├── pkg
└── src
    └── github.com
        └── tylfin
            └── geospy

For multiple repositories, organizations, and version control systems this will look like:

go/
├── bin
├── pkg
└── src
    ├── github.com
    │   ├── tylfin
    │   │   ├── dynatomic
    │   │   └── geospy
    │   └── uudashr
    │       └── gopkgs
    └── golang.org
        └── x
            └── tools

That’s it. Go will create a deterministic structure that answers all the questions above, and will be the same across every system.

While I understand it might be suspect to use an entire programming language to organize your code, it’s a quick install and works well enough for me that I’ll be using it for the foreseeable future. I recommend it to all my colleagues as well.

The GOPATH is for everyone was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.