Libelli

Implementing Function Calling LLMs without Fear

Wed, 16 Apr 2025 14:06:23 -0400

Implementing Function Calling LLMs without Fear is a talk that I gave at a C4AI/RealmOne Happy Hour Tech Meetup in Columbia, Maryland. The slides of the talk are below:

Abstract

Description: For an AI system to be an agent rather than a simple chatbot, it needs to be able to do work on behalf of its users, often accomplished through the use of Function Calling LLMs. Instruction-based models can identify external functions to call for additional input or context before creating a final response without the need for any additional training. However, giving an AI system access to databases, APIs, or even tools like our calendars is fraught with security concerns and task validation nightmares. In this talk, we’ll discuss the basics of how Function Calling works and think through the best practices and techniques to ensure that your agents work for you, not against you!

Privacy and Security in the Age of Generative AI

Wed, 30 Oct 2024 11:35:00 -0700

Privacy and Security in the Age of Generative AI is a talk that I gave at ODSC West 2024 in Burlingame, California. The slides of the talk are below:

An updated presentation that I gave at C4AI on April 15, 2025 in Columbia, Maryland is below:

Abstract

From sensitive data leakage to prompt injection and zero-click worms, LLMs and generative models are the new cyber battleground for hackers. As more AI models are deployed in production, data scientists and ML engineers can’t ignore these problems. The good news is that we can influence privacy and security in the machine learning lifecycle using data specific techniques. In this talk, we’ll review some of the newest security concerns affecting LLMs and deep learning models and learn how to embed privacy into model training with ACLs and differential privacy, secure text generation and function-calling interfaces, and even leverage models to defend other models.

Smart Global Replication Using Reinforcement Learning

Tue, 07 Nov 2023 12:00:00 -0500

Smart Global Replication using Reinforcement Learning is a talk that I gave at KubeCon + CloudNative North America 2023 in Chicago, IL. The video of the talk is below:

Description

There are many great reasons to replicate data across Kubernetes clusters in different geographic regions: e.g. for disaster recovery and to ensure the best possible user experiences. Unfortunately, global replication is not easy; not just because of the difficulty in consistency reasoning that it introduces, but also due to the increased cost of provisioning multiple volumes that exponentially duplicate ingress and egress. Wouldn’t it be great if our systems could learn the optimal placement of storage blocks so that total replication was not necessary? Wouldn’t it be even better if our replication messaging was reduced ensuring communication only between the minimally necessary set of storage nodes? We show a system that uses multi-armed bandits to perform such an optimization; dynamically adjusting how data is replicated based on usage. We demonstrate the savings achieved and system performance using a real world system: the TRISA Global Travel Rule Compliance Directory.

DIY Consensus: Crafting Your Own Distributed Code (with Benjamin Bengfort)

Wed, 30 Aug 2023 12:00:00 -0500

DIY Consensus: Crafting Your Own Distributed Code (with Benjamin Bengfort)

Description

How do distributed systems work? If you’ve got a database spread over three servers, how do they elect a leader? How does that change when we spread those machines out across data centers, situated around the globe? Do we even need to understand how it works, or can we relegate those problems to an off the shelf tool like Zookeeper? Joining me this week is Distributed Systems Doctor—Benjamin Bengfort—for a deep dive into consensus algorithms. We start off by discussing how much of “the clustering problem” is your problem, and how much can be handled by a library. We go through many of the constraints and tradeoffs that you need to understand either way. And we eventually reach Benjamin’s surprising message - maybe the time is ripe to roll your own. Should we be writing our own bespoke Raft implementations? And if so, how hard would that be? What guidance can he offer us? Somewhere in the recording of this episode, I decided I want to sit down and try to implement a leader election protocol. Maybe you will too. And if not, you’ll at least have a better appreciation for what it takes. Distributed systems used to be rocket science, but they’re becoming deployment as usual. This episode should help us all to keep up!

Faster Protocol Buffer Serialization

Wed, 03 May 2023 10:30:16 -0500

Performance is key when building streaming gRPC services. When you’re trying to maximize throughput (e.g. messages per second) benchmarking is essential to understanding where the bottlenecks in your application are.

However, as a start, you can pretty much guarantee that one bottleneck is going to be the serialization (marshaling) and deserialization (unmarshaling) of protocol buffer messages.

We have a use case where the server does not need all of the information in the message in order to process the message. E.g. we have header information such as IDs and client information that the server does need to update as part of processing. The other part of the message is data that needs to be saved to disk and does not have to be unmarshaled until it’s read. However, our protocol buffer schema right now is “flat” — meaning that all fields whether they are required for processing or not are defined by a single protocol buffer message.

Atomic vs Mutex

Sat, 26 Nov 2022 12:24:25 -0600

When implementing Go code, I find myself chasing increased concurrency performance by trying to reduce the number of locks in my code. Often I wonder if using the sync/atomic package is a better choice because I know (as proved by this blog post) that atomics have far more performance than mutexes. The issue is that reading on the internet, including the package documentation itself strongly recommends relying on channels, then mutexes, and finally atomics only if you know what you’re doing.

Nonlinear Workflow for Planning Software Projects

Sun, 14 Mar 2021 09:53:49 -0400

Good software development achieves complexity by describing the interactions between simpler components. Although we tend to think of software processes as step-by-step “wizards”, design and decoupling of components often means that the interactions are non-linear. So why should our software project planning be defined in a linear progression of steps with time estimates? Can we plan projects using a non-linear workflow that mirrors how we think about component design?

Go Closures & Interfaces

Tue, 23 Feb 2021 08:28:22 -0500

Strict typing in the Go programming language provides safety and performance that is valuable even if it does increase the verbosity of code. If there is a drawback to be found with strict typing, it is usually felt by library developers who require flexibility to cover different use cases, and most often appears as a suite of type-named functions such as lib.HandleString, lib.HandleUint64, lib.HandleBool and so on. Go does provide two important language tools that do provide a lot of flexibility in library development: closures and interfaces, which we will explore in this post.

New Hugo Theme

Sun, 24 Jan 2021 17:12:16 -0500

A facelift for Libelli today! I moved from Jekyll to Hugo for static site generation, a move that has been long overdue — and I’m very happy I’ve done it. Not only can I take advantage of a new theme with extra functionality (PaperMod in this case) but also because Hugo is written in Go, I feel like I have more control over how the site gets generated.

A lot has been said on this topic, if you’re thinking about migrating from Jekyll to Hugo, I recommend Sara Soueidan’s blog post — the notes here are Libelli specific and are listed here more as notes than anything else.

Documenting a gRPC API with OpenAPI

Thu, 21 Jan 2021 17:45:35 +0000

gRPC makes the specification and implementation of networked APIs a snap. But what is the simplest way to document a gRPC API? There seem to be some hosted providers by Google, e.g. SmartDocs, but I have yet to find a gRPC-specific tool. For REST API frameworks, documentation is commonly generated along with live examples using OpenAPI (formerly swagger). By using grpc-gateway it appears to be pretty straight forward to generate a REST/gRPC API combo from protocol buffers and then hook into the OpenAPI specification.

Self Signed CA

Wed, 30 Dec 2020 15:51:06 +0000

I went on a brief adventure looking into creating a lightweight certificate authority (CA) in Go to issue certificates for mTLS connections between peers in a network. The CA was a simple command line program and the idea was that the certificate would initialize its own self-generated certs whose public key would be included in the code base of the peer-to-peer servers, then it could generate TLS x.509 key pairs signed by the CA. Of course you could do this with openssl, but I wanted to keep a self-coded Go version around for posterity.

OS X Cleanup

Tue, 24 Nov 2020 14:26:25 +0000

Developer computers often get a lot of cruft built up in non-standard places because of compiled binaries, assets, packages, and other tools that we install over time then forget about as we move onto other projects. In general, I like to reinstall my OS and wipe my disk every year or so to prevent crud from accumulating. As an interemediate step, this post compiles several maintenance caommands that I run fairly routinely.

Managing Multi-Errors in Go

Thu, 22 Oct 2020 11:45:41 +0000

This post is a response to Go: Multiple Errors Management. I’ve dealt with a multiple error contexts in a few places in my Go code but never created a subpackage for it in github.com/bbengfort/x and so I thought this post was a good motivation to explore it in slightly more detail. I’d also like to make error contexts for routine cancellation a part of my standard programming practice, so this post also investigates multiple error handling in a single routine or multiple routines like the original post.

Writing JSON into a Zip file with Python

Thu, 20 Aug 2020 11:41:14 +0000

For scientific reproducibility, it has become common for me to output experimental results as zip files that contain both configurations and inputs as well as one or more output results files. This is similar to .epub or .docx formats which are just specialized zip files - and allows me to easily rerun experiments for comparison purposes. Recently I tried to dump some json data into a zip file using Python 3.8 and was surprised when the code errored as it seemed pretty standard. This is the story of the crazy loophole that I had to go into as a result.

Read mprofile Output into Pandas

Mon, 27 Jul 2020 18:16:50 +0000

When benchmarking Python programs, it is very common for me to use memory_profiler from the command line - e.g. mprof run python myscript.py. This creates a .dat file in the current working directory which you can view with mprof show. More often than not, though I want to compare two different runs for their memory profiles or do things like annotate the graphs with different timing benchmarks. This requires generating my own figures, which requires loading the memory profiler data myself.

Basic Python Profiling

Tue, 14 Jul 2020 18:01:08 +0000

I’m getting started on some projects that will make use of extensive Python performance profiling, unfortunately Python doesn’t focus on performance and so doesn’t have benchmark tools like I might find in Go. I’ve noticed that the two most important usages I’m looking at when profiling are speed and memory usage. For the latter, I simply use memory_profiler from the command line - which is pretty straight forward. However for speed usage, I did find a snippet that I thought would be useful to include and update depending on how my usage changes.

Launching a JupyterHub Instance

Wed, 09 Oct 2019 16:40:08 +0000

In this post I walk through the steps of creating a multi-user JupyterHub sever running on an AWS Ubuntu 18.04 instance. There are many ways of setting up JupyterHub including using Docker and Kubernetes - but this is a pretty staight forward mechanism that doesn’t have too many moving parts such as TLS termination proxies etc. I think of this as the baseline setup.

Note that this setup has a few pros or cons depending on how you look at them.

Mount an EBS volume

Tue, 05 Feb 2019 12:48:18 +0000

Once the EBS volume has been created and attached to the instance, ssh into the instance and list the available disks:

$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0         7:0    0 86.9M  1 loop /snap/core/4917
loop1         7:1    0 12.6M  1 loop /snap/amazon-ssm-agent/295
loop2         7:2    0   91M  1 loop /snap/core/6350
loop3         7:3    0   18M  1 loop /snap/amazon-ssm-agent/930
nvme0n1     259:0    0  300G  0 disk
nvme1n1     259:1    0    8G  0 disk
└─nvme1n1p1 259:2    0    8G  0 part /

In the above case we want to attach nvme0n1 - a 300GB gp2 EBS volume. Check if the volume already has data in it (e.g. created from a snapshot or being attached to a new instance):

Visual Diagnostics for More Effective Machine Learning

Thu, 10 Jan 2019 12:00:00 -0500

Visual Diagnostics for More Effective Machine Learning

Description

Modeling is often treated as a search activity: find some combination of features, algorithm, and hyperparameters that yields the best score after cross-validation. In this talk, we will explore how to steer the model selection process with visual diagnostics and the Yellowbrick library, leading to more effective and more interpretable results and faster experimental workflows.

Blast Throughput

Wed, 26 Sep 2018 17:06:24 +0000

Blast throughput is what we call a throughput measurement such that N requests are simultaneously sent to the server and the duration to receive responses for all N requests is recorded. The throughput is computed as N/duration where duration is in seconds. This is the typical and potentially correct way to measure throughput from a client to a server, however issues do arise in distributed systems land:

the requests must all originate from a single client
high latency response outliers can skew results
you must be confident that N is big enough to max out the server
N mustn’t be so big as to create non-server related bottlenecks.

In this post I’ll discuss my implementation of the blast workload as well as an issue that came up with many concurrent connections in gRPC. This led me down the path to use one connection to do blast throughput testing, which led to other issues, which I’ll discuss later.

Go Testing Notes

Sat, 22 Sep 2018 09:58:12 +0000

In this post I’m just going to maintain a list of notes for Go testing that I seem to commonly need to reference. It will also serve as an index for the posts related to testing that I have to commonly look up as well. Here is a quick listing of the table of contents:

Basics

Just a quick reminder of how to write tests, benchmarks, and examples. A test is written as follows:

Streaming Remote Throughput

Tue, 11 Sep 2018 15:19:17 +0000

In order to improve the performance of asynchronous message passing in Alia, I’m using gRPC bidirectional streaming to create the peer to peer connections. When the replica is initialized it creates a remote connection to each of its peers that lives in its own go routine; any other thread can send messages by passing them to that go routine through a channel, replies are then dispatched via another channel, directed to the thread via an actor dispatching model.

Future Date Script

Wed, 05 Sep 2018 17:42:50 +0000

This is kind of a dumb post, but it’s something I’m sure I’ll look up in the future. I have a lot of emails where I have to send a date that’s sometime in the future, e.g. six weeks from the end of a class to specify a deadline … I’ve just been opening a Python terminal and importing datetime and timedelta but I figured this quick script on the command line would make my life a bit easier:

Aggregating Reads from a Go Channel

Sat, 25 Aug 2018 08:28:59 +0000

Here’s the scenario: we have a buffered channel that’s being read by a single Go routine and is written to by multiple go routines. For simplicity, we’ll say that the channel accepts events and that the other routines generate events of specific types, A, B, and C. If there are more of one type of event generator (or some producers are faster than others) we may end up in the situation where there are a series of the same events on the buffered channel. What we would like to do is read all of the same type of event that is on the buffered channel at once, handling them all simultaneously; e.g. aggregating the read of our events.

The Actor Model

Fri, 03 Aug 2018 07:27:36 +0000

Building correct concurrent programs in a distributed system with multiple threads and processes can quickly become very complex to reason about. For performance, we want each thread in a single process to operate as independently as possible; however anytime the shared state of the system is modified synchronization is required. Primitives like mutexes can [ensure structs are thread-safe]({% post_url 2017-02-21-synchronizing-structs %}), however in Go, the strong preference for synchronization is communication. In either case Go programs can quickly become locks upon locks or morasses of channels, incurring performance penalties at each synchronization point.

Syntax Parsing with CoreNLP and NLTK

Fri, 22 Jun 2018 14:38:21 +0000

Syntactic parsing is a technique by which segmented, tokenized, and part-of-speech tagged text is assigned a structure that reveals the relationships between tokens governed by syntax rules, e.g. by grammars. Consider the sentence:

The factory employs 12.8 percent of Bradford County.

A syntax parse produces a tree that might help us understand that the subject of the sentence is “the factory”, the predicate is “employs”, and the target is “12.8 percent”, which in turn is modified by “Bradford County”. Syntax parses are often a first step toward deep information extraction or semantic understanding of text. Note however, that syntax parsing methods suffer from structural ambiguity, that is the possibility that there exists more than one correct parse for a given sentence. Attempting to select the most likely parse for a sentence is incredibly difficult.

Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166

Sun, 17 Jun 2018 12:00:00 -0500

Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166

Description

Machine learning models are often inscrutable and it can be difficult to know whether you are making progress. To improve feedback and speed up iteration

Continuing Outer Loops with for/else

Thu, 17 May 2018 09:02:43 +0000

When you have an outer and an inner loop, how do you continue the outer loop from a condition inside the inner loop? Consider the following code:

for i in range(10):
    for j in range(9):
        if i <= j:
            # break out of inner loop
            # continue outer loop
        print(i,j)

    # don't print unless inner loop completes,
    # e.g. outer loop is not continued
    print("inner complete!")

Here, we want to print for all i ∈ [0,10) all numbers j ∈ [0,9) that are less than or equal to i and we want to print complete once we’ve found an entire list of j that meets the criteria. While this seems like a fairly contrived example, I’ve actually encountered this exact situation in several places in code this week, and I’ll provide a real example in a bit.

Predicted Class Balance

Thu, 08 Mar 2018 09:18:37 +0000

This is a follow on to the [prediction distribution]({{ site.base_url }}{% link _posts/2018-02-28-prediction-distribution.md %}) visualization presented in the last post. This visualization shows a bar chart with the number of predicted and number of actual values for each class, e.g. a class balance chart with predicted balance as well.

This visualization actually came before the prior visualization, but I was more excited about that one because it showed where error was occurring similar to a classification report or confusion matrix. I’ve recently been using this chart for initial spot checking more however, since it gives me a general feel for how balanced both the class and the classifier is with respect to each other. It has also helped diagnose what is being displayed in the heat map chart of the other post.

Class Balance Prediction Distribution

Wed, 28 Feb 2018 12:52:11 +0000

In this quick snippet I present an alternative to the confusion matrix or classification report visualizations in order to judge the efficacy of multi-class classifiers:

The base of the visualization is a class balance chart, the x-axis is the actual (or true class) and the height of the bar chart is the number of instances that match that class in the dataset. The difference here is that each bar is a stacked chart representing the percentage of the predicted class given the actual value. If the predicted color matches the actual color then the classifier was correct, otherwise it was wrong.

Synchronization in Write Throughput

Tue, 13 Feb 2018 07:10:06 +0000

This post serves as a reminder of how to perform benchmarks when accounting for synchronized writing in Go. The normal benchmarking process involves running a command a large number of times and determining the average amount of time that operation took. When threads come into play, we consider throughput - that is the number of operations that can be conducted per second. However, in order to successfully measure this without duplicating time, the throughput must be measured from the server’s perspective.

Thread and Non-Thread Safe Go Set

Fri, 26 Jan 2018 09:15:13 +0000

I came across this now archived project that implements a set data structure in Go and was intrigued by the implementation of both thread-safe and non-thread-safe implementations of the same data structure. Recently I’ve been attempting to get rid of locks in my code in favor of one master data structure that does all of the synchronization, having multiple options for thread safety is useful. Previously I did this by having a lower-case method name (a private method) that was non-thread-safe and an upper-case method name (public) that did implement thread-safety. However, as I’ve started to reorganize my packages this no longer works.

Git-Style File Editing in CLI

Sat, 06 Jan 2018 09:30:58 +0000

A recent application I was working on required the management of several configuration and list files that needed to be validated. Rather than have the user find and edit these files directly, I wanted to create an editing workflow similar to crontab -e or git commit — the user would call the application, which would redirect to a text editor like vim, then when editing was complete, the application would take over again.

Transaction Handling with Psycopg2

Wed, 06 Dec 2017 13:58:16 +0000

Databases are essential to most applications, however most database interaction is often overlooked by Python developers who use higher level libraries like Django or SQLAlchemy. We use and love PostgreSQL with Psycopg2, but I recently realized that I didn’t have a good grasp on how exactly psycopg2 implemented core database concepts: particularly transaction isolation and thread safety.

Here’s what the documentation says regarding transactions:

Transactions are handled by the connection class. By default, the first time a command is sent to the database (using one of the cursors created by the connection), a new transaction is created. The following database commands will be executed in the context of the same transaction – not only the commands issued by the first cursor, but the ones issued by all the cursors created by the same connection. Should any command fail, the transaction will be aborted and no further command will be executed until a call to the rollback() method.

Lock Diagnostics in Go

Thu, 28 Sep 2017 10:44:30 +0000

By now it’s pretty clear that I’ve just had a bear of a time with locks and synchronization inside of multi-threaded environments with Go. Probably most gophers would simply tell me that I should share memory by communicating rather than to communication by sharing memory — and frankly I’m in that camp too. The issue is that:

Mutexes can be more expressive than channels
Channels are fairly heavyweight

So to be honest, there are situations where a mutex is a better choice than a channel. I believe that one of those situations is when dealing with replicated state machines … which is what I’ve been working on the past few months. The issue is that the state of the replica has to be consistent across a variety of events: timers and remote messages. The problem is that the timers and network traffic are all go routines, and there can be a lot of them running in the system at a time.

Lock Queuing in Go

Fri, 08 Sep 2017 11:31:19 +0000

In Go, you can use sync.Mutex and sync.RWMutex objects to create thread-safe data structures in memory as discussed in [“Synchronizing Structs for Safe Concurrency in Go”]({% post_url 2017-02-21-synchronizing-structs %}). When using the sync.RWMutex in Go, there are two kinds of locks: read locks and write locks. The basic difference is that many read locks can be acquired at the same time, but only one write lock can be acquired at at time.

Messaging Throughput gRPC vs. ZMQ

Mon, 04 Sep 2017 17:20:06 +0000

Building distributed systems in Go requires an RPC or message framework of some sort. In the systems I build I prefer to pass messages serialized with protocol buffers therefore a natural choice for me is grpc. The grpc library uses HTTP2 as a transport layer and provides a code generator based on the protocol buffer syntax making it very simple to use.

For more detailed control, the ZMQ library is an excellent, low latency socket framework. ZMQ provides several communication patterns from basic REQ/REP (request/reply) to PUB/SUB (publish/subscribe). ZMQ is used at a lower level though, so more infrastructure per app needs to be built.

Online Distribution

Mon, 28 Aug 2017 12:49:46 +0000

This post started out as a discussion of a struct in Go that could keep track of online statistics without keeping an array of values. It ended up being a lesson on over-engineering for concurrency.

The spec of the routine was to build a data structure that could keep track of internal statistics of values over time in a space-saving fashion. The primary interface was a method, Update(sample float64), so that a new sample could be passed to the structure, updating internal parameters. At conclusion, the structure should be able to describe the mean, variance, and range of all values passed to the update method. I created two versions:

Rapid FS Walks with ErrGroup

Fri, 18 Aug 2017 15:33:35 +0000

I’ve been looking for a way to quickly scan a file system and gather information about the files in directories contained within. I had been doing this with multiprocessing in Python, but figured Go could speed up my performance by a lot. What I discovered when I went down this path was the sync.ErrGroup, an extension of the sync.WaitGroup that helps manage the complexity of multiple go routines but also includes error handling!

Buffered Write Performance

Thu, 03 Aug 2017 09:48:19 +0000

This is just a quick note on the performance of writing to a file on disk using Go, and reveals a question about a common programming paradigm that I am now suspicious of. I discovered that when I wrapped the open file object with a bufio.Writer that the performance of my writes to disk significantly increased. Ok, so this isn’t about simple file writing to disk, this is about a complex writer that does some seeking in the file writing to different positions and maintains the overall state of what’s on disk in memory, however the question remains:

Event Dispatcher in Go

Fri, 21 Jul 2017 06:28:45 +0000

The event dispatcher pattern is extremely common in software design, particularly in languages like JavaScript that are primarily used for user interface work. The dispatcher is an object (usually a mixin to other objects) that can register callback functions for particular events. Then when a dispatch method is called with an event, the dispatcher calls each callback function in order of their registration and passes them a copy of the event. In fact, I’ve already written a version of this pattern in Python: [Implementing Observers with Events]({% post_url 2016-02-16-observer-pattern %}) In this snippet, I’m presenting a version in Go that has been incredibly stable and useful in my code.

Lazy Pirate Client

Fri, 14 Jul 2017 10:24:15 +0000

In the [last post]({% post_url 2017-07-13-zmq-basic %}) I discussed a simple REQ/REP pattern for ZMQ. However, by itself REQ/REP is pretty fragile. First, every REQ requires a REP and a server can only handle one request at a time. Moreover, if the server fails in the middle of a reply, then everything is hung. We need more reliable REQ/REP, which is actually the subject of an entire chapter in the ZMQ book.

Simple ZMQ Message Passing

Thu, 13 Jul 2017 11:00:27 +0000

There are many ways to create RPCs and send messages between nodes in a distributed system. Typically when we think about messaging, we think about a transport layer (TCP, IP) and a protocol layer (HTTP) along with some message serialization. Perhaps best known are RESTful APIs which allow us to GET, POST, PUT, and DELETE JSON data to a server. Other methods include gRPC which uses HTTP and protocol buffers for interprocess communication.

PID File Management

Tue, 11 Jul 2017 09:10:44 +0000

In this discussion, I want to propose some code to perform PID file management in a Go program. When a program is backgrounded or daemonized we need some way to communicate with it in order to stop it. All active processes are assigned a unique process id by the operating system and that ID can be used to send signals to the program. Therefore a PID file:

The pid files contains the process id (a number) of a given program. For example, Apache HTTPD may write it’s main process number to a pid file - which is a regular text file, nothing more than that - and later use the information there contained to stop itself. You can also use that information (just do a cat filename.pid) to kill the process yourself, using echo filename.pid | xargs kill.

Public IP Address Discovery

Sun, 09 Jul 2017 13:14:46 +0000

When doing research on peer-to-peer networks, addressing can become pretty complex pretty quickly. Not everyone has the resources to allocate static, public facing IP addresses to machines. A machine that is in a home network for example only has a single public-facing IP address, usually assigned to the router. The router then performs NAT (network address translation) forwarding requests to internal devices.

In order to get a service running on an internal network, you can port forward external requests to a specific port to a specific device. Requests are made to the router’s IP address, and the router passes it on. But how do you know the IP address of the device? Moreover, what happens if the router is assigned a new IP address? Static IP addresses generally cost more.

On the Tracks with Rails

Thu, 06 Jul 2017 08:15:13 +0000

I’m preparing to move into a new job when I finish my dissertation hopefully later this summer. The new role involves web application development with Rails and so I needed to get up to speed. I had a web application requirement for my research so I figured I’d knock out two birds with one stone and build that app with Rails (a screenshot of the app is above, though of course this is just a front-end and doesn’t really tell you it was built with Rails).

Visual Pipelines for Text Analysis

Sat, 24 Jun 2017 12:00:00 -0500

Visual Pipelines for Text Analysis

Description

Employing machine learning in practice is half search, half expertise, and half blind luck. In this talk we will explore how to make the luck half less blind by using visual pipelines to steer model selection from raw input to operational prediction. We will look specifically at extending transformer pipelines with visualizers for sentiment analysis and topic modeling text corpora.

Concurrent Subprocesses and Fabric

Wed, 14 Jun 2017 15:56:24 +0000

I’ve ben using Fabric to concurrently start multiple processes on several machines. These processes have to run at the same time (since they are experimental processes and are interacting with each other) and shut down at more or less the same time so that I can collect results and immediately execute the next sample in the experiment. However, I was having a some difficulties directly using Fabric:

Fabric can parallelize one task across multiple hosts accordint to roles.
Fabric can be hacked to run multiple tasks on multiple hosts by setting env.dedupe_hosts = False
Fabric can only parallelize one type of task, not multiple types
Fabric can’t handle large numbers of SSH connections

In this post we’ll explore my approach with Fabric and my current solution.

Appending Results to a File

Mon, 12 Jun 2017 16:04:24 +0000

In my current experimental setup, each process is a single instance of sample, from start to finish. This means that I need to aggregate results across multiple process runs that are running concurrently. Moreover, I may need to aggregate those results between machines.

The most compact format to store results in is CSV. This was my first approach and it had some benefits including:

small file sizes
readability
CSV files can just be concatenated together

The problems were:

Compression Benchmarks

Wed, 07 Jun 2017 10:45:35 +0000

One of the projects I’m currently working on is the ingestion of RSS feeds into a Mongo database. It’s been running for the past year, and as of this post has collected 1,575,987 posts for 373 feeds after 8,126 jobs. This equates to about 585GB of raw data, and a firm requirement for compression in order to exchange data.

Recently, @ojedatony1616 downloaded the compressed zip file (53GB) onto a 1TB external hard disk and attempted to decompress it. After three days, he tried to cancel it and ended up restarting his computer because it wouldn’t cancel. His approach was simply to double click the file on OS X, but that got me to thinking – it shouldn’t have taken that long; why did it choke? Inspecting the export logs on the server, I noted that it took 137 minutes to compress the directory; shouldn’t it take that long to decompress as well?

Decorating Nose Tests

Mon, 22 May 2017 13:05:08 +0000

Was introduced to an interesting problem today when decorating tests that need to be discovered by the nose runner. By default, nose explores a directory looking for things named test or tests and then executes those functions, classes, modules, etc. as tests. A standard test suite for me looks something like:

import unittest


class MyTests(unittest.TestCase):

    def test_undecorated(self):
        """
        assert undecorated works
        """
        self.assertEqual(2+2, 4)

The problem came up when we wanted to decorate a test with some extra functionality, for example loading a fixture:

In Process Cacheing

Wed, 17 May 2017 08:16:34 +0000

I have had some recent discussions regarding cacheing to improve application performance that I wanted to share. Most of the time those conversations go something like this: “have you heard of Redis?” I’m fascinated by the fact that an independent, distributed key-value store has won the market to this degree. However, as I’ve pointed out in these conversations, cacheing is a hierarchy (heck, even the processor has varying levels of cacheing). Especially when considering micro-service architectures that require extremely low latency responses, cacheing should be a critical part of the design, not just a bolt-on after thought!

Unique Values in Python: A Benchmark

Tue, 02 May 2017 14:24:18 +0000

An interesting question came up in the development of Yellowbrick: given a vector of values, what is the quickest way to get the unique values? Ok, so maybe this isn’t a terribly interesting question, however the results surprised us and may surprise you as well. First we’ll do a little background, then I’ll give the results and then discuss the benchmarking method.

The problem comes up in Yellowbrick when we want to get the discrete values for a target vector, y — a problem that comes up in classification tasks. By getting the unique set of values we know the number of classes, as well as the class names. This information is necessary during visualization because it is vital in assigning colors to individual classes. Therefore in a Visualizer we might have a method as follows:

Measuring Throughput

Fri, 28 Apr 2017 15:22:40 +0000

Part of my research is taking me down a path where I want to measure the number of reads and writes from a client to a storage server. A key metric that we’re looking for is throughput — the number of accesses per second that a system supports. As I discovered in a very simple test to get some baseline metrics, even this simple metric can have some interesting complications.

OAuth Tokens on the Command Line

Thu, 20 Apr 2017 10:26:32 +0000

This week I discovered I had a problem with my Google Calendar — events accidentally got duplicated or deleted and I needed a way to verify that my primary calendar was correct. Rather than painstakingly go through the web interface and spot check every event, I instead wrote a Go console program using the Google Calendar API to retrieve events and save them in a CSV so I could inspect them all at once. This was great, and very easy using Google’s Go libraries for their APIs, and the quick start was very handy.

Gmail Notifications with Python

Mon, 17 Apr 2017 12:26:55 +0000

I routinely have long-running scripts (e.g. for a data processing task) that I want to know when they’re complete. It seems like it should be simple for me to add in a little snippet of code that will send an email using Gmail to notify me, right? Unfortunately, it isn’t quite that simple for a lot of reasons, including security, attachment handling, configuration, etc. In this snippet, I’ve attached my constant copy and paste notify() function, written into a command line script for easy sending on the command line.

A Benchmark of Grumpy Transpiling

Thu, 23 Mar 2017 08:47:04 +0000

On Tuesday evening I attended a Django District meetup on Grumpy, a transpiler from Python to Go. Because it was a Python meetup, the talk naturally focused on introducing Go to a Python audience, and because it was a Django meetup, we also focused on web services. The premise for Grumpy, as discussed in the announcing Google blog post, is also a web focused one — to take YouTube’s API that’s primarily written in Python and transpile it to Go to improve the overall performance and stability of YouTube’s front-end services.

Sanely gRPC Dial a Remote

Tue, 21 Mar 2017 16:27:39 +0000

In my systems I need to handle failure; so unlike in a typical client-server relationship, I’m prepared for the remote I’m dialing to not be available. Unfortunately when you do this with gRPC-Go there are a couple of annoyances you have to address. They are (in order of solutions):

Verbose connection logging
Background and back-off for reconnection attempts
Errors are not returned on demand.
There is no ability to keep track of statistics

So first the logging. When you dial an unavailable remote as follows:

Contributing a Multiprocess Memory Profiler

Mon, 20 Mar 2017 11:42:58 +0000

In this post I wanted to catalog the process of an open source contribution I was a part of, which added a feature to the memory profiler Python library by Fabian Pedregosa and Philippe Gervais. It’s a quick story to tell but took over a year to complete, and I learned a lot from the process. I hope that the story is revealing, particularly to first time contributors and shows that even folks that have been doing this for a long time still have to find ways to positively approach collaboration in an open source environment. I also think it’s a fairly standard example of how contributions work in practice and perhaps this story will help us all think about how to better approach the pull request process.

Pseudo Merkle Tree

Thu, 16 Mar 2017 12:23:21 +0000

A Merkle tree is a data structure in which every non-leaf node is labeled with the hash of its child nodes. This makes them particular useful for comparing large data structures quickly and efficiently. Given trees a and b, if the root hash of either is different, it means that part of the tree below is different (if they are identical, they are probably also identical). You can then proceed in a a breadth first fashion, pruning nodes with identical hashes to directly identify the differences.

Using Select in Go

Wed, 08 Mar 2017 10:52:39 +0000

Ask a Go programmer what makes Go special and they will immediately say “concurrency is baked into the language”. Go’s concurrency model is one of communication (as opposed to locks) and so concurrency primitives are implemented using channels. In order to synchronize across multiple channels, go provides the select statement.

A common pattern for me has become to use a select to manage broadcasted work (either in a publisher/subscriber model or a fanout model) by initializing go routines and passing them directional channels for synchronization and communication. In the example below, I create a buffered channel for output (so that the workers don’t block waiting for the receiver to collect data), a channel for errors (first error kills the program) and a timer to update the state of my process on a routine basis. The select waits for the first channel to receive a message and then continues processing. By keeping the select in a for loop, I can continually read of the channels until I’m done.

Benchmarking Secure gRPC

Sun, 05 Mar 2017 17:26:24 +0000

A natural question to ask after the previous post is “how much overhead does security add?” So I’ve benchmarked the three methods discussed; mutual TLS, server-side TLS, and no encryption. The results are below:

Here are the numeric results for one of the runs:

BenchmarkMutualTLS-8   	     200	   9331850 ns/op
BenchmarkServerTLS-8   	     300	   5004505 ns/op
BenchmarkInsecure-8    	    2000	   1179252 ns/op
PASS
ok  	github.com/bbengfort/sping	7.364s

Here is the code for the benchmarking for reference:

Secure gRPC with TLS/SSL

Fri, 03 Mar 2017 09:41:39 +0000

One of the primary requirements for the systems we build is something we call the “minimum security requirement”. Although our systems are not designed specifically for high security applications, they must use minimum standards of encryption and authentication. For example, it seems obvious to me that a web application that stores passwords or credit card information would encrypt their data on disk on a per-record basis with a salted hash. In the same way, a distributed system must be able to handle encrypted blobs, encrypt all inter-node communication, and authenticate and sign all messages. This adds some overhead to the system but the cost of overhead is far smaller than the cost of a breach, and if minimum security is the baseline then the overhead is just an accepted part of doing business.

Synchronizing Structs for Safe Concurrency in Go

Tue, 21 Feb 2017 10:48:24 +0000

Go is built for concurrency by providing language features that allow developers to embed complex concurrency patterns into their applications. These language features can be intuitive and a lot of safety is built in (for example a race detector) but developers still need to be aware of the interactions between various threads in their programs.

In any shared memory system the biggest concern is synchronization: ensuring that separate go routines operate in the correct order and that no race conditions occur. The primary way to handle synchronization is the use of channels. Channels synchronize execution by forcing sends on the channel to block until the value on the channel is received. In this way, channels act as a barrier since the go routine can not progress while being blocked by the channel and enforce a specific ordering to execution, the ordering of routines arriving at the barrier.

Fixed vs. Variable Length Chunking

Wed, 08 Feb 2017 19:51:28 +0000

FluidFS and other file systems break large files into recipes of hash-identified blobs of binary data. Blobs can then be replicated with far more ease than a single file, as well as streamed from disk in a memory safe manner. Blobs are treated as single, independent units so the underlying data store doesn’t grow as files are duplicated. Finally, blobs can be encrypted individually and provide more opportunities for privacy.

Extracting a TOC from Markup

Sun, 05 Feb 2017 09:11:27 +0000

In today’s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:

So this is pretty simple, just use regular expressions to look for lines that start with one or more "#" or "=" (for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent ## heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I’ve also implemented some simple type detection using common extensions to decide which regex to use.

In-Memory File System with FUSE

Mon, 30 Jan 2017 16:17:26 +0000

The Filesystem in Userspace (FUSE) software interface allows developers to create file systems without editing kernel code. This is especially useful when creating replicated file systems, file protocols, backup systems, or other computer systems that require intervention for FS operations but not an entire operating system. FUSE works by running the FS code as a user process while FUSE provides a bridge through a request/response protocol to the kernel.

In Go, the FUSE library is implemented by bazil.org/fuse. It is a from-scratch implementation of the kernel-userspace communication protocol and does not use the C library. The library has been excellent for research implementations, particularly because Go is such an excellent language (named programming language of 2016). However, it does lead to some questions (particularly because of the questions in the Go documentation):

FUSE Calls on Go Writes

Thu, 26 Jan 2017 20:04:40 +0000

For close-to-open consistency, we need to be able to implement a file system that can detect atomic changes to a single file. Most programming languages implement open() and close() methods for files - but what they are really modifying is the access of a handle to an open file that the operating system provides. Writes are buffered in an asynchronous fashion so that the operating system and user program don’t have to wait for the spinning disk to figure itself out before carrying on. Additional file calls such as sync() and flush() give the user the ability to hint to the OS about what should happen relative to the state of data and the disk, but the OS provides no guarantees that will happen.

Error Descriptions for System Calls

Mon, 23 Jan 2017 14:29:50 +0000

Working with FUSE to build file systems means inevitably you have to deal with (or return) system call errors. The Go FUSE implementation includes helpers and constants for returning these errors, but simply wraps them around the syscall error numbers. I needed descriptions to better understand what was doing what. Pete saved the day by pointing me towards the errno.h header file on my Macbook. Some Python later and we had the descriptions:

Run Until Error with Go Channels

Thu, 19 Jan 2017 11:00:40 +0000

Writing systems means the heavy use of go routines to support concurrent operations. My current architecture employs several go routines to run a server for a simple web interface as well as command line app, file system servers, replica servers, consensus coordination, etc. Using multiple go routines (threads) instead of processes allows for easier development and shared resources, such as a database that can support transactions. However, management of all these threads can be tricky.

Generic JSON Serialization with Go

Wed, 18 Jan 2017 11:31:06 +0000

This post is just a reminder as I work through handling JSON data with Go. Go provides first class JSON support through its standard library json package. The interface is simple, primarily through json.Marshal and json.Unmarshal functions which are analagous to typed versions of json.load and json.dump. Type safety is the trick, however, and generally speaking you define a struct to serialize and deserialize as follows:

type Person struct {
    Name   string `json:"name,omitempty"`
    Age    int    `json:"age,omitempty"`
    Salary int    `json:"-"`
}

op := &Person{"John Doe", 42}
data, _ := json.Marshal(op)

var np Person
json.Unmarshall(data, &np)

So this is all well and good, until you start wanting to just send around arbirtray data. Luckly the json package will allow you to do that using reflection to load data into a map[string]interface{}, e.g. a dictionary whose keys are strings and whose values are any arbitrary type (anything that implements the null interface, that is has zero or more methods, which all Go types do). So you might see code like this:

Resolving Matplotlib Colors

Tue, 17 Jan 2017 14:52:50 +0000

One of the challenges we’ve been dealing with in the Yellowbrick library is the proper resolution of colors, a problem that seems to have parallels in matplotlib as well. The issue is that colors can be described by the user in a variety of ways, then that description has to be parsed and rendered as specific colors. To name a few color specifications that exist in matplotlib:

None: choose a reasonable default color
The name of the color, e.g. "b" or "blue"
The hex code of the color e.g. "#377eb8"
The RGB or RGBA tuples of the color, e.g. (0.0078, 0.4470, 0.6353)
A greyscale intensity string, e.g. "0.76".

The pyplot api documentation sums it up as follows:

Benchmarking Readline Iterators

Fri, 23 Dec 2016 10:18:01 +0000

I’m starting to get serious about programming in Go, trying to move from an intermediate level to an advanced/expert level as I start to build larger systems. Right now I’m working on a problem that involves on demand iteration, and I don’t want to pass around entire arrays and instead be a bit more frugal about my memory usage. Yesterday, I discussed using [channels to yield iterators from functions]({% post_url 2016-12-22-yielding-functions-for-iteration-golang %}) and was a big fan of the API, but had some questions about memory usage. So today I created a package, iterfile to benchmark and profile various iteration constructs in Go.

Yielding Functions for Iteration in Go

Thu, 22 Dec 2016 06:54:26 +0000

It is very common for me to design code that expects functions to return an iterable context, particularly because I have been developing in Python with the yield statement. The yield statement allows functions to “return” the execution context to the caller while still maintaining state such that the caller can return state to the function and continue to iterate. It does this by actually returning a generator, iterable object constructed from the local state of the closure.

Data Product Architectures: O'Reilly Webinar

Wed, 07 Dec 2016 12:00:00 -0500

Data Product Architectures: O’Reilly Webinar

Description

Data products derive their value from data and generate new data in return. As a result, machine-learning techniques must be applied to their architecture and development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back into the data product.

Exception Handling

Mon, 21 Nov 2016 12:53:30 +0000

This short tutorial is intended to demonstrate the basics of exception handling and the use of context management in order to handle standard cases. These notes were originally created for a training I gave, and the notebook can be found at Exception Handling. I’m happy for any comments or pull requests on the notebook.

Exceptions

Exceptions are a tool that programmers use to describe errors or faults that are fatal to the program; e.g. the program cannot or should not continue when an exception occurs. Exceptions can occur due to programming errors, user errors, or simply unexpected conditions like no internet access. Exceptions themselves are simply objects that contain information about what went wrong. Exceptions are usually defined by their type - which describes broadly the class of exception that occurred, and by a message that says specifically what happened. Here are a few common exception types:

SVG Vertex with a Timer

Fri, 04 Nov 2016 10:30:29 +0000

In order to promote the use of graph data structures for data analysis, I’ve recently given talks on dynamic graphs: embedding time into graph structures to analyze change. In order to embed time into a graph there are two primary mechanisms: make time a graph element (a vertex or an edge) or have multiple subgraphs where each graph represents a discrete time step. By using either of these techniques, opportunities exist to perform a structural analysis using graph algorithms on time; for example - asking what time is most central to a particular set of relationships.

Message Latency: Ping vs. gRPC

Wed, 02 Nov 2016 15:46:31 +0000

Building distributed systems means passing messages between devices over a network connection. My research specifically considers networks that have extremely variable latencies or that can be partition prone. This led me to the natural question, “how variable are real world networks?” In order to get real numbers, I built a simple echo protocol using Go and gRPC called Orca.

I ran Orca for a few days and got some latency measurements as I traveled around with my laptop. Orca does a lot of work, including GeoIP look ups, IP address resolution, and database queries and storage. This post, however, is not about Orca. The latencies I was getting were very high relative to the round-trip latencies reported by the simple ping command that implements the ICMP protocol.

Computing Reading Speed

Fri, 28 Oct 2016 13:16:24 +0000

Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she’s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what’s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily.

Dynamics in Graph Analysis Adding Time as a Structure for Visual and Statistical Insight

Mon, 24 Oct 2016 12:00:00 -0500

I gave this talk twice, both at PyData DC on October 24, 2016 and at PyData Carolinas on September 15, 2016. Both videos are below if you feel like figuring out which presentation was better!

PyData DC

PyData Carolinas

Slides

Description

Network analyses are powerful methods for both visual analytics and machine learning but can suffer as their complexity increases. By embedding time as a structural element rather than a property, we will explore how time series and interactive analysis can be improved on Graph structures. Primarily we will look at decomposition in NLP-extracted concept graphs using NetworkX and Graph Tool.

Modifying an Image's Aspect Ratio

Tue, 13 Sep 2016 14:19:14 +0000

When making slides, I generally like to use Flickr to search for images that are licensed via Creative Commons to use as backgrounds. My slide deck tools of choice are either Reveal.js or Google Slides. Both tools allow you to specify an image as a background for the slide, but for Google Slides in particular, if the aspect ratio of the image doesn’t match the aspect ratio of the slide deck, then weird things can happen.

Serializing GraphML

Fri, 09 Sep 2016 17:13:13 +0000

This is mostly a post of annoyance. I’ve been working with graphs in Python via NetworkX and trying to serialize them to GraphML for use in Gephi and graph-tool. Unfortunately the following error is really starting to get on my nerves:

networkx.exception.NetworkXError: GraphML writer does not support <class 'datetime.datetime'> as data values.

Also it doesn’t support <type NoneType> or list or dict or …

So I have to do something about it:

Parallel Enqueue and Workers

Wed, 07 Sep 2016 14:29:51 +0000

I was recently asked about the parallelization of both the enqueuing of tasks and their processing. This is a tricky subject because there are a lot of factors that come into play. For example do you have two parallel phases, e.g. a map and a reduce phase that need to be synchronized, or is there some sort of data parallelism that requires multiple tasks to be applied to the data (e.g. Storm-style topology). While there are a lot of tools for parallel processing in batch for large data sets, how do you take care of simple problems with large datasets (say hundreds of gigabytes) on a single machine with a quad core or hyperthreading multiprocessor?

Parallel NLP Preprocessing

Fri, 12 Aug 2016 22:09:25 +0000

A common source of natural language corpora comes from the web, usually in the form of HTML documents. However, in order to actually build models on the natural language, the structured HTML needs to be transformed into units of discourse that can then be used for learning. In particular, we need to strip away extraneous material such as navigation or advertisements, targeting exactly the content we’re looking for. Once done, we need to split paragraphs into sentences, sentences into tokens, and assign part-of-speech tags to each token. The preprocessing therefore transforms HTML documents to a list of paragraphs, which are themselves a list of sentences, which are lists of token, tag tuples.

Pretty Print Directories

Mon, 01 Aug 2016 12:17:47 +0000

It feels like there are many questions like this one on Stack Overflow: Representing Directory & File Structure in Markdown Syntax, basically asking “how can we represent a directory structure in text in a pleasant way?” I too use these types of text representations in slides, blog posts, books, etc. It would be very helpful if I had an automatic way of doing this so I didn’t have to create it from scratch.

Interview - Ben Bengfort of District Data Labs

Thu, 28 Jul 2016 12:00:00 -0500

Description

We talk to Benjamin Bengfort about his Data Day Seattle talks, District Data Labs, and Ben’s popular O’Reilly books.

Visualizing the Model Selection Process

Sat, 23 Jul 2016 12:00:00 -0500

Description

Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it’s beginnings in academia, and with tools like Scikit-Learn, it’s easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model’s evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.

Data Product Architectures: Seattle Data Day

Thu, 21 Jul 2016 12:00:00 -0500

Description

Data products derive their value from data and generate new data in return; as a result, machine learning techniques must be applied to their architecture and their development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back in to the data product. Data product architectures are therefore life cycles and understanding the data product life cycle will enable architects to develop robust, failure free workflows and applications. In this talk we will discuss the data product life cycle, explore how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Following the lambda architecture, we will investigate wrapping a central computational store for speed and querying, as well as incorporating a discussion of monitoring, management, and data exploration for hypothesis driven development. From web applications to big data appliances; this architecture serves as a blueprint for handling data services of all sizes!

Color Map Utility

Fri, 15 Jul 2016 16:28:11 +0000

Many of us are spoiled by the use of matplotlib’s colormaps which allow you to specify a string or object name of a color map (e.g. Blues) then simply pass in a range of nearly continuous values which are spread along the color map. However, using these color maps for categorical or discrete values (like the colors of nodes) can pose challenges as the colors may not be distinct enough for the representation you’re looking for.

Visualizing Normal Distributions

Mon, 27 Jun 2016 08:28:07 +0000

Normal distributions are the backbone of random number generation for simulation. By selecting a mean (μ) and standard deviation (σ) you can generate simulated data representative of the types of models you’re trying to build (and certainly better than simple uniform random number generators). However, you might already be able to tell that selecting μ and σ is a little backward! Typically these metrics are computed from data, not used to describe data. As a result, utilities for tuning the behavior of your random number generators are simply not discussed.

Background Work with Goroutines on a Timer

Sun, 26 Jun 2016 06:52:38 +0000

As I’m moving deeper into my PhD, I’m getting into more Go programming for the systems that I’m building. One thing that I’m constantly doing is trying to create a background process that runs forever, and does some work at an interval. Concurrency in Go is native and therefore the use of threads and parallel processing is very simple, syntax-wise. However I am still solving problems that I wanted to make sure I recorded here.

Converting NetworkX to Graph-Tool

Thu, 23 Jun 2016 19:21:58 +0000

This week I discovered graph-tool, a Python library for network analysis and visualization that is implemented in C++ with Boost. As a result, it can quickly and efficiently perform manipulations, statistical analyses of Graphs, and draw them in a visual pleasing style. It’s like using Python with the performance of C++, and I was rightly excited:

It's a bear to get setup, but once you do things get pretty nice. Moving my network viz over to it now!

Natural Language Processing with NLTK and Gensim

Mon, 30 May 2016 12:00:00 -0500

Natural Language Processing with NLTK and Gensim

Description

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, Gensim, and the Natural Language Toolkit (NLTK).

Text Classification with NLTK and Scikit-Learn

Thu, 19 May 2016 08:06:40 +0000

This post is an early draft of expanded work that will eventually appear on the District Data Labs Blog. Your feedback is welcome, and you can submit your comments on the draft GitHub issue.

I’ve often been asked which is better for text processing, NLTK or Scikit-Learn (and sometimes Gensim). The answer is that I use all three tools on a regular basis, but I often have a problem mixing and matching them or combining them in meaningful ways. In this post, I want to show how I use NLTK for preprocessing and tokenization, but then apply machine learning techniques (e.g. building a linear SVM using stochastic gradient descent) using Scikit-Learn. In a follow on post, I’ll talk about vectorizing text with word2vec for machine learning in Scikit-Learn.

Creating a Microservice in Go

Wed, 11 May 2016 09:52:13 +0000

Yesterday I built my first microservice (a RESTful API) using Go, and I wanted to collect a few of my thoughts on the experience here before I forgot them. The project, Scribo, is intended to aid in my research by collecting data about a specific network that I’m looking to build distributed systems for. I do have something running, which will need to evolve a lot, and it could be helpful to know where it started.

Extracting Diffs from Git with Python

Fri, 06 May 2016 08:43:29 +0000

One of the first steps to performing analysis of Git repositories is extracting the changes over time, e.g. the Git log. This seems like it should be a very simple thing to do, as visualizations on GitHub and elsewhere show file change analyses through history on a commit by commit basis. Moreover, by using the GitPython library you have direct access to Git repositories that is scriptable. Unfortunately, things aren’t as simple as that, so I present a snippet for extracting change information from a Repository.

Visualizing Distributed Systems

Tue, 26 Apr 2016 11:34:42 +0000

As I’ve dug into my distributed systems research, one question keeps coming up: “How do you visualize distributed systems?” Distributed systems are hard, so it feels like being able to visualize the data flow would go a long way to understanding them in detail and avoiding bugs. Unfortunately, the same things that make architecting distributed systems difficult also make them hard to visualize.

I don’t have an answer to this question, unfortunately. However, in this post I’d like to state my requirements and highlight some visualizations that I think are important. Hopefully this will be the start of a more complete investigation or at least allow others to comment on what they’re doing and whether or not visualization is important.

Scikit-Learn Data Management: Bunches

Tue, 19 Apr 2016 11:29:30 +0000

One large issue that I encounter in development with machine learning is the need to structure our data on disk in a way that we can load into Scikit-Learn in a repeatable fashion for continued analysis. My proposal is to use the sklearn.datasets.base.Bunch object to load the data into data and target attributes respectively, similar to how Scikit-Learn’s toy datasets are structured. Using this object to manage our data will mirror the native API and allow us to easily copy and paste code that demonstrates classifiers and techniques with the built in datasets. Importantly, this API will also allow us to communicate to other developers and our future-selves exactly how to use the data.

Lessons in Discrete Event Simulation

Fri, 15 Apr 2016 06:26:42 +0000

Part of my research involves the creation of large scale distributed systems, and while we do build these systems and deploy them, we do find that simulating them for development and research gives us an advantage in trying new things out. To that end, I employ discrete event simulation (DES) using Python’s SimPy library to build very large simulations of distributed systems, such as the one I’ve built to inspect consistency patterns in variable latency, heterogenous, partition prone networks: CloudScope.

NLTK Corpus Reader for Extracted Corpus

Mon, 11 Apr 2016 21:03:18 +0000

Yesterday I wrote a blog about [extracting a corpus]({% post_url 2016-04-10-extract-ddl-corpus %}) from a directory containing Markdown, such as for a blog that is deployed with Silvrback or Jekyll. In this post, I’ll briefly show how to use the built in CorpusReader objects in nltk for streaming the data to the segmentation and tokenization preprocessing functions that are built into NLTK for performing analytics.

The dataset that I’ll be working with is the District Data Labs Blog, in particular the state of the blog as of today. The dataset can be downloaded from the ddl corpus, which also has the code in this post for you to use to perform other analytics.

Extracting the DDL Blog Corpus

Sun, 10 Apr 2016 06:44:28 +0000

We have some simple text analyses coming up and as an example, I thought it might be nice to use the DDL blog corpus as a data set. There are relatively few DDL blogs, but they all are long with a lot of significant text and discourse. It might be interesting to try to do some lightweight analysis on them.

So, how to extract the corpus? The DDL blog is currently hosted on Silvrback which is designed for text-forward, distraction-free blogging. As a result, there isn’t a lot of cruft on the page. I considered doing a scraper that pulled the web pages down or using the RSS feed to do the data ingestion. After all, I wouldn’t have to do a lot of HTML cleaning.

Dispatching Types to Handler Methods

Tue, 05 Apr 2016 08:58:32 +0000

A while I ago, I discussed the [observer pattern]({% post_url 2016-02-16-observer-pattern %}) for dispatching events based on a series of registered callbacks. In this post, I take a look at a similar, but very different methodology for dispatching based on type with pre-assigned handlers. For me, this is actually the more common pattern because the observer pattern is usually implemented as an API to outsider code. On the other hand, this type of dispatcher is usually a programmer’s pattern, used for development and decoupling.

Class Variables

Mon, 04 Apr 2016 19:52:46 +0000

These snippets are just a short reminder of how class variables work in Python. I understand this topic a bit too well, I think; I always remember the gotchas and can’t remember which gotcha belongs to which important detail. I generally come up with the right answer then convince myself I’m wrong until I write a bit of code and experiment. Hopefully this snippet will shortcut that process.

Consider the following class hierarchy:

Simple Password Generation

Wed, 30 Mar 2016 09:07:21 +0000

I was talking with @looselycoupled the other day about how we generate passwords for use on websites. We both agree that every single domain should have its own password (to prevent one crack ruling all your Internets). However, we’ve both evolved on the method over time, and I’ve written a simple script that allows me to generate passwords using methodologies discussed in this post.

In particular I use the generator to create passwords for pwSafe, the tool I currently use for password management (due to its use of the open source database format created by Bruce Schneier). It is my hope that this script can be embedded directly into pwSafe, or at least allow me to write directly to the database; but for now I just copy and paste with the pbcopy utility.

Visualizing Pi with matplotlib

Mon, 14 Mar 2016 10:56:57 +0000

Happy Pi day! As is the tradition at the University of Maryland (and to a certain extent, in my family) we are celebrating March 14 with pie and Pi. A shoutout to @konstantinosx who, during last year’s Pi day, requested blueberry pie, which was the strangest pie request I’ve received for Pi day. Not that blueberry pie is strange, just that someone would want one so badly for Pi day (he got a mixed berry pie).

Adding a Git Commit to Header Comments

Tue, 08 Mar 2016 13:57:56 +0000

You may have seen the following type of header at the top of my source code:

# main
# short description
#
# Author:   Benjamin Bengfort <benjamin@bengfort.com>
# Created:  Tue Mar 08 14:07:24 2016 -0500
#
# Copyright (C) 2016 Bengfort.com
# For license information, see LICENSE.txt
#
# ID: main.py [] benjamin@bengfort.com $

All of this is pretty self explanatory with the exception of the final line. This final line is a throw back to Subversion actually, when you could add a $Id$ tag to your code, and Subversion would automatically populate it with something that looks like:

The Bengfort Toolkit

Mon, 07 Mar 2016 13:32:14 +0000

Programming life has finally caused me to give into something that I’ve resisted for a while: the creation of a Bengfort Toolkit and specifically a benlib. This post is mostly a reminder that this toolkit now exists and that I spent valuable time creating it against my better judgement. And as a result, I should probably use it and update it.

I’ve already written (whoops, I almost said “you’ve already read” but I know no one reads this) posts about tools that I use frequently including [clock.py]({% post_url 2016-01-12-codetime-and-clock %}) and [requires]({% post_url 2016-01-21-freezing-requirements %}). These things have been simply Python scripts that I’ve put in ~/bin, which is part of my $PATH. These are too small or simple to require full blown repositories and PyPI listings on their own merit. Plus, I honestly believe that I’m the only one that uses them.

Anonymizing User Profile Data with Faker

Thu, 25 Feb 2016 12:32:54 +0000

This post is an early draft of expanded work that will eventually appear on the District Data Labs Blog. Your feedback is welcome, and you can submit your comments on the draft GitHub issue.

In order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to show examples and how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson because it provides for deep and meaningful further exploration. Non-trivial datasets can provide surprise and intuition in a way that toy datasets just cannot. Unfortunately, non-trivial datasets can be hard to find for a few reasons, but one common reason is that the dataset contains personally identifying information (PII).

Implementing the Observer Pattern with an Event System

Tue, 16 Feb 2016 07:24:04 +0000

I was looking back through some old code (hoping to find a quick post before I got back to work) when I ran across a project I worked on called Mortar. Mortar was a simple daemon that ran in the background and watched a particular directory. When a file was added or removed from that directory, Mortar would notify other services or perform some other task (e.g. if it was integrated into a library). At the time, we used Mortar to keep an eye on FTP directories, and when a file was uploaded Mortar would move it to a staging directory based on who uploaded it, then do some work on the file.

Running on Schedule

Wed, 10 Feb 2016 09:50:33 +0000

Automation with Python is a lovely thing, particularly for very repetitive or long running tasks; but unfortunately someone still has to press the button to make it go. It feels like there should be an easy way to set up a program such that it runs routinely, in the background, without much human intervention. Daemonized services are the route to go in server land; but how do you routinely schedule a process to run on your local computer, which may or may not be turned off1? Moreover, long running daemon processes seem expensive when you just want a quick job to execute routinely.

Iterators and Generators

Fri, 05 Feb 2016 22:47:15 +0000

This post is an attempt to explain what iterators and generators are in Python, defend the yield statement, and reveal why a library like SimPy is possible. But first some terminology (that specifically targets my friends who Java). Iteration is a syntactic construct that implements a loop over an iterable object. The for statement provides iteration, the while statement may provide iteration. An iterable object is something that implements the iteration protocol (Java folks, read interface). A generator is a function that produces a sequence of results instead of a single value and is designed to make writing iterable objects easier.

On Interval Calls with Threading

Tue, 02 Feb 2016 20:43:07 +0000

Event driven programming can be a wonderful thing, particularly when the execution of your code is dependent on user input. It is for this reason that JavaScript and other user facing languages implement very strong event based semantics. Many times event driven semantics depends on elapsed time (e.g. wait then execute). Python, however, does not provide a native setTimeout or setInterval that will allow you to call a function after a specific amount of time, or to call a function again and again at a specific interval.

Timeline Visualization with Matplotlib

Thu, 28 Jan 2016 22:24:47 +0000

Several times it’s come up that I’ve needed to visualize a time sequence for a collection of events across multiple sources. Unlike a normal time series, events don’t necessarily have a magnitude, e.g. a stock market series is a graph with a time and a price. Events simply have times, and possibly types.

A one dimensional number line is still interesting in this case, because the frequency or density of events reveal patterns that might not easily be analyzed with non-visual methods. Moreover, if you have multiple sources, overlaying a timeline on each can show which is busier, when and possibly also demonstrate some effect or causality.

Building a Console Utility with Commis

Sat, 23 Jan 2016 08:46:18 +0000

Applications like Git or Django’s management utility provide a rich interaction between a software library and their users by exposing many subcommands from a single root command. This style of what is essentially better argument parsing simplifies the user experience by only forcing them to remember one primary command, and allows the exploration of the utility hierarchy by using --help and other visibility mechanisms. Moreover, it allows the utility writer to decouple different commands or actions from each other.

Freezing Package Requirements

Thu, 21 Jan 2016 10:23:06 +0000

I have a minor issue with freezing requirements, and so I put together a very complex solution. One that is documented here. Not 100% sure why this week is all about packaging, but there you go.

First up, what is a requirement file? Basically they are a list of items that can be installed with pip using the following command:

$ pip install -r requirements.txt

The file therefore mostly serves as a list of arguments to the pip install command. The requirements file itself has a very specific format and can be created by hand, but generally the pip freeze command is used to dump out the requirements as follows:

Packaging Python Libraries with PyPI

Wed, 20 Jan 2016 15:33:06 +0000

Package deployment is something that is so completely necessary, but such a pain in the butt that I avoid it a little bit. However to reuse code in Python and to do awesome things like pip install mycode, you need to package it up and stick it on to PyPI (pronounced /pīˈpēˈī/ according to one site I read, though I still prefer /pīˈpī/). This process should be easy, but it’s detail oriented and there are only two good walk throughs (see links below).

Better JSON Encoding

Tue, 19 Jan 2016 14:26:27 +0000

The topic of the day is a simple one: JSON serialization. Here is my question, if you have a data structure like this:

import json
import datetime

data = {
    "now": datetime.datetime.now(),
    "range": xrange(42),
}

Why can’t you do something as simple as: print json.dumps(data)? These are simple Python datetypes from the standard library. Granted serializing a datetime might have some complications, but JSON does have a datetime specification. Moreover, a generator is just an iterable, which can be put into memory as a list, which is exactly the kind of thing that JSON likes to serialize. It feels like this should just work. Luckily, there is a solution to the problem as shown in the Gist below:

Simple SQL Query Wrapper

Mon, 18 Jan 2016 10:52:00 +0000

Programming with databases is a fact of life for any seasoned programmer (read, “worth their salt”). From embedded databases like SQLite and LevelDB to server databases like PostgreSQL, data management is a fundamental part of any significant project. The first thing I should say here is skip the ORM and learn SQL. SQL is such a powerful tool to query and manage a database, and is far more performant thanks to 40 years of research and development.

The codetime and clock Commands

Tue, 12 Jan 2016 17:02:51 +0000

If you’ve pair programmed with me, you might have seen me type something to the following effect on my terminal, particularly if I have just created a new file:

$ codetime

Then somehow I can magically paste a formatted timestamp into the file! Well it’s not a mystery, in fact, it’s just a simple alias:

alias codetime="clock.py code | pbcopy"

Oh, well that’s easy — why the blog post? Hey, what’s clock.py? A great question! This Python script is the dumbest thing that I have ever written, that has become the most useful tool that I use on a daily basis. Whenever there is a dumb to useful ratio like that, it’s blogging time. Here is clock.py:

Wrapping the Logging Module

Mon, 11 Jan 2016 08:15:05 +0000

The standard library logging module is excellent. It is also quite tedious if you want to use it in a production system. In particular you have to figure out the following:

configuration of the formatters, handlers, and loggers
object management throughout the script (e.g. the logging.getLogger function)
adding extra context to log messages for more complex formatters
handling and logging warnings (and to a lesser extent, exceptions)

The logging module actually does all of these things. The problem is that it doesn’t do them all at once for you, or with one single API. Therefore we typically go the route that we want to wrap the logging module so that we can provide extra context on demand, as well as handle warnings with ease. Moreover, once we have a wrapped logger, we can do fun things like create mixins to put together classes that have loggers inside of them.

Simple CLI Script with Argparse

Sun, 10 Jan 2016 14:48:14 +0000

Let’s face it, most of the Python programs we write are going to be used from the command line. There are tons of command line interface helper libraries out there. My preferred CLI method is the style of Django’s management utility. More on this later, when we hopefully publish a library that gives us that out of the box (we use it in many of our projects already).

Sometimes though, you just want a simple CLI script. These days we use the standard library argparse module to parse commands off the command line. Here is my basic script that I use for most of my projects:

Basic Python Project Files

Sat, 09 Jan 2016 14:01:59 +0000

I don’t use project templates like cookiecutter. I’m sure they’re fine, but when I start a new project I like to get a cup of coffee, go to my zen place and manually create the workspace. It gets me in the right place to code. Here’s the thing, there is a right way to set up a Python project. Plus, I have a particular style for my repositories — particularly how I use Creative Commons Flickr photos as the header for my README files.

Frequently Copied and Pasted

Fri, 08 Jan 2016 23:14:57 +0000

I have a bit of catch up to do — and I think that this notepad and development journal is the perfect resource to do it. You see, I am constantly copy and pasting code from other projects into the current project that I’m working on. Usually this takes the form of a problem that I had solved previously that has a similar domain to a new problem, but requires a slight amount of tweaking. Other times I am just doing the same task over and over again.

One Big Gift Selection Algorithm

Fri, 25 Dec 2015 11:54:12 +0000

My family does “one big gift” every Christmas; that is instead of everyone simply buying everyone else a smaller gift; every person is assigned to one other person to give them a single large gift. Selection of who gives what to who is a place of some (minor) conflict. Therefore we simply use a random algorithm. Unfortunately, apparently a uniform random sample of pairs is not enough, therefore we take 100 samples to vote for each combination to see who gets what as follows:

Natural Language Processing and Hadoop

Sat, 02 Nov 2013 12:00:00 -0500

Natural Language Processing and Hadoop

Description

Benjamin Bengfort and Sean Murphy discuss how NLP can be integrated with Hadoop to gain insights in big data.

Unbound Concepts: Columbia TechBreakfast Dec. 2012

Wed, 12 Dec 2012 12:00:00 -0500

Description

Presentation by the CTO of Unbound Concepts, Benjamin Bengfort, at the Columbia TechBreakfast 2012.

About

Mon, 01 Jan 0001 00:00:00 +0000

about