[{"content":"Implementing Function Calling LLMs without Fear is a talk that I gave at a C4AI/RealmOne Happy Hour Tech Meetup in Columbia, Maryland. The slides of the talk are below:\nAbstract Description: For an AI system to be an agent rather than a simple chatbot, it needs to be able to do work on behalf of its users, often accomplished through the use of Function Calling LLMs. Instruction-based models can identify external functions to call for additional input or context before creating a final response without the need for any additional training. However, giving an AI system access to databases, APIs, or even tools like our calendars is fraught with security concerns and task validation nightmares. In this talk, we\u0026rsquo;ll discuss the basics of how Function Calling works and think through the best practices and techniques to ensure that your agents work for you, not against you!\n","permalink":"https://bbengfort.github.io/2025/04/function-calling-without-fear/","summary":"\u003cp\u003e\u003ca href=\"https://info.umbctraining.com/effective-ai-systems\"\u003eImplementing Function Calling LLMs without Fear\u003c/a\u003e is a talk that I gave at a C4AI/RealmOne Happy Hour Tech Meetup in Columbia, Maryland. The slides of the talk are below:\u003c/p\u003e\n\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/278029409?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003ch3 id=\"abstract\"\u003eAbstract\u003c/h3\u003e\n\u003cp\u003eDescription: For an AI system to be an agent rather than a simple chatbot, it needs to be able to do work on behalf of its users, often accomplished through the use of Function Calling LLMs. Instruction-based models can identify external functions to call for additional input or context before creating a final response without the need for any additional training. However, giving an AI system access to databases, APIs, or even tools like our calendars is fraught with security concerns and task validation nightmares. In this talk, we\u0026rsquo;ll discuss the basics of how Function Calling works and think through the best practices and techniques to ensure that your agents work for you, not against you!\u003c/p\u003e","title":"Implementing Function Calling LLMs without Fear"},{"content":"Privacy and Security in the Age of Generative AI is a talk that I gave at ODSC West 2024 in Burlingame, California. The slides of the talk are below:\nAn updated presentation that I gave at C4AI on April 15, 2025 in Columbia, Maryland is below:\nAbstract From sensitive data leakage to prompt injection and zero-click worms, LLMs and generative models are the new cyber battleground for hackers. As more AI models are deployed in production, data scientists and ML engineers can\u0026rsquo;t ignore these problems. The good news is that we can influence privacy and security in the machine learning lifecycle using data specific techniques. In this talk, we\u0026rsquo;ll review some of the newest security concerns affecting LLMs and deep learning models and learn how to embed privacy into model training with ACLs and differential privacy, secure text generation and function-calling interfaces, and even leverage models to defend other models.\nSession Outline: Security Concerns in Generative AI (6 minutes) Data Access Controls for MLOps (6 minutes) Building Privacy into Models (4 minutes) LLM Model Evaluation (4 minutes) Security Context for TGI and Function Calling (6 minutes) Can Models Secure Models? (4 minutes) Learning Objectives: Generative AI is an amazing interface for human users to better access the huge amounts of data with natural language queries. However, as AI becomes more important in automating repetitive, inconsistent, and boring tasks it has also become a target for hackers and malicious actors.\nImage modification attacks such as modifying pixels in an image or adding stickers to signs can dramatically influence the output of computer vision systems and classifiers: usually to cause harmful actions to occur (e.g. to cause a vehicle to change lanes or for a fake driver\u0026rsquo;s license to be recognized). Prompt injection attacks are used to manipulate LLMs into leaking sensitive data or spread misinformation. Researchers have even recently shown that AI worms are possible that target generative AI systems through adversarial self-replicating prompts.\nAs data scientists, we already have a lot of concerns from model quality and generalization to bias and fairness in our outputs; do we really need to take on security and privacy also? Data scientists and machine learning engineers use data to build data products for our users. Generative AI attacks are based on data, and therefore it is squarely in our purview as data scientists to ensure that we create high quality data pipelines and models that preserve the security and privacy of our users and organizations when used in combination with application security techniques.\nIn this talk we will explore data-driven techniques for privacy and security that will augment the security best practices of the application and product teams we belong to.\nWe know that the quality of a model is based on the data, so to is the security of the model. We\u0026rsquo;ll explore the use of data access controls to influence model training and inferencing. We\u0026rsquo;ll also look at algorithmic approaches such as differential privacy to prevent model leakage. Finally we\u0026rsquo;ll explore how to combine security context and awareness in text generation inferences and function calling LLMs.\nAt the end of the the talk we\u0026rsquo;ll touch on an open question for security researchers: can AI models be used to enhance the security of other models and more rapidly detect emergent threats?\n","permalink":"https://bbengfort.github.io/2024/10/privacy-security-generative-ai/","summary":"\u003cp\u003e\u003ca href=\"https://odsc.com/speakers/privacy-and-security-in-the-age-of-generative-ai/\"\u003ePrivacy and Security in the Age of Generative AI\u003c/a\u003e is a talk that I gave at ODSC West 2024 in Burlingame, California. The slides of the talk are below:\u003c/p\u003e\n\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/272867721?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003cp\u003eAn updated presentation that I gave at C4AI on April 15, 2025 in Columbia, Maryland is below:\u003c/p\u003e\n\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/278029566?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003ch3 id=\"abstract\"\u003eAbstract\u003c/h3\u003e\n\u003cp\u003eFrom sensitive data leakage to prompt injection and zero-click worms, LLMs and generative models are the new cyber battleground for hackers. As more AI models are deployed in production, data scientists and ML engineers can\u0026rsquo;t ignore these problems. The good news is that we can influence privacy and security in the machine learning lifecycle using data specific techniques. In this talk, we\u0026rsquo;ll review some of the newest security concerns affecting LLMs and deep learning models and learn how to embed privacy into model training with ACLs and differential privacy, secure text generation and function-calling interfaces, and even leverage models to defend other models.\u003c/p\u003e","title":"Privacy and Security in the Age of Generative AI"},{"content":"Smart Global Replication using Reinforcement Learning is a talk that I gave at KubeCon + CloudNative North America 2023 in Chicago, IL. The video of the talk is below:\nDescription There are many great reasons to replicate data across Kubernetes clusters in different geographic regions: e.g. for disaster recovery and to ensure the best possible user experiences. Unfortunately, global replication is not easy; not just because of the difficulty in consistency reasoning that it introduces, but also due to the increased cost of provisioning multiple volumes that exponentially duplicate ingress and egress. Wouldn\u0026rsquo;t it be great if our systems could learn the optimal placement of storage blocks so that total replication was not necessary? Wouldn\u0026rsquo;t it be even better if our replication messaging was reduced ensuring communication only between the minimally necessary set of storage nodes? We show a system that uses multi-armed bandits to perform such an optimization; dynamically adjusting how data is replicated based on usage. We demonstrate the savings achieved and system performance using a real world system: the TRISA Global Travel Rule Compliance Directory.\n","permalink":"https://bbengfort.github.io/2023/11/smart-global-replication-using-reinforcement-learning/","summary":"\u003cp\u003e\u003ca href=\"https://kccncna2023.sched.com/event/21d4640540a0961019d201de8ec2fd5e\"\u003eSmart Global Replication using Reinforcement Learning\u003c/a\u003e is a talk that I gave at KubeCon + CloudNative North America 2023 in Chicago, IL. The video of the talk is below:\u003c/p\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/YTF2dXNhFzI?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eThere are many great reasons to replicate data across Kubernetes clusters in different geographic regions: e.g. for disaster recovery and to ensure the best possible user experiences. Unfortunately, global replication is not easy; not just because of the difficulty in consistency reasoning that it introduces, but also due to the increased cost of provisioning multiple volumes that exponentially duplicate ingress and egress. Wouldn\u0026rsquo;t it be great if our systems could learn the optimal placement of storage blocks so that total replication was not necessary? Wouldn\u0026rsquo;t it be even better if our replication messaging was reduced ensuring communication only between the minimally necessary set of storage nodes? We show a system that uses multi-armed bandits to perform such an optimization; dynamically adjusting how data is replicated based on usage. We demonstrate the savings achieved and system performance using a real world system: the TRISA Global Travel Rule Compliance Directory.\u003c/p\u003e","title":"Smart Global Replication Using Reinforcement Learning"},{"content":"DIY Consensus: Crafting Your Own Distributed Code (with Benjamin Bengfort)\nDescription How do distributed systems work? If you\u0026rsquo;ve got a database spread over three servers, how do they elect a leader? How does that change when we spread those machines out across data centers, situated around the globe? Do we even need to understand how it works, or can we relegate those problems to an off the shelf tool like Zookeeper? Joining me this week is Distributed Systems Doctor—Benjamin Bengfort—for a deep dive into consensus algorithms. We start off by discussing how much of \u0026ldquo;the clustering problem\u0026rdquo; is your problem, and how much can be handled by a library. We go through many of the constraints and tradeoffs that you need to understand either way. And we eventually reach Benjamin\u0026rsquo;s surprising message - maybe the time is ripe to roll your own. Should we be writing our own bespoke Raft implementations? And if so, how hard would that be? What guidance can he offer us? Somewhere in the recording of this episode, I decided I want to sit down and try to implement a leader election protocol. Maybe you will too. And if not, you\u0026rsquo;ll at least have a better appreciation for what it takes. Distributed systems used to be rocket science, but they\u0026rsquo;re becoming deployment as usual. This episode should help us all to keep up!\n","permalink":"https://bbengfort.github.io/2023/08/diy-consensus-crafting-your-own-distributed-code/","summary":"\u003cp\u003e\u003ca href=\"https://pod.link/developer-voices/episode/938292a4e4b2c1ca82a66d4674dd8d97\"\u003eDIY Consensus: Crafting Your Own Distributed Code (with Benjamin Bengfort)\u003c/a\u003e\u003c/p\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/Ij_PBvocf5c?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eHow do distributed systems work? If you\u0026rsquo;ve got a database spread over three servers, how do they elect a leader? How does that change when we spread those machines out across data centers, situated around the globe? Do we even need to understand how it works, or can we relegate those problems to an off the shelf tool like Zookeeper? Joining me this week is Distributed Systems Doctor—Benjamin Bengfort—for a deep dive into consensus algorithms. We start off by discussing how much of \u0026ldquo;the clustering problem\u0026rdquo; is your problem, and how much can be handled by a library. We go through many of the constraints and tradeoffs that you need to understand either way. And we eventually reach Benjamin\u0026rsquo;s surprising message - maybe the time is ripe to roll your own. Should we be writing our own bespoke Raft implementations? And if so, how hard would that be? What guidance can he offer us?  Somewhere in the recording of this episode, I decided I want to sit down and try to implement a leader election protocol. Maybe you will too. And if not, you\u0026rsquo;ll at least have a better appreciation for what it takes. Distributed systems used to be rocket science, but they\u0026rsquo;re becoming deployment as usual. This episode should help us all to keep up!\u003c/p\u003e","title":"DIY Consensus: Crafting Your Own Distributed Code (with Benjamin Bengfort)"},{"content":"Performance is key when building streaming gRPC services. When you\u0026rsquo;re trying to maximize throughput (e.g. messages per second) benchmarking is essential to understanding where the bottlenecks in your application are.\nHowever, as a start, you can pretty much guarantee that one bottleneck is going to be the serialization (marshaling) and deserialization (unmarshaling) of protocol buffer messages.\nWe have a use case where the server does not need all of the information in the message in order to process the message. E.g. we have header information such as IDs and client information that the server does need to update as part of processing. The other part of the message is data that needs to be saved to disk and does not have to be unmarshaled until it\u0026rsquo;s read. However, our protocol buffer schema right now is \u0026ldquo;flat\u0026rdquo; — meaning that all fields whether they are required for processing or not are defined by a single protocol buffer message.\nSo we thought - could we break the flat protocol buffer message into two parts with one part wrapping the other? E.g. the outer message contains just the information the server needs for processing and the inner message remains marshalled until it is needed? Would this increase the performance of marshaling and unmarshaling?\nThe answer surprised me — yes, sort of. In the figure below, smaller throughput is better (e.g. it is faster):\nThe event size in this case is the size of the inner message which is just bytes. When the event size is \u0026ldquo;small\u0026rdquo; (less than 16KiB) then having a wrapped message outperforms the flat message for both marshaling and unmarshaling. However, as the inner message gets larger, serialization of the flat message gets faster.\nMy hypothesis for this is that the serializing a fixed-size outer wrapper is a constant cost; but the serializer does still have to read the entire data field into memory. At some point the time that reading the data field into memory takes starts to outweigh the benefits of the wrapper object.\nAlso, don\u0026rsquo;t get me wrong; overall it will take longer to serialize the entire message. The client will have to first serialize the inner message, then the outer message, which will take longer on the client side; and when reading the inner message will have to be deserialized again. However, having the client do the work does increase the throughput of the server, so it\u0026rsquo;s worth it to us.\nAlso, in the case of unmarshaling when you have nested types, the number of allocs falls by the number of nested types in the inner struct - another bonus!\nThe full code for this benchmark can be found here:\nprotocol buffers benchmarks More detailed results:\nTiny/FlatMarshal-10 1945454\t607.1 ns/op\t1536 B/op\t1 allocs/op Tiny/WrappedMarshal-10 2869138\t418.6 ns/op\t1536 B/op\t1 allocs/op XSmall/FlatMarshal-10 1202126\t919.8 ns/op\t4864 B/op\t1 allocs/op XSmall/WrappedMarshal-10 1773110\t797.0 ns/op\t4864 B/op\t1 allocs/op Small/FlatMarshal-10 1000000\t1217 ns/op\t9472 B/op\t1 allocs/op Small/WrappedMarshal-10 1000000\t1076 ns/op\t9472 B/op\t1 allocs/op Medium/FlatMarshal-10 331611\t3792 ns/op\t40960 B/op\t1 allocs/op Medium/WrappedMarshal-10 335064\t3512 ns/op\t40960 B/op\t1 allocs/op Large/FlatMarshal-10 18306\t64584 ns/op\t1474560 B/op\t1 allocs/op Large/WrappedMarshal-10 16376\t69056 ns/op\t1474561 B/op\t1 allocs/op XLarge/FlatMarshal-10 7090\t167709 ns/op\t5251074 B/op\t1 allocs/op XLarge/WrappedMarshal-10 6192\t174831 ns/op\t5251075 B/op\t1 allocs/op Tiny/FlatUnmarshal-10 1000000\t1102 ns/op\t2280 B/op\t24 allocs/op Tiny/WrappedUnmarshal-10 1783885\t691.9 ns/op\t2008 B/op\t14 allocs/op XSmall/FlatUnmarshal-10 794436\t1528 ns/op\t5352 B/op\t24 allocs/op XSmall/WrappedUnmarshal-10 1256288\t942.3 ns/op\t5464 B/op\t14 allocs/op Small/FlatUnmarshal-10 631819\t1936 ns/op\t9448 B/op\t24 allocs/op Small/WrappedUnmarshal-10 986050\t1243 ns/op\t10072 B/op\t14 allocs/op Medium/FlatUnmarshal-10 320212\t3653 ns/op\t34024 B/op\t24 allocs/op Medium/WrappedUnmarshal-10 322702\t3714 ns/op\t41560 B/op\t14 allocs/op Large/FlatUnmarshal-10 21088\t52541 ns/op\t1475816 B/op\t24 allocs/op Large/WrappedUnmarshal-10 19327\t56723 ns/op\t1475160 B/op\t14 allocs/op XLarge/FlatUnmarshal-10 8012\t131589 ns/op\t5252329 B/op\t24 allocs/op XLarge/WrappedUnmarshal-10 8575\t146534 ns/op\t5251674 B/op\t14 allocs/op ","permalink":"https://bbengfort.github.io/2023/05/faster-protocol-buffer-serialization/","summary":"\u003cp\u003ePerformance is key when building streaming gRPC services. When you\u0026rsquo;re trying to maximize throughput (e.g. messages per second) benchmarking is essential to understanding where the bottlenecks in your application are.\u003c/p\u003e\n\u003cp\u003eHowever, as a start, you can pretty much guarantee that one bottleneck is going to be the serialization (marshaling) and deserialization (unmarshaling) of protocol buffer messages.\u003c/p\u003e\n\u003cp\u003eWe have a use case where the server does not need all of the information in the message in order to process the message. E.g. we have header information such as IDs and client information that the server does need to update as part of processing. The other part of the message is data that needs to be saved to disk and does not have to be unmarshaled until it\u0026rsquo;s read. However, our protocol buffer schema right now is \u0026ldquo;flat\u0026rdquo; — meaning that all fields whether they are required for processing or not are defined by a single protocol buffer message.\u003c/p\u003e","title":"Faster Protocol Buffer Serialization"},{"content":"When implementing Go code, I find myself chasing increased concurrency performance by trying to reduce the number of locks in my code. Often I wonder if using the sync/atomic package is a better choice because I know (as proved by this blog post) that atomics have far more performance than mutexes. The issue is that reading on the internet, including the package documentation itself strongly recommends relying on channels, then mutexes, and finally atomics only if you know what you\u0026rsquo;re doing.\nThe primary difference is that the sync/atomic package uses low level atomic memory primitives provided directly by CPU instructions but without any ordering guarantees. Channels and mutexes guarantee the strict order of accesses to values being shared by go routines, and since these semantics are what we expect, it is often the better choice to use mutexes and channels. However, if you\u0026rsquo;re just trying to ensure that a single operation happens correctly in isolation (such as tracking statistics), or if you\u0026rsquo;re building concurrency primitives from scratch for advanced algorithms, then using atomics makes sense.\nAnd here\u0026rsquo;s why it makes sense:\nBenchmarkCounterInc/Atomic-10 170998743\t6.881 ns/op\t1162.54 MB/s\t0 B/op\t0 allocs/op BenchmarkCounterInc/Mutex-10 65349984\t18.50 ns/op\t432.34 MB/s\t0 B/op\t0 allocs/op BenchmarkCounterLoad/Atomic-10 1000000000\t0.5131 ns/op\t15590.98 MB/s\t0 B/op\t0 allocs/op BenchmarkCounterLoad/Mutex-10 87413383\t13.72 ns/op\t583.05 MB/s\t0 B/op\t0 allocs/op On my Macbook Pro, using atomics to keep track of a counter is 3x faster for writes and and 26x faster for reads.\nSources Atomic Package Documentation StackOverflow: Is there a difference in Go between a counter using atomic operations and one using a mutex? Complete Code The complete code and benchmark results on gist can be found below:\n","permalink":"https://bbengfort.github.io/2022/11/atomic-vs-mutex/","summary":"\u003cp\u003eWhen implementing Go code, I find myself chasing increased concurrency performance by trying to reduce the number of locks in my code. Often I wonder if using the \u003ccode\u003esync/atomic\u003c/code\u003e package is a better choice because I know (as proved by this blog post) that atomics have far more performance than mutexes. The issue is that reading on the internet, including the \u003ca href=\"https://pkg.go.dev/sync/atomic\"\u003epackage documentation\u003c/a\u003e itself strongly recommends relying on channels, then mutexes, and finally atomics \u003cem\u003eonly if you know what you\u0026rsquo;re doing\u003c/em\u003e.\u003c/p\u003e","title":"Atomic vs Mutex"},{"content":"Good software development achieves complexity by describing the interactions between simpler components. Although we tend to think of software processes as step-by-step \u0026ldquo;wizards\u0026rdquo;, design and decoupling of components often means that the interactions are non-linear. So why should our software project planning be defined in a linear progression of steps with time estimates? Can we plan projects using a non-linear workflow that mirrors how we think about component design?\nThe figure above is an experiment in task planning that I recently used to try to describe the complex dependencies between different tasks in a project. On the left are a 2-level hierarchy of tasks that represent decomposition in the planning process - first major tasks are described, then subtasks for each major task can be planned out. Each task is assigned a level of work estimate - in this case using Fibonacci numbers, though I think t-shirt sizes would also work well. So far, so normal - the novelty comes from the Gantt-esque chart on the right.\nThe temporal chart shows blocks of time that corresponds to one unit of estimated work. If a task is estimated at 3 points, then it takes three blocks in the temporal chart. Unlike Gantt charts, however, the blocks of time do not represent clock or calendar time, but instead represent \u0026ldquo;happens before\u0026rdquo; relationships. E.g. in the green section, \u0026ldquo;Create schema for PostgreSQL database\u0026rdquo; and \u0026ldquo;Create manifests and deploy DB to staging\u0026rdquo; must be completed before \u0026ldquo;Add test data to the db (unblocks service implementation)\u0026rdquo;. Tasks that are coupled or related are also shown by overlapping time, e.g. \u0026ldquo;Create schema\u0026rdquo; and \u0026ldquo;Create manifests\u0026rdquo; are overlapping tasks - there is work that can be done in parallel between the schema and the manifests, but at some point the manifests need the schema in order to be complete.\nBy describing the relationships between tasks in this way, dependencies between tasks become quickly apparent without specifying a DAG or other relationship diagram. Tasks that can be completed in parallel allow team members to collaborate effectively, while overlapping tasks need the more focused effort of an individual contributor. Although this temporal chart seems to describe a waterfall of development effort - I believe that it\u0026rsquo;s flexibility lends itself to agile development workflows since these dependencies can make sprint planning simpler.\nConsider that in an agile software development workflow, requirements and tasks are added to a backlog, then during sprint planning, those tasks are prioritized and added to a discrete sprint. Dependencies between tasks are often described by \u0026ldquo;blockers\u0026rdquo;. This process works well for more mature projects where each task tends to be self-contained. However, I\u0026rsquo;ve observed two issues with this process: first, team-members tend to focus on their own projects and second, it\u0026rsquo;s very difficult to plan projects beyond a single sprint, particularly right at the beginning of a project. Instead, I envision more focused sprint planning where the only tasks that are considered are the tasks that are available at the point in the temporal flow chart at the time of the sprint, and as many tasks as there are points available in the sprint are collected into it.\nThere are some issues: dependencies between subtasks are easy to describe, but dependencies across subtasks in different major tasks are more difficult. To show this I\u0026rsquo;ve added \u0026ldquo;x\u0026rdquo; marks to dependent subtasks in across major tasks, but it would be better if the relationships across major tasks could be trusted. Also, if used in sprint planning, this temporal flow chart would have to be dynamic; showing the actual vs predicted relationships and able to add new major tasks and subtasks as the project continued.\nI think it would be interesting to prototype a planning tool that allowed you to easily add major tasks and subtasks with time estimates. Then by simply clicking on a block the time would be filled in. The major task would automatically be updated with the full estimate and the time window it\u0026rsquo;s in. Groups of tasks could be moved left or right together. This level of interactivity would make planning much more effective!\n","permalink":"https://bbengfort.github.io/2021/03/nonlinear-workflow-planning-software-projects/","summary":"\u003cp\u003eGood software development achieves complexity by describing the interactions between simpler components. Although we tend to think of software processes as step-by-step \u0026ldquo;wizards\u0026rdquo;, design and decoupling of components often means that the interactions are non-linear. So why should our software project planning be defined in a linear progression of steps with time estimates? Can we plan projects using a non-linear workflow that mirrors how we think about component design?\u003c/p\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/images/2021-03-14-blocks-dependencies.png\" alt=\"Blocks and Dependencies\"  /\u003e\n\u003c/p\u003e","title":"Nonlinear Workflow for Planning Software Projects"},{"content":"Strict typing in the Go programming language provides safety and performance that is valuable even if it does increase the verbosity of code. If there is a drawback to be found with strict typing, it is usually felt by library developers who require flexibility to cover different use cases, and most often appears as a suite of type-named functions such as lib.HandleString, lib.HandleUint64, lib.HandleBool and so on. Go does provide two important language tools that do provide a lot of flexibility in library development: closures and interfaces, which we will explore in this post.\nClosures Closures may be one of the most misunderstood concepts in computer science because the term is rooted in language-specific constructs such as scope, stack vs heap allocation, and anonymous functions. Perhaps the most precise definition of a closure is:\nThe combination of a function with an environment where the environment is a mapping of all free variables available to the function to the value or reference of those variables at the time the closure was created.\nSaid a bit more simply, closures can be thought of as a packaging of specific state with a specific function in an isolated (closed) bundle. If this sounds similar to object-oriented programming, then you\u0026rsquo;d be right - in fact, programming languages that allow closures typically implement them with a special data structure analogous to an object or by implementing function objects. This has implications for stack vs heap allocation and defines what kinds of programming languages could support closures (not all can). Note that languages that natively support closures also tend to use garbage collection.\nSo why would you use a closure instead of an object? One very common answer is simplicity - compare the following code that implements a closure with code that implements a struct with a method:\npackage main import \u0026#34;fmt\u0026#34; func Counter() func () int { var i int return func() int { i++ return i } } func main() { counter := Counter() fmt.Println(counter()) fmt.Println(counter()) other := Counter() fmt.Println(other()) fmt.Println(counter()) } In the closure code, the Counter function returns the closure, the func() int, whose environment is the variable i. When the closure is called, it increments its state variable i and maintains that state across multiple calls. If a new closure, other, is created its state is separate from the state of the original counter.\npackage main import \u0026#34;fmt\u0026#34; type Counter struct { i int } func (c *Counter) Incr() int { c.i++ return c.i } func main() { counter := new(Counter) fmt.Println(counter.Incr()) fmt.Println(counter.Incr()) other := new(Counter) fmt.Println(other.Incr()) fmt.Println(counter.Incr()) } The struct code defines a Counter struct with an internal int i as its state, then provides a method Incr to modify and return that state in a similar way to the closure code.\nThe simplicity argument is that the counter only uses a single definition instead of two (e.g. a definition for the closure function instead of one each for the struct and method definitions) and reduces the lexical complexity - e.g. removing the need for creating and initializing a new struct and the dotted method call. However, they are the same number of lines of code and in my opinion the struct code is more flexible. More importantly, the struct code is more performant:\nBenchmarkCounterStruct-8 1000000000\t0.278 ns/op BenchmarkCounterClosure-8 27671817\t39.6 ns/op These benchmarks test the allocation/creation of either the struct or the closure and one call to increment. The closure is 142x slower likely because of the overhead of creating the additional closure structure and the allocations required to add the environment and mapping to the stack.\nA better argument for the use of closures is to allow flexibility in code.\nUnlike a struct, the state of a closure need not be defined prior to its use since its state is all variables in the enclosing scope. This is why anonymous or lambda functions are inextricably tied with closures, since in Go an anonymous function is required to create a closure. Anonymous functions are not necessarily closures, however, they\u0026rsquo;re simply functions that are not bound to a name. Still, it\u0026rsquo;s tough to come up with a use case in Go where an anonymous function is not a closure.\nCapturing variables in the enclosing scope in the environment of the closure brings us to the primary use case for closures in Golang: kicking off go routines to do work. Consider the following example that creates and calls 11 closures as go routines:\npackage main import ( \u0026#34;fmt\u0026#34; \u0026#34;log\u0026#34; ) func main() { // Create the done and errc channels in the scope of the main function. These // channels are the primary local variables that we\u0026#39;re going to enclose in the // environment of all 11 closures created in this function. errc := make(chan error, 1) done := make(chan int, 1) // Instantiate 10 worker closures that uses the errc and done channels. for i := 0; i \u0026lt; 10; i++ { // Note that this closure takes as input an integer j, and the for loop // variable, i, is passed to this function as an argument rather than the // closure accessing the enclosed i. More on this later. go func(j int) { if err := doWork(); err != nil { errc \u0026lt;- err } done \u0026lt;- j }(i) } // The 11th closure reads off of the done channel and marks progress, closing // the error channel when all work is complete to signal to the main routine // that the work is finished. go func() { for i := 0; i \u0026lt; 10; i++ { j := \u0026lt;-done fmt.Printf(\u0026#34;worker %d done in %d position\\n\u0026#34;, j, i+1) } // When done, close the error channel close(errc) }() // The main routine continues, blocking on the error channel to allow the // workers to do their thing rather than simply exiting. If it receives an // error, the program is terminated early. for err := range errc { if err != nil { log.Fatal(err) } } // When the error channel is closed by the 11th closure, we get to this point // and exit the program with a success message! fmt.Println(\u0026#34;done successfully!\u0026#34;) } In this snippet of code the calls to go func() {}() are the simultaneous definition of an anonymous function, the creation of a closure, and the execution of a go routine. In this context, the anonymous function is what is allowing the closure to be created. Consider the valid execution of a named function with a similar function body, go doWork(i), to kick off the go routine; in this case the compiler would error with undefined: errc or undefined: done since the named function is not a closure and the two channels would have to be passed explicitly to the function.\nThis also brings us to the major gotcha with closures - the state that they enclose is a shared state that is readable and writable by all routines. Note that all routines in the snippet above could easily send on or read from any of the channels in the enclosing state. This is precisely why it is so important to use channels or lower level atomics/mutexes for synchronization across go routines. One key place that is a common error for Go programmers is the for loop that creates the 10 workers. Although the closure has access to the variable i, as the loop changes, so does the variable. Therefore if it were to done \u0026lt;- i rather than j, then it\u0026rsquo;s likely only 9 would be printed to the console. By creating an argument j and passing i into the function, the variable is escaped out of the outer scope and placed into the inner scope of the closure function.\nAnd finally, a last comment on closures in Go - they can be typed. One common pattern is to create a factory function that has an outer scope to do some work and returns a closure to do the handling. For example, the go routines from above could be refactored as follows:\n// Specify the type of the worker function, that is a function that takes an int type worker func(i int) // Worker factory creates a new worker function enclosing the outer scope that // contains the done and errc channels. Note that the directionality is now // specified, the worker can only send on the channel, not recv. func workerFactory(done chan\u0026lt;- int, errc chan\u0026lt;- error) (worker, err) { // Check to make sure the worker can be created, otherwise return error // Create the closure and return it return func(i int) { if err := doWork(); err != nil { errc \u0026lt;- err } done \u0026lt;- i }, nil } func main() { errc := make(chan error, 1) done := make(chan int, 1) // Instantiate 10 workers to do the work, first checking to ensure the worker // can be created by the factory. for i := 0; i \u0026lt; 10; i++ { worker, err := workerFactory(done, errc) if err != nil { log.Fatal(err) } go worker(i) } ... } Although it appears that there is no closure created here, in fact there is. Instead of the enclosing the scope of the main function, it is the scope of workerFactory that is enclosed in the closure. This also gives us the opportunity to specify the directionality of the channels so that the worker can only send and not recv on the done and errc channels. This pattern is useful when some work needs to be done to set up the worker(s) (e.g. connect to a database) or check to make sure other constraints are satisfied (e.g. \u0026ldquo;only launch 100 workers\u0026rdquo;).\nClosures and function types are one of the two primary ways Go provides flexibility for library code, the other is interfaces.\nInterfaces Interfaces are types in programming that specify the behavior of an entity.\nI\u0026rsquo;ve chosen this definition carefully because interfaces are commonly used in object-oriented programming languages such as Java and may have flexible interpretations for duck-typed languages, e.g. \u0026ldquo;if it looks like a duck, it quacks like a duck, it walks like a duck - then it\u0026rsquo;s a duck\u0026rdquo;. In a strictly typed language like Go, an interface must be a type. More specifically in Go, an interface is \u0026ldquo;a named collection of method signatures\u0026rdquo;.\nThe most common interface that we use in go is the error interface, which is defined as follows:\ntype error interface { Error() string } When we declare a function that returns an error, we are actually specifying that the function returns a struct that implements the error interface, e.g. it has an Error method that returns a string. The two most common ways we create errors in Go are via fmt.Errorf and errors.New - both of these functions actually return a *errors.errorString type (use fmt.Printf(\u0026quot;%T\\n\u0026quot;, err) to print the type of an object or an error). We could just as easily do the following (which is also very common):\ntype Error struct { code uint32 msg string } func (e *Error) Error() string { return fmt.Errorf(\u0026#34;[%d] %s\u0026#34;, e.code, e.msg) } func myfunc() error { return \u0026amp;Error{code: 32, msg: \u0026#34;it didn\u0026#39;t work\u0026#34;} } Interfaces therefore give us the opportunity to pass around different base types to different functions even in a strictly typed environment, because the function will know what method it can call using that type. One of the best examples in the Go standard library of the flexibility of types is the io library and the io.Reader and io.Writer interfaces:\ntype Reader interface { Read(p []byte) (n int, err error) } type Writer interface { Write(p []byte) (n int, err error) } These interfaces mean that you can create a function that accepts an either an *os.File or a *bytes.Buffer or some other socket connection, etc. and read or write to it using the same interface. You can also compile interfaces into a single interface, e.g.\ntype ReadWriter interface { io.Reader io.Writer io.Closer } This interface requires objects to have all of the Read([]byte) (int, error), Write([]byte) (int, error), and Close() error methods. In fact, the io.ReadCloser interface is such a compiled interface in the standard library. The best practice for flexbility with interfaces is to define the minimum number of methods required for a single interface, and to build them up into larger interfaces as needed (without letting it get out of hand). For example:\ntype Worker interface { Handle(Task) error Name() string } type Preparer interface { Before() error } type Janitor interface { Cleanup() error } type Workflow interface { Preparer Worker Janitor } Library code might have a single method that does all of the workflow using a Worfklow object, but can be made more flexbile by calling individual functions for the other interfaces. By requiring only the interface that is needed for the input of the function, the testing becomes easier and the library more robust.\nLet\u0026rsquo;s get into some common gotchas. Consider the following interface:\ntype Handler interface { Handle() error } First, arguments to functions that accept Handler types are defined as follows:\nfunc Do(h Handler) error { return h.Handle() } This can be confusing because you might be passing a pointer to a struct that implements Handler, but if you create the following definition:\nfunc Do(h *Handler) error {} You\u0026rsquo;ll receive the following error:\ncannot use valueHandler literal (type valueHandler) as type *Handler in argument to Do: *Handler is pointer to interface, not interface That\u0026rsquo;s because Handler is an interface, and the struct that implements Handler could be either a value or a pointer. Whether it is a value or pointer depends on how the method is implemented:\ntype pointerHandler struct {} func (*pointerHandler) Handle() error { return nil } type valueHandler struct {} func (valueHandler) Handle() error { return nil } Given these methods the following are valid:\nDo(\u0026amp;pointerHandler{}) Do(valueHandler) But the following is not valid:\nDo(pointerHandler{}) Which will cause the following error to be raised:\ncannot use pointerHandler literal (type pointerHandler) as type Handler in argument to Do: pointerHandler does not implement Handler (Handle method has pointer receiver) Somewhat confusingly, Do(\u0026amp;valueHandler{}) is allowed, but when h.Handle is called it will be called with a copy of the value passed in which means that the handler will not be able to change the data on the original referenced struct. I tend to try to avoid this situation because of the confusion it might cause.\nThis leads us finally to interface{} and nil. The empty interface, interface{} is an interface type that specifies no methods. Because it specifies no methods, it may hold values of any type since every type implements at least no methods. Practically speaking interface{} is used as a catchall as a preliminary to custom type checking, for example:\nfunc serialize(v interface{}) string { switch t := v.(type) { case int: return fmt.Sprintf(\u0026#34;%X\u0026#34;, v) case bool: if t { return \u0026#34;true\u0026#34; } else { return \u0026#34;false\u0026#34; } case string: return t case error: return t.Error() default: return fmt.Sprintf(\u0026#34;unknown type %T\u0026#34;, v) } } Here, a type switch is used to test the type of the value v, the case statements must use types to test and the assigned value t is a variable of the type matched with the value of v. Using type assertions also allows unpack the base type of an interface. Using our Error example from earlier:\nif e, ok := err.(*Error); ok { // e is an *Error from our code, so we can access it as needed. if e.code == 32 { ... } } This kind of type assertion is only able to be performed on an interface type; otherwise the type assertion would be irrelevant.\nThe final question we should ask ourselves is why we can use nil as a value for an interface, e.g. return nil instead of an error. All types in Go have a \u0026ldquo;zero value\u0026rdquo; since there is no None or null in Go. The zero value for an int is 0, for a bool is false, for a string is \u0026quot;\u0026quot;. If you define a struct value without a pointer, e.g. when you do time.Now() you get back a time.Time not a *time.Time, then the zero value is an empty struct, e.g. time.Time{}, whose internal values are all zero values for their types. The zero value for maps, slices, channels, and pointers is nil.\nConsider the following code:\nvar ( i int s string t time.Time m map[string]string ) These variables are declared but not assigned to, therefore they will all be allocated with their zero value, 0, \u0026quot;\u0026quot;, time.Time{}, and nil respectively. Because nil is the zero value for all pointers, and because pointers can be used with interface types, passing nil in for an interface is effectively the same as saying \u0026ldquo;pass the zero value of the interface\u0026rdquo;. E.g. return nil is pass the zero value of an error. This also means that the nil type check can work directly with interfaces:\ntype Handler interface { Handle() error } func Do(h Handler) error { if h == nil { return errors.New(\u0026#34;nil handler\u0026#34;) } return h.Handle() } Be careful with this type of check, however, as non-nil zero value types can also be used for an interface:\ntype IntHandler int func (IntHandler) Handle() error { return nil } func main() { fmt.Println(Do(nil)) fmt.Println(Do(IntHandler(0))) } In this case, only the Do(nil) is correctly handled, whereas the zero-valued IntHandler is not nil and therefore its handle method is called.\nnet/http Example Closures and Interfaces might not seem like a natural pairing for discussion, however they are both essential to effective and flexible library development. One of the best examples of this is in Go\u0026rsquo;s standard library net/http package which allows Go developers to quickly and easily spin up an http server.\nThe documentation describes how to get a simple server going with the DefaultServeMux using http.Handle and http.HandleFunc. A \u0026ldquo;mux\u0026rdquo; (short for \u0026ldquo;multiplexer\u0026rdquo;) passes incoming requests to the correct handler based on information from the request. By default, the path of the URL is used to determine which handler to use. In the following example the mux ensures that requests to http://localhost:8080/foo are handled by the fooHandler and to http://localhost:8080/bar are handled by the function passed to http.HandleFunc. Here is the example from the documentation:\nhttp.Handle(\u0026#34;/foo\u0026#34;, fooHandler) http.HandleFunc(\u0026#34;/bar\u0026#34;, func(w http.ResponseWriter, r *http.Request) { fmt.Fprintf(w, \u0026#34;Hello, %q\u0026#34;, html.EscapeString(r.URL.Path)) }) log.Fatal(http.ListenAndServe(\u0026#34;:8080\u0026#34;, nil)) This example shows both the interface and closure use cases for setting up an http server. In the first line of code, the http.Handle function takes a struct that implements the http.Handler interface as its second argument. In the second, the http.HandleFunc takes as its second argument a function of type http.HandlerFunc, which is a perfect use case for a closure. Finally the http.ListenAndServe method is called to serve requests on the localhost on port :8080; if it returns an error, it will be passed to log.Fatal. Let\u0026rsquo;s break this down:\nThe http.Handler interface is defined as follows:\ntype Handler interface { ServeHTTP(http.ResponseWriter, *http.Request) } Therefore, to implement the fooHandler we might do something as follows:\ntype FooContext struct { ID int Name string } type Foo struct { content *template.Template } func (f *Foo) ServeHTTP(w http.ResponseWriter, r *http.Request) { context := FooContext{ID: 1, Name: \u0026#34;Lydia\u0026#34;} f.content.Execute(w, context) } func main() { // Create a new Foo handler with the specified template loaded from disk. fooHandler := \u0026amp;Foo{ content: template.New(\u0026#34;foo.html\u0026#34;), } fooHandler.content = template.Must(fooHandler.content.ParseFiles(\u0026#34;foo.html\u0026#34;)) http.Handle(\u0026#34;/foo\u0026#34;, fooHandler) } Here, the Foo struct implements Handler with its ServeHTTP method. In this case, it maintains state with an HTML template that is loaded from a file and that template is executed with a context and written into the response body. You could see how the ServeHTTP method might use the request to look up information in a database to use for the context or do other processing on behalf of the user.\nThe http.HandlerFunc type is defined as follows:\ntype HanderFunc func(http.ResponseWriter, *http.Request) The function signature is identical to the method signature in the interface. In both cases the library expects the user to handle the incoming request and write the response into the response writer. Using http.HandleFunc provides the opportunity to create a closure to maintain simple state. One very common use case for this is to create a chain of middleware functions that wrap other handlers. Consider the following middleware stub code:\nfunc middleware(next http.HandlerFunc) http.HandlerFunc { // One-time setup/initialization of the middleware coded here. ... // Create the closure to return as the handler function return func(w http.ResponseWriter, r *http.Request) { // Before the next request is handled, e.g. handle the incoming request // If this function returns before the next handler is called, then it short // circuits the request and no downstream handlers will be called. ... // Execute the next handler in the change next(w, r) // After the request has been handled, e.g. handle the outgoing response ... } } This allows you to create a flow of control such that the outer middleware are handled first until some final middleware, then the response goes back out through the layers until the response is returned to the user. Common middleware steps for an http server include tracing, logging, authentication, content negotation, and more! This might look something as follows:\nfunc logging(next http.Handler) http.Handler { log := logger.New() return func(w http.ResponseWriter, r *http.Request) { next(w, r) log.Info(\u0026#34;request handled\u0026#34;) } } func authentication(username, password string, next http.Handler) http.Handler { token := fmt.Sprintf(\u0026#34;%s:%s\u0026#34;, username, password) return func(w http.ResponseWriter, r *http.Request) { header := r.Header.Get(\u0026#34;Authorization\u0026#34;) if header != token { http.Error(w, \u0026#34;username/password not recognized\u0026#34;, http.StatusForbidden) return } next(w, r) } } func final(w http.ResponseWriter, r *http.Request) { w.Write([]byte(\u0026#34;Hello World!\u0026#34;)) } func main() { // Create middleware handler := logging(authenticate(username, password, final)) } The closures allow us to easily construct handler-specific chains of http request handling that are intuitive and easily adaptible!\nWrap-Up Strict typing helps programmers write safe and highly-performant code and language constructs like interfaces and closures help them write flexible code even with strict typing. Careful consideration of how to define interfaces and libraries that use closures can make working with Go libraries more productive and enjoyable!\n","permalink":"https://bbengfort.github.io/2021/02/go-closures-interfaces/","summary":"\u003cp\u003eStrict typing in the Go programming language provides safety and performance that is valuable even if it does increase the verbosity of code. If there is a drawback to be found with strict typing, it is usually felt by library developers who require flexibility to cover different use cases, and most often appears as a suite of type-named functions such as \u003ccode\u003elib.HandleString\u003c/code\u003e, \u003ccode\u003elib.HandleUint64\u003c/code\u003e, \u003ccode\u003elib.HandleBool\u003c/code\u003e and so on. Go does provide two important language tools that do provide a lot of flexibility in library development: \u003cem\u003eclosures\u003c/em\u003e and \u003cem\u003einterfaces\u003c/em\u003e, which we will explore in this post.\u003c/p\u003e","title":"Go Closures \u0026 Interfaces"},{"content":"A facelift for Libelli today! I moved from Jekyll to Hugo for static site generation, a move that has been long overdue — and I\u0026rsquo;m very happy I\u0026rsquo;ve done it. Not only can I take advantage of a new theme with extra functionality (PaperMod in this case) but also because Hugo is written in Go, I feel like I have more control over how the site gets generated.\nA lot has been said on this topic, if you\u0026rsquo;re thinking about migrating from Jekyll to Hugo, I recommend Sara Soueidan\u0026rsquo;s blog post — the notes here are Libelli specific and are listed here more as notes than anything else.\nReasons for Switching These are my personal reasons:\nJekyll was getting very difficult to work with; I primarily used the bundle exec command, but that would fail more times than not, and because of the frequency that I was writing posts - every time I wanted to write a post I had to update a slough of dependencies. Travis-CI was failing - I had some Jekyll CI integration, but it would fail even though my pages would be built ok; it wasn\u0026rsquo;t inspiring a lot of confidence. I wanted to make changes to the site, but it seemed like I had a giant obstacle of having to update jekyll and the theme first and I\u0026rsquo;m not very good with Ruby … Jekyll was taking a long time to build - I had 106 posts before I decided to switch and it was crawling. I wanted to use Hugo for other projects that I am working on. Considerations A few key considerations while making the switch:\nI really like Hugo\u0026rsquo;s content organization system; unfortunately that would break the permalink structure from Jekyll - the solution was aliases which allowed me to redirect the old post to the new link which I liked better. Along those lines, I really liked page-bundles for organizing content. It\u0026rsquo;s a bit tricky to switch with where I\u0026rsquo;m at now, but I\u0026rsquo;m going to use it in the future I think. I selected PaperMod primarily because of the search and archive functionality \u0026ndash; which will make finding posts on Libelli much easier. I hope to create a theme of my own someday though. Deploying on GitHub pages - moving away from Jekyll means I don\u0026rsquo;t get builds for free. I\u0026rsquo;m working on using Travis CI for deployments — we\u0026rsquo;ll see if it works after this post! I had to write a script to deal with making changes to the frontmatter to fit my new configuration, the hugo import jekyll command did seem to do a lot of work, but I still had to modify the slug and the aliases configurations as well as bring over defaults from my posts archetype. Tricky Bits Unfortunately (or fortunately), \u0026lt;script\u0026gt; tags are cleaned by the Hugo markdownify where they weren\u0026rsquo;t in Jekyll. This meant that all my gist embeddings had to be updated to the following shortcode:\n{{\u0026lt; gist bbengfort foo \u0026gt;}} Rather then writing a script, I went through this with find and replace.\nIn SVG Vertex with a Timer, however, I had actual Javascript that I wanted rendered. So I created a layouts/shortcodes/vertex_timer.html script and embedded it as a short code in the file with:\n{{\u0026lt; vertex_timer \u0026gt;}} Finally, I had to replace all of the images that used the wrong short code format from Jekyll and replace them with absolute references to their location in the static folder. Most of this was copy and paste, but I\u0026rsquo;m worried I missed something, hence the tricky bits!\nThoughts It was very straight forward to move over to Hugo from Jekyll particularly because I was switching themes. I\u0026rsquo;m very much looking forward to using Hugo from here on out!\n","permalink":"https://bbengfort.github.io/2021/01/new-hugo-theme/","summary":"\u003cp\u003eA facelift for Libelli today! I moved from \u003ca href=\"https://jekyllrb.com/\"\u003eJekyll\u003c/a\u003e to \u003ca href=\"https://gohugo.io/\"\u003eHugo\u003c/a\u003e for static site generation, a move that has been long overdue — and I\u0026rsquo;m very happy I\u0026rsquo;ve done it. Not only can I take advantage of a new theme with extra functionality (\u003ca href=\"https://themes.gohugo.io/hugo-papermod/\"\u003ePaperMod\u003c/a\u003e in this case) but also because Hugo is written in Go, I feel like I have more control over how the site gets generated.\u003c/p\u003e\n\u003cp\u003eA lot has been said on this topic, if you\u0026rsquo;re thinking about migrating from Jekyll to Hugo, I recommend \u003ca href=\"https://www.sarasoueidan.com/blog/jekyll-ghpages-to-hugo-netlify/\"\u003eSara Soueidan\u0026rsquo;s blog post\u003c/a\u003e — the notes here are Libelli specific and are listed here more as notes than anything else.\u003c/p\u003e","title":"New Hugo Theme"},{"content":"gRPC makes the specification and implementation of networked APIs a snap. But what is the simplest way to document a gRPC API? There seem to be some hosted providers by Google, e.g. SmartDocs, but I have yet to find a gRPC-specific tool. For REST API frameworks, documentation is commonly generated along with live examples using OpenAPI (formerly swagger). By using grpc-gateway it appears to be pretty straight forward to generate a REST/gRPC API combo from protocol buffers and then hook into the OpenAPI specification.\nIn this post, I\u0026rsquo;ll go through the creation of docs from gRPC protocol buffers. In a following post, I\u0026rsquo;ll go through the creation of a live gRPC/REST service with Swagger documentation.\nStep 1: Define the service.\nWe\u0026rsquo;ll create a simple \u0026ldquo;notes\u0026rdquo; service that has two endpoints: create a note and fetch notes, optionally filtered.\nsyntax = \u0026#34;proto3\u0026#34;; package notes.v1; option go_package = \u0026#34;github.com/bbengfort/notes/v1\u0026#34;; service NoteService { rpc Fetch(NoteFilter) returns (Notebook) {}; rpc Create(Note) returns (Notebook) {}; } message Note { uint64 id = 1; string timestamp = 2; string author = 3; string text = 4; bool private = 5; } message NoteFilter { repeated uint64 ids = 1; repeated string author = 2; string before = 3; string after = 4; bool private = 5; } message Notebook { Error error = 1; repeated Note notes = 2; } message Error { uint32 code = 1; string message = 2; } So far, this is just a gRPC service definition. If we\u0026rsquo;re working in a Go project, we can version our API and structure our project as follows:\nWorkspace/go/src/github.com/bbengfort/notes └── cmd | └── notes | | └── main.go ├── go.mod ├── go.sum └── proto | └── notes | | └── v1 | | | └── api.proto Using this directory structure, generate the struct code and server and client interfaces using protoc with go and grpc plugins:\n$ protoc -I ./proto/ \\ --go_out=. --go_opt=module=github.com/bbengfort/notes \\ --go-grpc_out=. --go-grpc_opt=module=github.com/bbengfort/notes \\ proto/notes/v1/*.proto NOTE: You\u0026rsquo;ll have to install protoc (I did so with brew) and the go and grpc plugins (I used go get). See the gRPC Go Quickstart for more information on the installation process.\nIn this command the -I flag specifies where protoc can look for included protocol buffer files (e.g. if they\u0026rsquo;re imported) - more on this later. The --go_out and --go-grpc_out flags specify where to write the generated go code and the --go_opt=module= and --go-grpc_out=module flags specify the root Go module. If you run this in the project root, e.g. github.com/bbengfort/notes then a v1 directory will be created with api.pb.go and api_grpc.pb.go inside of it. This is because of the option go_package = \u0026quot;github.com/bbengfort/notes/v1\u0026quot;; directive at the top of api.proto which resolves the output path based on all the module directives.\nStep 2: download includes files and install dependencies\nIn order to use the grpc-gateway and openapiv2 protocol buffer plugins, we\u0026rsquo;ll have to modify our proto file with options that allow us to specify how the REST API is defined and to supply information to the swagger.json generated OpenAPI v2 specification. Custom options are described in third party protocol buffer files that must be included when we generate our protocol buffers using the -I flag.\nNOTE: I believe there is a way to download and \u0026ldquo;install\u0026rdquo; third party libraries into a global includes path, e.g. /usr/local/include/google/protobuf but I have to investigate this further.\nTo simplify the use of protoc and to prevent dependency management issues, I just downloaded the needed third-party from grpc-gateway-boilerplate. This appears to be a pattern in some of the repos I\u0026rsquo;ve seen, adding them to a third_party directory, though I prefer to add them to a include directory. Your directory should now look like:\nWorkspace/go/src/github.com/bbengfort/notes └── cmd | └── notes | | └── main.go ├── go.mod ├── go.sum └── include | └── googleapis | | ├── LICENSE | | └── google | | | └── api | | | | ├── annotations.proto | | | | └── http.proto | | | └── rpc | | | | ├── code.proto | | | | ├── error_details.proto | | | | └── status.proto | └── grpc-gateway | | ├── LICENSE.txt | | └── protoc-gen-openapiv2 | | | └── options | | | | ├── annotations.proto | | | | └── openapiv2.proto └── proto | └── notes | | └── v1 | | | └── api.proto └── v1 | ├── api.pb.go | └── api_grpc.pb.go Finally install the required grpc-gateway plugins so that you have protoc-gen-grpc-gateway and protoc-gen-openapiv2 in your $GOBIN.\nStep 3 (optional): tell vscode where includes are\nIf you\u0026rsquo;re using VSCode and the vscode-proto3 extension, then I like to add the following directives to my workspace settings (.vscode/settings.json):\n{ \u0026#34;protoc\u0026#34;: { \u0026#34;path\u0026#34;: \u0026#34;/usr/local/bin/protoc\u0026#34;, \u0026#34;compile_on_save\u0026#34;: false, \u0026#34;options\u0026#34;: [ \u0026#34;-I=${workspaceRoot}/proto\u0026#34;, \u0026#34;-I=${workspaceRoot}/includes/googleapis\u0026#34;, \u0026#34;-I=${workspaceRoot}/includes/grpc-gateway\u0026#34;, ] } } This prevents import error messages in your protocol buffers and enables autocomplete.\nStep 4: annotate our service\nFinally, we\u0026rsquo;re to the part we\u0026rsquo;ve been waiting for - annotating our proto file with the REST and OpenAPI v2 options. At the beginning of api.proto add the following:\nsyntax = \u0026#34;proto3\u0026#34;; package notes.v1; option go_package = \u0026#34;github.com/bbengfort/notes/v1\u0026#34;; import \u0026#34;google/api/annotations.proto\u0026#34;; import \u0026#34;protoc-gen-openapiv2/options/annotations.proto\u0026#34;; option (grpc.gateway.protoc_gen_openapiv2.options.openapiv2_swagger) = { info: { title: \u0026#34;Notes\u0026#34;; version: \u0026#34;1.0\u0026#34;; contact: { name: \u0026#34;bbengfort\u0026#34;; url: \u0026#34;https://github.com/bbengfort/notes\u0026#34;; email: \u0026#34;info@bengfort.com\u0026#34;; }; license: { name: \u0026#34;BSD 3-Clause License\u0026#34;; url: \u0026#34;https://github.com/bbengfort/notes/LICENSE\u0026#34;; }; }; schemes: HTTP; schemes: HTTPS; consumes: \u0026#34;application/json\u0026#34;; produces: \u0026#34;application/json\u0026#34;; }; When the openapiv2 plugin generates a swagger.json file, the information in this option will be used to populate the info, schemes, consumes, and produces fields of the specificiation. This will both influence the information in the generated documentation as well as make it easier to create a live server.\nNext we must update the service definition to map gRPC services to REST API calls:\nservice NoteService { rpc Fetch(NoteFilter) returns (Notebook) { option (google.api.http) = { get: \u0026#34;/api/v1/notes\u0026#34; }; }; rpc Create(Note) returns (Notebook) { option (google.api.http) = { post: \u0026#34;/api/v1/notes\u0026#34; body: \u0026#34;*\u0026#34; }; }; } These options specify that the Fetch RPC can be accessed with a GET request to /api/v1/notes and that the Create RPC uses POST to the same endpoint. Note that the body: \u0026quot;*\u0026quot; flag ensures that the request body is included in endpoint.\nStep 5: generate swagger spec and serve\nIn the final step of this post, we\u0026rsquo;ll use the openapiv2 plugin to generate the swagger json specification and use the swagger-ui docker image to serve some static documentation.\nprotoc -I ./proto/ \\ -I include/googleapis -I include/grpc-gateway \\ --go_out=. --go_opt=module=github.com/bbengfort/notes \\ --go-grpc_out=. --go-grpc_opt=module=github.com/bbengfort/notes \\ --openapiv2_out ./openapiv2 --openapiv2_opt logtostderr=true \\ proto/notes/v1/*.proto This protoc command has been updated to include the third party protocol buffer files and also adds the openapiv2 plugin, writing a specification file at openapiv2/notes/v1/api.swagger.json (note you may have to make the openapiv2 directory before running this command).\nTo serve the static swagger-ui docs, I used Docker as follows:\n$ docker run -p 80:8080 \\ -e SWAGGER_JSON=/openapiv2/notes/v1/api.swagger.json \\ -v $PWD/openapiv2/:/openapiv2 \\ swaggerapi/swagger-ui This will pull the swaggerapi/swagger-ui image from DockerHub when you run it for the first time. You can then view the docs at http://localhost/:\nThe next steps are to use grpc-gateway to create a server that does both gRPC hosting and a JSON REST API - complete with live Swagger documentation and styling.\n","permalink":"https://bbengfort.github.io/2021/01/grpc-openapi-docs/","summary":"\u003cp\u003egRPC makes the specification and implementation of networked APIs a snap. But what is the simplest way to \u003cem\u003edocument\u003c/em\u003e a gRPC API? There seem to be some hosted providers by Google, e.g. \u003ca href=\"https://cloud.google.com/endpoints/docs/grpc/dev-portal-update-ref-docs\"\u003eSmartDocs\u003c/a\u003e, but I have yet to find a gRPC-specific tool. For REST API frameworks, documentation is commonly generated along with live examples using \u003ca href=\"https://swagger.io/resources/open-api/\"\u003eOpenAPI (formerly swagger)\u003c/a\u003e. By using \u003ca href=\"https://github.com/grpc-ecosystem/grpc-gateway\"\u003egrpc-gateway\u003c/a\u003e it appears to be pretty straight forward to generate a REST/gRPC API combo from protocol buffers and then hook into the OpenAPI specification.\u003c/p\u003e","title":"Documenting a gRPC API with OpenAPI"},{"content":"I went on a brief adventure looking into creating a lightweight certificate authority (CA) in Go to issue certificates for mTLS connections between peers in a network. The CA was a simple command line program and the idea was that the certificate would initialize its own self-generated certs whose public key would be included in the code base of the peer-to-peer servers, then it could generate TLS x.509 key pairs signed by the CA. Of course you could do this with openssl, but I wanted to keep a self-coded Go version around for posterity.\nUsage:\n$ ca init -o \u0026#34;My P2P Network\u0026#34; -C \u0026#34;United States\u0026#34; $ ca issue -o \u0026#34;Peer 1\u0026#34; -C \u0026#34;United States\u0026#34; -p \u0026#34;California\u0026#34; $ ca issue -o \u0026#34;Peer 2\u0026#34; -C \u0026#34;France\u0026#34; -l \u0026#34;Paris\u0026#34; The gist is as follows:\nAfter usage there are a couple of key things that came up:\nHow do you generate serial numbers for the certificates? Can you PEM encode the certificate along with the CA public key in a single CA file? Can you PKCS12 encrypt the issued certificates for emailing? ","permalink":"https://bbengfort.github.io/2020/12/self-signed-ca/","summary":"\u003cp\u003eI went on a brief adventure looking into creating a lightweight certificate authority (CA) in Go to issue certificates for mTLS connections between peers in a network. The CA was a simple command line program and the idea was that the certificate would initialize its own self-generated certs whose public key would be included in the code base of the peer-to-peer servers, then it could generate TLS x.509 key pairs signed by the CA. Of course you could do this with \u003ccode\u003eopenssl\u003c/code\u003e, but I wanted to keep a self-coded Go version around for posterity.\u003c/p\u003e","title":"Self Signed CA"},{"content":"Developer computers often get a lot of cruft built up in non-standard places because of compiled binaries, assets, packages, and other tools that we install over time then forget about as we move onto other projects. In general, I like to reinstall my OS and wipe my disk every year or so to prevent crud from accumulating. As an interemediate step, this post compiles several maintenance caommands that I run fairly routinely.\nHomebrew Update homebrew and remove old formulae and their folders often, otherwise it\u0026rsquo;s going to take a long time to run if you forget about it!\n$ brew update $ brew upgrade $ brew cleanup Linking and casks are also often problems, run brew doctor to inspect the warnings and take care of anything that needs cleaning up. Note that I usually run update and upgrade multiple times to make sure everything was correctly grabbed.\nFinally, a routine brew list will show what has been installed; although it\u0026rsquo;s tough to figure out dependencies from brew, if I recognize something that I installed and am not using anymore, I usually uninstall it.\nDocker I fairly routinely clean up docker containers, images, volumes, networks, etc. from my machine since I don\u0026rsquo;t generally rely on a local build process except for testing.\n$ docker system prune --all Python I manage my Python system and environments with pyenv, which means it\u0026rsquo;s easy to build up a cruft of old environments and duplicate copies of Python packages. Using pyenv versions, list your environments and versions routinely and then use either pyenv uninstall to remove a version of Python or pyenv virtualenv-delete to remove a virtual environment.\nRuby I don\u0026rsquo;t use Ruby very much, but similar to Python I have rbenv for a few projects that require Ruby components. Use rbenv versions to list Ruby installs and then rbenv uninstall to remove the environments.\nAlso routinely run:\n$ gem cleanup --dryrun $ gem cleanup To clean up old versions of gems.\nNode Modules Node and npm download a lot of packages into project directories. The following command searches for any node_modules directories that are older than 4 months.\n$ find . -name \u0026#34;node_modules\u0026#34; -mtime +120 -type d Clean them up as follows:\nfind . -name \u0026#34;node_modules\u0026#34; -mtime +120 -type d | xargs rm -rf I usually run this in my workspaces directory before I engage in any web development.\nGit I usually do a pretty good job deleting branches that have been merged from my local computer, but ocassionally I\u0026rsquo;ll clone a repository and do a fetch that pulls down a bunch of branches. The following command says what branches have been merged:\n$ get branch --merged main The issue with this is that we normally squash and merge our branches, so this may not catch the branches you\u0026rsquo;re looking for. However, thinking about how many git objects are on your system is important!\n","permalink":"https://bbengfort.github.io/2020/11/mac-cleanup/","summary":"\u003cp\u003eDeveloper computers often get a lot of cruft built up in non-standard places because of compiled binaries, assets, packages, and other tools that we install over time then forget about as we move onto other projects. In general, I like to reinstall my OS and wipe my disk every year or so to prevent crud from accumulating. As an interemediate step, this post compiles several maintenance caommands that I run fairly routinely.\u003c/p\u003e","title":"OS X Cleanup"},{"content":"This post is a response to Go: Multiple Errors Management. I\u0026rsquo;ve dealt with a multiple error contexts in a few places in my Go code but never created a subpackage for it in github.com/bbengfort/x and so I thought this post was a good motivation to explore it in slightly more detail. I\u0026rsquo;d also like to make error contexts for routine cancellation a part of my standard programming practice, so this post also investigates multiple error handling in a single routine or multiple routines like the original post.\nMulti-error management for me usually comes in the form of a Shutdown or Close method where I\u0026rsquo;m cleaning up a lot of things and would like to do everything before I handle errors:\nfunc (s *Server) Shutdown() (err error) { errs = make([]error, 0, 4) if err = s.router.GracefulStop(); err != nil { errs = append(errs, err) } if err = s.db.Close(); err != nil { errs = append(errs, err) } if err = s.meta.Flush(); err != nil { errs = append(errs, err) } if err = s.meta.Close(); err != nil { errs = append(Errs, err) } // Best case scenario first if len(errs) == 0 { return nil } if len(errs) == 1 { return errs[0] } return fmt.Errorf(\u0026#34;%d errors occurred during shutdown\u0026#34;, len(errs)) } Obviously this is less than ideal in a lot of ways and using go-multierror by HashiCorp or multierr by Uber cleans things up nicely. Better yet, we could implement a simple type to handle reporting and appending:\n// MultiError implements the Error interface so it can be used as an error while also // wrapping multiple errors and easily appending them during execution. type MultiError struct { errors []error } // Error prints a semicolon separated list of the errors that occurred. The Report // method returns an error with a newline separated bulleted list if that\u0026#39;s better. func (m *MultiError) Error() string { report := make([]string, 0, len(m)+1) report = append(report, fmt.Sprintf(\u0026#34;%d errors occurred\u0026#34;, len(m))) for _, err := range m { report = append(report, err.Error()) } return strings.Join(report, \u0026#34;; \u0026#34;) } // Appends more errors onto a MultiError, ignoring nil errors for ease of use. If the // MultiError hasn\u0026#39;t been initialized, it is in this function. If any of the errs are // MultiErrors themselves, they are flattened into the top-level multi error. func (m *MultiError) Append(errs ...error) { if m.errors == nil { m.errors = make([]error, 0, len(errs)) } for _, err := range errs { // ignore nil errors for quick appends. if err == nil { continue } switch err.(type) { // flatten multi-error to the top level. case *MultiError: if len(err.errors) \u0026gt; 0 { m.errors = append(m.errors, err.errors...) } default: m.errors = append(m.errors, err) } } } // Get returns nil if no errors have been added, the unique error if only one error // has been added, or the multi-error if multiple errors have been added. func (m MultiError) Get() error { switch len(m) { case 0: return nil case 1: return m[0] default: return m } } This code simplifies the process a bit and adds more helper functionality, but I haven\u0026rsquo;t benchmarked it yet. New usage would be as follows:\nfunc (s *Server) Shutdown() (err error) { var merr MultiError merr.Append(s.router.GracefulStop()) merr.Append(s.db.Close()) merr.Append(s.meta.Flush()) merr.Append(s.meta.Close()) return merr.Get() } In real code, though, I think I might prefer to use go-multierror as it has a lot more functionality and a slightly more intuitive implementation. This code was mostly for commentary purposes.\nThe real thing I need to remember is goroutine cancellation contexts using errgroup:\nfunc action(ctx context.Context) (err error) { // Note that the action must listen for the cancellation! timer := time.NewTimer(time.Duration(rand.Int63n(4000)) * time.Millisecond) select { case \u0026lt;-timer.C: if rand.Float64() \u0026lt; 0.2 { return errors.New(\u0026#34;something bad happened\u0026#34;) } case \u0026lt;-ctx.Done(): return nil } return nil } func main() { g, ctx := errgroup.WithContext(context.Background()) for i := 0; i \u0026lt; 3; i++ { g.Go(func() (err error) { for j := 0; j \u0026lt; 3; j++ { if err = action(ctx); err != nil { return err } } return nil }) } if err := g.Wait(); err != nil { log.Fatal(err) } } The thing the blog post forgot to mention is that the go routine must be able to actively cancel its operation by listening on the ctx.Done() channel in addition to a channel that signals the operation is done (in the above example, the timer channel that is just causing the routine to sleep). If the action function does not listen to the ctx.Done() channel, even though the error propagates to the g.Wait() and returns, and cancel() for the context is called; the program will not terminate \u0026ldquo;early\u0026rdquo; because no action is waiting for the cancellation signal.\n","permalink":"https://bbengfort.github.io/2020/10/go-multiple-errors/","summary":"\u003cp\u003eThis post is a response to \u003ca href=\"https://medium.com/a-journey-with-go/go-multiple-errors-management-a67477628cf1\"\u003eGo: Multiple Errors Management\u003c/a\u003e. I\u0026rsquo;ve dealt with a multiple error contexts in a few places in my Go code but never created a subpackage for it in \u003ccode\u003egithub.com/bbengfort/x\u003c/code\u003e and so I thought this post was a good motivation to explore it in slightly more detail. I\u0026rsquo;d also like to make error contexts for routine cancellation a part of my standard programming practice, so this post also investigates multiple error handling in a single routine or multiple routines like the original post.\u003c/p\u003e","title":"Managing Multi-Errors in Go"},{"content":"For scientific reproducibility, it has become common for me to output experimental results as zip files that contain both configurations and inputs as well as one or more output results files. This is similar to .epub or .docx formats which are just specialized zip files - and allows me to easily rerun experiments for comparison purposes. Recently I tried to dump some json data into a zip file using Python 3.8 and was surprised when the code errored as it seemed pretty standard. This is the story of the crazy loophole that I had to go into as a result.\nFirst off, here is the code that didn\u0026rsquo;t work:\ndef make_archive_bad(path=\u0026#34;test.zip\u0026#34;): with zipfile.ZipFile(path, \u0026#34;x\u0026#34;) as z: with z.open(\u0026#34;config.json\u0026#34;, \u0026#34;w\u0026#34;) as c: # This doesn\u0026#39;t work json.dump(config, c, indent=2) with z.open(\u0026#34;data.json\u0026#34;, \u0026#34;w\u0026#34;) as d: for row in data(config): # This doesn\u0026#39;t work d.write(json.dumps(row)) d.write(\u0026#34;\\n\u0026#34;) The exception is located in the interaction between json.dump and zipfile._ZipWriteFile.write as you can see in the traceback below.\nFile \u0026#34;./zipr.py\u0026#34;, line 144, in \u0026lt;module\u0026gt; make_archive_bad() File \u0026#34;./zipr.py\u0026#34;, line 32, in make_archive_bad json.dump(config, c, indent=2) File \u0026#34;3.7/lib/python3.7/json/__init__.py\u0026#34;, line 180, in dump fp.write(chunk) File \u0026#34;3.7/lib/python3.7/zipfile.py\u0026#34;, line 1094, in write self._crc = crc32(data, self._crc) TypeError: a bytes-like object is required, not \u0026#39;str\u0026#39; The json module always produces str objects, not bytes objects. Therefore, fp.write() must support str input.\n— json documentation\nSo the issue is really with the zipfile library and to make it work, you\u0026rsquo;ll have to json.dumps and then encode your data yourself. Which is annoying:\ndef make_archive_annoying(path=\u0026#34;test.zip\u0026#34;): # This does work but is annoying # Also note, this will write to the root of the archive, and when unzipped will not # unzip to a directory but rather into the same directory as the zip file with zipfile.ZipFile(path, \u0026#34;x\u0026#34;) as z: with z.open(\u0026#34;config.json\u0026#34;, \u0026#34;w\u0026#34;) as c: c.write(json.dumps(config, indent=2).encode(\u0026#34;utf-8\u0026#34;)) with z.open(\u0026#34;data.json\u0026#34;, \u0026#34;w\u0026#34;) as d: for row in data(config): d.write(json.dumps(row).encode(\u0026#34;utf-8\u0026#34;)) d.write(\u0026#34;\\n\u0026#34;.encode(\u0026#34;utf-8\u0026#34;)) For 99% of people, the above solution is the way to go. However, as soon as I realized that this was also going to write into the root of the zip file so that when you extract the contents they are in the same directory as the zip file instead of a subdirectory \u0026hellip; I went a little overboard. So here is a solution that is not annoying but requires some wrapper utility code in your library.\nI\u0026rsquo;m sorry.\nclass Zipr(object): def __enter__(self): # Makes this thing a context manager return self def __exit__(self, exc_type, exc_value, traceback): # Makes this thing a context manager self.close() def close(self): self.zobj.close() class ZipArchive(Zipr): def __init__(self, path, mode=\u0026#34;r\u0026#34;): self.zobj = zipfile.ZipFile( path, mode, compression=zipfile.ZIP_STORED, allowZip64=True, compresslevel=None ) self.root, _ = os.path.splitext(os.path.basename(path)) def open(self, path, mode=\u0026#39;r\u0026#39;): # Write into a directory instead of the root of the zip file path = os.path.join(self.root, path) return ZipArchiveFile(self.zobj.open(path, mode)) class ZipArchiveFile(Zipr): def __init__(self, zobj, encoding=\u0026#34;utf-8\u0026#34;): self.zobj = zobj self.encoding = encoding def write(self, data): if isinstance(data, str): data = data.encode(self.encoding) self.zobj.write(data) These classes are essentially just wrappers for ZipFile and _ZipWriteFile, so potentially it would just be easier to subclass these files and and override the open and write methods - but I\u0026rsquo;ll leave that to the reader to make a decision. This is enough to have less annoying code:\ndef make_archive(path=\u0026#34;test.zip\u0026#34;): # Less annoying make archive with workaround classes with ZipArchive(path, \u0026#34;x\u0026#34;) as z: with z.open(\u0026#34;config.json\u0026#34;, \u0026#34;w\u0026#34;) as c: json.dump(config, c, indent=2) with z.open(\u0026#34;data.json\u0026#34;, \u0026#34;w\u0026#34;) as d: for row in data(config): d.write(json.dumps(row)) d.write(\u0026#34;\\n\u0026#34;) There are still issues, however. For example append ('a') mode does not work with _ZipWriteFile, so if you try to stream data by opening for appending, you\u0026rsquo;ll run into issues. See below:\ndef make_archive_stream(path=\u0026#34;test.zip\u0026#34;): # Attempts to open the internal zip file for appending, to stream data in. # But this doesn\u0026#39;t work because you can\u0026#39;t open an internal zip object for appending with ZipArchive(path, \u0026#34;x\u0026#34;) as z: with z.open(\u0026#34;config.json\u0026#34;, \u0026#34;w\u0026#34;) as c: json.dump(config, c, indent=2) cache = [] for i, row in enumerate(data(config)): cache.append(json.dumps(row)) # dump cache every 5 rows if i%5 == 0: with z.open(\u0026#34;data.json\u0026#34;, \u0026#34;a\u0026#34;) as d: for row in cache: d.write(row+\u0026#34;\\n\u0026#34;) cache = [] if len(cache) \u0026gt; 0: with z.open(\u0026#34;data.json\u0026#34;, \u0026#34;a\u0026#34;) as d: for row in cache: d.write(row+\u0026#34;\\n\u0026#34;) The full code can be found on this gist.\n","permalink":"https://bbengfort.github.io/2020/08/zipfiles-json/","summary":"\u003cp\u003eFor scientific reproducibility, it has become common for me to output experimental results as zip files that contain both configurations and inputs as well as one or more output results files. This is similar to .epub or .docx formats which are just specialized zip files - and allows me to easily rerun experiments for comparison purposes. Recently I tried to dump some json data into a zip file using Python 3.8 and was surprised when the code errored as it seemed pretty standard. This is the story of the crazy loophole that I had to go into as a result.\u003c/p\u003e","title":"Writing JSON into a Zip file with Python"},{"content":"When benchmarking Python programs, it is very common for me to use memory_profiler from the command line - e.g. mprof run python myscript.py. This creates a .dat file in the current working directory which you can view with mprof show. More often than not, though I want to compare two different runs for their memory profiles or do things like annotate the graphs with different timing benchmarks. This requires generating my own figures, which requires loading the memory profiler data myself.\nI\u0026rsquo;m sure that the memory_profiler library probably has some utility functions to do this, but the simplest for me is to load things into a Pandas series. The mprof command keeps track of real timestamps, so in order to do comparisons, I have to reindex the time series based on the starting timestamp reference. The code snippet is as follows:\nimport pandas as pd def load_mprofile(path, name=None): ref = None times, values = [], [] with open(path, \u0026#39;r\u0026#39;) as f: for line in f: if line.startswith(\u0026#34;CMDLINE\u0026#34;): if name is None: name = line.rstrip(\u0026#34;CMDLINE\u0026#34;).strip() if line.startswith(\u0026#34;MEM\u0026#34;): parts = line.split() val, ts = float(parts[1]), float(parts[2]) if ref is None: ref = ts times.append(ts-ref) values.append(val) return pd.Series(values, index=times, name=name) Using this loader, I can compare two memory profiling sequences as follows:\nimport os import matplotlib.pyplot as plt def plot_mprofiles(directory=\u0026#34;.\u0026#34;, ax=None): if ax is None: _, ax = plt.subplots(figsize=(9,6)) for name in os.listdir(directory): s = load_mprofile(os.path.join(directory, name)) s.plot(ax=ax) ax.legend() return ax Pretty straight forward, but a useful snippet to be able to lookup at a glance.\n","permalink":"https://bbengfort.github.io/2020/07/read-mprofile-into-pandas/","summary":"\u003cp\u003eWhen benchmarking Python programs, it is very common for me to use \u003ca href=\"https://pypi.org/project/memory-profiler/\"\u003e\u003ccode\u003ememory_profiler\u003c/code\u003e\u003c/a\u003e from the command line - e.g. \u003ccode\u003emprof run python myscript.py\u003c/code\u003e. This creates a .dat file in the current working directory which you can view with \u003ccode\u003emprof show\u003c/code\u003e. More often than not, though I want to compare two different runs for their memory profiles or do things like annotate the graphs with different timing benchmarks. This requires generating my own figures, which requires loading the memory profiler data myself.\u003c/p\u003e","title":"Read mprofile Output into Pandas"},{"content":"I\u0026rsquo;m getting started on some projects that will make use of extensive Python performance profiling, unfortunately Python doesn\u0026rsquo;t focus on performance and so doesn\u0026rsquo;t have benchmark tools like I might find in Go. I\u0026rsquo;ve noticed that the two most important usages I\u0026rsquo;m looking at when profiling are speed and memory usage. For the latter, I simply use memory_profiler from the command line - which is pretty straight forward. However for speed usage, I did find a snippet that I thought would be useful to include and update depending on how my usage changes.\nimport cProfile from pstats import Stats from functools import wraps def sprofile(func): @wraps(func) def wrapper(*args, **kwargs): pr = cProfile.Profile() pr.enable() result = func(*args, **kwargs) pr.disable() Stats(pr).strip_dirs().sort_stats(\u0026#39;cumulative\u0026#39;).print_stats(20) return result return wrapper This decorator allows you to profile the speed performance of functions on the stack below the function being decorated. It uses standard library dependencies, which is great, and you can change the way the stats are printed out to suit your needs (e.g. this is formatted well for my analysis style).\nThe report it prints out is as follows:\n7636523 function calls (7636479 primitive calls) in 14.669 seconds Ordered by: cumulative time List reduced from 306 to 20 due to restriction \u0026lt;20\u0026gt; ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 14.669 14.669 sequential.py:107(run) 150 2.584 0.017 14.633 0.098 sequential.py:75(step) 843750 10.228 0.000 10.335 0.000 grid.py:72(neighborhood_sum) 843750 0.765 0.000 0.988 0.000 grid.py:129(__setitem__) 843750 0.529 0.000 0.726 0.000 grid.py:124(__getitem__) 2531454 0.275 0.000 0.275 0.000 {built-in method builtins.isinstance} 1687783 0.145 0.000 0.145 0.000 {built-in method builtins.len} 843750 0.107 0.000 0.107 0.000 grid.py:57(adjacency) 151 0.001 0.000 0.020 0.000 std.py:1099(__iter__) 82 0.000 0.000 0.019 0.000 std.py:1317(refresh) 83 0.000 0.000 0.017 0.000 std.py:1447(display) 83 0.000 0.000 0.015 0.000 std.py:1089(__repr__) 9/4 0.000 0.000 0.014 0.004 \u0026lt;frozen importlib._bootstrap\u0026gt;:978(_find_and_load) 9/4 0.000 0.000 0.014 0.004 \u0026lt;frozen importlib._bootstrap\u0026gt;:948(_find_and_load_unlocked) 83 0.002 0.000 0.014 0.000 std.py:310(format_meter) 9/4 0.000 0.000 0.013 0.003 \u0026lt;frozen importlib._bootstrap\u0026gt;:663(_load_unlocked) 17/6 0.000 0.000 0.011 0.002 \u0026lt;frozen importlib._bootstrap\u0026gt;:211(_call_with_frames_removed) 1 0.000 0.000 0.010 0.010 std.py:511(__new__) 1 0.000 0.000 0.009 0.009 std.py:623(get_lock) 1 0.000 0.000 0.009 0.009 std.py:79(__init__) You can see in this report that the majority of time is being spent in the neighborhood_sum function from line 3 and that the step function calls it nearly 5,625 times!\n","permalink":"https://bbengfort.github.io/2020/07/basic-python-profiling/","summary":"\u003cp\u003eI\u0026rsquo;m getting started on some projects that will make use of extensive Python performance profiling, unfortunately Python doesn\u0026rsquo;t focus on performance and so doesn\u0026rsquo;t have benchmark tools like I might find in Go. I\u0026rsquo;ve noticed that the two most important usages I\u0026rsquo;m looking at when profiling are speed and memory usage. For the latter, I simply use \u003ca href=\"https://pypi.org/project/memory-profiler/\"\u003e\u003ccode\u003ememory_profiler\u003c/code\u003e\u003c/a\u003e from the command line - which is pretty straight forward. However for speed usage, I did find a snippet that I thought would be useful to include and update depending on how my usage changes.\u003c/p\u003e","title":"Basic Python Profiling"},{"content":"In this post I walk through the steps of creating a multi-user JupyterHub sever running on an AWS Ubuntu 18.04 instance. There are many ways of setting up JupyterHub including using Docker and Kubernetes - but this is a pretty staight forward mechanism that doesn\u0026rsquo;t have too many moving parts such as TLS termination proxies etc. I think of this as the baseline setup.\nNote that this setup has a few pros or cons depending on how you look at them.\nJupyterHub is responsible for TLS and you need to create certificates for it. Users cannot install packages using !conda install or !pip install. Users cannot create environments and use them. PAM users are created, so users can SSH into the machine. This post serves as a general sketch of how to get a production ready JupyterHub server up and running for multiple users.\nStep 1: Launch an instance Use the method of your choice to create an AWS instance in a VPC that has access to the Internet. A couple of notes on the instance creation process:\nI used the Ubuntu 18.04 HVM LTS as the base AMI Ensure the instance is in a security group that allows ports 22 and 443 Ensure you have SSH access to the instance Ensure the instance has enough memory, compute, and disk for your intended workload Next ensure the instance can be reachable by a DNS name. This is required for TLS to work. There are a few ways to do this, but the way I did it was to:\nCreate an elastic IP address (EIP) and assign it to the instance Create an A record in route53 mapping the domain name to the EIP At this point you should be able to SSH into your instance via the domain name. If so, you\u0026rsquo;re good to go for the next steps.\nStep 2: Install Anaconda I chose to use Anaconda to facilitate data science workloads for this installation. Anaconda has its pros and cons on a system level install, but one of the implications was that users could not install their own packages. Additionally, I had to jump through some hoops to get the server running with systemd. After this experience, vanilla Python might be a better choice, to be frank.\nFirst create a system user for Anaconda and add the ubuntu user to the group:\n$ sudo useradd -r anaconda $ sudo usermod -a -G anaconda ubuntu Next install Anaconda as follows:\nSelect distribution from Anaconda Distributions\nCopy 64-bit (x86) installer URL\nDownload, verify integrity, and execute the script\n$ curl -O https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh $ sha256sum Anaconda3-2019.07-Linux-x86_64.sh 69581cf739365ec7fb95608eef694ba959d7d33b36eb961953f2b82cb25bdf5a Anaconda3-2019.07-Linux-x86_64.sh $ sudo bash Anaconda3-2019.07-Linux-x86_64.sh Accept the user agreement\nInstall to /opt/anaconda\nUpdate permissions of /opt/anaconda\n$ sudo chown -R anaconda:anaconda /opt/anaconda $ sudo chmod -R 775 /opt/anaconda These permissions should give the ubuntu user the ability to install packages to anaconda, but not other users. Other users should be able to execute anaconda commands but not modify the anaconda install. If they want to install their own packages, they\u0026rsquo;ll have to create a virtual environment in their home directory.\nOptional step: ensure Anaconda is available for new users This is an optional step, but because Anaconda relies heavily on the shell for configuration, I ensured that any new users would have access to Anaconda by appending the following to /etc/skel/.bashrc:\n# \u0026gt;\u0026gt;\u0026gt; conda initialize \u0026gt;\u0026gt;\u0026gt; # !! Contents within this block are managed by \u0026#39;conda init\u0026#39; !! __conda_setup=\u0026#34;$(\u0026#39;/opt/anaconda/bin/conda\u0026#39; \u0026#39;shell.bash\u0026#39; \u0026#39;hook\u0026#39; 2\u0026gt; /dev/null)\u0026#34; if [ $? -eq 0 ]; then eval \u0026#34;$__conda_setup\u0026#34; else if [ -f \u0026#34;/opt/anaconda/etc/profile.d/conda.sh\u0026#34; ]; then . \u0026#34;/opt/anaconda/etc/profile.d/conda.sh\u0026#34; else export PATH=\u0026#34;/opt/anaconda/bin:$PATH\u0026#34; fi fi unset __conda_setup # \u0026lt;\u0026lt;\u0026lt; conda initialize \u0026lt;\u0026lt;\u0026lt; Note that I also added this to /root/.bashrc so that when I executed commands with sudo su, Anaconda was available to the root user.\nStep 3: Install JupyterHub Install the required packages using conda and pip:\n(base) $ conda install -c conda-forge jupyterhub (base) $ conda install notebook (base) $ pip install jupyterhub-systemdspawner Make sure that pip is associated with the conda environment and not with the default Python installation!\nAs with Anaconda, we\u0026rsquo;ll create a jupyterhub system user and working directory for the server.\n$ sudo useradd -r jupyterhub $ sudo usermod -a -G jupyterhub ubuntu $ sudo mkdir /srv/jupyterhub $ sudo chown -R jupyterhub:jupyterhub /srv/jupyterhub Note, however, that we\u0026rsquo;ll run JupyterHub as a privileged user rather than as this user, but it simplifies the management of files a bit to do it this way.\nStep 4: Create LetsEncrypt certs At this point you\u0026rsquo;ll have to create certificates for the TLS endpoint. I prefer to use certbot and LetsEncrypt since it makes things so easy. Note that you\u0026rsquo;ll also have to implement a verification method to be granted the certificates (to prove you own the domain) which will also be true for renewal. I\u0026rsquo;ll be using route53 verification in this example.\nInstall certbot (instructions from the certbot website)\n$ sudo apt-get update $ sudo apt-get install software-properties-common $ sudo add-apt-repository universe $ sudo add-apt-repository ppa:certbot/certbot $ sudo apt-get update $ sudo apt-get install certbot python3-certbot-dns-route53 Ensure that your AWS credentials are correctly configured for boto3 and that sudo can access them (e.g. not in the environment), then perform the verification. You\u0026rsquo;ll need to submit an email address, agree to the license, and choose if you want the EFF electronic newsletter.\n$ sudo certbot certonly --dns-route53 -d jupyter.mydomain.com Ensure that crontab is setup to automatically renew the certs.\n$ sudo certbot renew --dry-run $ cat /etc/cron.d/certbot Note that we could also setup JupyterHub behind a proxy like Traefik or nginx, which would terminate the TLS itself and is perhaps a bit more easy to automatically configure than route53. I recommend this method particularly if you\u0026rsquo;re not on AWS.\nStep 5: Configure JupyterHub Create the jupyterhub configuration file and move it to the recommended system configuration location as follows:\n$ jupyterhub --generate-config $ sudo mkdir /etc/jupyterhub $ sudo mv jupyterhub_config.py /etc/jupyterhub There is a lot of commented out configuration details, but the important configurations to me are as follows:\n# Set the JupyterHub bind URL, protocol and port c.JupyterHub.bind_url = \u0026#39;https://0.0.0.0:443\u0026#39; # Save the cookie secret file in the jupyterhub working directory c.JupyterHub.cookie_secret_file = \u0026#39;/srv/jupyterhub/cookie_secret\u0026#39; # Save the database in the jupyterhub working directory c.JupyterHub.db_url = \u0026#39;sqlite:////srv/jupyterhub/jupyterhub.sqlite\u0026#39; # Set the hub bind url for all jupyter-single user instances c.JupyterHub.hub_bind_url = \u0026#39;http://127.0.0.1:8081\u0026#39; # Store the pid file in the jupyterhub working directory c.JupyterHub.pid_file = \u0026#39;/srv/jupyterhub/jupyterhub.pid\u0026#39; # Ensure that notebooks are shutdown when users log out so that notebooks # are cleaned up releasing their memory. Note this is also a helpful way # to do support: just have them log out and log back in again! c.JupyterHub.shutdown_on_logout = True # We\u0026#39;re using the SystemdSpawner c.JupyterHub.spawner_class = \u0026#39;systemdspawner.SystemdSpawner\u0026#39; # Specify the locations of the letsencrypt certs c.JupyterHub.ssl_cert = \u0026#39;/etc/letsencrypt/live/jupyter.mydomain.com/fullchain.pem\u0026#39; c.JupyterHub.ssl_key = \u0026#39;/etc/letsencrypt/live/jupyter.mydomain.com/privkey.pem\u0026#39; # Set limits on the compute resources users have access to c.SystemdSpawner.cpu_limit = 3.0 c.SystemdSpawner.isolate_tmp = True c.SystemdSpawner.isolate_devices = True c.SystemdSpawner.disable_user_sudo = True c.SystemdSpawner.mem_limit = \u0026#39;8G\u0026#39; # Set any environment variables you\u0026#39;d like your users to have access to. # Very helpful for things like $NLTK_DATA or other Python resources. c.Spawner.environment = { \u0026#34;MY_ENV_VAR\u0026#34;: \u0026#34;fo\u0026#34; } # Determine who can create new user accounts c.Authenticator.admin_users = set([\u0026#39;ubuntu\u0026#39;, \u0026#39;admin\u0026#39;]) Create a new user and password for the admin user:\n$ sudo adduser admin Enter the password and details for the admin user \u0026ndash; this will give you admin access to the JupyterHub server and allow you to create new users online.\nAt this point, as the root user you should be able to run:\n$ jupyterhub -f /etc/jupyterhub/jupyterhub_config.py You should be able to see the login screen and login as the admin user. You should also be able to create new users from the admin page.\nStep 6: Run JupyterHub as a systemd service Because anaconda needs to be initialized and the environment modified to support it, the systemd service needs to be run from a bash script. Add the following to /usr/local/bin/run_jupyterhub.sh and make it executable.\n#!/bin/bash # \u0026gt;\u0026gt;\u0026gt; conda initialize \u0026gt;\u0026gt;\u0026gt; # !! Contents within this block are managed by \u0026#39;conda init\u0026#39; !! __conda_setup=\u0026#34;$(\u0026#39;/opt/anaconda/bin/conda\u0026#39; \u0026#39;shell.bash\u0026#39; \u0026#39;hook\u0026#39; 2\u0026gt; /dev/null)\u0026#34; if [ $? -eq 0 ]; then eval \u0026#34;$__conda_setup\u0026#34; else if [ -f \u0026#34;/opt/anaconda/etc/profile.d/conda.sh\u0026#34; ]; then . \u0026#34;/opt/anaconda/etc/profile.d/conda.sh\u0026#34; else export PATH=\u0026#34;/opt/anaconda/bin:$PATH\u0026#34; fi fi unset __conda_setup # \u0026lt;\u0026lt;\u0026lt; conda initialize \u0026lt;\u0026lt;\u0026lt; # Run the JupyterHub service with the system configuration jupyterhub -f /etc/jupyterhub/jupyterhub_config.py Create a systemd service by writing the following into /etc/systemd/system/jupyterhub.service:\n[Unit] Description=JupyterHub Documentation=https://jupyterhub.readthedocs.io/en/stable/ [Service] Type=simple After=network.target Restart=always RestartSec=10 WorkingDirectory=/srv/jupyterhub ExecStart=/usr/local/bin/run_jupyterhub.sh StandardOutput=syslog StandardError=syslog SyslogIdentifier=jupyterhub [Install] WantedBy=multi-user.target You can now enable and start the server as follows:\n$ sudo systemctl enable jupyterhub.service $ sudo systemctl start jupyterhub.service You should now be able to go to the server and access JupyterHub without running it specifically as the user. To diagnose issues, use journalctl as follows:\n$ sudo journalctl -u jupyterhub.service This will open up the log file and tell you if anything went wrong.\nStep 7: Handle Logging Our systemd service writes all output to the syslog. By default we can access the logs using journalctl. However, to make debugging easier, we can use rsyslog to write the logs into a file. Add the following in /etc/rsyslog/jupyterhub.conf:\nif $programname == \u0026#39;jupyterhub\u0026#39; then /var/log/jupyterhub/access.log \u0026amp; stop Note also that you might have to add a priority to the configuration, e.g. 22-jupyterhub.conf to ensure that rsyslog executes it correctly. Create the logging directory and give it the correct permissions, then restart rsyslog:\n$ mkdir /var/log/jupyterhub $ chown root:syslog /var/log/jupyterhub $ chmod 775 /var/log/jupyterhub $ systemctl restart rsyslog.service To make sure the logs don\u0026rsquo;t get too large, add the following to /etc/logrotate.d/jupyterhub.conf:\n/var/log/jupyterhub/access.log { daily rotate 15 size 50M missingok notifempty compress delaycompress dateext dateformat -%Y-%m-%d create 0644 root root } This will ensure that the logs are compressed and rotated daily and that they will be deleted after 15 days.\n","permalink":"https://bbengfort.github.io/2019/10/launch-jupyterhub-server/","summary":"\u003cp\u003eIn this post I walk through the steps of creating a multi-user JupyterHub sever running on an AWS Ubuntu 18.04 instance. There are many ways of setting up JupyterHub including using Docker and Kubernetes - but this is a pretty staight forward mechanism that doesn\u0026rsquo;t have too many moving parts such as TLS termination proxies etc. I think of this as the baseline setup.\u003c/p\u003e\n\u003cp\u003eNote that this setup has a few pros or cons depending on how you look at them.\u003c/p\u003e","title":"Launching a JupyterHub Instance"},{"content":"Once the EBS volume has been created and attached to the instance, ssh into the instance and list the available disks:\n$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 86.9M 1 loop /snap/core/4917 loop1 7:1 0 12.6M 1 loop /snap/amazon-ssm-agent/295 loop2 7:2 0 91M 1 loop /snap/core/6350 loop3 7:3 0 18M 1 loop /snap/amazon-ssm-agent/930 nvme0n1 259:0 0 300G 0 disk nvme1n1 259:1 0 8G 0 disk └─nvme1n1p1 259:2 0 8G 0 part / In the above case we want to attach nvme0n1 - a 300GB gp2 EBS volume. Check if the volume already has data in it (e.g. created from a snapshot or being attached to a new instance):\n$ sudo file -s /dev/nvme0n1 /dev/nvme0n1: data If the above command shows data then the volume is empty. Format the file system as follows:\n$ sudo mkfs -t ext4 /dev/nvme0n1 mke2fs 1.44.1 (24-Mar-2018) Creating filesystem with 78643200 4k blocks and 19660800 inodes Filesystem UUID: 42a9004d-7d79-4113-8d36-2daaaaa63c87 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616 Allocating group tables: done Writing inode tables: done Creating journal (262144 blocks): done Writing superblocks and filesystem accounting information: done Create a directory to mount the volume to:\n$ sudo mkdir /data $ sudo chown ubuntu:ubuntu /data Mount the directory to the volume:\n$ sudo mount /dev/nvme0n1 /data/ The volume should now be available for access:\n$ cd /data $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1 295G 65M 280G 1% /data To unmount the volume:\n$ unmount /dev/nvme0n1 Automount To mount the EBS volume on reboot you need to make an fstab entry. First create a backup of the fstab configuration:\n$ sudo cp /etc/fstab /etc/fstab.bak Then add the entry to /etc/fstab as follows:\n# device_name mount_point file_system_type fs_mntops fs_freq fs_passno /dev/nvme0n1 /data ext4 defaults,nofail To check tha tthe fstab file has been created correctly run:\n$ sudo mount -a ","permalink":"https://bbengfort.github.io/2019/02/mount-ebs-volume/","summary":"\u003cp\u003eOnce the EBS volume has been created and attached to the instance, ssh into the instance and list the available disks:\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003e$ lsblk\nNAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT\nloop0         7:0    0 86.9M  1 loop /snap/core/4917\nloop1         7:1    0 12.6M  1 loop /snap/amazon-ssm-agent/295\nloop2         7:2    0   91M  1 loop /snap/core/6350\nloop3         7:3    0   18M  1 loop /snap/amazon-ssm-agent/930\nnvme0n1     259:0    0  300G  0 disk\nnvme1n1     259:1    0    8G  0 disk\n└─nvme1n1p1 259:2    0    8G  0 part /\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eIn the above case we want to attach nvme0n1 - a 300GB gp2 EBS volume. Check if the volume already has data in it (e.g. created from a snapshot or being attached to a new instance):\u003c/p\u003e","title":"Mount an EBS volume"},{"content":"Visual Diagnostics for More Effective Machine Learning\nDescription Modeling is often treated as a search activity: find some combination of features, algorithm, and hyperparameters that yields the best score after cross-validation. In this talk, we will explore how to steer the model selection process with visual diagnostics and the Yellowbrick library, leading to more effective and more interpretable results and faster experimental workflows.\n","permalink":"https://bbengfort.github.io/2019/01/visual-diagnostics-for-more-effective-machine-learning/","summary":"\u003cp\u003e\u003ca href=\"https://pydata.org/miami2019/schedule/presentation/8/\"\u003eVisual Diagnostics for More Effective Machine Learning\u003c/a\u003e\u003c/p\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/2kZ38ysHDzM?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/127690054?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eModeling is often treated as a search activity: find some combination of features, algorithm, and hyperparameters that yields the best score after cross-validation. In this talk, we will explore how to steer the model selection process with visual diagnostics and the Yellowbrick library, leading to more effective and more interpretable results and faster experimental workflows.\u003c/p\u003e","title":"Visual Diagnostics for More Effective Machine Learning"},{"content":"Blast throughput is what we call a throughput measurement such that N requests are simultaneously sent to the server and the duration to receive responses for all N requests is recorded. The throughput is computed as N/duration where duration is in seconds. This is the typical and potentially correct way to measure throughput from a client to a server, however issues do arise in distributed systems land:\nthe requests must all originate from a single client high latency response outliers can skew results you must be confident that N is big enough to max out the server N mustn\u0026rsquo;t be so big as to create non-server related bottlenecks. In this post I\u0026rsquo;ll discuss my implementation of the blast workload as well as an issue that came up with many concurrent connections in gRPC. This led me down the path to use one connection to do blast throughput testing, which led to other issues, which I\u0026rsquo;ll discuss later.\nFirst, let\u0026rsquo;s suppose that we have a gRPC service that defines a unary RPC with a Put() interface that allows the storage of a string key and a bytes value with protocol buffers. The blast throughput implementation is as follows:\nThis is a lot of code to go through but the key parts of this are as follows:\nAs much work as possible is done before executing the blast, e.g. creating request objects and connecting the clients to the server. Synchronization is achieved through arrays of length N - no channels or locks are used for reporting purposes. The only thing each blast operation executes is the creation of a context and sending the request to the server. I created a simple server that wrapped a map[string][]byte with a sync.RWMutex and implemented the Put service. It\u0026rsquo;s not high performance, sure, but it should highlight how well Blast works as well as the performance of a minimal gRPC server, the results surprised me:\nThe top graph shows the throughput, a terrible 4500 ops/second for only 250 blasted requests, and worse, after 250 requests the throughput drops to nothing, because as you can see from the bottom graph, the failures start to increase.\nPrinting out the errors I was getting rpc error: code = Unavailable desc = transport is closing errors from gRPC. All 1000 clients successfully connected, but then could not make requests.\nThe fix, as mentioned on line 41 was to replace the client per request with a single client (or possible a handful of clients that are used in a round-robin fashion by each request). This improved things significantly:\nNow we\u0026rsquo;re getting 30,000 puts per second, which is closer to what I would expect from gRPC\u0026rsquo;s Unary RPC. However, using a single client does pose some issues:\nThe client must be thread safe when making requests, which could add additional overhead to the throughput computation. Dealing with redirects or other server-side errors may become impossible with a single client blast throughput measurement. How do you balance the blast against multiple servers? The complete implementation of Blast and the server can be found at github.com/bbengfort/speedmap in the server-blast branch in the server folder.\nNote that I just found strest-grpc, which I\u0026rsquo;m interested in figuring out how it matches up with this assesment and blog post.\nIn a later post, I\u0026rsquo;ll discuss how we implement sustained throughput - where we have multiple clients continuously writing to the system and we measure throughput server-side.\n","permalink":"https://bbengfort.github.io/2018/09/blast-throughput/","summary":"\u003cp\u003eBlast throughput is what we call a throughput measurement such that N requests are simultaneously sent to the server and the duration to receive responses for all N requests is recorded. The throughput is computed as \u003ccode\u003eN/duration\u003c/code\u003e where duration is in seconds. This is the typical and potentially correct way to measure throughput from a client to a server, however issues do arise in distributed systems land:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003ethe requests must all originate from a single client\u003c/li\u003e\n\u003cli\u003ehigh latency response outliers can skew results\u003c/li\u003e\n\u003cli\u003eyou must be confident that N is big enough to max out the server\u003c/li\u003e\n\u003cli\u003eN mustn\u0026rsquo;t be so big as to create non-server related bottlenecks.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIn this post I\u0026rsquo;ll discuss my implementation of the blast workload as well as an issue that came up with many concurrent connections in gRPC. This led me down the path to use one connection to do blast throughput testing, which led to other issues, which I\u0026rsquo;ll discuss later.\u003c/p\u003e","title":"Blast Throughput"},{"content":"In this post I\u0026rsquo;m just going to maintain a list of notes for Go testing that I seem to commonly need to reference. It will also serve as an index for the posts related to testing that I have to commonly look up as well. Here is a quick listing of the table of contents:\nBasics Table Driven Tests Fixtures Golden Files Frameworks No Framework Ginkgo \u0026amp; Gomega Helpers Temporary Directories Sources and References Basics Just a quick reminder of how to write tests, benchmarks, and examples. A test is written as follows:\nfunc TestThing(t *testing.T) {} Benchmarks are written as follows:\nfunc BenchmarkThing(b *testing.B) { for i := 0; i \u0026lt; b.N; i++ { // Do Thing } } To run benchmarks, ensure you use the bench flag: go test -bench=. with the directory that contains the benchmarks specified. Examples are written as follows:\nfunc Examplething() { Thing() // Output: // thing happened } Test assertions and descriptions are as follows (most commonly used):\nt.Error/t.Errorf: equivalent to t.Log followed by t.Fail, which means that the failure message is printed out, but the test continues running. t.Fatal/t.Fatalf: equivalent to t.Log followed by t.FailNow, which means the failure message is printed but the test is canceled at that point (all deferred calls will be executed after this step). t.Helper: marks the test is a helper, when printing file and line info, the function will be skipped. Usually used to make common assertions or perform setup or tear down. t.Skip/t.Skipif: marks the test as skipped, though the test will still fail if any Error was called before the skip. Ref: Package testing\nTable Driven Tests Table driven tests use several language features including composite literals and anonymous structs to write related tests in a compact form. The most compact form of the tests looks like this:\nvar fibTests = []struct{ n int //input expected int // expected result }{ {1, 1}, {2, 1}, {3, 2}, {4, 3}, {5, 5}, {6, 8}, {7, 13}, } Of course it is also possible to define an internal struct in the test package (if using pkg_test) for reusable test construction. Hooking it up is as simple as:\nfunc TestFig(t *testing.T) { for _, tt := range fibTests { actual := Fib(tt.n) if actual != expected { t.Errorf(\u0026#34;Fig(%d): expected %d, actual %d\u0026#34;, tt.n, tt.expected, actual) } } } Ref: Dave Cheney — Writing table driven tests in Go\nFixtures When using the Go testing package, the test binary will be executed with its working directory set to the source directory of the package being tested. Additionally, the Go tool will ignore directories that start with a period, an underscore, or matches the word testdata. This means that you can create a directory called testdata and store fixtures there. You can then load data as follows:\nfunc loadFixture(t *testing.T, name string) []byte { path := filepath.Join(\u0026#34;testdata\u0026#34;, name) bytes, err := ioutil.Readfile(path) if err != nil { t.Fatalf(\u0026#34;could not open test fixture %s: %s\u0026#34;, name, err) } return bytes } Ref: Dave Cheney — Test fixtures in Go\nGolden Files When testing complicated or large output, you can save the data as an output file named .golden and provide a flag for updating it:\nvar update = flag.Bool(\u0026#34;update\u0026#34;, false, \u0026#34;update .golden files\u0026#34;) func TestSomething(t *testing.T) { actual := doSomething() golden := filepath.Join(\u0026#34;testdata\u0026#34;, tc.Name+\u0026#34;.golden\u0026#34;) if *update { ioutil.WriteFile(golden, actual, 0644) } expected, _ := ioutil.ReadFile(golden) if !bytes.Equal(actual, expected) { // Fail! } } Ref: Povilas Versockas — Go advanced testing tips \u0026amp; tricks\nFrameworks No Framework Ben Johnson makes a good argument for not using a testing framework. Go has a simple yet powerful testing framework. Frameworks are a barrier to entry for contributors to code. Frameworks require more dependencies to be fetched and managed. To reduce the verbosity, you can include simple test assertions as follows:\nfunc assert(tb testing.TB, condition bool, msg string) func ok(tb testing.TB, err error) func equals(tb testing.TB, exp, act interface{}) This way you can write tests as:\nfunc TestSomething(t *testing.T) { value, err := DoSomething() ok(t, err) equals(t, 100, value) } I certainly like the simplicity of this idea and on many of my small packages I simply write tests like this. However, in larger projects, it feels like test organization can quickly get out of control and I don\u0026rsquo;t know what I\u0026rsquo;ve tested and where.\nRef: Ben Johnson — Structuring Tests in Go Ref: Testing Functions for Go\nGinkgo \u0026amp; Gomega Many of my projects have started off using Ginkgo and Gomega for testing. Ginkgo provides BDD style testing to write expressive and well organized tests. Gomega provides a matching library for performing test-related assertions.\nTo bootstrap a test suite (after installing the libraries with go get) you would run ginkgo bootstrap. This creates the test suite which runs the tests. You can then generate tests by running ginkgo generate thing to create thing_test.go with the test stub already inside it:\npackage thing_test import ( . \u0026#34;/path/to/thing\u0026#34; . \u0026#34;github.com/onsi/ginkgo\u0026#34; . \u0026#34;github.com/onsi/gomega\u0026#34; ) var _ = Describe(\u0026#34;Thing\u0026#34;, func() { var thing Thing BeforeEach(func() { thing = new(Thing) }) It(\u0026#34;should do something\u0026#34;, func() { Ω(thing.Something()).Should(Succeed()) }) }) While I do like the use of this test framework, it\u0026rsquo;s primarily for the organization of the tests and the runner.\nHelpers Helper functions can be marked with t.Helper(), which excludes their line and signature information from the test error traceback. They can be used to do setup for the test case, unrelated error checks, and can even clean up after themselves!\nTemporary Directories Often, I need a temporary directory to store a database in or write files to. I can create the temporary directory with this helper function, which also returns a function to cleanup the temporary directories.\nconst tmpDirPrefix = \u0026#34;mytests\u0026#34; func tempDir(t *testing.T, name string) (path string, cleanup func()) { t.Helper() tmpDir, err = ioutil.TempDir(\u0026#34;\u0026#34;, tmpDirPrefix) if err != nil { t.Fatalf(\u0026#34;could not create temporary directory: %s\u0026#34;, err) } return filepath.Join(tmpDir, name), func() { err = os.RemoveAll(tmpDir) if err != nil { t.Errorf(\u0026#34;could not remove temporary directory: %s\u0026#34;, err) } } } This can be used with the cleanup function pretty simply:\nfunc TestThing(t *testing.T) { dir, cleanup := tempDir(t, \u0026#34;db\u0026#34;) defer cleanup() ... } Another version that I have in some of my tests creates a temporary directory for all of the tests, stored in a variable at the top level, any caller asking for a directory can create it, but it won\u0026rsquo;t be overridden if if already exists; then any test that cleans up will clean up that directory.\nSources and References Package testing Dave Cheney — Writing table driven tests in Go Dave Cheney — Test fixtures in Go Ben Johnson — Structuring Tests in Go Povilas Versockas — Go advanced testing tips \u0026amp; tricks Not related to testing, but saved for reference for a later godoc notes post:\nElliot Chance — Godoc: Tips \u0026amp; Tricks Andrew Gerrand — Godoc: documenting Go code ","permalink":"https://bbengfort.github.io/2018/09/go-testing-notes/","summary":"\u003cp\u003eIn this post I\u0026rsquo;m just going to maintain a list of notes for Go testing that I seem to commonly need to reference. It will also serve as an index for the posts related to testing that I have to commonly look up as well. Here is a quick listing of the table of contents:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"#basics\"\u003eBasics\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#table-driven-tests\"\u003eTable Driven Tests\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#fixtures\"\u003eFixtures\u003c/a\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"#golden-files\"\u003eGolden Files\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#frameworks\"\u003eFrameworks\u003c/a\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"#no-framework\"\u003eNo Framework\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#ginkgo--gomega\"\u003eGinkgo \u0026amp; Gomega\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#helpers\"\u003eHelpers\u003c/a\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"#temporary-directories\"\u003eTemporary Directories\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#sources-and-references\"\u003eSources and References\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"basics\"\u003eBasics\u003c/h2\u003e\n\u003cp\u003eJust a quick reminder of how to write tests, benchmarks, and examples. A test is written as follows:\u003c/p\u003e","title":"Go Testing Notes"},{"content":"In order to improve the performance of asynchronous message passing in Alia, I\u0026rsquo;m using gRPC bidirectional streaming to create the peer to peer connections. When the replica is initialized it creates a remote connection to each of its peers that lives in its own go routine; any other thread can send messages by passing them to that go routine through a channel, replies are then dispatched via another channel, directed to the thread via an actor dispatching model.\nThis post is about the performance of the remote sending go routine, particularly with respect to how many threads that routine is. Here is some basic stub code for the messenger go routine that listens for incoming messages on a buffered channel, and sends them to the remote via the stream:\nfunc (r *Remote) messenger() { // Attempt to establish a connection to the remote peer var err error if err = r.connect(); err != nil { out.Warn(err.Error()) } // Send all messages in the order they arrive on the channel for msg := range r.messages { // If we\u0026#39;re not online try to re-establish the connection if !r.online { if err = r.connect(); err != nil { out.Warn( \u0026#34;dropped %s message to %s (%s): could not connect\u0026#34;, msg.Type, r.Name, r.Endpoint() ) // close the connection and go to the next message r.close() continue } } // Send the message on the remote stream if err = r.stream.Send(msg); err != nil { out.Warn( \u0026#34;dropped %s message to %s (%s): could not send: %s\u0026#34;, msg.Type, r.Name, r.Endpoint(), err.Error() ) // go offline if there was an error sending a message r.close() continue } // But now how do we receive the reply? } } The question is, how do we receive the reply from the remote?\nIn sync mode, we can simply receive the reply before we send the next message. This has the benefit of ensuring that there is no further synchronization required on connect and close, however as shown in the graph below, it does not perform well at all.\nIn async mode, we can launch another go routine to handle all the incoming requests and dispatch them:\nfunc (r *Remote) listener() { for { if r.online { var ( err error rep *pb.PeerReply ) if rep, err = r.stream.Recv(); err != nil { out.Warn( \u0026#34;no response from %s (%s): %s\u0026#34;, r.Name, r.Endpoint(), err ) return } r.Dispatcher.Dispatch(events.New(rep.EventType(), r, rep)) } } } This does much better in terms of performance, however there is a race condition on the access to r.online before the access to r.stream which may be made nil by messenger routine closing.\nTo test this, I ran a benchmark, sending 5000 messages each in their own go routine and waiting until all responses were dispatched before computing the throughput. The iorder mode is to prove that even when in async if the messages are sent one at a time (e.g. not in a go routine) the order is preserved.\nAt first, I thought the size of the message buffer might be causing the bottleneck (hence the x-axis). The buffer prevents back-pressure from the message sender, and it does appear to have some influence on sync and async mode (but less of an impact in iorder mode). From these numbers, however, it\u0026rsquo;s clear that we need to run the listener in its own routine.\nNotes:\nWith sender and receiver go routines, the message order is preserved There is a race condition between sender and receiver Buffer size only has a small impact ","permalink":"https://bbengfort.github.io/2018/09/streaming-remote-throughput/","summary":"\u003cp\u003eIn order to improve the performance of asynchronous message passing in Alia, I\u0026rsquo;m using gRPC bidirectional streaming to create the peer to peer connections. When the replica is initialized it creates a remote connection to each of its peers that lives in its own go routine; any other thread can send messages by passing them to that go routine through a channel, replies are then dispatched via another channel, directed to the thread via an actor dispatching model.\u003c/p\u003e","title":"Streaming Remote Throughput"},{"content":"This is kind of a dumb post, but it\u0026rsquo;s something I\u0026rsquo;m sure I\u0026rsquo;ll look up in the future. I have a lot of emails where I have to send a date that\u0026rsquo;s sometime in the future, e.g. six weeks from the end of a class to specify a deadline … I\u0026rsquo;ve just been opening a Python terminal and importing datetime and timedelta but I figured this quick script on the command line would make my life a bit easier:\nAnd that\u0026rsquo;s all there is to it, not very interesting, but something I will probably have in my bin for the rest of my life.\n","permalink":"https://bbengfort.github.io/2018/09/future-date/","summary":"\u003cp\u003eThis is kind of a dumb post, but it\u0026rsquo;s something I\u0026rsquo;m sure I\u0026rsquo;ll look up in the future. I have a lot of emails where I have to send a date that\u0026rsquo;s sometime in the future, e.g. six weeks from the end of a class to specify a deadline … I\u0026rsquo;ve just been opening a Python terminal and importing \u003ccode\u003edatetime\u003c/code\u003e and \u003ccode\u003etimedelta\u003c/code\u003e but I figured this quick script on the command line would make my life a bit easier:\u003c/p\u003e","title":"Future Date Script"},{"content":"Here\u0026rsquo;s the scenario: we have a buffered channel that\u0026rsquo;s being read by a single Go routine and is written to by multiple go routines. For simplicity, we\u0026rsquo;ll say that the channel accepts events and that the other routines generate events of specific types, A, B, and C. If there are more of one type of event generator (or some producers are faster than others) we may end up in the situation where there are a series of the same events on the buffered channel. What we would like to do is read all of the same type of event that is on the buffered channel at once, handling them all simultaneously; e.g. aggregating the read of our events.\nAn initial solution is composed of two loops; the first loop has a select that performs a blocking read of either the msgs or a done channel to determine when to exit the go routine. If a msg is received a second loop labeled grouper is initiated with a non blocking read of the msgs channel. The loop keeps track of a current and next value. If next and current are the same, it continues reading off the channel, until they are different or there is nothing to read at which point it handles both next and current.\nfunc consumeAggregate(msgs \u0026lt;-chan string, done \u0026lt;-chan bool) { var current, next string var count int for { // Outer select does a blocking read on both channels select { case current = \u0026lt;-msgs: // count our current event ecount = 1 grouper: // continue reading events off the msgs channel for { select { case next = \u0026lt;-msgs: if next != current { // exit grouper loop and handle next and current break grouper } else { // keep track of the number of similar events count++ } default: // nothing is on the channel, break grouper and // only handle current by setting next to empty next = \u0026#34;\u0026#34; break grouper } } case \u0026lt;-done: // done consuming exit go routine return } // This section happens after select is complete // handle the current messages with the aggregate count handle(current, count) // handle next if one exists if next != \u0026#34;\u0026#34; { handle(next, 1) } } } This solution does have one obvious problem; the next value is not aggregated with similar values that happen after. E.g. in the event stream aaaabbb, the calls to handle will be (a, 4), (b, 1), (b, 2). The good news though is that testing with the race and deadlock detector show that this method is correct. Possible improvements for a future post include:\nAggregate the next value Read val, ok from the channel to detect if it\u0026rsquo;s closed to exit convert the outer loop to a range to complete when the channel is closed Here is the Aggregating Channel Gist that contains the complete code and tests.\n","permalink":"https://bbengfort.github.io/2018/08/aggregating-go-channels/","summary":"\u003cp\u003eHere\u0026rsquo;s the scenario: we have a buffered channel that\u0026rsquo;s being read by a single Go routine and is written to by multiple go routines. For simplicity, we\u0026rsquo;ll say that the channel accepts events and that the other routines generate events of specific types, \u003ccode\u003eA\u003c/code\u003e, \u003ccode\u003eB\u003c/code\u003e, and \u003ccode\u003eC\u003c/code\u003e. If there are more of one type of event generator (or some producers are faster than others) we may end up in the situation where there are a series of the same events on the buffered channel. What we would like to do is read \u003cem\u003eall\u003c/em\u003e of the same type of event that is on the buffered channel at once, handling them all simultaneously; e.g. aggregating the read of our events.\u003c/p\u003e","title":"Aggregating Reads from a Go Channel"},{"content":"Building correct concurrent programs in a distributed system with multiple threads and processes can quickly become very complex to reason about. For performance, we want each thread in a single process to operate as independently as possible; however anytime the shared state of the system is modified synchronization is required. Primitives like mutexes can [ensure structs are thread-safe]({% post_url 2017-02-21-synchronizing-structs %}), however in Go, the strong preference for synchronization is communication. In either case Go programs can quickly become locks upon locks or morasses of channels, incurring performance penalties at each synchronization point.\nThe Actor Model is a solution for reasoning about concurrency in distributed systems that helps eliminate unnecessary synchronization. In the actor model, we consider our system to be composed of actors, computational primitives that have a private state, can send and receive messages, and perform computations based on those messages. The key is that a system is composed of many actors and actors do not share memory, they have to communicate with messaging. Although Go does not provide first class actor primitives like languages such as Akka or Erlang, this does fit in well with the CSP principle.\nIn the next few posts, I\u0026rsquo;ll explore implementing the Actor model in Go for a simple distributed system that allows clients to make requests and periodically synchronizes its state to its peers. The model is shown below:\nActors An actor is a process or a thread that has the ability to send and receive messages. When an actor receives a message it can do one of three things:\nCreate new actors Send messages to known actors Can designate how you handle the next message At first glance we may think that actors are only created at the beginning of a program, e.g. the \u0026ldquo;main\u0026rdquo; actor or the instantiation of a program-long ticker actor that sends periodic messages and can receive start and stop messages. However, anytime a go programmer executes a new go routine, there is the possibility of a new actor being created. In our example, we\u0026rsquo;ll explore how a server creates temporary actors to handle single requests from clients.\nSending messages to known actors allows an actor to synchronize or share state with other go routines in the same process, other processes on the same machine, or even processes on other machines. As a result, actors are a natural framework for creating distributed systems. In our example we\u0026rsquo;ll send messages both with channels as well as using gRPC for network communications.\nThe most important thing to understand about actor communication is that although actors run concurrently, they will only process messages sequentially in the order which they are received. Actors send messages asynchronously (e.g. an actor isn\u0026rsquo;t blocked while waiting for another actor to receive the message). This means that messages need to be stored while the actor is processing other messages; this storage is usually called a \u0026ldquo;mailbox\u0026rdquo;. We\u0026rsquo;ll implement mailboxes with buffered channels in this post.\nDeciding how to handle the next message is a general way for saying that actors \u0026ldquo;do something\u0026rdquo; with messages, usually by modifying their state, and that it is something \u0026ldquo;interesting enough\u0026rdquo; that it impacts how the next message is handled. This implies a couple of things:\nActors have an internal state and memory Actors mutate their state based on messages How an actor responds depends on the order of messages received Actors can shutdown or stop For the rest of the posts, we\u0026rsquo;ll consider a simple service that hands out monotonically increasing, unique identities to clients called Ipseity. If the actor receives a next() message, it increments it\u0026rsquo;s local counter (mutating it\u0026rsquo;s internal state) ensuring that the next message always returns a monotonically increasing number. If it receives an update(id) message, it updates it\u0026rsquo;s internal state to specified id if it is larger than its internal id, allowing it to synchronize with remote peers (in an eventually consistent fashion).\nEvent Model In order to reduce confusion between network messages and actor messages, I prefer to use the term \u0026ldquo;event\u0026rdquo; when referring to messages sent between actors. This also allows us to reason about actors as implementing an event loop, another common distributed systems design paradigm. It is important to note that “actors are a specialized, opinionated implementation of an event driven architecture”, which means the actor model is a subset of event architectures, such as the [dispatcher model]({% post_url 2017-07-21-event-dispatcher %}) described earlier in this journal.\nI realize this does cause a bit of cognitive overhead, but this pays off when complex systems with many event types can be traced, showing a serial order of events handled by an actor. So for now, we\u0026rsquo;ll consider an event a message that can be \u0026ldquo;dispatched\u0026rdquo; (sent) to other actors, and \u0026ldquo;handled\u0026rdquo; (received) by an actor, one at a time.\nEvents are described by their type, which determines what data the event contains and how it should be handled by the actor. In Go, event types can be implemented as an enumeration by extending the uint16 type as follows:\n// Event types represented in Ipseity const ( UnknownEvent EventType = iota IdentityRequest SyncTimeout SyncRequest SyncReply ) // String names of event types var eventTypeStrings = [...]string{ \u0026#34;unknown\u0026#34;, \u0026#34;identityRequest\u0026#34;, \u0026#34;syncTimeout\u0026#34;, \u0026#34;syncRequest\u0026#34;, \u0026#34;syncReply\u0026#34;, } // EventType is an enumeration of the kind of events that actors handle type EventType uint16 // String returns the human readable name of the event type func (t EventType) String() string { if int(t) \u0026lt; len(eventTypeStrings) { return eventTypeStrings[t] } return eventTypeStrings[0] } Events themselves are usually represented by an interface to allow for multiple event types with specialized functionality to be created in code. For simplicity here, however, I\u0026rsquo;ll simply define a single event struct and we\u0026rsquo;ll use type casting later in the code:\ntype Event struct { Type EventType Source interface{} Value interface{} } The Source of the event is the actor that is dispatching the event, and we\u0026rsquo;ll primarily use this to store channels so that we can send messages (events) back to the actor. The Value of the event is any associated data that needs to be used by the actor processing the event.\nActor Interface There are a lot of different types of actors including:\nActors that run for the duration of the program Actors that generate events but do not receive them Actors that exist ephemerally to handle one or few events As a result it is difficult to describe an interface that handles all of these types generically. Instead we\u0026rsquo;ll focus on the central actor of our application (called the \u0026ldquo;Local Actor\u0026rdquo; in the diagram above), which fulfills the first role (runs the duration of the program) and most completely describes the actor design.\ntype Actor interface { Listen(addr string) error // Run the actor to listen for messages Dispatch(Event) error // Outside callers dispatch events to actor Handle(Event) error // Handle each event sequentially } As noted in the introduction and throughput appendix below, there are a number of ways to implement the actor interface that ensure events received by the Dispatch method are handled one at a time, in sequential order. Here, we\u0026rsquo;ll use a a buffered channel as a mailbox of a fixed size, so that other actors that are dispatching events to this actor aren\u0026rsquo;t blocked while the actor is handling other messages.\ntype ActorServer struct { pid int64 // unique identity of the actor events chan Event // mailbox to receive event dispatches sequence int64 // internal state, monotonically increasing identity } The Listen method starts the actor, (as well as a gRPC server on the specified addr, which we\u0026rsquo;ll discuss later) and reads messages off the channel one at a time, executing the Handle method for each message before moving to the next message. Listen runs forever until the events channel is closed, e.g. when the program exits.\nfunc (a *ActorServer) Listen(addr string) error { // Initialize the events channel able to buffer 1024 messages a.events = make(chan Event, 1024) // Read events off of the channel sequentially for event := range a.events { if err := a.Handle(event); err != nil { return err } } return nil } The Handle method can create new actors, send messages, and determine how to respond to the next event. Generally it is just a jump table, passing the event to the correct event handling method:\nfunc (a *ActorServer) Handle(e Event) error { switch e.Type() { case IdentityRequest: return a.onIdentityRequest(e) case SyncTimeout: return a.onSyncTimeout(e) case SyncRequest: return a.onSyncRequest(e) case SyncReply: return a.onSyncReply(e) default: return fmt.Errorf(\u0026#34;no handler identified for event %s\u0026#34;, e.Type()) } } The Dispatch method allows other actors to send events to the actor, by simply putting the event on the channel. When other go routines call Dispatch they won\u0026rsquo;t be blocked, waiting for the actor to handle the event because of the buffer … unless the actor has been backed up so the buffer is full.\nfunc (a *ActorServer) Dispatch(e Event) error { a.events \u0026lt;- e return nil } Next Steps In the next post (or two) we\u0026rsquo;ll hook up a gRPC server to the actor so that it can serve identity requests to clients as well as send and respond to synchronization requests for remote actors. We\u0026rsquo;ll also create a second go routine next to the actor process that issues synchronization timeouts on a periodic interval. Together, the complete system will be able to issue monotonically increasing identities in an eventually consistent fashion.\nOther Resources For any discussion of Actors, it seems obligatory to include this very entertaining video of Carl Hewitt, the inventor of the actor model, describing them on a white board with Erik Meijer and Clemens Szyperski.\nOther blog posts:\nThe actor model in 10 minutes Why has the actor model not succeeded? Understanding reactive architecture through the actor model Appendix: Throughput One of the biggest questions I had was whether or not the actor model introduced any performance issues over a regular mutex by serializing a wrapper event over a channel instead of directly locking the actor state. I tested the throughput for the following types of ipseity servers:\nSimple: locks the whole server to increment the sequence and create the response to the client. Sequence: creates a sequence struct that is locked when incremented, but not when creating the response to the client. Actor: Uses the buffered channel actor model as described in this post. Locker: Implements the actor interface but instead of a buffered channel uses a mutex to serialize events. As you can see from the above benchmark, it does not appear that the actor model described in these posts adds overhead that penalizes performance.\nThe code for both the benchmark and the implementations of the servers above can be found at: https://github.com/bbengfort/ipseity/tree/multiactor\n","permalink":"https://bbengfort.github.io/2018/08/actor-model/","summary":"\u003cp\u003eBuilding correct concurrent programs in a distributed system with multiple threads and processes can quickly become very complex to reason about. For performance, we want each thread in a single process to operate as independently as possible; however anytime the shared state of the system is modified synchronization is required. Primitives like mutexes can [ensure structs are thread-safe]({% post_url 2017-02-21-synchronizing-structs %}), however in Go, the strong preference for \u003ca href=\"https://blog.golang.org/share-memory-by-communicating\"\u003esynchronization is communication\u003c/a\u003e. In either case Go programs can quickly become locks upon locks or morasses of channels, incurring performance penalties at each synchronization point.\u003c/p\u003e","title":"The Actor Model"},{"content":"Syntactic parsing is a technique by which segmented, tokenized, and part-of-speech tagged text is assigned a structure that reveals the relationships between tokens governed by syntax rules, e.g. by grammars. Consider the sentence:\nThe factory employs 12.8 percent of Bradford County.\nA syntax parse produces a tree that might help us understand that the subject of the sentence is \u0026ldquo;the factory\u0026rdquo;, the predicate is \u0026ldquo;employs\u0026rdquo;, and the target is \u0026ldquo;12.8 percent\u0026rdquo;, which in turn is modified by \u0026ldquo;Bradford County\u0026rdquo;. Syntax parses are often a first step toward deep information extraction or semantic understanding of text. Note however, that syntax parsing methods suffer from structural ambiguity, that is the possibility that there exists more than one correct parse for a given sentence. Attempting to select the most likely parse for a sentence is incredibly difficult.\nThe best general syntax parser that exists for English, Arabic, Chinese, French, German, and Spanish is currently the blackbox parser found in Stanford\u0026rsquo;s CoreNLP library. This parser is a Java library, however, and requires Java 1.8 to be installed. Luckily it also comes with a server that can be run and accessed from Python using NLTK 3.2.3 or later. Once you have downloaded the JAR files from the CoreNLP download page and installed Java 1.8 as well as pip installed nltk, you can run the server as follows:\nfrom nltk.parse.corenlp import CoreNLPServer # The server needs to know the location of the following files: # - stanford-corenlp-X.X.X.jar # - stanford-corenlp-X.X.X-models.jar STANFORD = os.path.join(\u0026#34;models\u0026#34;, \u0026#34;stanford-corenlp-full-2018-02-27\u0026#34;) # Create the server server = CoreNLPServer( os.path.join(STANFORD, \u0026#34;stanford-corenlp-3.9.1.jar\u0026#34;), os.path.join(STANFORD, \u0026#34;stanford-corenlp-3.9.1-models.jar\u0026#34;), ) # Start the server in the background server.start() The server needs to know the location of the JAR files you downloaded, either by adding them to your Java $CLASSPATH or like me, storing them in a models directory that you can access from your project. When you start the server, it runs in the background, ready for parsing.\nTo get constituency parses from the server, instantiate a CoreNLPParser and parse raw text as follows:\nfrom nltk.parse.corenlpnltk.pa import CoreNLPParser parser = CoreNLPParser() parse = next(parser.raw_parse(\u0026#34;I put the book in the box on the table.\u0026#34;)) If you\u0026rsquo;re in a Jupyter notebook, the tree will be drawn as above. Note that the CoreNLPParser can take a URL to the CoreNLP server, so if you\u0026rsquo;re deploying this in production, you can run the server in a docker container, etc. and access it for multiple parses. The raw_parse method expects a single sentence as a string; you can also use the parse method to pass in tokenized and tagged text using other NLTK methods. Parses are also handy for identifying questions:\nnext(parser.raw_parse(\u0026#34;What is the longest river in the world?\u0026#34;)) Note the SBARQ representing the question; this data can be used to create a classifier that can detect what type of question is being asked, which can then in turn be used to transform the question into a database query!\nI should also point out why we\u0026rsquo;re using next(); the parser actually returns a generator of parses, starting with the most likely. By using next, we\u0026rsquo;re selecting only the first, most likely parse.\nConstituency parses are deep and contain a lot of information, but often dependency parses are more useful for text analytics and information extraction. To get a Stanford dependency parse with Python:\nfrom nltk.parse.corenlp import CoreNLPDependencyParser parser = CoreNLPDependencyParser() parse = next(parser.raw_parse(\u0026#34;I put the book in the box on the table.\u0026#34;)) Once you\u0026rsquo;re done parsing, don\u0026rsquo;t forget to stop the server!\n# Stop the CoreNLP server server.stop() To ensure that the server is stopped even when an exception occurs, you can also use the CoreNLPServer context manager as follows:\njars = ( \u0026#34;stanford-corenlp-3.9.1.jar\u0026#34;, \u0026#34;stanford-corenlp-3.9.1-models.jar\u0026#34; ) with CoreNLPServer(*jars): parser = CoreNLPParser() text = \u0026#34;The runner scored from second on a base hit\u0026#34; parse = next(parser.parse_text(text)) parse.draw() Note that the parse_text function in the above code allows a string to be passed that might contain multiple sentences and returns a parse for each sentence it segments. Additionally the tokenize and tag methods can be used on the parser to get the Stanford part of speech tags from the text.\nUnfortunately there isn\u0026rsquo;t much documentation on this, but for more check out the NLTK CoreNLP API documentation.\n","permalink":"https://bbengfort.github.io/2018/06/corenlp-nltk-parses/","summary":"\u003cp\u003eSyntactic parsing is a technique by which segmented, tokenized, and part-of-speech tagged text is assigned a structure that reveals the relationships between tokens governed by syntax rules, e.g. by grammars. Consider the sentence:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eThe factory employs 12.8 percent of Bradford County.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eA syntax parse produces a tree that might help us understand that the subject of the sentence is \u0026ldquo;the factory\u0026rdquo;, the predicate is \u0026ldquo;employs\u0026rdquo;, and the target is \u0026ldquo;12.8 percent\u0026rdquo;, which in turn is modified by \u0026ldquo;Bradford County\u0026rdquo;. Syntax parses are often a first step toward deep information extraction or semantic understanding of text. Note however, that syntax parsing methods suffer from \u003cem\u003estructural ambiguity\u003c/em\u003e, that is the possibility that there exists more than one correct parse for a given sentence. Attempting to select the most likely parse for a sentence is incredibly difficult.\u003c/p\u003e","title":"Syntax Parsing with CoreNLP and NLTK"},{"content":"Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166\nDescription Machine learning models are often inscrutable and it can be difficult to know whether you are making progress. To improve feedback and speed up iteration\n","permalink":"https://bbengfort.github.io/2018/06/understanding-machine-learning-through-visualizations-with-benjamin-bengfort-and-rebecca-bilbro-episode-166/","summary":"\u003cp\u003e\u003ca href=\"https://www.pythonpodcast.com/yellowbrick-with-bejnamin-bengfort-and-rebecca-bilbro-episode-166/\"\u003eUnderstanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166\u003c/a\u003e\u003c/p\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eMachine learning models are often inscrutable and it can be difficult to know whether you are making progress. To improve feedback and speed up iteration\u003c/p\u003e","title":"Understanding Machine Learning Through Visualizations with Benjamin Bengfort and Rebecca Bilbro - Episode 166"},{"content":"When you have an outer and an inner loop, how do you continue the outer loop from a condition inside the inner loop? Consider the following code:\nfor i in range(10): for j in range(9): if i \u0026lt;= j: # break out of inner loop # continue outer loop print(i,j) # don\u0026#39;t print unless inner loop completes, # e.g. outer loop is not continued print(\u0026#34;inner complete!\u0026#34;) Here, we want to print for all i ∈ [0,10) all numbers j ∈ [0,9) that are less than or equal to i and we want to print complete once we\u0026rsquo;ve found an entire list of j that meets the criteria. While this seems like a fairly contrived example, I\u0026rsquo;ve actually encountered this exact situation in several places in code this week, and I\u0026rsquo;ll provide a real example in a bit.\nMy first instinct simply uses a function to use return to do a \u0026ldquo;hard break\u0026rdquo; out of the loop. This allows us to short-circuit functionality by exiting the function, but doesn\u0026rsquo;t actually provide continue functionality, which is the goal in the above example. The technique does work, however, and in multi-loop situations is probably the best bet.\ndef inner(i): for j in range(9): if i \u0026lt;= j: # Note if this was break, the print statement would execute return print(i,j) print(\u0026#34;inner complete\u0026#34;) for i in range(10): inner(i) Much neater, however is using for/else. The else block fires iff the for loop it is connected with completes. This was very weird to me at first, I thought else should trigger if break. Think of it this way though:\nYou\u0026rsquo;re searching through a list of things, for item in collection and you plan to break when you\u0026rsquo;ve found the item you\u0026rsquo;re looking for, else you do something if you exhaust the collection and didn\u0026rsquo;t find what you were looking for.\nTherefore we can code our loop as follows:\nfor i in range(10): for j in range(9): if i \u0026lt;= j: break print(i,j) else: # Outer loop is continued continue print(\u0026#34;inner complete!\u0026#34;) This is a little strange, because it is probably more appropriate to put our print in the else block, but this was the spec, continue the outer loop if the inner loop gets broken.\nHere\u0026rsquo;s a better example with date parsing:\n# Try to parse a timestamp with a bunch of formats for fmt in (JSON, PG, ISO, RFC, HUMAN): try: ts = datetime.strptime(ts, fmt) break except ValueError: continue else: # Could not parse with any of the formats required raise ValueError(\u0026#34;could not parse timestamp\u0026#34;) Is this better or worse than the function version of this?\ndef parse_timestamp(ts): for fmt in (JSON, PG, ISO, RFC, HUMAN): try: return datetime.strptime(ts, fmt) except ValueError: continue raise ValueError(\u0026#34;could not parse timestamp\u0026#34;) ts = parse_timestamp(ts) Let\u0026rsquo;s go to the benchmarks:\nSo basically, there is no meaningful difference, but depending on the context of implementation, using for/else may be a bit more meaningful or easy to test than having to implement another function.\nBenchmark code can be found here.\n","permalink":"https://bbengfort.github.io/2018/05/continuing-outer-loops-for-else/","summary":"\u003cp\u003eWhen you have an outer and an inner loop, how do you continue the outer loop from a condition inside the inner loop? Consider the following code:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003efor\u003c/span\u003e \u003cspan class=\"n\"\u003ei\u003c/span\u003e \u003cspan class=\"ow\"\u003ein\u003c/span\u003e \u003cspan class=\"nb\"\u003erange\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"mi\"\u003e10\u003c/span\u003e\u003cspan class=\"p\"\u003e):\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003efor\u003c/span\u003e \u003cspan class=\"n\"\u003ej\u003c/span\u003e \u003cspan class=\"ow\"\u003ein\u003c/span\u003e \u003cspan class=\"nb\"\u003erange\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"mi\"\u003e9\u003c/span\u003e\u003cspan class=\"p\"\u003e):\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e        \u003cspan class=\"k\"\u003eif\u003c/span\u003e \u003cspan class=\"n\"\u003ei\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026lt;=\u003c/span\u003e \u003cspan class=\"n\"\u003ej\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e            \u003cspan class=\"c1\"\u003e# break out of inner loop\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e            \u003cspan class=\"c1\"\u003e# continue outer loop\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e        \u003cspan class=\"nb\"\u003eprint\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003ei\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e\u003cspan class=\"n\"\u003ej\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"c1\"\u003e# don\u0026#39;t print unless inner loop completes,\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"c1\"\u003e# e.g. outer loop is not continued\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nb\"\u003eprint\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;inner complete!\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eHere, we want to print for all \u003ccode\u003ei\u003c/code\u003e ∈ \u003ccode\u003e[0,10)\u003c/code\u003e all numbers \u003ccode\u003ej\u003c/code\u003e ∈ \u003ccode\u003e[0,9)\u003c/code\u003e that are less than or equal to i and we want to print complete once we\u0026rsquo;ve found an entire list of \u003ccode\u003ej\u003c/code\u003e that meets the criteria. While this seems like a fairly contrived example, I\u0026rsquo;ve actually encountered this exact situation in several places in code this week, and I\u0026rsquo;ll provide a real example in a bit.\u003c/p\u003e","title":"Continuing Outer Loops with for/else"},{"content":"This is a follow on to the [prediction distribution]({{ site.base_url }}{% link _posts/2018-02-28-prediction-distribution.md %}) visualization presented in the last post. This visualization shows a bar chart with the number of predicted and number of actual values for each class, e.g. a class balance chart with predicted balance as well.\nThis visualization actually came before the prior visualization, but I was more excited about that one because it showed where error was occurring similar to a classification report or confusion matrix. I\u0026rsquo;ve recently been using this chart for initial spot checking more however, since it gives me a general feel for how balanced both the class and the classifier is with respect to each other. It has also helped diagnose what is being displayed in the heat map chart of the other post.\nThe code follows, again prototype code. However in this code I made an effort to use more scikit-learn tooling in the visualization, including their validation and checking code. Hopefully this will help us eliminate problems with various types of input.\nThis code also shows a cross-validation strategy for getting y_true and y_pred from a classifier. I think this type of code will become a cornerstone in Yellowbrick, so please let us know in the YB issues if you see anything fishy with this methodology!\n","permalink":"https://bbengfort.github.io/2018/03/prediction-balance/","summary":"\u003cp\u003eThis is a follow on to the [prediction distribution]({{ site.base_url }}{% link _posts/2018-02-28-prediction-distribution.md %}) visualization presented in the last post. This visualization shows a bar chart with the number of predicted and number of actual values for each class, e.g. a class balance chart with predicted balance as well.\u003c/p\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/images/2018-03-08-cb-preds.png\" alt=\"Class Balance of Actual vs. Predictions\"  /\u003e\n\u003c/p\u003e\n\u003cp\u003eThis visualization actually came before the prior visualization, but I was more excited about that one because it showed where error was occurring similar to a classification report or confusion matrix. I\u0026rsquo;ve recently been using this chart for initial spot checking more however, since it gives me a general feel for how balanced both the class and the classifier is with respect to each other. It has also helped diagnose what is being displayed in the heat map chart of the other post.\u003c/p\u003e","title":"Predicted Class Balance"},{"content":"In this quick snippet I present an alternative to the confusion matrix or classification report visualizations in order to judge the efficacy of multi-class classifiers:\nThe base of the visualization is a class balance chart, the x-axis is the actual (or true class) and the height of the bar chart is the number of instances that match that class in the dataset. The difference here is that each bar is a stacked chart representing the percentage of the predicted class given the actual value. If the predicted color matches the actual color then the classifier was correct, otherwise it was wrong.\nThe code to do this follows. This is simple prototype code that we\u0026rsquo;ll be including in Yellowbrick soon and may not work in all cases; nor does it include features for doing cross-validation and putting together the two vectors required for visualization.\nOther interesting things that can be done with this: make the x axis the predicted class instead of the actual class, if classes are ordinal use a heatmap to show predictions over or under the specified class, or find a better way to show \u0026ldquo;correct\u0026rdquo; values with out discrete color values. More investigation on this and an implementation in Yellowbrick soon!\nUpdate: Thanks @lwgray for putting together the pull request for this!\n","permalink":"https://bbengfort.github.io/2018/02/prediction-distribution/","summary":"\u003cp\u003eIn this quick snippet I present an alternative to the \u003ca href=\"http://www.scikit-yb.org/en/latest/api/classifier/confusion_matrix.html\"\u003econfusion matrix\u003c/a\u003e or \u003ca href=\"http://www.scikit-yb.org/en/latest/api/classifier/classification_report.html\"\u003eclassification report\u003c/a\u003e visualizations in order to judge the efficacy of multi-class classifiers:\u003c/p\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/images/2018-02-28-cb-preds-dist.png\" alt=\"Class Balance of Actual vs. Predictions\"  /\u003e\n\u003c/p\u003e\n\u003cp\u003eThe base of the visualization is a class balance chart, the x-axis is the actual (or true class) and the height of the bar chart is the number of instances that match that class in the dataset. The difference here is that each bar is a stacked chart representing the percentage of the predicted class given the actual value. If the predicted color matches the actual color then the classifier was correct, otherwise it was wrong.\u003c/p\u003e","title":"Class Balance Prediction Distribution"},{"content":"This post serves as a reminder of how to perform benchmarks when accounting for synchronized writing in Go. The normal benchmarking process involves running a command a large number of times and determining the average amount of time that operation took. When threads come into play, we consider throughput - that is the number of operations that can be conducted per second. However, in order to successfully measure this without duplicating time, the throughput must be measured from the server\u0026rsquo;s perspective.\nLet w be the amount of time a single operation takes and n be the number of operations per thread. Given t threads, the cost for each operation from the perspective of the client thread will be t*w because the server is synchronizing writes as shown in the figure above, e.g. the thread has to wait for t-1 other writes to complete before conducting it\u0026rsquo;s write. This means that each thread returns a latency of n*t*w from it\u0026rsquo;s perspective. If this is aggregated, the total time is computed n*w*t^2, even though the real time that has passed is actually n*t*w as shown in the single threaded case.\nFor normal server-client throughput we measure the start time of the first access on the server end and the end time of the last access and the duration as the difference between these two timestamps. We then compute the number of operations conducted in that time to measure throughput. This only works if the server is pegged, e.g. it is not waiting for requests.\nHowever, for synchronization we can\u0026rsquo;t measure at the server since we\u0026rsquo;re trying to determine the cost of locks and scheduling. We\u0026rsquo;ve used an external benchmarking procedure that may underestimate throughput but allows us to measure from the client-side rather than at the server. I\u0026rsquo;ve put up a simple library called syncwrite to benchmark this code.\nHere is the use case: consider a system that has multiple goroutines, each of which want to append to a log file on disk. The append-only log determines the sequence or ordering of events in the system, so appends must be atomic and reads to an index in the log, idempotent. The interface of the Log is as follows:\ntype Log interface { Open(path string) error Append(value []byte) error Get(index int) (*Entry, error) Close() error } The log embeds sync.RWMutex to ensure no race conditions occur and that our read/write invariants are met. The Open, Append, and Close methods are all protected by a write lock, and the Get method is protected with a read lock. The structs that implement Log all deal with the disk and in-memory storage in different ways:\nInMemoryLog: appends to an in-memory slice and does not write to disk. FileLog: on open, reads entries from file into in-memory slice and reads from it, writes append to both the slice and the file. LevelDBLog: both writes and reads go to a LevelDB database. In the future (probably in the next post), I will also implement an AsyncLog that wraps a Log and causes write operations to be asynchronous by storing them on a channel and allowing the goroutine to immediately return.\nBenchmarks The benchmarks are associated with an action, which can be one or more operations to disk. In this benchmark we simply evaluate an action that calls the Write method of a log with \u0026quot;foo\u0026quot; as the value. Per-action benchmarks are computed using go-bench, which computes the average time it takes to run the action once:\nBenchmarkInMemoryLog-8 10000000\t210 ns/op BenchmarkFileLog-8 200000\t7456 ns/op BenchmarkLevelDBLog-8 100000\t12379 ns/op Writing to the in-memory log is by far the fastest, while writing to the LevelDB log is by far the slowest operation. We expect throughput, that is the number of operations per second, to be equivalent with these per-action benchmarks in a single thread. The theoretical throughput is simply 1/w*1e-9 (converting nanoseconds to seconds). The question is how throughput changes with more threads and in a real workload.\nThroughput benchmarks are conducted by running t threads, each of which run n actions and returns the amount of time it takes all threads to run n*t actions. As the t increases, the workload stays static, e.g. n becomes smaller to keep n*t constant. The throughput is the number of operations divided by the duration in seconds.\nNote that the y-axis is on a logarithm scale, and because of the magnitude of in-memory writes, the chart is a bit difficult to read. Therefore the next chart shows the percentage of the theoretical throughput (as computed by w) the real system achieves:\nObservations Looking at the percent theoretical chart, I believe the reason that both In-Memory and LevelDB achieve \u0026gt;100% for up to 4 threads is because the benchmark that computes w has such a high variability; though this does not explain why the File log has dramatically lower throughput.\nBecause the benchmarks were run on a 4 core machine, up to 4 threads for In-Memory and LevelDB can operate without a noticeable decrease in throughput since there is no scheduling issue. However, at 8 threads and above, there is a noticeable drop in the percent of theoretical throughput, probably due to synchronization or scheduling issues. This same drop does not occur in the File log because it was already below it\u0026rsquo;s theoretical maximum throughput.\n","permalink":"https://bbengfort.github.io/2018/02/sync-write-throughput/","summary":"\u003cp\u003eThis post serves as a reminder of how to perform benchmarks when accounting for synchronized writing in Go. The normal benchmarking process involves running a command a large number of times and determining the average amount of time that operation took. When threads come into play, we consider \u003cem\u003ethroughput\u003c/em\u003e - that is the number of operations that can be conducted per second. However, in order to successfully measure this without duplicating time, the throughput must be measured from the server\u0026rsquo;s perspective.\u003c/p\u003e","title":"Synchronization in Write Throughput"},{"content":"I came across this now archived project that implements a set data structure in Go and was intrigued by the implementation of both thread-safe and non-thread-safe implementations of the same data structure. Recently I\u0026rsquo;ve been attempting to get rid of locks in my code in favor of one master data structure that does all of the synchronization, having multiple options for thread safety is useful. Previously I did this by having a lower-case method name (a private method) that was non-thread-safe and an upper-case method name (public) that did implement thread-safety. However, as I\u0026rsquo;ve started to reorganize my packages this no longer works.\nThe way that the Set implementation works is that it defines a base data structure that is private, set, as well as an interface (set.Interface) that describes the methods a set is expected to have. The set methods are all private, then two data structures are composed that embed the set — Set and SetNonTS — the thread and non-thread safe versions of set. In this snippet I\u0026rsquo;ll just show a bit of boiler plate code that does this for reference, see the full set implementation for more detail.\nIn the implementation above, the set object provides four internal methods: init() creates the internal map data structure, add updates the map with one or more items, remove deletes one or more items from the map, and contains does a simple check to see if the item is in the internal map. All of these methods are private to the set package.\nThe SetNonTs and Set methods embed the set object and add some additional functionality. Both implement a constructor, NewNonTS and New respectively, which call the internal init functions. Both also implement Add and Remove, which silently exit if no items are added, the difference being that Set write locks the data structure after performing that check. Contains is also implemented, which the Set data structure read locks before checking.\nThe only small problem with this implementation is that there is a little bit of code duplication (e.g. the checks for non items in the Add and Remove methods). However, I\u0026rsquo;ve noticed in my code that often there are tasks that are done in either thread-safe or non-thread safe versions but not both (like marking a flag or sending data to a channel). Because of this, it\u0026rsquo;s often better to keep those methods separate rather then relying solely on embedding.\n","permalink":"https://bbengfort.github.io/2018/01/go-set/","summary":"\u003cp\u003eI came across this now archived project that implements a \u003ca href=\"https://github.com/fatih/set\"\u003eset data structure in Go\u003c/a\u003e and was intrigued by the implementation of both thread-safe and non-thread-safe implementations of the same data structure. Recently I\u0026rsquo;ve been attempting to get rid of locks in my code in favor of one master data structure that does all of the synchronization, having multiple options for thread safety is useful. Previously I did this by having a lower-case method name (a private method) that was non-thread-safe and an upper-case method name (public) that did implement thread-safety. However, as I\u0026rsquo;ve started to reorganize my packages this no longer works.\u003c/p\u003e","title":"Thread and Non-Thread Safe Go Set"},{"content":"A recent application I was working on required the management of several configuration and list files that needed to be validated. Rather than have the user find and edit these files directly, I wanted to create an editing workflow similar to crontab -e or git commit — the user would call the application, which would redirect to a text editor like vim, then when editing was complete, the application would take over again.\nThis happened to be a Go app, so the following code is in Go, but it would work with any programming language. The workflow is as follows:\nFind an editor executable Copy the original to a temporary file Exec the editor on the temporary file Wait for the editor to be done Validate the temporary file Copy the temporary file to the original location This worked surprisingly well especially for things like YAML files which are structured enough to be validated easily, but human readable enough to edit.\nFirst up, finding an editor executable. I used a three part strategy; first the user could specify the path to an editor in the configuration file (like git), second, the user could set the $EDITOR environment variable, and third, I look for common editors. Here\u0026rsquo;s the code:\nvar editors = [4]string{\u0026#34;vim\u0026#34;, \u0026#34;emacs\u0026#34;, \u0026#34;nano\u0026#34;} func findEditor() (string, error) { config, err := LoadConfig() if err != nil { return \u0026#34;\u0026#34;, err } if config.Editor != \u0026#34;\u0026#34; { return config.Editor, nil } if editor := os.Getenv(\u0026#34;EDITOR\u0026#34;); editor != \u0026#34;\u0026#34; { return editor, nil } for _, name := range editors { path, err := exec.LookPath(name) if err == nil { return path, nil } } return \u0026#34;\u0026#34;, errors.New(\u0026#34;no editor found\u0026#34;) } The crucial part of this is exec.LookPath which searches the $PATH for editor and returns the full path to exec it. Next up is copying the file:\nfunc copyFile(src, dst string) error { in, err := os.Open(src) if err != nil { return err } defer in.Close() out, err := os.Create(dst) if err != nil { return err } defer out.Close() if _, err = io.Copy(out, in); err != nil { return err } return nil } Finally the full editor workflow:\nfunc EditFile(path string) error { // Find the editor to use editor, err := findEditor() if err != nil { return err } // Create the temporary directory and ensure we clean up when done. tmpDir := os.TempDir() defer os.RemoveAll(tmpDir) // Get the temporary file location tmpFile := filepath.Join(tmpDir, filepath.Base(path)) // Copy the original file to the tmpFile if err = copyFile(path, tmpFile); err != nil { return err } // Create the editor command cmd := exec.Command(editor, tmpFile) cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr // Start the editor command and wait for it to finish if err = cmd.Start(); err != nil { return err } if err = cmd.Wait(); err != nil { return err } // Copy the tmp file back to the original file return copyFile(tmpFile, path) } This workflow assumes that the file being edited already exists, but of course you could modify it any number of ways. For example, you could use a template to populate the temporary file (similar to what git does for a commit message), or you could add more validation around input and output.\n","permalink":"https://bbengfort.github.io/2018/01/cli-editor-app/","summary":"\u003cp\u003eA recent application I was working on required the management of several configuration and list files that needed to be validated. Rather than have the user find and edit these files directly, I wanted to create an editing  workflow similar to \u003ccode\u003ecrontab -e\u003c/code\u003e or \u003ccode\u003egit commit\u003c/code\u003e — the user would call the application, which would redirect to a text editor like vim, then when editing was complete, the application would take over again.\u003c/p\u003e","title":"Git-Style File Editing in CLI"},{"content":"Databases are essential to most applications, however most database interaction is often overlooked by Python developers who use higher level libraries like Django or SQLAlchemy. We use and love PostgreSQL with Psycopg2, but I recently realized that I didn\u0026rsquo;t have a good grasp on how exactly psycopg2 implemented core database concepts: particularly transaction isolation and thread safety.\nHere\u0026rsquo;s what the documentation says regarding transactions:\nTransactions are handled by the connection class. By default, the first time a command is sent to the database (using one of the cursors created by the connection), a new transaction is created. The following database commands will be executed in the context of the same transaction – not only the commands issued by the first cursor, but the ones issued by all the cursors created by the same connection. Should any command fail, the transaction will be aborted and no further command will be executed until a call to the rollback() method.\nTransactions are therefore connection specific. When you create a connection, you can create multiple cursors, the transaction begins when the first cursor issues an execute \u0026ndash; all all commands executed by all cursors after that are part of the same transaction until commit or rollback. After any of these methods are called, the next transaction is started on the next execute call.\nThis brings up a very important point:\nBy default even a simple SELECT will start a transaction: in long-running programs, if no further action is taken, the session will remain “idle in transaction”, an undesirable condition for several reasons (locks are held by the session, tables bloat…). For long lived scripts, either make sure to terminate a transaction as soon as possible or use an autocommit connection.\nThis seems to indicate that when working directly with psycopg2, understanding transactions is essential to writing stable scripts. This post therefore details my notes and techniques for working more effectively with PostgreSQL from Python.\nDatabase Preliminaries In order to demonstrate the code in this blog post, we need a database. The classic database example taught to undergraduates is that of a bank account, so we\u0026rsquo;ll continue with that theme here! Sorry if this part is tedious, feel free to skip ahead. In a file, schema.sql, I defined the following schema as DDL (data definition language):\nDROP TABLE IF EXISTS users CASCADE; CREATE TABLE users ( id SERIAL PRIMARY KEY, username VARCHAR(255) UNIQUE, pin SMALLINT NOT NULL ); DROP TYPE IF EXISTS account_type CASCADE; CREATE TYPE account_type AS ENUM (\u0026#39;checking\u0026#39;, \u0026#39;savings\u0026#39;); DROP TABLE IF EXISTS accounts CASCADE; CREATE TABLE accounts ( id SERIAL PRIMARY KEY, type account_type, owner_id INTEGER NOT NULL, balance NUMERIC DEFAULT 0.0, CONSTRAINT positive_balance CHECK (balance \u0026gt;= 0), FOREIGN KEY (owner_id) REFERENCES users (id) ); DROP TYPE IF EXISTS ledger_type CASCADE; CREATE TYPE ledger_type AS ENUM (\u0026#39;credit\u0026#39;, \u0026#39;debit\u0026#39;); DROP TABLE IF EXISTS ledger; CREATE TABLE ledger ( id SERIAL PRIMARY KEY, account_id INTEGER NOT NULL, date DATE NOT NULL DEFAULT CURRENT_DATE, type ledger_type NOT NULL, amount NUMERIC NOT NULL, FOREIGN KEY (account_id) REFERENCES accounts (id) ); This creates a simple database with two tables. The owners table contains a PIN code for verification. Owners can have one or more accounts, and accounts have the constraint that the balance can never fall below $0.00. We can also seed the database with some initial data:\nINSERT INTO users (id, username, pin) VALUES (1, \u0026#39;alice\u0026#39;, 1234), (2, \u0026#39;bob\u0026#39;, 9999); INSERT INTO accounts (type, owner_id, balance) VALUES (\u0026#39;checking\u0026#39;, 1, 250.0), (\u0026#39;savings\u0026#39;, 1, 5.00), (\u0026#39;checking\u0026#39;, 2, 100.0), (\u0026#39;savings\u0026#39;, 2, 2342.13); Moving to Python code we can add some template code to allow us to connect to the database and execute the SQL in our file above:\nimport os import psycopg2 as pg def connect(env=\u0026#34;DATABASE_URL\u0026#34;): url = os.getenv(env) if not url: raise ValueError(\u0026#34;no database url specified\u0026#34;) return pg.connect(url) def createdb(conn, schema=\u0026#34;schema.sql\u0026#34;): with open(schema, \u0026#39;r\u0026#39;) as f: sql = f.read() try: with conn.cursor() as curs: curs.execute(sql) conn.commit() except Exception as e: conn.rollback() raise e The connect function looks for the database connection string in the environment variable $DATABASE_URL. Because database configuration code can contain passwords and network information it is always best to store it in the environment or in a local, secure configuration file that can only be accessed by the process and not checked in with code. The connection string should look something like: postgresql://user@localhost:5432/dbname.\nThe createdb function reads the SQL from the schema.sql file and executes it against the database. Note this is why we have the DROP TABLE IF EXISTS statements, so we can guarantee we always start with a fresh database when we run this script. This function also gives us our first glance at transactions and database interaction with Python.\nComplying with PEP 249 we create a connection to the database, then create a cursor from the connection. Cursors manage the execution of SQL against the database as well as data retrieval. We execute the SQL in our schema file, committing the transaction if no exceptions are raised, and rolling back if it fails. We will explore this more in the next section.\nTransaction Management A transaction consists of one or more related operations that represent a single unit of work. For example, in the bank account example you might have a deposit transaction that executes queries to look up the account and verify the user, add a record to a list of daily deposits, check if the daily deposit limit has been reached, then modify the account balance. All of these operations represent all of the steps required to perform a deposit.\nThe goal of a transaction is that when the transaction is complete, the database remains in a single consistent state. Consistency is often defined by invariants or constraints that describe at a higher level how the database should maintain information. From a programming perspective, if those constraints are violated an exception is raised. For example, the database has a positive_balance constraint, if the balance for an account goes below zero an exception is raised. When this constraint is violated the database must remain unchanged and all operations performed by the transaction must be rolled back. If the transaction was successful we can then commit the changes, which guarantee that the database has successfully applied our operation.\nSo why do we need to manage transactions? Consider the following code:\nconn = connect() curs = conn.cursor() try: # Execute a command that will raise a constraint curs.execute(\u0026#34;UPDATE accounts SET balance=%s\u0026#34;, (-130.935,)) except Exception as e: print(e) # Constraint exception # Execute another command, but because of the previous exception: curs = conn.cursor() try: curs.execute(\u0026#34;SELECT id, type FROM accounts WHERE owner_id=%s\u0026#34;, (1,)) except pg.InternalError as e: print(e) The first curs.execute triggers the constraint exception, which is caught and printed. However, the database is now in an inconsistent state. When you try to execute the second query, a psycopg2.InternalError is raised: \u0026quot;current transaction is aborted, commands ignored until end of transaction block\u0026quot;. In order to continue with the application, conn.rollback() needs to be called to end the transaction and start a new one.\nNOTE: Using with conn.cursor() as curs: causes the same behavior, the context manager does not automatically clean up the state of the transaction.\nThis essentially means all transactions can be wrapped in a try block, if they conclude successfully they can be committed, however if they raise an exception, they must be rolled back. A basic decorator that does this is as follows:\nfrom functools import wraps def transaction(func): @wraps(func) def inner(*args, **kwargs): conn = connect() try: func(conn, *args, **kwargs) conn.commit() except Exception as e: conn.rollback() log.error(\u0026#34;{} error: {}\u0026#34;.format(func.__name__, e)) finally: conn.close() return inner This decorator wraps the specified function, returning an inner function that injects a new connection as the first argument to the decorated function. If the decorated function raises an exception, the transaction is rolled back and the error is logged.\nThe decorator method is nice but the connection injection can be a bit weird. An alternative is a context manager that ensures the connection is committed or rolled back in a similar fashion:\nfrom contextlib import contextmanager @contextmanager def transaction(): try: conn = connect() yield conn conn.commit() except Exception as e: conn.rollback() log.error(\u0026#34;db error: {}\u0026#34;.format(e)) finally: conn.close() This allows you to write code using with as follows:\nwith transaction() as conn: # do transaction The context manager allows you to easily compose two transactions inside a single function — of course this may be against the point. However, it is no problem to combine both the decorator and the context manager methods into two steps (more on this in isolation levels).\nATM Application So let\u0026rsquo;s talk about two specific transactions for an imaginary database application: deposit and withdraw. Each of these operations has several steps:\nValidate the user with the associated PIN Ensure the user owns the account being modified Write a ledger record with the credit or debit being applied On credit, ensure the daily deposit limit isn\u0026rsquo;t reached Modify the balance of the account Fetch the current balance to display to the user Each transaction will perform 6-7 distinct SQL queries: SELECT, INSERT, and UPDATE. If any of them fails, then the database should remain completely unchanged. Failure in this case is that an exception is raised, which is potentially the easiest thing to do when you have a stack of functions calling other functions. Let\u0026rsquo;s look at deposit first:\n@transaction def deposit(conn, user, pin, account, amount): # Step 1: authenticate the user via pin and verify account ownership authenticate(conn, user, pin, account) # Step 2: add the ledger record with the credit ledger(conn, account, \u0026#34;credit\u0026#34;, amount) # Step 3: update the account value by adding the amount update_balance(conn, account, amount) # Fetch the current balance in the account and log it record = \u0026#34;withdraw ${:0.2f} from account {} | current balance: ${:0.2f}\u0026#34; log.info(record.format(amount, account, balance(conn, account))) This function simply calls other functions, passing the transaction context (in this case a connection as well as input details) to other functions which may or may not raise exceptions. Here are the two authenticate methods:\ndef authenticate(conn, user, pin, account=None): \u0026#34;\u0026#34;\u0026#34; Returns an account id if the name is found and if the pin matches. \u0026#34;\u0026#34;\u0026#34; with conn.cursor() as curs: sql = \u0026#34;SELECT 1 AS authd FROM users WHERE username=%s AND pin=%s\u0026#34; curs.execute(sql, (user, pin)) if curs.fetchone() is None: raise ValueError(\u0026#34;could not validate user via PIN\u0026#34;) return True if account: # Verify account ownership if account is provided verify_account(conn, user, account) def verify_account(conn, user, account): \u0026#34;\u0026#34;\u0026#34; Verify that the account is held by the user. \u0026#34;\u0026#34;\u0026#34; with conn.cursor() as curs: sql = ( \u0026#34;SELECT 1 AS verified FROM accounts a \u0026#34; \u0026#34;JOIN users u on u.id = a.owner_id \u0026#34; \u0026#34;WHERE u.username=%s AND a.id=%s\u0026#34; ) curs.execute(sql, (user, account)) if curs.fetchone() is None: raise ValueError(\u0026#34;account belonging to user not found\u0026#34;) return True The authenticate and verify_account functions basically look in the database to see if there is a record that matches the conditions — a user with a matching PIN in authenticate and a (user, account_id) pair in verify_account. Both of these functions rely on the UNIQUE constraint in the database for usernames and account ids. This example shows how the function call stack can get arbitrarily deep; verify_account is called by authenticate which is called by deposit. By raising an exception at any point in the stack, the transaction will proceed no further, protecting us from harm later in the transaction.\nNote also that neither of these functions have an @transaction decorator, this is because it is expected that they are called from within another transaction. They are independent operations, but they can be called independently in a transaction with the context manager.\nNext we insert a ledger record:\nMAX_DEPOSIT_LIMIT = 1000.00 def ledger(conn, account, record, amount): \u0026#34;\u0026#34;\u0026#34; Add a ledger record with the amount being credited or debited. \u0026#34;\u0026#34;\u0026#34; # Perform the insert with conn.cursor() as curs: sql = \u0026#34;INSERT INTO ledger (account_id, type, amount) VALUES (%s, %s, %s)\u0026#34; curs.execute(sql, (account, record, amount)) # If we are crediting the account, perform daily deposit verification if record == \u0026#34;credit\u0026#34;: check_daily_deposit(conn, account) def check_daily_deposit(conn, account): \u0026#34;\u0026#34;\u0026#34; Raise an exception if the deposit limit has been exceeded. \u0026#34;\u0026#34;\u0026#34; with conn.cursor() as curs: sql = ( \u0026#34;SELECT amount FROM ledger \u0026#34; \u0026#34;WHERE date=now()::date AND type=\u0026#39;credit\u0026#39; AND account_id=%s\u0026#34; ) curs.execute(sql, (account,)) total = sum(row[0] for row in curs.fetchall()) if total \u0026gt; MAX_DEPOSIT_LIMIT: raise Exception(\u0026#34;daily deposit limit has been exceeded!\u0026#34;) This is the first place that we modify the state of the database by inserting a ledger record. If, when we check_daily_deposit, we discover that our deposit limit has been exceeded for the day, an exception is raised that will rollback the transaction. This will ensure that the ledger record is not accidentally stored on disk. Finally we update the account balance:\ndef update_balance(conn, account, amount): \u0026#34;\u0026#34;\u0026#34; Add the amount (or subtract if negative) to the account balance. \u0026#34;\u0026#34;\u0026#34; amount = Decimal(amount) with conn.cursor() as curs: current = balance(conn, account) sql = \u0026#34;UPDATE accounts SET balance=%s WHERE id=%s\u0026#34; curs.execute(sql, (current+amount, account)) def balance(conn, account): with conn.cursor() as curs: curs.execute(\u0026#34;SELECT balance FROM accounts WHERE id=%s\u0026#34;, (account,)) return curs.fetchone()[0] I\u0026rsquo;ll have more to say on update_balance when we discuss isolation levels, but suffice it to say, this is another place where if the transaction fails we want to ensure that our account is not modified! In order to complete the example, here is the withdraw transaction:\n@transaction def withdraw(conn, user, pin, account, amount): # Step 1: authenticate the user via pin and verify account ownership authenticate(conn, user, pin, account) # Step 2: add the ledger record with the debit ledger(conn, account, \u0026#34;debit\u0026#34;, amount) # Step 3: update the account value by subtracting the amount update_balance(conn, account, amount * -1) # Fetch the current balance in the account and log it record = \u0026#34;withdraw ${:0.2f} from account {} | current balance: ${:0.2f}\u0026#34; log.info(record.format(amount, account, balance(conn, account))) This is similar but modifies the inputs to the various operations to decrease the amount of the account by a debit ledger record. We can run:\nif __name__ == \u0026#39;__main__\u0026#39;: conn = connect() createdb(conn) # Successful deposit deposit(\u0026#39;alice\u0026#39;, 1234, 1, 785.0) # Successful withdrawal withdraw(\u0026#39;alice\u0026#39;, 1234, 1, 230.0) # Unsuccessful deposit deposit(\u0026#39;alice\u0026#39;, 1234, 1, 489.0) # Successful deposit deposit(\u0026#39;bob\u0026#39;, 9999, 2, 220.23) And we should see the following log records:\n2017-12-06 20:01:00,086 withdraw $785.00 from account 1 | current balance: $1035.00 2017-12-06 20:01:00,094 withdraw error: could not validate user via PIN 2017-12-06 20:01:00,103 withdraw $230.00 from account 1 | current balance: $805.00 2017-12-06 20:01:00,118 deposit error: daily deposit limit has been exceeded! 2017-12-06 20:01:00,130 withdraw $220.23 from account 2 | current balance: $225.23 This should set a baseline for creating simple and easy to use transactions in Python. However, if you remember your databases class as an undergraduate, things get more interesting when two transactions are occurring at the same time. We\u0026rsquo;ll explore that from a single process by looking at multi-threaded database connections.\nThreads Let\u0026rsquo;s consider how to run two transactions at the same time from within the same application. The simplest way to do this is to use the threading library to execute transactions simultaneously. How do you achieve thread safety when accessing the database? Back to the docs:\nConnection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors. In DB API 2.0 parlance, Psycopg is level 2 thread safe.\nThis means that every thread must have its own conn object (which explore in the connection pool section). Any cursor created from the same connection object will be in the same transaction no matter the thread. We also want to consider how each transaction influences each other, and we\u0026rsquo;ll take a look at that first by exploring isolation levels and session state.\nSession State Let\u0026rsquo;s say that Alice and Charlie have a joint account, under Alice\u0026rsquo;s name. They both show up to ATMs at the same time, Alice tries to deposit $75 and then withdraw $25 and Charlie attempts to withdraw $300. We can simulate this with threads as follows:\nimport time import random import threading def op1(): time.sleep(random.random()) withdraw(\u0026#39;alice\u0026#39;, 1234, 1, 300.0) def op2(): time.sleep(random.random()) deposit(\u0026#39;alice\u0026#39;, 1234, 1, 75.0) withdraw(\u0026#39;alice\u0026#39;, 1234, 1, 25.0) threads = [ threading.Thread(target=op1), threading.Thread(target=op2), ] for t in threads: t.start() for t in threads: t.join() Depending on the timing, one of two things can happen. Charlie can get rejected as not having enough money in his account, and the final state of the database can be $300 or all transaction can succeed with the final state of the database set to $0. There are three transactions happening, two withdraw transactions and a deposit. Each of these transactions runs in isolation, meaning that they see the database how they started and any changes that they make; so if Charlie\u0026rsquo;s withdraw and Alice\u0026rsquo;s deposit happen simultaneously, Charlie will be rejected since it doesn\u0026rsquo;t know about the deposit until it\u0026rsquo;s finished. No matter what, the database will be left in the same state.\nHowever, for performance reasons, you may want to modify the isolation level for a particular transaction. Possible levels are as follows:\nREAD UNCOMMITTED: lowest isolation level, transaction may read values that are not yet committed (and may never be committed). READ COMMITTED: write locks are maintained but read locks are released after select, meaning two different values can be read in different parts of the transaction. REPEATABLE READ: keep both read and write locks so multiple reads return same values but phantom reads can occur. SERIALIZABLE: the highest isolation level: read, write, and range locks are maintained until the end of the transaction. DEFAULT: set by server configuration not Python, usually READ COMMITTED. Note that as the isolation level increases, the number of locks being maintained also increases, which severely impacts performance if there is lock contention or deadlocks. It is possible to set the isolation level on a per-transaction basis in order to improve performance of all transactions happening concurrently. To do this we must modify the session parameters on the connection, which modify the behavior of the transaction or statements that follow in that particular session. Additionally we can set the session to readonly, which does not allow writes to temporary tables (for performance and security) or to deferrable.\nDeferrability is very interesting in a transaction, because it modifies how database constraints are checked. Non-deferrable transactions immediately check the constraint after a statement is executed. This means that UPDATE accounts SET balance=-5.45 will immediately raise an exception. Deferrable transactions however wait until the transaction is concluded before checking the constraints. This allows you to write multiple overlapping operations that may put the database into a correct state by the end of the transaction, but potentially not during the transaction (this also overlaps with the performance of various isolation levels).\nIn order to change the session, we\u0026rsquo;ll use a context manager as we did before to modify the session for the transaction, then reset the session back to the defaults:\n@contextmanager def session(conn, isolation_level=None, readonly=None, deferrable=None): try: conn.set_session( isolation_level=isolation_level, readonly=readonly, deferrable=deferrable ) yield conn finally: # Reset the session to defaults conn.set_session(None, None, None, None) We can then use with to conduct transactions with different isolation levels:\nwith transaction() as conn: with session(conn, isolation_level=\u0026#34;READ COMMITTED\u0026#34;) as conn: # Do transaction NOTE: There cannot be an ongoing transaction when the session is set therefore it is more common for me to set the isolation level, readonly, and deferrable inside of the transaction decorator, rather than using two separate context managers as shown above. Frankly, it is also common to set these properties on a per-process basis rather than on a per-transaction basis, therefore the session is set in connect.\nConnection Pools Connections cannot be shared across threads. In the threading example above, if we remove the @transaction decorator and pass the same connection into both operations as follows:\nconn = connect() def op1(): time.sleep(random.random()) withdraw(conn, \u0026#39;alice\u0026#39;, 1234, 1, 300.0) def op2(): time.sleep(random.random()) deposit(conn, \u0026#39;alice\u0026#39;, 1234, 1, 75.0) withdraw(conn, \u0026#39;alice\u0026#39;, 1234, 1, 25.0) If the op1 withdraw fires first, the exception will cause all of the op2 statements to also fail, since its in the same transaction. This essentially means that both op1 and op2 are in the same transaction even though they are in different threads!\nWe\u0026rsquo;ve avoided this so far by creating a new connection every time a transaction runs. However, connecting to the database can be expensive and in high-transaction workloads we may want to simply keep the connection open, but ensure they are only used by one transaction at a time. The solution is to use connection pools. We can modify our connect function as follows:\nfrom psycopg2.pool import ThreadedConnectionPool def connect(env=\u0026#34;DATABASE_URL\u0026#34;, connections=2): \u0026#34;\u0026#34;\u0026#34; Connect to the database using an environment variable. \u0026#34;\u0026#34;\u0026#34; url = os.getenv(env) if not url: raise ValueError(\u0026#34;no database url specified\u0026#34;) minconns = connections maxconns = connections * 2 return ThreadedConnectionPool(minconns, maxconns, url) This creates a thread-safe connection pool that establishes at least 2 connections and will go up to a maximum of 4 connections on demand. In order to use the pool object in our transaction decorator, we will have to connect when the decorator is imported, creating a global pool object:\npool = connect() @contextmanager def transaction(name=\u0026#34;transaction\u0026#34;, **kwargs): # Get the session parameters from the kwargs options = { \u0026#34;isolation_level\u0026#34;: kwargs.get(\u0026#34;isolation_level\u0026#34;, None), \u0026#34;readonly\u0026#34;: kwargs.get(\u0026#34;readonly\u0026#34;, None), \u0026#34;deferrable\u0026#34;: kwargs.get(\u0026#34;deferrable\u0026#34;, None), } try: conn = pool.getconn() conn.set_session(**options) yield conn conn.commit() except Exception as e: conn.rollback() log.error(\u0026#34;{} error: {}\u0026#34;.format(name, e)) finally: conn.reset() pool.putconn(conn) Using pool.getconn retrieves a connection from the pool (if one is available, blocking until one is ready), then when we\u0026rsquo;re done we can pool.putconn to release the connection object.\nConclusion This has been a ton of notes on more direct usage of psycopg2. Sorry I couldn\u0026rsquo;t write a more conclusive conclusion but it\u0026rsquo;s late and this post is now close to 4k words. Time to go get dinner!\nNotes I used logging as the primary output to this application. The logging was set up as follows:\nimport logging LOG_FORMAT = \u0026#34;%(asctime)s %(message)s\u0026#34; logging.basicConfig(level=logging.INFO, format=LOG_FORMAT) log = logging.getLogger(\u0026#39;balance\u0026#39;) For the complete code, see this gist.\n","permalink":"https://bbengfort.github.io/2017/12/psycopg2-transactions/","summary":"\u003cp\u003eDatabases are essential to most applications, however most database interaction is often overlooked by Python developers who use higher level libraries like Django or SQLAlchemy. We use and love PostgreSQL with Psycopg2, but I recently realized that I didn\u0026rsquo;t have a good grasp on how exactly psycopg2 implemented core database concepts: particularly transaction isolation and thread safety.\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s what the documentation says regarding transactions:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eTransactions are handled by the connection class. By default, the first time a command is sent to the database (using one of the cursors created by the connection), a new transaction is created. The following database commands will be executed in the context of the same transaction – not only the commands issued by the first cursor, but the ones issued by all the cursors created by the same connection. Should any command fail, the transaction will be aborted and no further command will be executed until a call to the rollback() method.\u003c/p\u003e","title":"Transaction Handling with Psycopg2"},{"content":"By now it\u0026rsquo;s pretty clear that I\u0026rsquo;ve just had a bear of a time with locks and synchronization inside of multi-threaded environments with Go. Probably most gophers would simply tell me that I should share memory by communicating rather than to communication by sharing memory — and frankly I\u0026rsquo;m in that camp too. The issue is that:\nMutexes can be more expressive than channels Channels are fairly heavyweight So to be honest, there are situations where a mutex is a better choice than a channel. I believe that one of those situations is when dealing with replicated state machines … which is what I\u0026rsquo;ve been working on the past few months. The issue is that the state of the replica has to be consistent across a variety of events: timers and remote messages. The problem is that the timers and network traffic are all go routines, and there can be a lot of them running in the system at a time.\nOf course you could simply create a channel and push event objects to it to serialize all events. The problem with that is that events generate other events that have to be ordered with respect to the parent event. For example, one event might require the generation of messages to be sent to remote replicas, which requires a per-remote state read that is variable. Said another way, the state can be read locked for all go routines operating at that time, but no write locks can be acquired. Hence the last post.\nThings got complicated. Lock contention was a thing.\nSo I had to diagnose who was trying to acquire locks and when and why they were contending. For reference, the most common issues were:\nA global read lock was being released before all sub read locks were finished. A struct with an embedded RWMutex was then embedded by another object with only a Mutex but it still had RLock() methods as a result (or vice versa). The wrong lock was being called on embedded structs. The primary lesson I learned was this: when embedding synchronized objects, only embed the mutex on the child object. Hopefully that rule of thumb lasts.\nI learned these lessons using a handy little diagnostic tool that this snippet is about. Basically I wanted to track who was acquiring locks and who was waiting on locks. I could then print out a report when I thought something was contending (e.g. on an Interrupt signal) and figure things out.\nFirst step, figure out the name of the calling method:\n// Caller returns the name function that called the function which // called the caller function. func caller() string { pc, _, _, ok := runtime.Caller(2) details := runtime.FuncForPC(pc) if ok \u0026amp;\u0026amp; details != nil { return details.Name() } return UnknownCaller } This handy little snippet uses the runtime package to detect the caller two steps above the caller() function in the stack. This allows you to call caller() inside of a function to get the name of the function that\u0026rsquo;s calling the function calling caller(). Confusing? Try this:\nfunc outer() string { return inner() } func inner() string { return caller() } Calling outer() will return something like main.outer — the function that called the inner() function. Here is a runnable example.\nWith that in hand we can simply create a map[string]int64 and increment any calls by caller name before Lock() and decrement any calls by caller name after Unlock(). Here is the example:\nBut … that\u0026rsquo;s actually a little more complicated than I let on!\nThe problem is that we definitely have multiple go routines calling locks on the lockable struct. However, if we simply try to access the map in the MutexD, then we can have a panic for concurrent map reads and writes. So now, I use the share memory by communicating technique and pass signals via an internal channel, which is read by a go routine ranging over it.\nHow to use it? Well do something like this:\ntype StateMachine struct { MutexD } func (s *StateMachine) Alpha() { s.Lock() defer s.Unlock() time.Sleep(1*time.Second) } func (s *StateMachine) Bravo() { s.Lock() defer s.Unlock() time.Sleep(100*time.Millisecond) } func main() { m := new(StateMachine) go m.Alpha() time.Sleep(100*time.Millisecond) for i:=0; i \u0026lt; 2; i++ { go m.Bravo() } fmt.Println(m.MutexD.String()) } You should see something like:\n1 locks requested by main.(*StateMachine).Alpha 2 locks requested by main.(*StateMachine).Bravo Obviously you can do the same thing for RWMutex objects, and it\u0026rsquo;s easy to swap them in and out of code by changing the package and adding or removing a \u0026ldquo;D\u0026rdquo;. My implementation is here: github.com/bbengfort/x/lock.\n","permalink":"https://bbengfort.github.io/2017/09/lock-diagnostics/","summary":"\u003cp\u003eBy now it\u0026rsquo;s pretty clear that I\u0026rsquo;ve just had a bear of a time with locks and synchronization inside of multi-threaded environments with Go. Probably most gophers would simply tell me that I should share memory by communicating rather than to communication by sharing memory — and frankly I\u0026rsquo;m in that camp too. The issue is that:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eMutexes can be more expressive than channels\u003c/li\u003e\n\u003cli\u003eChannels are fairly heavyweight\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eSo to be honest, there are situations where a mutex is a better choice than a channel. I believe that one of those situations is when dealing with replicated state machines … which is what I\u0026rsquo;ve been working on the past few months. The issue is that the state of the replica has to be consistent across a variety of events: timers and remote messages. The problem is that the timers and network traffic are all go routines, and there can be a lot of them running in the system at a time.\u003c/p\u003e","title":"Lock Diagnostics in Go"},{"content":"In Go, you can use sync.Mutex and sync.RWMutex objects to create thread-safe data structures in memory as discussed in [“Synchronizing Structs for Safe Concurrency in Go”]({% post_url 2017-02-21-synchronizing-structs %}). When using the sync.RWMutex in Go, there are two kinds of locks: read locks and write locks. The basic difference is that many read locks can be acquired at the same time, but only one write lock can be acquired at at time.\nThis means that if a thread attempts to acquire a read lock on an object that is already read locked, then it will not block and it will acquire its own read lock. If a thread attempts to acquire a read or a write lock on a write locked object, then it will block until it is unlocked (as will a write lock acquisition on a read locked object).\nGranting a lock can be prioritized depending on different policies for accesses. Priorities balance the trade-off between concurrency and starvation as follows:\nRead-Preferring RW allows new read locks to be acquired as long as the lock is read-locked, forcing the write-lock acquirer to wait until there are no more read-locks. In high contention environments, this might lead to write-starvation.\nWrite-Preferring RW prevents a read-lock acquisition if a writer is queued and waiting for the lock. This reduces concurrency, because new read locks have to wait for the write lock, but prevents starvation.\nSo which of these does Go implement? According to the documentation:\nIf a goroutine holds a RWMutex for reading and another goroutine might call Lock, no goroutine should expect to be able to acquire a read lock until the initial read lock is released. In particular, this prohibits recursive read locking. This is to ensure that the lock eventually becomes available; a blocked Lock call excludes new readers from acquiring the lock. — godoc\nMy initial read of this made me think that Go implements write-preferring mutexes. However, this was not the behavior that I observed.\nConsider the following locker:\nvar delay time.Duration var started time.Time // Locker holds values that are threadsafe type Locker struct { sync.RWMutex value uint64 // the current value of the locker access time.Time // time of the last access } // Write to the value of the locker in a threadsafe fashion. func (l *Locker) Write(value uint64) { l.Lock() defer l.Unlock() // Arbitrary amount of work time.Sleep(delay) l.value = value l.access = time.Now() l.log(\u0026#34;written\u0026#34;) } // Read the value of the locker in a threadsafe fasion. func (l *Locker) Read() uint64 { l.RLock() defer l.RUnlock() // Arbirtray amount of work time.Sleep(delay / 2) l.access = time.Now() l.log(\u0026#34;read\u0026#34;) return l.value } // Log the access (not thread-safe) func (l *Locker) log(method string) { after := l.access.Sub(started) log.Printf( \u0026#34;%d %s after %s\\n\u0026#34;, l.value, method, after, ) } This locker holds a value and logs all accesses to it after the start time. If we run a few threads to read and write to it we can see concurrent reads in action:\nfunc main() { delay = 1 * time.Second started = time.Now() group := new(errgroup.Group) locker := new(Locker) // Straight forward, write three reads and a write group.Go(func() error { locker.Write(42); return nil }) group.Go(func() error { locker.Read(); return nil }) group.Go(func() error { locker.Read(); return nil }) group.Go(func() error { locker.Read(); return nil }) group.Go(func() error { locker.Write(101); return nil }) group.Wait() } The output is as follows\n$ go run locker.go 2017/09/08 12:26:32 101 written after 1.005058824s 2017/09/08 12:26:33 101 read after 1.50770225s 2017/09/08 12:26:33 101 read after 1.507769109s 2017/09/08 12:26:33 101 read after 1.50773587s 2017/09/08 12:26:34 42 written after 2.511968581s Note that the last go routine actually managed to acquire the lock first, after which the three readers managed to acquire the lock, then finally the last writer. Now if we interleave the read and write access, adding a sleep between the kick-off of each go routine to ensure that the preceding thread has time to acquire the lock:\nfunc main() { delay = 1 * time.Second started = time.Now() group := new(errgroup.Group) locker := new(Locker) // Straight forward, write three reads and a write group.Go(func() error { locker.Write(42); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Read(); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Write(101); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Read(); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Write(3); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Read(); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Write(18); return nil }) time.Sleep(10 * time.Millisecond) group.Go(func() error { locker.Read(); return nil }) group.Wait() } We get the following output:\ngo run locker.go 2017/09/08 12:29:28 42 written after 1.000178155s 2017/09/08 12:29:28 42 read after 1.500703007s 2017/09/08 12:29:28 42 read after 1.500691088s 2017/09/08 12:29:28 42 read after 1.500756144s 2017/09/08 12:29:28 42 read after 1.500648159s 2017/09/08 12:29:28 42 read after 1.500762323s 2017/09/08 12:29:28 42 read after 1.500679533s 2017/09/08 12:29:28 42 read after 1.500795204s 2017/09/08 12:29:29 101 written after 2.500971593s 2017/09/08 12:29:30 3 written after 3.505325487s 2017/09/08 12:29:31 18 written after 4.50594131s This suggests that the reads continue to acquire locks as long as the Locker is read locked, forcing the writes to happen at the end.\nI found one Stack Overflow post: “Read preferring RW mutex lock in Golang” that seems to suggest that sync.RWMutex can implement both read and write preferred locking, but doesn\u0026rsquo;t really give an explanation about how external callers can implement it.\nFinally consider the following:\nfunc main() { delay = 1 * time.Second started = time.Now() group := new(errgroup.Group) locker := new(Locker) // Straight forward, write three reads and a write group.Go(func() error { locker.Write(42); return nil }) group.Go(func() error { locker.Write(101); return nil }) group.Go(func() error { locker.Write(3); return nil }) group.Go(func() error { locker.Write(18); return nil }) for i := 0; i \u0026lt; 22; i++ { group.Go(func() error { locker.Read() return nil }) time.Sleep(delay / 4) } group.Wait() } Given the loop issuing 22 read locks that sleep only a quarter of the time of the write lock, we might expect that this code will issue 22 read locks then all the write locks will occur at the end (and if we put this in a forever loop, then the writes would never occur). However, the output of this is as follows:\n2017/09/08 12:43:40 18 written after 1.004461829s 2017/09/08 12:43:40 18 read after 1.508343716s 2017/09/08 12:43:40 18 read after 1.50842899s 2017/09/08 12:43:40 18 read after 1.508362345s 2017/09/08 12:43:40 18 read after 1.508339659s 2017/09/08 12:43:40 18 read after 1.50852229s 2017/09/08 12:43:41 42 written after 2.513789339s 2017/09/08 12:43:42 42 read after 3.0163191s 2017/09/08 12:43:42 42 read after 3.016330534s 2017/09/08 12:43:42 42 read after 3.016355628s 2017/09/08 12:43:42 42 read after 3.016371381s 2017/09/08 12:43:42 42 read after 3.016316992s 2017/09/08 12:43:43 3 written after 4.017954589s 2017/09/08 12:43:43 3 read after 4.518495233s 2017/09/08 12:43:43 3 read after 4.518523255s 2017/09/08 12:43:43 3 read after 4.518537387s 2017/09/08 12:43:43 3 read after 4.518540397s 2017/09/08 12:43:43 3 read after 4.518543262s 2017/09/08 12:43:43 3 read after 4.51863128s 2017/09/08 12:43:44 101 written after 5.521872765s 2017/09/08 12:43:45 101 read after 6.023207828s 2017/09/08 12:43:45 101 read after 6.023225272s 2017/09/08 12:43:45 101 read after 6.023249529s 2017/09/08 12:43:45 101 read after 6.023190828s 2017/09/08 12:43:45 101 read after 6.023243032s 2017/09/08 12:43:45 101 read after 6.023190457s 2017/09/08 12:43:45 101 read after 6.04455716s 2017/09/08 12:43:45 101 read after 6.29457923s What Go implements is actually something else: read locks can only be acquired so long as the original read lock is maintained (the word “initial” being critical in the documentation). As soon as the first read lock is released, then the queued write-lock gets priority. The first read lock lasts approximately 500ms; this means that there is enough time for between 4-5 other locks to acquire a read lock, as soon as the first read lock completes, the write is given priority.\n","permalink":"https://bbengfort.github.io/2017/09/lock-queueing/","summary":"\u003cp\u003eIn Go, you can use \u003ccode\u003esync.Mutex\u003c/code\u003e and \u003ccode\u003esync.RWMutex\u003c/code\u003e objects to create thread-safe data structures in memory as discussed in [“Synchronizing Structs for Safe Concurrency in Go”]({% post_url 2017-02-21-synchronizing-structs %}). When using the \u003ccode\u003esync.RWMutex\u003c/code\u003e in Go, there are two kinds of locks: read locks and write locks. The basic difference is that many read locks can be acquired at the same time, but only one write lock can be acquired at at time.\u003c/p\u003e","title":"Lock Queuing in Go"},{"content":"Building distributed systems in Go requires an RPC or message framework of some sort. In the systems I build I prefer to pass messages serialized with protocol buffers therefore a natural choice for me is grpc. The grpc library uses HTTP2 as a transport layer and provides a code generator based on the protocol buffer syntax making it very simple to use.\nFor more detailed control, the ZMQ library is an excellent, low latency socket framework. ZMQ provides several communication patterns from basic REQ/REP (request/reply) to PUB/SUB (publish/subscribe). ZMQ is used at a lower level though, so more infrastructure per app needs to be built.\nThis leads to the obvious question: which RPC framework is faster? Here are the results:\nThese results show the message throughput of three echo servers that respond to a simple message with a response including a sequence number. Each server is running on its own EC2 micro instance with 1GB of memory and 1 vCPU. Each client is running on on an EC2 nano instance with 0.5GB of memory and 1 vCPU and are constantly sending messages at the server. The throughput is the number of messages per second the server can handle.\nThe servers are as follows:\nrep: a server that implements a REQ/REP socket and simple handler. router: a server that implements a REQ/ROUTER socket along with a DEALER/REP socket for 16 workers, connected via a proxy. grpc: implements a gRPC service. The runner and results can be found here.\nDiscussion All the figures exhibit a standard shape for throughput - namely as more clients are added the throughput increases, but begins to tail off toward an asymptote. The asymptote represents the maximum number of messages a server can respond to without message latency. Generally speaking if a server can handle multiple clients at once, the throughput is higher.\nThe ZMQ REQ/ROUTER/PROXY/DEALER/REP server with 16 workers outperforms the gRPC server (it has a higher overall throughput). I hypothesize that this is because ZMQ does not have the overhead of HTTP and is in fact lighter weight code than gRPC since none of it is generated. It\u0026rsquo;s unclear if adding more workers would improve the throughput of the ZMQ router server.\nThe performance of the REQ/REP server is a mystery. It\u0026rsquo;s doing way better than the other two. This socket has very little overhead, so for fewer clients it should be performing better. However, this socket also blocks on a per-client basis. Both grpc and router are asynchronous and can handle multiple clients at a time suggesting that they should be much faster.\n","permalink":"https://bbengfort.github.io/2017/09/message-throughput/","summary":"\u003cp\u003eBuilding distributed systems in Go requires an RPC or message framework of some sort. In the systems I build I prefer to pass messages serialized with \u003ca href=\"https://developers.google.com/protocol-buffers/\"\u003eprotocol buffers\u003c/a\u003e therefore a natural choice for me is \u003ca href=\"https://grpc.io/\"\u003egrpc\u003c/a\u003e. The grpc library uses HTTP2 as a transport layer and provides a code generator based on the protocol buffer syntax making it very simple to use.\u003c/p\u003e\n\u003cp\u003eFor more detailed control, the \u003ca href=\"http://zeromq.org/\"\u003eZMQ\u003c/a\u003e library is an excellent, low latency socket framework. ZMQ provides several communication patterns from basic REQ/REP (request/reply) to PUB/SUB (publish/subscribe). ZMQ is used at a lower level though, so more infrastructure per app needs to be built.\u003c/p\u003e","title":"Messaging Throughput gRPC vs. ZMQ"},{"content":"This post started out as a discussion of a struct in Go that could keep track of online statistics without keeping an array of values. It ended up being a lesson on over-engineering for concurrency.\nThe spec of the routine was to build a data structure that could keep track of internal statistics of values over time in a space-saving fashion. The primary interface was a method, Update(sample float64), so that a new sample could be passed to the structure, updating internal parameters. At conclusion, the structure should be able to describe the mean, variance, and range of all values passed to the update method. I created two versions:\nA thread-safe version using mutexes, but blocking on Update() A thread-safe version using a channel and a go routine so that Update() was non-blocking. I ran some benchmarking, and discovered that the blocking implementation of Update was actually far faster than the non-blocking version. Here are the numbers:\nBenchmarkBlocking-8 20000000 81.1 ns/op BenchmarkNonBlocking-8 10000000\t140 ns/op Apparently, putting a float on a channel, even a buffered channel, incurs some overhead that is more expensive than simply incrementing and summing a few integers and floats. I will present both methods here, but note that the first method (blocking update) should be implemented in production.\nYou can find this code at github.com/bbengfort/x/stats if you would like to use it in your work.\nOnline Descriptive Statistics (Blocking) To track statistics in an online fashion, you need to keep track of the various aggregates that are used to compute the final descriptives statistics of the distribution. For simple statistics such as the minimum, maximum, standard deviation, and mean you need to track the number of samples, the sum of samples, and the sum of the squares of all samples (along with the minimum and maximum value seen). Here is how you do that:\nI use this data structure as a lightweight mechanism to keep track of online statistics for experimental results or latency. It gives a good overall view of incoming values at very little expense.\nNon-blocking Update In an attempt to improve the performance of this method, I envisioned a mechanism where I could simply dump values into a buffered channel then run an updater go routine to collect values and perform the online computation. The updater function can simply range over the channel, and the channel can be closed to stop the goroutine and finalize anything still on the channel. This is written as follows:\nThe lesson was that this is actually less performant, no matter how large the buffer is. I increased the buffer size to 10000001 to ensure that the sender could not block, but I still received 116 ns/op benchmarks. Generally, this style is what I use when the function being implemented is actually pretty heavy (e.g. writes to disk). In this case, the function was too lightweight to matter!\n","permalink":"https://bbengfort.github.io/2017/08/online-distribution/","summary":"\u003cp\u003eThis post started out as a discussion of a \u003ccode\u003estruct\u003c/code\u003e in Go that could keep track of online statistics without keeping an array of values. It ended up being a lesson on over-engineering for concurrency.\u003c/p\u003e\n\u003cp\u003eThe spec of the routine was to build a data structure that could keep track of internal statistics of values over time in a space-saving fashion. The primary interface was a method, \u003ccode\u003eUpdate(sample float64)\u003c/code\u003e, so that a new sample could be passed to the structure, updating internal parameters. At conclusion, the structure should be able to describe the mean, variance, and range of all values passed to the update method. I created two versions:\u003c/p\u003e","title":"Online Distribution"},{"content":"I\u0026rsquo;ve been looking for a way to quickly scan a file system and gather information about the files in directories contained within. I had been doing this with multiprocessing in Python, but figured Go could speed up my performance by a lot. What I discovered when I went down this path was the sync.ErrGroup, an extension of the sync.WaitGroup that helps manage the complexity of multiple go routines but also includes error handling!\nThe end result of this exploration was a utility called urfs — which you can install on your system to take a uniform random sample of files in a directory or to compute the number of files and bytes per directory. This utility is also extensible to a large number of functionality that requires rapid walking of a file system like search or other utilities.\nThis post is therefore a bit of a walkthrough on using sync.ErrGroup for scanning a file system and applying arbitrary functions. First a couple of types:\ntype WalkFunc func(path string) (string, error) type FSWalker struct { Workers int SkipHidden bool SkipDirs bool Match string root string paths chan string nPaths uint64 results chan string nResults uint64 group *errgroup.Group ctx context.Context started time.Time duration time.Duration } The first type is a generic function that can be passed to the Walk method of the FSWalker. The FSWalker maintains state with a variety of channels, and of course the errgroup.Group object. The SkipHidden, SkipDirs, and Match properties allow us to filter path types being passed to Walk.\nTo initialize FSWalker:\nfunc (fs *FSWalker) Init(ctx context.Context) { // Set up FSWalker defaults fs.Workers = DefaultWorkers fs.SkipHidden = true fs.SkipDirs = true fs.Match = \u0026#34;*\u0026#34; // Create the context for the errgroup if ctx == nil { // Create a new context ctx = context.Background() deadline, ok := fs.ctx.Deadline() if ok { ctx, _ = context.WithDeadline(ctx, deadline) } } // Create the err group fs.group, fs.ctx = errgroup.WithContext(ctx) // Create channels and instantiate other statistics variables fs.paths = make(chan string, DefaultBuffer) fs.results = make(chan string, DefaultBuffer) fs.nPaths = 0 fs.nResults = 0 fs.started = time.Time{} fs.duration = time.Duration(0) } Ok, so we\u0026rsquo;re doing a lot of work here, but things get paid off in the Walk function where we keep track of the number of paths we\u0026rsquo;ve seen at a root directory, passing them off to a WalkFunc using a variety of Go routines:\nfunc (fs *FSWalker) Walk(path string, walkFn WalkFunc) error { // Compute the duration of the walk fs.started = time.Now() defer func() { fs.duration = time.Since(fs.started) }() // Set the root path for the walk fs.root = path // Launch the goroutine that populates the paths fs.group.Go(func() error { // Ensure that the channel is closed when all paths loaded defer close(fs.paths) // Apply the path filter to the filepath.Walk function return filepath.Walk(fs.root, fs.filterPaths) }) // Create the worker function and allocate pool worker := fs.worker(walkFn) for w := 0; w \u0026lt; fs.Workers; w++ { fs.group.Go(worker) } // Wait for the workers to complete, then close the results channel go func() { fs.group.Wait() close(fs.results) }() // Start gathering the results for _ = range fs.results { fs.nResults++ } return fs.group.Wait() } So this is a lot of code, let\u0026rsquo;s step through it. The first thing we do is set the started time to now, and defer a function to compute the duration as the difference between the time at the end of the function and the start function. We also set the root value. We then launch a go routine in the ErrGroup by using fs.group.Go(func) — this function must have the signature func() error, so we use an anonymous function to kick off the filepath.Walk, which starts walking the directory structure, adding paths that match the filter criteria to a buffered channel called fs.paths, more on this later. This channel must be closed on complete so that our worker go routines complete, more on that later.\nNext we create a worker function using our worker method and walk function. The workers read paths off the fs.paths channel, and apply the walkFn to each path individually. Note that we use a pool-like structure here, limiting the number of workers to 5000 — this is so we don\u0026rsquo;t get a \u0026ldquo;too many files open\u0026rdquo; error when we exhaust the number of file descriptors since Go has unlimited go routines. The worker definitions is here:\nfunc (fs *FSWalker) worker(walkFn WalkFunc) func() error { return func() error { // Apply the function all paths in the channel for path := range fs.paths { // avoid race condition p := path // apply the walk function to the path and return errors r, err := walkFn(p) if err != nil { return err } // store the result and check the context if r != \u0026#34;\u0026#34; { select { case fs.results \u0026lt;- r: case \u0026lt;-fs.ctx.Done(): return fs.ctx.Err() } } } return nil } } As you can see, the worker function just creates a closure with the signature of our ErrGroup function, so that we can pass it to the wait group. All the worker function does is range over the paths channel, applying the path to the walkFn.\nFinally, we kick off another go routine that waits until all the workers have stopped, and when it does, we close our results channel. We do this so that we can start gathering results, immediately; we don\u0026rsquo;t have to wait. We can do this by simply ranging over the results channel and adding the number of results. A final wait at the end means that we can wait for all go routines to complete.\nLastly the filter function. We want to ignore files and directories that are hidden, e.g. start with a \u0026ldquo;.\u0026rdquo; or a \u0026ldquo;~\u0026rdquo; on Unix systems. We also want to be able to pass a glob like matcher, e.g. \u0026quot;*.txt\u0026quot; to only match text files. The filter function is here:\n// Internal filter paths function that is passed to filepath.Walk func (fs *FSWalker) filterPaths(path string, info os.FileInfo, err error) error { // Propagate any errors if err != nil { return err } // Check to ensure that no mode bits are set if !info.Mode().IsRegular() { return nil } // Get the name of the file without the complete path name := info.Name() // Skip hidden files or directories if required. if fs.SkipHidden { if strings.HasPrefix(name, \u0026#34;.\u0026#34;) || strings.HasPrefix(name, \u0026#34;~\u0026#34;) { return nil } } // Skip directories if required if fs.SkipDirs { if info.IsDir() { return nil } } // Check to see if the pattern matches the file match, err := filepath.Match(fs.Match, name) if err != nil { return err } else if !match { return nil } // Increment the total number of paths we\u0026#39;ve seen. atomic.AddUint64(\u0026amp;fs.nPaths, 1) select { case fs.paths \u0026lt;- path: case \u0026lt;-fs.ctx.Done(): return fs.ctx.Err() } return nil } And that\u0026rsquo;s it, with this simple framework, you can apply an arbitrary walkFn to all paths in a directory, matching a specific criteria. The big win here is to manage all of the go routines using the ErrGroup and a context.Context object.\nThe following post: Run strikingly fast parallel file searches in Go with sync.ErrGroup by Brian Ketelsen was the primary inspiration for the use of sync.ErrGroup.\n","permalink":"https://bbengfort.github.io/2017/08/rapid-fs-walk/","summary":"\u003cp\u003eI\u0026rsquo;ve been looking for a way to quickly scan a file system and gather information about the files in directories contained within. I had been doing this with multiprocessing in Python, but figured Go could speed up my performance by a lot. What I discovered when I went down this path was the \u003ca href=\"https://godoc.org/golang.org/x/sync/errgroup\"\u003e\u003ccode\u003esync.ErrGroup\u003c/code\u003e\u003c/a\u003e, an extension of the \u003ccode\u003esync.WaitGroup\u003c/code\u003e that helps manage the complexity of multiple go routines but also includes error handling!\u003c/p\u003e","title":"Rapid FS Walks with ErrGroup"},{"content":"This is just a quick note on the performance of writing to a file on disk using Go, and reveals a question about a common programming paradigm that I am now suspicious of. I discovered that when I wrapped the open file object with a bufio.Writer that the performance of my writes to disk significantly increased. Ok, so this isn\u0026rsquo;t about simple file writing to disk, this is about a complex writer that does some seeking in the file writing to different positions and maintains the overall state of what\u0026rsquo;s on disk in memory, however the question remains:\nWhy do we buffer our writes to a file?\nA couple of answers come to mind: safety, the buffer ensures that writes to the underlying writer are not flushed when an error occurs; helpers, there may be some methods in the buffer struct not available to a native writer; concurrency, the buffer can be appended to concurrently with another part of the buffer being flushed.\nHowever, we determined that in performance critical applications (file systems, databases) the buffer abstraction adds an unacceptable performance overhead. Here are the results.\nResults First, we\u0026rsquo;re not doing a simple write - we\u0026rsquo;re appending to a write-ahead log that has fixed length metadata at the top of the file. This means that a single operation to append data to the log consists of the following steps:\nMarshal data to bytes Write data to end of the log file (possibly sync the file) Seek to the top of the file Marshall and write fixed length meta data header Seek to the bottom of the file Sync the file to disk So there is a bit more work here than simply throwing data at disk. We can see in the following graph that the performance of the machine (CPU, Memory, and Disk) plays a huge role in determining the performance of these operations in terms of the number of these writes the machine is able to do per second:\nIn the above graph, Hyperion and Lagoon are Dell Optiplex servers and Antigua, Curacao, and Nevis are Intel NUCs. They all have different processors and SSDs, but all have 16GB memory. For throughput, bigger is better (you can do more operations per second). As you can see on all of the servers, there is about a 1.6x increase in throughput using unbuffered writes to the file over buffered writes to the file.\nWe can inspect the distribution of the latency of each individual operation as follows (with latency, smaller is better — you\u0026rsquo;re doing operations faster):\nThe boxplot shows the distribution of latency such that the box is between the 25th and 75th percentile (with a bisecting line at the median) - the lines are from the 5th to the 95th percentile, and anything outside the lines are considered outliers and are visualized as diamonds.\nWe can see the shift not just in the mean, but also the median; a 1.6 increase in speed (decrease in latency) from buffered to unbuffered writes. More importantly, we can see that unbuffered writes are more consistent; e.g. they have a tighter distribution and less variable operational latency. I suspect this means that while both types of writes are bound by disk accesses from other processes, buffered writes are also bound by CPU whereas unbuffered writes are less so.\nMethod The idea here is that we are going to open a file and append data to it, tracking what we\u0026rsquo;re doing with a fixed length metadata header at the beginning of the file. Creating a struct to wrap the file and open, sync, and close it is pretty straight forward:\ntype Log struct { path string file *os.File } func (l *Log) Open(path string) (err error) { l.path = path l.file, err = os.OpenFile(path, os.O_WRONLY|os.O_CREATE, 0644) if err != nil { return err } } func (l *Log) Close() error { err := l.file.Close() l.file = nil return err } func (l *Log) Sync() error { return l.file.Sync() } Now let\u0026rsquo;s say that we have an entry that knows how to write itself to an io.Writer interface as follows:\ntype Entry struct { Version uint64 `json:\u0026#34;version\u0026#34;` Key string `json:\u0026#34;key\u0026#34;` Value []byte `json:\u0026#34;value\u0026#34;` Created time.Time `json:\u0026#34;created\u0026#34;` } func (e *Entry) Dump(w io.Writer) (int64, error) { // Encode entry as JSON data (base64 enocded bytes value) data, err := json.Marshal(e) if err != nil { return -1, err } // Add a newline to the data for json lines format data = append(data, byte(\u0026#39;\\n\u0026#39;)) // Write the data to the writer and return. return w.Write(data) } So the question is, if we have a list of entries we want to append to the log, how do we pass the io.Writer to the Entry.Dump method in order to write them one at a time?\nThe first method is the standard method, buffered, using bufio.Writer:\nfunc (l *Log) Append(entries ...*Entry) (size int64, err error) { // Crate the buffer and define the bytes bytes := 0 buffer := bufio.NewWriter(l.file) // Write each entry keeping track of the amount of data written for _, entry := range entries { if bytes, err = entry.Write(buffer); err != nil { return -1, err } else { size += bytes } } // Flush the buffer if err = buffer.Flush(); err != nil { return -1, err } // Sync the underlying file if err = l.Sync(); err != nil { return -1, err } return size, nil } As you can see, even though we\u0026rsquo;re getting a buffered write to disk, we\u0026rsquo;re not actually leveraging any of the benefits of the buffered write. By eliminating the middleman with an unbuffered approach:\nfunc (l *Log) Append(entries ...*Entry) (size int64, err error) { // Write each entry keeping track of the amount of data written for _, entry := range entries { if bytes, err := entry.Write(buffer); err != nil { return -1, err } else { size += bytes } } // Sync the underlying file if err = l.Sync(); err != nil { return -1, err } return size, nil } We get the performance benefit as shown above. Now, I\u0026rsquo;m not sure if this is obvious or not; but I do know that it\u0026rsquo;s commonly taught to wrap the file object with the buffer; the unbuffered approach may be simpler and faster but it may also be less safe, it depends on your use case.\n","permalink":"https://bbengfort.github.io/2017/08/buffered-writes/","summary":"\u003cp\u003eThis is just a quick note on the performance of writing to a file on disk using Go, and reveals a question about a common programming paradigm that I am now suspicious of.  I discovered that when I wrapped the open file object with a \u003ca href=\"https://golang.org/pkg/bufio/#Writer\"\u003e\u003ccode\u003ebufio.Writer\u003c/code\u003e\u003c/a\u003e that the performance of my writes to disk significantly increased. Ok, so this isn\u0026rsquo;t about simple file writing to disk, this is about a complex writer that does some seeking in the file writing to different positions and maintains the overall state of what\u0026rsquo;s on disk in memory, however the question remains:\u003c/p\u003e","title":"Buffered Write Performance"},{"content":"The event dispatcher pattern is extremely common in software design, particularly in languages like JavaScript that are primarily used for user interface work. The dispatcher is an object (usually a mixin to other objects) that can register callback functions for particular events. Then when a dispatch method is called with an event, the dispatcher calls each callback function in order of their registration and passes them a copy of the event. In fact, I\u0026rsquo;ve already written a version of this pattern in Python: [Implementing Observers with Events]({% post_url 2016-02-16-observer-pattern %}) In this snippet, I\u0026rsquo;m presenting a version in Go that has been incredibly stable and useful in my code.\nThere are three types in the snippet below:\nEventType is a uint16 that represents the type of event that occurs, several constants in the code declare event types along with a string method for human readability. Typing constants this way improves performance in the dispatcher environment. Callback defines the signature of a function that can be registered. Dispatcher is the core of the code and wraps a source — that is the actual object that is doing the dispatching. Event is an interface for event types that has a type, source, and value. For example, consider if you want to watch a directory for new files being created, you could do something like this:\ntype DirectoryWatcher struct { Dispatcher path string // path to directory on disk } func (w *DirectoryWatcher) Init(path string) { w.path = path w.Dispatcher.Init(w) } // Watch the given directory and dispatch new file events func (w *DirectoryWatcher) Watch() error { for { files, _ := ioutil.ReadDir(w.path) for _, file := range files { if w.Unseen(file) { w.Dispatch(NewFile, file) } } time.Sleep(100 * time.Millisecond) } } This initializes the DirectoryWatcher dispatcher with the source as the watcher (so you can refer to exactly which directory was being watched). Then as the watcher looks at the directory for new data every 100 milliseconds, if it sees any files that were Unseen() then it dispatches the event.\nThe dispatcher code is as follows:\nSo this works very well but there are a copule of key points:\nWhen dispatching the event, a single error terminates all event handling. It might be better to create a specific error type that terminates event handling (e.g. do not propagate) and then collect all other errors into a slice and return them from the dispatcher. The event can technically be modified by callback functions since it\u0026rsquo;s a pointer. It might be better to pass by value to guarantee that all callbacks see the original event. Callback handling is in order of registration, which gets to point number one about canceling event propagation. An alternative is to do all the callbacks concurrently using Go routines; which is something I want to investigate further. ","permalink":"https://bbengfort.github.io/2017/07/event-dispatcher/","summary":"\u003cp\u003eThe event dispatcher pattern is extremely common in software design, particularly in languages like JavaScript that are primarily used for user interface work. The dispatcher is an object (usually a mixin to other objects) that can \u003cem\u003eregister\u003c/em\u003e callback functions for particular events. Then when a \u003cem\u003edispatch\u003c/em\u003e method is called with an event, the dispatcher calls each callback function in order of their registration and passes them a copy of the event. In fact, I\u0026rsquo;ve already written a version of this pattern in Python: [Implementing Observers with Events]({% post_url 2016-02-16-observer-pattern %})  In this snippet, I\u0026rsquo;m presenting a version in Go that has been incredibly stable and useful in my code.\u003c/p\u003e","title":"Event Dispatcher in Go"},{"content":"In the [last post]({% post_url 2017-07-13-zmq-basic %}) I discussed a simple REQ/REP pattern for ZMQ. However, by itself REQ/REP is pretty fragile. First, every REQ requires a REP and a server can only handle one request at a time. Moreover, if the server fails in the middle of a reply, then everything is hung. We need more reliable REQ/REP, which is actually the subject of an entire chapter in the ZMQ book.\nFor my purposes, I want to ensure that repliers (servers) can fail without taking out the client. The server can simply sock.Send(zmq.DONTWAIT) to deal with clients that dropout before the communication is complete. Server failure is a bit more difficult to deal with, however. Client side reliability is based on timeouts and retries, dealing with failed messages. ZMQ calls this the Lazy Pirate Pattern.\nThis is a pretty big chunk of code, but it creates a Client object that wraps a socket and performs lazy pirate sends. The primary code is in the Reset() and Send() methods. The Reset() method sets the linger to zero in order to close the connection immediately without errors; it then closes the connection and reconnects thereby resetting the state to be able to send messages again. This is \u0026ldquo;brute force but effective and reliable\u0026rdquo;.\nThe Send() method fires off a message then uses a zmq.Poller with a timeout to keep checking if a message has been received in that time limit. If it was successful, then great! Otherwise we decrement our retries and try again. If we\u0026rsquo;re out of retries there is nothing to do but return an error. The code is here:\nThis code is fairly lengthy, but as it turns out, most of the content for both clients and servers on either side of REQ/REP have similar wrapper code for context, socket, and connection/bind wrapping. So far it\u0026rsquo;s been very reliable in my code to allow servers to drop out and fail without blocking clients or other nodes in the network.\n","permalink":"https://bbengfort.github.io/2017/07/lazy-pirate/","summary":"\u003cp\u003eIn the [last post]({% post_url 2017-07-13-zmq-basic %}) I discussed a simple REQ/REP pattern for ZMQ. However, by itself \u003ca href=\"http://dbeck.github.io/5-lessons-learnt-from-choosing-zeromq-and-protobuf/\"\u003eREQ/REP is pretty fragile\u003c/a\u003e. First, every REQ requires a REP and a server can only handle one request at a time. Moreover, if the server fails in the middle of a reply, then everything is hung. We need more reliable REQ/REP, which is actually the subject of \u003ca href=\"http://zguide.zeromq.org/page:all#toc86\"\u003ean entire chapter\u003c/a\u003e in the ZMQ book.\u003c/p\u003e","title":"Lazy Pirate Client"},{"content":"There are many ways to create RPCs and send messages between nodes in a distributed system. Typically when we think about messaging, we think about a transport layer (TCP, IP) and a protocol layer (HTTP) along with some message serialization. Perhaps best known are RESTful APIs which allow us to GET, POST, PUT, and DELETE JSON data to a server. Other methods include gRPC which uses HTTP and protocol buffers for interprocess communication.\nZMQ is a bit different. It provides an abstraction for sockets that look like embedded networking but can actually be used for in- and inter-process channels, multicast, TCP, and more. ZMQ has many patterns, starting on simple REQ/REP (request/reply) where a client connects to a socket that a server is bound on; the client sends a REQ and waits for a response, the REP from the server.\nThe interesting thing about this (pretty standard) network communication is that the server doesn\u0026rsquo;t have to be up for the client to connect, it will just wait until the server is available. Moreover, there is no need for multiplexing because ZMQ buffers messages under the hood. The pattern is incredibly failure resistant. ZMQ is not HTTP, ZMQ is something different with its own protocol, and even though its a lower level networking abstraction, it can be used for very powerful distributed systems design.\nThis is just a snippet with a bare bones REQ/REP message server and client that passes strings back and forth.\nTo use this code, download the gist and run the server and client in two different terminal windows with go run. To run the server:\n$ go run zmqmsg.go serve And to send messages:\n$ go run zmqmsg.go send \u0026#34;first message\u0026#34; \u0026#34;second message\u0026#34; \u0026#34;third message\u0026#34; You should see messages received at the server and replies sent back to the client. Of course this is pretty much the hello world of the ZMQ REQ/REP model and there are many other networking patterns and sockets provided by ZMQ to check out. In particular, there is a PUB/SUB pattern where clients can connect to a publisher to receive updates pushed to them. More to come!\n","permalink":"https://bbengfort.github.io/2017/07/zmq-basic/","summary":"\u003cp\u003eThere are many ways to create RPCs and send messages between nodes in a distributed system. Typically when we think about messaging, we think about a transport layer (TCP, IP) and a protocol layer (HTTP) along with some message serialization. Perhaps best known are RESTful APIs which allow us to GET, POST, PUT, and DELETE JSON data to a server. Other methods include gRPC which uses HTTP and protocol buffers for interprocess communication.\u003c/p\u003e","title":"Simple ZMQ Message Passing"},{"content":"In this discussion, I want to propose some code to perform PID file management in a Go program. When a program is backgrounded or daemonized we need some way to communicate with it in order to stop it. All active processes are assigned a unique process id by the operating system and that ID can be used to send signals to the program. Therefore a PID file:\nThe pid files contains the process id (a number) of a given program. For example, Apache HTTPD may write it\u0026rsquo;s main process number to a pid file - which is a regular text file, nothing more than that - and later use the information there contained to stop itself. You can also use that information (just do a cat filename.pid) to kill the process yourself, using echo filename.pid | xargs kill.\n— Rafael Steil\nFrom a Go program we can use the PID to get access to the program and send a signal, such as SIGTERM - terminate the program!\nimport ( \u0026#34;os\u0026#34; \u0026#34;syscall\u0026#34; \u0026#34;github.com/bbengfort/x/pid\u0026#34; \u0026#34;github.com/urfave/cli\u0026#34; ) // Send a kill signal to the process defined by the PID func stop(c *cli.Context) error { pid := pid.New() if err := pid.Load(); err != nil { return cli.NewExitError(err.Error(), 1) } // Get the process from the os proc, err := os.FindProcess(pid.PID) if err != nil { return cli.NewExitError(err.Error(), 1) } // Kill the process if err := proc.Signal(syscall.SIGTERM); err != nil { return cli.NewExitError(err.Error(), 1) } return nil } Using the PID file within a program requires a bit of forethought. Where do you store the PID file? Do you only allow one running instance of the program? If so, the program needs to throw an error if it starts up and a PID file exists, if not, how do you name multiple PID files? When exiting, how do you make sure that the PID file is deleted?\nSome of these questions are addressed by my initial implementation of the PID file in the github.com/bbengfort/x/pid package. The stub of that implementation is as follows:\nThis implementation stores both the PID and the parent PID (if the process forks) in the PID file in JSON format. JSON is not necessarily required, but it does make the format a bit simpler to understand and also allows the addition of other process information.\nSo why talk about PIDs? Well I\u0026rsquo;m writing some programs that need to be run in the background and always started up. I\u0026rsquo;m investigating systemd for Ubuntu and launchtl for OS X in order to manage the processes. But more on that in a future post.\n","permalink":"https://bbengfort.github.io/2017/07/pid-management/","summary":"\u003cp\u003eIn this discussion, I want to propose some code to perform PID file management in a Go program. When a program is backgrounded or daemonized we need some way to communicate with it in order to stop it. All active processes are assigned a \u003ca href=\"https://en.wikipedia.org/wiki/Process_identifier\"\u003eunique process id\u003c/a\u003e by the operating system and that ID can be used to send signals to the program. Therefore a PID file:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eThe pid files contains the process id (a number) of a given program. For example, Apache HTTPD may write it\u0026rsquo;s main process number to a pid file - which is a regular text file, nothing more than that - and later use the information there contained to stop itself. You can also use that information (just do a \u003ccode\u003ecat filename.pid\u003c/code\u003e) to kill the process yourself, using \u003ccode\u003eecho filename.pid | xargs kill\u003c/code\u003e.\u003c/p\u003e","title":"PID File Management"},{"content":"When doing research on peer-to-peer networks, addressing can become pretty complex pretty quickly. Not everyone has the resources to allocate static, public facing IP addresses to machines. A machine that is in a home network for example only has a single public-facing IP address, usually assigned to the router. The router then performs NAT (network address translation) forwarding requests to internal devices.\nIn order to get a service running on an internal network, you can port forward external requests to a specific port to a specific device. Requests are made to the router\u0026rsquo;s IP address, and the router passes it on. But how do you know the IP address of the device? Moreover, what happens if the router is assigned a new IP address? Static IP addresses generally cost more.\nIt seems like services such as DynDNS and DDNS are no longer a default on the routers that are being shipped with broadband services like Xfinity or Fios. I therefore had to create my own, using the excellent service provided by myexternalip.com. The wrapper in Go is as follows:\nWhen making a request to an external server like myexternalip.com, the public IP address of the router is used in the connection. The external server therefore can respond with what it sees as your public facing IP address, and that\u0026rsquo;s exactly what happens here.\nI tried to make the PublicIP() function a bit robust, using a timeout of 5 seconds so it couldn\u0026rsquo;t hang up any calling programs, and performing a lot of error handling. For example, a 429 response from myexternalip.com means that the rate limit has been exceeded (30 requests per minute). As I like the service, I wanted to make sure this was maintained so I ensured an error was thrown if this was breached. Additionally I used the json format rather than the raw format which meant I had to do some parsing, but I think it lends the code a bit more stability.\nIf you\u0026rsquo;re looking for a raw option, check out: get_external_ip.go. But I hope you see that my version is a tad more robust.\n","permalink":"https://bbengfort.github.io/2017/07/public-ip/","summary":"\u003cp\u003eWhen doing research on peer-to-peer networks, addressing can become pretty complex pretty quickly. Not everyone has the resources to allocate static, public facing IP addresses to machines. A machine that is in a home network for example only has a single public-facing IP address, usually assigned to the router. The router then performs NAT (network address translation) forwarding requests to internal devices.\u003c/p\u003e\n\u003cp\u003eIn order to get a service running on an internal network, you can port forward external requests to a specific port to a specific device. Requests are made to the router\u0026rsquo;s IP address, and the router passes it on. But how do you know the IP address of the device? Moreover, what happens if the router is assigned a new IP address? Static IP addresses generally cost more.\u003c/p\u003e","title":"Public IP Address Discovery"},{"content":" I\u0026rsquo;m preparing to move into a new job when I finish my dissertation hopefully later this summer. The new role involves web application development with Rails and so I needed to get up to speed. I had a web application requirement for my research so I figured I\u0026rsquo;d knock out two birds with one stone and build that app with Rails (a screenshot of the app is above, though of course this is just a front-end and doesn\u0026rsquo;t really tell you it was built with Rails).\nNow I\u0026rsquo;m by no means a web developer; I can create simple apps as universal UIs using Bootstrap and simple frameworks. I\u0026rsquo;m more likely to generate databases or APIs that much better front-end developers access. That said, I knew that if I did this project using Django, it would have probably taken me a day, maybe a day and half to get everything the way I wanted it. I knew there would be some learning curve to Rails, and so I figured I\u0026rsquo;d take three days over the July 4th holiday to knock this up. A week later \u0026hellip; here are my impressions and lessons learned.\nThe Good I\u0026rsquo;m not one of those evangelical developers that focuses on one tool or technology. My understanding of Ruby (thanks to @looselycoupled) was that Matz created it to be fun for developers to code in. I totally get this from the pure Ruby side of things. Rails also seems like it is designed to get relative novice programmers up and running, building professional websites as quickly as possible. In no particular order, here are some of my good vibes about Ruby and Rails.\nRSpec and tests in tutorials I\u0026rsquo;m not exactly a BDD or TDD guy, but I do write tests when I code (it\u0026rsquo;s not just because I\u0026rsquo;m a professional, tests allow me to program fearlessly). I\u0026rsquo;ve always enjoyed the RSpec style testing that has a natural DSL for describing tests, their contexts, and matching. In fact, I use Ginkgo for RSpec style testing in Go (haven\u0026rsquo;t found one for Python yet, but if anyone has anything, let me know!)\nEven more importantly than RSpec being cool, is the fact that Rails itself lends itself to generating tests and even more than that, most tutorials I encountered also included tests. I\u0026rsquo;ve always had a hard time testing web apps with Django, but I found it extremely easy and even enjoyable to test the Rails app.\n5.minutes.ago This astounded me, and it took me a while to figure out what was happening. Basically this line of code returns a timestamp that is 5 minutes in the past. How?! Well, everything in Ruby is an object including numbers. Therefore the number 5 has a method called minutes (also minute for 1.minute.ago) that converts the number into some kind of time delta. Then that thing has a method that subtracts it from Time.now. Mind blown. This reveals the fact that most objects in Ruby have a huge number of instance methods, presumably added via a huge number of mixins. This system is pretty neat, if possibly not the most performant thing in the world.\nAssets and the front-end One thing that always bothered me about Django was that static assets were only very loosely related to the application; Django focused on the backend details. Rails apps on the other hand put the front-end first, generating stylesheets and javascript on demand and building them with the asset pipeline for delivery. The front-end feels like a major component of a Rails app, not just the HTML rendering bits.\nThis also has a lot to do with the built-in compiling for SASS and CoffeeScript in the asset pipeline by the Rails app. Unlike a Python app, gems are available for the front-end tools I use every day. Rather than download JavaScripts and stylesheets, I instead gem installed them and included them in my requirements. It was much easier to get jQuery, Bootstrap, Underscore, etc. this way. The big win was really gmaps4rails — it was a snap to get those maps up and running in the app!\nSecrets I don\u0026rsquo;t really know how secrets.yml works, but I\u0026rsquo;m glad it\u0026rsquo;s there and I hope that it\u0026rsquo;s doing some fancy hiding of variables. I have my API keys and passwords etc in the environment, and I\u0026rsquo;m used to loading them into the configuration using ENV (or rather, os.environ). Something about secrets.yml just rubs me the right way though.\nRESTful by design The resources route configuration is amazing. While I\u0026rsquo;m very used to creating RESTful APIs, I recognize that REST was originally intended for HTML documents and resources. Couple this directly with controller methods such as index, create, update, destroy it made the application extremely intuitive to create and design.\nThe Bad At the risk of sounding simply annoyed because Ruby isn\u0026rsquo;t my favorite programming language or Rails isn\u0026rsquo;t my web framework of choice, I do want to point out some struggles I had that I don\u0026rsquo;t think were related to the learning curve.\nAutoloader and requiring files By far my biggest challenge was creating the geoip component of the app, which was a client that queried another API for the latitude and longitude of a given IP address. Here was the problem: I built that component in plain Ruby in about 20 minutes. I could run the Ruby script from the command line. Then I tried to add it to Rails and … it couldn\u0026rsquo;t find my dependencies.\nSo first off, I knew that in order to get the app to find my library file or anything outside of a directory I needed to add it to the configuration. E.g. if I was going to create a directory, app/services then in config/application.rb I needed to do something like:\nconfig.autoload_paths += [ \u0026#34;#{config.root}/app/services\u0026#34;, ] Additionally, I have to name the files as the lower snake case version of the class name. E.g. put GeoipService in a file called services/geoip.rb. So this is a bit annoying, and I think using require is much more obvious.\nHowever, when the app gives you a NameError: uninitialized constant Faraday or NameError: uninitialized constant HTTParty (the two libraries I tried to use to make web requests), things get annoying. Was it in my Gemfile? Yes. Did I run bundle install? Yes. Did I run bundle exec bin/rails server? Yes. Do I have any idea how to deal with the autoloader? No.\nI finally got it by putting my client script in lib and having that script require the library. This seemed to make Rails happy enough. Why? No idea.\nIs it Ruby? Having been warned by @looselycoupled that Rails and Ruby are different, I learned Ruby first, using the Codecademy Ruby Tutorial (mostly because this is what we tell our students to do and I wanted to try it out). That went well and I think I got a pretty good grasp on Ruby. In fact, I felt comfortable enough with Ruby to write some scripts \u0026ndash; at this point I just need to know more about the standard library and useful third party libraries to be effective.\nOn to Rails — wait is that Ruby? Rails describes itself as a “Ruby-like domain specific language for developing web applications” and I think they\u0026rsquo;re right. Much of the syntax in a Rails app is sugar that exploits a number of nice qualities about Ruby. I can easily see how it may be difficult for a Rails developer to write a Ruby library, or even move on to other programming languages. I can also see how it is super difficult for a programmer in another language to figure out web applications with Rails.\nWhat\u0026rsquo;s in the model? While I do like the migrations database management in Rails, I constantly finding myself asking where the model definition was. Properties are not specified explicitly in an ActiveRecord subclass, instead they\u0026rsquo;re reflected from the database (as far as I can tell). So once a migration was created I had to remember if I created the is_admin or admin boolean field since it wasn\u0026rsquo;t on the model I was working in. I became very reliant on db/schema.rb to tell me about the database!\nAuth from scratch I used the Clearance gem by Thoughtbot for user authentication, but it still felt like I had to roll a lot of the authentication from scratch. I\u0026rsquo;m sure Rails has some sort of CMS gem (and a lot of my choices were informed by the web app that I have to learn to hack on), but I was surprised that there was nothing there by default. I think that in the Rails world, if you\u0026rsquo;re building web apps all the time you have very specific preferences about what you want to do with Auth; unfortunately I believe this is one place things should be standardized. It took a lot of my time and created anxiety that someone was going to hack into my app and ruin my research.\nThe Ugly Last a few comments that aren\u0026rsquo;t necessarily bad and not necessarily good.\nWizardry is mysterious Based on my conversations with other Rails developers and reading StackOverflow questions, it seems that everyone agrees that Rails is magic and does a lot of magical things. Unfortunately, magic is, by its nature, mysterious. There were many times I couldn\u0026rsquo;t figure out what was going on because there were obfuscations or methods designed to create simple syntax.\nHere is what I figured out: everything is a method, but a method doesn\u0026rsquo;t have to have parentheses to be called, and hashes don\u0026rsquo;t have to have braces. Therefore a line of code like\ncommand something with this and that Is easily possible, but something with this and that could be either:\narguments to the command method a hash that\u0026rsquo;s being passed to the command a block or a description of a block And really the only way to tell is to pay close attention to the commas and colons in the line of code (which is also not helped by symbols). This combined with monkey patching meant that I couldn\u0026rsquo;t easily find method definitions to override them or ways to create my own methods.\nI think experience will help, and certainly the compact, expressiveness of code is nice, but magic is mysterious.\nWhere is that file again? All I have to say is this: my Rails code base seems to be 300 files with 50 lines of code in each file. My directory structure is so big (and things are so similarly named) that it was hard to find stuff.\nConclusion It\u0026rsquo;s been a while since I learned a new programming language, and I love to learn new things. Ruby is really a joy to program in, though I think Python suits my scripting requirements a bit better. Rails is a solid framework for web development. I think that I\u0026rsquo;ll do fine working with Rails at the new company.\nHowever, I would probably not encourage my students to start with Ruby as a first language or Rails for web development (unless they intended to be professional web developers, which my students rarely are). I think once you\u0026rsquo;re in that world it\u0026rsquo;s a bit tricky to get out of it and the magic means that you\u0026rsquo;re not necessarily learning programming fundamentals. Still, it is a professional grade tool for web developers, and I\u0026rsquo;m glad it exists.\n","permalink":"https://bbengfort.github.io/2017/07/on-track-with-rails/","summary":"\u003cp\u003e\u003ca href=\"/images/2017-07-06-kahu-screenshot.png\"\u003e\u003cimg loading=\"lazy\" src=\"/images/2017-07-06-kahu-screenshot.png\" alt=\"Kahu Screenshot\"  /\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;m preparing to move into a new job when I finish my dissertation hopefully later this summer. The new role involves web application development with Rails and so I needed to get up to speed. I had a web application requirement for my research so I figured I\u0026rsquo;d knock out two birds with one stone and build that \u003ca href=\"https://github.com/bbengfort/kahu\"\u003eapp with Rails\u003c/a\u003e (a screenshot of the app is above, though of course this is just a front-end and doesn\u0026rsquo;t really tell you it was built with Rails).\u003c/p\u003e","title":"On the Tracks with Rails"},{"content":"Visual Pipelines for Text Analysis\nDescription Employing machine learning in practice is half search, half expertise, and half blind luck. In this talk we will explore how to make the luck half less blind by using visual pipelines to steer model selection from raw input to operational prediction. We will look specifically at extending transformer pipelines with visualizers for sentiment analysis and topic modeling text corpora.\n","permalink":"https://bbengfort.github.io/2017/06/visual-pipelines-for-text-analysis/","summary":"\u003cp\u003e\u003ca href=\"http://data-intelligence.ai/presentations/13\"\u003eVisual Pipelines for Text Analysis\u003c/a\u003e\u003c/p\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eEmploying machine learning in practice is half search, half expertise, and half blind luck. In this talk we will explore how to make the luck half less blind by using visual pipelines to steer model selection from raw input to operational prediction. We will look specifically at extending transformer pipelines with visualizers for sentiment analysis and topic modeling text corpora.\u003c/p\u003e","title":"Visual Pipelines for Text Analysis"},{"content":"I\u0026rsquo;ve ben using Fabric to concurrently start multiple processes on several machines. These processes have to run at the same time (since they are experimental processes and are interacting with each other) and shut down at more or less the same time so that I can collect results and immediately execute the next sample in the experiment. However, I was having a some difficulties directly using Fabric:\nFabric can parallelize one task across multiple hosts accordint to roles. Fabric can be hacked to run multiple tasks on multiple hosts by setting env.dedupe_hosts = False Fabric can only parallelize one type of task, not multiple types Fabric can\u0026rsquo;t handle large numbers of SSH connections In this post we\u0026rsquo;ll explore my approach with Fabric and my current solution.\nFabric Consider the following problem: I want to run a Honu replica server on four different hosts. This is pretty easy using fabric as follows:\nfrom itertools import count from fabric.api import env, parallel, run # assign unique pids to servers counter = count(1,1) # Set the hosts environment env.hosts = [\u0026#39;user@hostA:22\u0026#39;, \u0026#39;user@hostB:22\u0026#39;, \u0026#39;user@hostC:22\u0026#39;, \u0026#39;user@hostD:22\u0026#39;] @parallel def serve(pid=None): pid = pid or next(counter) run(\u0026#34;honu serve -i {}\u0026#34;.format(pid)) Note that this uses a global variable, counter to assign a unique id to each process (more on this later). What if I want to run four replica processes on four hosts? We can hack that as follows:\nfrom fabric.api import execute, settings def multiexecute(task, n, host, *args, **kwargs): \u0026#34;\u0026#34;\u0026#34; Execute the task n times on the specified host. If the task is parallel then this will be parallel as well. All other args are passed to execute. \u0026#34;\u0026#34;\u0026#34; # Do nothing if n is zero or less if n \u0026lt; 1: return # Return one execution of the task with the given host if n == 1: return execute(task, host=host, *args, **kwargs) # Otherwise create a lists of hosts, don\u0026#39;t dedupe them, and execute hosts = [host]*n with settings(dedupe_hosts=False): execute(task, hosts=hosts, *args, **kwargs) # Note the removal of the decorator def serve(pid=None): pid = pid or next(counter) run(\u0026#34;honu serve -i {}\u0026#34;.format(pid)) @parallel def serveall(): multiexecute(serve, 4, env.host) Here, we create a multiexecute() function that temporarily sets dedupe_hosts=False using the settings context manager, then creates a host list that duplicates the original host n times, executing the task in parallel. By parallelizing the serveall task, each host is passed into the task once, then branched out 4 times by multiexecute.\nNow, what if I want to run 4 serve() and 4 work() tasks with different arguments to each in parallel? Well, here\u0026rsquo;s where things fall apart, it can\u0026rsquo;t be done. If we write:\n@parallel def serveall(): multiexecute(serve, 4, env.host) multiexecute(work, 4, env.host) Then the second multiexecute() will happen sequentially after the first multiexecute(). Unfortunately there seems to be no solution. Moreover, each of the additional tasks opens up a new SSH connection and many SSH connections quickly become untenable as you reach file descriptor limits in Python.\nConcurrent Subprocess Ok, so let\u0026rsquo;s step back - Fabric is great for one task to one host, let\u0026rsquo;s continue to use that to our advantage. What can we put on each host that will be able to spawn multiple processes of different types? My first thought was a custom script, but after a tiny bit of research I found a StackOverflow question: Python subprocess in parallel.\nThe long and short of this is that creating a list of subprocess.Popen objects allows them to run concurrently. By polling them to see if they\u0026rsquo;re done and using select to buffer IO across multiple processes, you can collect stdout on demand, managing the execution of multiple subprocesses.\nSo now the plan is:\nFabric sends a list of commands per host to pproc pproc coordinates the execution of processes per host pproc sends Fabric serialized stdout Fabric quits when pproc exits I\u0026rsquo;ve created a command line script called pproc.py that wraps this and takes any number of commands and their arguments (so long as they are surrounded by quotes) and executes the pproc functionality described above. Consider the following \u0026ldquo;child process\u0026rdquo;:\n#!/usr/bin/env python3 import os import sys import time import random import argparse def fprint(s): \u0026#34;\u0026#34;\u0026#34; Performs a flush after print and prepends the pid. \u0026#34;\u0026#34;\u0026#34; msg = \u0026#34;proc {}: {}\u0026#34;.format(os.getpid(), s) print(msg) sys.stdout.flush() if __name__ == \u0026#39;__main__\u0026#39;: parser = argparse.ArgumentParser() parser.add_argument(\u0026#34;-l\u0026#34;, \u0026#34;--limit\u0026#34;, type=int, default=5) args = parser.parse_args() for idx in range(5): worked = random.random() * args.limit time.sleep(worked) fprint(\u0026#34;task {} lasted {:0.2f} seconds\u0026#34;.format(idx, worked)) This script is just simulating work by sleeping, but crucially, takes an argument on the command line. If we run proc as follows:\n$ pproc \u0026#34;./child.py -l 5\u0026#34; \u0026#34;./child.py -l 6\u0026#34; \u0026#34;./child.py -l 4\u0026#34; Then we get the following serialized output:\nproc 46145: task 0 lasted 2.68 seconds proc 46146: task 0 lasted 3.13 seconds proc 46145: task 1 lasted 0.95 seconds proc 46144: task 0 lasted 3.70 seconds proc 46144: task 1 lasted 0.15 seconds proc 46146: task 1 lasted 1.12 seconds proc 46145: task 2 lasted 2.90 seconds proc 46146: task 2 lasted 2.80 seconds proc 46144: task 2 lasted 3.67 seconds proc 46146: task 3 lasted 0.59 seconds proc 46144: task 3 lasted 2.30 seconds proc 46146: task 4 lasted 2.23 seconds proc 46145: task 3 lasted 4.65 seconds proc 46144: task 4 lasted 3.06 seconds proc 46145: task 4 lasted 4.05 seconds Sweet! Things are happening concurrently and we can specify any arbitrary commands with their arguments on the command line! Win! The complete listing of the pproc script is as follows:\nExperiments So what was this all for? Well, I\u0026rsquo;m running distributed systems experiments, and it\u0026rsquo;s very tricky to coordinate everything and get results. A datapoint for an experiment runs the entire system with a specific workload and a specific configuration for a fixed amount of time, then dumps the numbers to disk.\nProblem: For a single datapoint I need to concurrently startup 48 processes: 24 replicas and 24 workload generators on 4 machines. Each process requires a slightly different configuration. An experiment is composed of multiple data points, usually between 40-200 individual runs of samples that take approximately 45 - 480 seconds each.\nThe solutions I had proposed were as follows:\nSolution 1 (by hand): open up 48 terminals and type simultaneously into them using iTerm. Each configuration is handled by the environment of each terminal session. Experiments take about 4-5 hours using this method and is prone to user error.\nSolution 2 (ssh push): use fabric to parallelize the opening of 48 ssh sessions and run a command on the remote host. Experiment run times go down to about 1.5 hours, but each script has to be written by hand and am also noticing SSH failures for too many connections at the higher levels, it\u0026rsquo;s also pretty hacky.\nSolution 3 (amqp pull): write a daemon on all machines that listens to an amqp service (AWS SQS is $0.40 for 1M requests) and starts up processes on the local machine. This would solve the coordination issue and could even aggregate results, but would require extra coding and involve another process running on the machines.\nThe solution described in this post would hopefully modify Solution 2 (ssh push) to actually make it tenable.\n","permalink":"https://bbengfort.github.io/2017/06/concurrent-subprocesses-fabric/","summary":"\u003cp\u003eI\u0026rsquo;ve ben using \u003ca href=\"http://docs.fabfile.org/\"\u003eFabric\u003c/a\u003e to concurrently start multiple processes on several machines. These processes have to run at the same time (since they are experimental processes and are interacting with each other) and shut down at more or less the same time so that I can collect results and immediately execute the next sample in the experiment. However, I was having a some difficulties directly using Fabric:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eFabric can parallelize one task across multiple hosts accordint to roles.\u003c/li\u003e\n\u003cli\u003eFabric can be hacked to run multiple tasks on multiple hosts by setting \u003ccode\u003eenv.dedupe_hosts = False\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eFabric can only parallelize one type of task, not multiple types\u003c/li\u003e\n\u003cli\u003eFabric can\u0026rsquo;t handle large numbers of SSH connections\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIn this post we\u0026rsquo;ll explore my approach with Fabric and my current solution.\u003c/p\u003e","title":"Concurrent Subprocesses and Fabric"},{"content":"In my current experimental setup, each process is a single instance of sample, from start to finish. This means that I need to aggregate results across multiple process runs that are running concurrently. Moreover, I may need to aggregate those results between machines.\nThe most compact format to store results in is CSV. This was my first approach and it had some benefits including:\nsmall file sizes readability CSV files can just be concatenated together The problems were:\nheaders become very difficult everything is a string, no int or float types without parsing The headers problem is really the biggest problem, since I need future me to be able to read the results files and understand what\u0026rsquo;s going on in them. I therefore opted instead for .jsonl format, where each object is newline delimited JSON. Though way more verbose a format than CSV, it does preclude the headers problem and allows me to aggregate different results versions with ease. Again, I can just concatenate the results from different files together.\nThis is becoming so common in my Go code, here is a simple function that takes a path to append to as input as well as the JSON value (the interface) and appends the marshaled data to disk:\nNow my current worry is atomic appends from multiple processes (is this possible?!) I was hoping that the file system would lock the file between writes, but I\u0026rsquo;m not sure it does: Is file append atomic in UNIX?. Anyway, more on that later.\n","permalink":"https://bbengfort.github.io/2017/06/append-json-results/","summary":"\u003cp\u003eIn my current experimental setup, each process is a single instance of sample, from start to finish. This means that I need to aggregate results across multiple process runs that are running concurrently. Moreover, I may need to aggregate those results between machines.\u003c/p\u003e\n\u003cp\u003eThe most compact format to store results in is CSV. This was my first approach and it had some benefits including:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003esmall file sizes\u003c/li\u003e\n\u003cli\u003ereadability\u003c/li\u003e\n\u003cli\u003eCSV files can just be concatenated together\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThe problems were:\u003c/p\u003e","title":"Appending Results to a File"},{"content":"One of the projects I\u0026rsquo;m currently working on is the ingestion of RSS feeds into a Mongo database. It\u0026rsquo;s been running for the past year, and as of this post has collected 1,575,987 posts for 373 feeds after 8,126 jobs. This equates to about 585GB of raw data, and a firm requirement for compression in order to exchange data.\nRecently, @ojedatony1616 downloaded the compressed zip file (53GB) onto a 1TB external hard disk and attempted to decompress it. After three days, he tried to cancel it and ended up restarting his computer because it wouldn\u0026rsquo;t cancel. His approach was simply to double click the file on OS X, but that got me to thinking \u0026ndash; it shouldn\u0026rsquo;t have taken that long; why did it choke? Inspecting the export logs on the server, I noted that it took 137 minutes to compress the directory; shouldn\u0026rsquo;t it take that long to decompress as well?\nA quick Google revealed A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA, written in 2005 to explore the performance of Gzip, Bzip2, and LZMA. This post cited Gzip as having the largest final compression size, but the fastest compression speed. Being 12 years ago, however, I wanted to get more modern numbers for the compression of a directory of many intermediately sized files. Hopefully this will help us make better decisions about data management and compression in the future.\nIn particular, these observations explore the compression ratio and speed of Tar Gzip, Tar Bzip2, and Zip on directories containing many intermediate sized files from 1MB to 10MB.\nResults The following results were recorded on the following platform:\n2.8GHz Intel Core i7 Macbook Pro 16GB DDR3 Memory and 750GB Flash Storage Disk OS X El Capitan Version 10.11.6 bsdtar 2.8.3 - libarchive 2.8.3 Apple gzip 251 bzip2 version 1.0.6, 6-Sept-2010 Zip 3.0 (July 5th 2008), by Info-ZIP As always, performance measurements are determined by a number of factors, use these results as a guide rather than as strict truth!\nIn the first chart we explore the amount of time it takes to compress a large directory. There is linear relationship between the size of the directory and the amount of time it takes to compress it, which makes sense. BZip2 takes the longest, and Zip and GZip are comparable in terms of the overall amount of time.\nWe get a similar result for extraction time, though clearly extraction is much faster than compression. BZip2 is once again the slowest, but although Zip and GZip are still comparable at lower file sizes, GZip appears to be taking an advantage at the larger archives. We\u0026rsquo;ll have to explore this more with much larger archives.\nCompression to extraction times appear to have a nearly linear relationship. When plotted against each other, we can see that indeed the slope of Zip is slightly larger than that of GZip and in fact there will be a measurable difference for larger file sizes!\nThe above graph simply shows both the compression and extraction times and their relationship to each other.\nLooking at how much we\u0026rsquo;ve compressed, we can compute the compression ratio: plotting the size of the original data to the archive size. This is a log-log scale, and we can see that BZip2 creates smaller archives at the cost of the time performance hit. BZip2 appears to be parallel with GZip, but GZip appears to have a slightly larger slope than Zip, doing better at smaller archive sizes and may eventually do even better at much larger file sizes.\nAll compression algorithms of course reduce huge amounts of dataset space when reducing text, around 80% reductions for Zip and GZip and over 90% reduction for BZip2.\nBecause of this result, it\u0026rsquo;s clear that instead of compressing the entire directory, we should instead compress each individual file, extracting them only as necessary as we need to read them in.\nMethod The goal of this benchmark was to explore compression and extraction of a directory containing many small files (similar to the corpus dataset we are dealing with). The files in question are text, json, or html, which compress pretty well. Therefore I created a dataset generation script that used the lorem package to create random text files of various sizes (1MiB and 2MiB files to start).\nEach directory contained 8 subdirectories with n files in each directory, which determines the total size of the dataset. For example, the 64MiB dataset of 1MiB files contained 8 files per subdirectory. The benchmark script first walked the data directory to get an exact file size, then compressed it using the specified tool. It computed the archive size to get the percent compression, then extracted the file to a temporary directory. Both compression and extraction was timed.\nFor more details, please see the script used to generate test data sets and run benchmarks on Gist: zipbench.py.\nFor future work I\u0026rsquo;d like to build this up to much larger corpus sizes, but that will probably require AWS or some dedicated hardware other than my MacBook pro, and a lot more time!\n","permalink":"https://bbengfort.github.io/2017/06/compression-benchmarks/","summary":"\u003cp\u003eOne of the projects I\u0026rsquo;m currently working on is the \u003ca href=\"http://baleen.districtdatalabs.com/\"\u003eingestion of RSS feeds into a Mongo database\u003c/a\u003e. It\u0026rsquo;s been running for the past year, and as of this post has collected 1,575,987 posts for 373 feeds after 8,126 jobs. This equates to about 585GB of raw data, and a firm requirement for compression in order to exchange data.\u003c/p\u003e\n\u003cp\u003eRecently, \u003ca href=\"https://github.com/ojedatony1616\"\u003e@ojedatony1616\u003c/a\u003e downloaded the compressed zip file (53GB) onto a 1TB external hard disk and attempted to decompress it. After three days, he tried to cancel it and ended up restarting his computer because it wouldn\u0026rsquo;t cancel. His approach was simply to double click the file on OS X, but that got me to thinking \u0026ndash; it shouldn\u0026rsquo;t have taken that long; why did it choke? Inspecting the export logs on the server, I noted that it took 137 minutes to compress the directory; shouldn\u0026rsquo;t it take that long to decompress as well?\u003c/p\u003e","title":"Compression Benchmarks"},{"content":"Was introduced to an interesting problem today when decorating tests that need to be discovered by the nose runner. By default, nose explores a directory looking for things named test or tests and then executes those functions, classes, modules, etc. as tests. A standard test suite for me looks something like:\nimport unittest class MyTests(unittest.TestCase): def test_undecorated(self): \u0026#34;\u0026#34;\u0026#34; assert undecorated works \u0026#34;\u0026#34;\u0026#34; self.assertEqual(2+2, 4) The problem came up when we wanted to decorate a test with some extra functionality, for example loading a fixture:\ndef load_fixture(func): def wrapper(*args, **kwargs): # Load a fixture return func(*args, **kwargs) return wrapper class MyTests(unittest.TestCase): @load_fixture def test_decorated(self): \u0026#34;\u0026#34;\u0026#34; assert a decorated test works \u0026#34;\u0026#34;\u0026#34; self.assertEqual(2+2, 4) The key to remember is that you must wrap the function so that the name and docstring are added to the internal wrapper, thus allowing the nose test discovery function to work:\nfrom functools import wraps def load_fixture(func): @wraps(func) def wrapper(*args, **kwargs): # Load a fixture return func(*args, **kwargs) return wrapper Thanks to @ndanielsen for pointing this out, it\u0026rsquo;s going to save me a bit of trouble in the future, I expect.\n","permalink":"https://bbengfort.github.io/2017/05/test-decorators/","summary":"\u003cp\u003eWas introduced to an interesting problem today when decorating tests that need to be discovered by the \u003ca href=\"https://pypi.python.org/pypi/nose/1.3.7\"\u003enose\u003c/a\u003e runner. By default, nose explores a directory looking for things named \u003ccode\u003etest\u003c/code\u003e or \u003ccode\u003etests\u003c/code\u003e and then executes those functions, classes, modules, etc. as tests. A standard test suite for me looks something like:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kn\"\u003eimport\u003c/span\u003e \u003cspan class=\"nn\"\u003eunittest\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"k\"\u003eclass\u003c/span\u003e \u003cspan class=\"nc\"\u003eMyTests\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"n\"\u003eunittest\u003c/span\u003e\u003cspan class=\"o\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eTestCase\u003c/span\u003e\u003cspan class=\"p\"\u003e):\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"k\"\u003edef\u003c/span\u003e \u003cspan class=\"nf\"\u003etest_undecorated\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"bp\"\u003eself\u003c/span\u003e\u003cspan class=\"p\"\u003e):\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e        \u003cspan class=\"s2\"\u003e\u0026#34;\u0026#34;\u0026#34;\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s2\"\u003e        assert undecorated works\n\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"s2\"\u003e        \u0026#34;\u0026#34;\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e        \u003cspan class=\"bp\"\u003eself\u003c/span\u003e\u003cspan class=\"o\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003eassertEqual\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"mi\"\u003e2\u003c/span\u003e\u003cspan class=\"o\"\u003e+\u003c/span\u003e\u003cspan class=\"mi\"\u003e2\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"mi\"\u003e4\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThe problem came up when we wanted to decorate a test with some extra functionality, for example loading a fixture:\u003c/p\u003e","title":"Decorating Nose Tests"},{"content":"I have had some recent discussions regarding cacheing to improve application performance that I wanted to share. Most of the time those conversations go something like this: “have you heard of Redis?” I\u0026rsquo;m fascinated by the fact that an independent, distributed key-value store has won the market to this degree. However, as I\u0026rsquo;ve pointed out in these conversations, cacheing is a hierarchy (heck, even the processor has varying levels of cacheing). Especially when considering micro-service architectures that require extremely low latency responses, cacheing should be a critical part of the design, not just a bolt-on after thought!\nSo here are the tools I consider when implementing cacheing, in a hierarchy from single process to distributed processes:\nIn a later post, I may review embedded, multi-threaded, or external multi-process cacheing. In this post, however, I\u0026rsquo;m focused on component based single thread cacheing. But before we discuss that let’s review why cacheing is important. A definition:\nCacheing: storing a computed value in a quickly readable data structure (usually in memory) to reduce the amount of time to respond to API calls usually by minimizing the need for repeated computation.\nThe idea is that computing a value takes a measurable amount of time, either from processor cycles or I/O from the disk or another data source. By storing the computed value, repeated calls with a similar input can benefit from fast lookups in memory. Let\u0026rsquo;s look at a simple example of this, a process called memoization:\nfrom functool import wraps def memoized(fget): attr_name = \u0026#39;_{0}\u0026#39;.format(fget.__name__) @wraps(fget) def fget_memoized(self): if not hasattr(self, attr_name): setattr(self, attr_name, fget(self)) return getattr(self, attr_name) return property(fget_memoized) This snippet of code is so common that it is seen in a utils module in almost every larger piece of software I write. The memoized function is a method decorator (for classes) that acts similarly to the @property decorator. When a class attribute is accessed, its value corresponds to the return value of the fget function. When used with fget_memoized, however, the fget function is called, stored on the object, and instead of calling fget repeatedly, the cached value is returned. For example:\nclass Thing(object): @memoized def prime(self): print(\u0026#34;long running computation!\u0026#34;) return 31 The print statement will only occur once, on the first access to thing.prime. After that, all calls will return the value of thing._prime. To force a recomputation you can simply del thing._prime.\nThis is great, and extremely commonly used — but what if you want to cache the response by input or timeout the cache after a fixed period? The answer to the first is the lru_cache, which caches values, discarding the \u0026ldquo;least recently used\u0026rdquo; first. Therefore if you have a function that accepts an argument:\nfrom functools import lru_cache @lru_cache(maxsize=256) def fib(n): if n \u0026lt; 2: return n return fib(n-1) + fib(n-2) Then the cache will store values for that argument until maxsize is reached, at which point values used least recently will be discarded. Note that it is best to use a maxsize value that is a power of 2 for best performance. You can also inspect the cache as follows:\n\u0026gt;\u0026gt;\u0026gt; fib(31) 1346269 \u0026gt;\u0026gt;\u0026gt; fib.cache_info() CacheInfo(hits=29, misses=32, maxsize=256, currsize=32) To expire a value after a specific amount of time, I recommend using an ExpiringDict as follows:\nfrom expiringdict import ExpiringDict cache = ExpiringDict(max_len=256, max_age_seconds=2) You can now get and put values into the cache:\nimport time cache[\u0026#34;foo\u0026#34;] = \u0026#34;bar\u0026#34; cache.get(\u0026#34;foo\u0026#34;) # bar time.sleep(2) cache.get(\u0026#34;foo\u0026#34;) # None On get, the ExpiringDict checks the number of seconds since the value was inserted into the dictionary. If it is longer than the max_age the value is deleted and None is returned. Note that the cache is only managed on access, therefore without a max_length, they can grow to infinite size if not cleaned up. One way to manage this is with a routine garbage collection thread that just performs a get on all values, locking the dictionary as it does.\nNeither of these types of caches are persistent. In order to persist a cache to disk, you can simply pickle the object to disk. However, a better option might be the Python shelve module.\nA \u0026ldquo;shelf\u0026rdquo; is a persistent, dictionary-like object that stores Python objects to disk. By itself, it is not a cache per-se, but with the writeback flag set to True, it can be used as a durable cache. In this case, entries are cached and accessed in memory, and only snapshotted to disk on sync and close.\nfrom shelve import shelve class DurableCache(object): def __init__(self, path): self.db = shelve.open(path, writeback=True) def put(self, key, val): self.db[key] = val def get(self, key): return self.db[key] def close(self): self.db.close() def sync(self): self.db.sync() With a little creativity, these caches can be extremely effective local durable storage. However note that the shelf does not know when an object has been mutated, which means it can consume a lot of memory or take a long time to sync or close. Advanced in-memory caches that use the shelve module add logic to detect these things and background routines to clean up and periodically checkpoint to disk for recovery.\nThere is still a long way to go with cacheing options, including embedded and in-memory databases as well as external caches for multi-process or distributed cacheing. I may discuss these in another post.\n","permalink":"https://bbengfort.github.io/2017/05/in-process-caches/","summary":"\u003cp\u003eI have had some recent discussions regarding cacheing to improve application performance that I wanted to share. Most of the time those conversations go something like this: “have you heard of Redis?” I\u0026rsquo;m fascinated by the fact that an independent, distributed key-value store has won the market to this degree. However, as I\u0026rsquo;ve pointed out in these conversations, cacheing is a hierarchy (heck, even the processor has varying levels of cacheing). Especially when considering micro-service architectures that require extremely low latency responses, cacheing should be a critical part of the design, not just a bolt-on after thought!\u003c/p\u003e","title":"In Process Cacheing"},{"content":"An interesting question came up in the development of Yellowbrick: given a vector of values, what is the quickest way to get the unique values? Ok, so maybe this isn\u0026rsquo;t a terribly interesting question, however the results surprised us and may surprise you as well. First we\u0026rsquo;ll do a little background, then I\u0026rsquo;ll give the results and then discuss the benchmarking method.\nThe problem comes up in Yellowbrick when we want to get the discrete values for a target vector, y — a problem that comes up in classification tasks. By getting the unique set of values we know the number of classes, as well as the class names. This information is necessary during visualization because it is vital in assigning colors to individual classes. Therefore in a Visualizer we might have a method as follows:\nclass ScatterVisualizer(Visualizer): def fit(X, y=None): labels = [ str(item) for item in set(y) ] colors = dict(zip((labels, resolve_colors(len(labels))))) ... NOTE: a related question is how can we determine a continuous vector y (a regression problem) from a categorical vector y (a classification problem) automatically? This allows us to assign a sequential vs. discrete colors to the target variable.\nTo make a short story even shorter, when I reviewed the above code, my response was: “isn\u0026rsquo;t something like np.unique faster?”. I was returned a simple “yep, sure is” answer, and the code was changed to np.unique — job done, right? When the commit was pushed, a few tests didn\u0026rsquo;t pass; it looked like there was an issue converting a Python data type into the numpy data type to pass to the unique function (turns out this was not the issue), but that caused me to investigate the input type to the uniqueness method. Using set vs. np.unique depends on if the input type is a Python list or a Numpy array, as we\u0026rsquo;ll see shortly.\nSo let\u0026rsquo;s get into results. We proposed three methods of getting the unique items from our target vector:\nimport numpy as np from sklearn.preprocessing import LabelEncoder def py_unique(data): return list(set(data)) def np_unique(data): return np.unique(data) def sk_unique(data): encoder = LabelEncoder() encoder.fit(data) return encoder.classes_ The first converts a Python set into a list and returns the unsorted list of unique values. The second uses numpy and converts the input into a np.array; it actually returns a sorted array of values. The third option is more directly related to Scikit-Learn, fitting a LabelEncoder transformer and getting the unique classes from that.\nBefore getting into the benchmarking methodology, the results are as follow:\nThe results in the above figure show that by far the fastest unique computation is using set on a Python list. This is especially surprising given the fact that numpy arrays are C implementations, and are therefore guaranteed to be blazingly fast. Using np.unique is on average faster than everything else, and it certainly gives the best performance on array data structures out of all the methods. It does slightly worse with Python lists, but not as badly as Python does with array structures. Scikit-Learn clearly adds some overhead, especially when it comes to Python lists, but performs fairly well for array structs.\nIn the end, we chose to stick with np.unique in Yellowbrick, primarily because the expected input is in fact a np.array, either from data loaded from np.loadtxt or from a Pandas Series or DataFrame. If a Python list is passed in, then the performance is adequate for our needs. Still, the performance gaps based on input type were a surprise and I would encourage you, as always, to benchmark code and not just rely on traditional assumptions!\nNOTE: If you believe that our implementation or benchmarking can be improved, please let me know!\nBenchmarking Notes Benchmarking, especially in Python, is a tricky task. Therefore, in order to be as transparent as possible in the claims made above and to quickly catch any mistakes, I present the benchmarking methodology here. The complete script and notebook can be found on Gist.\nFirst, I will say that I did explore the timeit module for benchmarking, but couldn\u0026rsquo;t make these particular tests work with it. Instead, I wrote a simple timing function that returns the time delta in microseconds (μs). I also wrote a benchmark function that applied the unique method to a dataset n=10000 times and returned the average time for an operation.\ndef timeit(func): start = time.time() func() return ((time.time() - start) * 1000000.0) def benchmark(func, data, n=10000): delta = sum([ timeit(lambda: func(data)) for _ in range(n) ]) return (float(delta) / float(n)) Because a set operation is at worst O(n) and therefore depends on the length of the dataset, I created a function to make a dataset of a variable length with between 1 and 52 unique elements. This data was then stored as a Python list or as a Numpy array object depending on the input tested.\ndef make_data(uniques=10, length=10000): chars = string.ascii_letters if uniques \u0026gt; len(chars): raise ValueError(\u0026#34;too many uniques for the choices\u0026#34;) return [ random.choice(chars[:uniques]) for idx in range(length) ] The actual test protocol ran on datasets whose length went from 10 to 100,000 items by a factor of ten each time (e.g. 10, 100, 1000, etc.). The test also factored different numbers of unique values from 1 to 40 by 5. Each dataset was then benchmarked as a list and an array against the three _unique methods for a total of 195 benchmarks.\nAs you can see, the amount of time per operation increases exponentially as the length of the dataset increases:\nAnd it appears (as expected) that the number of unique values per dataset does not have a meaningful impact on the operation time:\nHopefully these timing numbers and approach to benchmarking seem valid. They certainly work to highlight interesting places where our coding assumptions might fail us.\n","permalink":"https://bbengfort.github.io/2017/05/python-unique-benchmark/","summary":"\u003cp\u003eAn interesting question came up in the development of \u003ca href=\"http://www.scikit-yb.org/\"\u003eYellowbrick\u003c/a\u003e: given a vector of values, what is the quickest way to get the unique values? Ok, so maybe this isn\u0026rsquo;t a terribly interesting question, however the results surprised us and may surprise you as well. First we\u0026rsquo;ll do a little background, then I\u0026rsquo;ll give the results and then discuss the benchmarking method.\u003c/p\u003e\n\u003cp\u003eThe problem comes up in Yellowbrick when we want to get the discrete values for a target vector, \u003ccode\u003ey\u003c/code\u003e — a problem that comes up in classification tasks. By getting the unique set of values we know the number of classes, as well as the class names. This information is necessary during visualization because it is vital in assigning colors to individual classes. Therefore in a Visualizer we might have a method as follows:\u003c/p\u003e","title":"Unique Values in Python: A Benchmark"},{"content":"Part of my research is taking me down a path where I want to measure the number of reads and writes from a client to a storage server. A key metric that we\u0026rsquo;re looking for is throughput — the number of accesses per second that a system supports. As I discovered in a very simple test to get some baseline metrics, even this simple metric can have some interesting complications.\nSo let\u0026rsquo;s start with the model. Consider a single server that maintains an in-memory key/value store. Clients can Get (read) values for a particular key and Put (write) values associated with a single key. Every Put creates a version associated with the value that orders the writes as they come in. This model has implications for consistency, even though there is a single server, and we\u0026rsquo;ll get into that later.\nThe server handles multiple clients concurrently, each client in its own goroutine. In order to maintain thread safety, a Put request must lock the store while it\u0026rsquo;s modifying it, ensuring that the value is correctly ordered and not corrupted. On the other hand, a Get request requires only a read lock; multiple goroutines can have a read lock but wait for a write lock to finish. The way the locks work can also inform consistency.\nThe server and client command line apps are implemented and can be found at github.com/bbengfort/honu.\nThroughput is measured by pushing the server to a steady-state of requests. Each client issues a Put request to the server, measuring how long it takes to get a response. As soon as the Put request returns, the client immediately sends another request, and continues to do so for some predetermined amount of time. As the number of clients increases, the server reaches a maximum capacity of requests it can handle in a second, and that is the maximum throughput.\nIn the first round of experiments, each client is writing to its own key, meaning that there is no conflict on the server end. I utilized the Horvitz Research Cluster to create 25 clients and a single server with low latency connections. Each client runs for 30 seconds sending as many messages sequentially as it possibly can. The throughput is measured as the number of messages divided by the latency of the RPC (and does not include the latency of creating or handling messages at the client end).\nThe first results, displayed in the figure above, show the average client throughput as the number of clients increases. This graph met my expectations, in that as the number of clients increases, the throughput goes down. However, when I showed this graph to my advisor, his first response was that it was off in two ways:\nA server should be able to handle far more than 1200 messages per second, probably closer to 10,000 messages per second. The throughput should actually increase as the number of clients goes up because the server spends most of its time waiting. To the first point, I noted that the server was on a VM, potentially the resource scheduling at the hypervisor layer was causing the throughput to be artificially less than a typical server environment. To test this, I ran a client and server as separate processes on the same machine, connecting over the local loopback address to minimize the noise of network constraints. I then compared the VM (bbc29) to a box in my office (lagoon).\nClearly my advisor was right on the money. On the hardware, the application (with the exact same configuration) performs slightly under 10x better than on the virtual machine. I also tested to see if trace messages (print statements that log connections) affected performance, and they do (blue is without trace, green is with trace) — but not to the amount that can be reconciled with the difference between virtual and hardware performance.\nThis was a surprise, and made me question whether or not I should rethink using virtual machines in the cloud. However, it was pointed out to me that cloud services do their best not to oversubscribe their hardware, and in an academic setting that may not be the case.\nTo the second point, the first graph is actually measuring latency at the client, not at the server. So although the server is actually sitting around with spare capacity when there are fewer clients, the throughput can only go as fast as the client does. I think the first graph does show that until about 9 clients or so the performance is plateaued, meaning that the server has capacity to handle all clients at their particular rates. After 9 clients, however, the server is no longer primarily waiting for requests, but is constantly handling requests, and the locks become a factor.\nIn order to explore this, I instead measured throughput at the server-side. The server records the timestamp of the first message it receives, then maintains the timestamp of the last message it receives, counting the number of messages. It then divides the number of messages by the delta of the last message to the first. The graph above shows the measurements back in the virtual machine cluster of the server-side throughput. This graph is the familiar one, the one it\u0026rsquo;s “supposed to be” — as the number of clients increases, the throughput increases linearly, until about 9 clients or so when the capacity plateaus at around 16,000 writes per second.\nLatency variability, message ordering, and other factors can come into play in a geographic environment — and it is certainly my intention to explore those factors in detail. However, I think it was an important systems lesson to learn the expected shape of baseline environments, so that I will be able to immediately compare graphs I\u0026rsquo;m getting with the expected form.\n","permalink":"https://bbengfort.github.io/2017/04/throughput/","summary":"\u003cp\u003ePart of my research is taking me down a path where I want to measure the number of reads and writes from a client to a storage server. A key metric that we\u0026rsquo;re looking for is \u003cem\u003ethroughput\u003c/em\u003e — the number of accesses per second that a system supports. As I discovered in a very simple test to get some baseline metrics, even this simple metric can have some interesting complications.\u003c/p\u003e","title":"Measuring Throughput"},{"content":"This week I discovered I had a problem with my Google Calendar — events accidentally got duplicated or deleted and I needed a way to verify that my primary calendar was correct. Rather than painstakingly go through the web interface and spot check every event, I instead wrote a Go console program using the Google Calendar API to retrieve events and save them in a CSV so I could inspect them all at once. This was great, and very easy using Google\u0026rsquo;s Go libraries for their APIs, and the quick start was very handy.\nMy calendar is private, therefore in order to access it from the command line, I had to authenticate with Google using OAuth2. This is an external application workflow that is browser based, an application that wants to authenticate Google\u0026rsquo;s service first redirects the client to Google with a token that allows Google to verify the application. The user logs in with Google, accepts the access the level the application wants, and Google sends the user back to the application with a token. That token, signed with the application secret allows the application to access Google on behalf of the user.\nSo, how do you do this type of authentication in the terminal? Basically, the console program prints out the link (or uses the $GOOS specific open command) and the user manually goes to the website. Google then provides the token in the browser, which the user has to copy and paste back into stdin on the command line. The good news is that if this token is cached somewhere, then this only has to be done once for multiple requests until the token expires or the user deletes it.\nThe Calendar API quickstart provided several functions for this, first looking to see if the token was cached on disk in a specific place in the user\u0026rsquo;s home directory; and then if not available, performed the web authentication and cached the token locally. There are, however, a lot of moving parts to this including the configuration for where to store the cached token, as well as the application credentials stored in a file called client_secret.json. Rather than hardcode these things, I created an Authentication struct that managed all aspects of authentication and token gathering, and I present it to you here:\nThe primary entry point to this struct is the auth.Token() method, which retrieves the token from the cache, or starts the web authentication process to cache the token if it doesn\u0026rsquo;t exist. This revolves around the key auth.CachePath() and auth.ConfigPath() that compute the default locations for the token cache and the client_secret.json file in a hidden directory in the user\u0026rsquo;s home directory. The Authentication struct also provides Load(), Save() and Delete() functions for managing the cache directly.\nThis can be used to create an API client as follows:\n// Initialize authentication auth := new(Authentication) // Load the configuration from client_secret.json config, err := auth.Config() if err != nil { log.Fatal(err.Error()) } // Load the token from the cache or force authentication token, err := auth.Token() if err != nil { log.Fatal(err.Error()) } // Create the API client with a background context. ctx := context.Background() client = config.Client(ctx, token) // Create the google calendar service gcal, err = calendar.New(client) if err != nil { log.Fatal(\u0026#34;could not create the google calendar service\u0026#34;) } And the service can be used to get the next 10 events on the calendar like so:\n// Create the time to get events from. now := time.Now().Format(time.RFC3339) // Get the events list from the calendar service. events, err := gcal.Events.List(\u0026#34;primary\u0026#34;) .ShowDeleted(false) .SingleEvents(true) .TimeMin(now) .MaxResults(10) .OrderBy(\u0026#34;startTime\u0026#34;) .Do() if err != nil { log.Fatal(\u0026#34;unable to retrieve upcoming events: %v\u0026#34;, err) } // Loop over the events and print them out. for _, i := range events.Items { var when string // If the DateTime is an empty string, // the event is an all day event if i.Start.DateTime != \u0026#34;\u0026#34; { when = i.Start.DateTime } else { when = i.Start.Date } fmt.Printf(\u0026#34;%s (%s)\\n\u0026#34;, i.Summary, when) } This is actually a complete example of using the Calendar API from the quickstart guide — most of the work comes from the interaction with OAuth2. But the good news is that the Authentication struct will work with most Google APIs, so long as you download the correct client_secret.json!\n","permalink":"https://bbengfort.github.io/2017/04/oauth-token-command-line/","summary":"\u003cp\u003eThis week I discovered I had a problem with my Google Calendar — events accidentally got duplicated or deleted and I needed a way to verify that my primary calendar was correct. Rather than painstakingly go through the web interface and spot check every event, I instead wrote a Go console program using the \u003ca href=\"https://developers.google.com/google-apps/calendar/quickstart/go\"\u003eGoogle Calendar API\u003c/a\u003e to retrieve events and save them in a CSV so I could inspect them all at once. This was great, and very easy using Google\u0026rsquo;s Go libraries for their APIs, and the quick start was very handy.\u003c/p\u003e","title":"OAuth Tokens on the Command Line"},{"content":"I routinely have long-running scripts (e.g. for a data processing task) that I want to know when they\u0026rsquo;re complete. It seems like it should be simple for me to add in a little snippet of code that will send an email using Gmail to notify me, right? Unfortunately, it isn\u0026rsquo;t quite that simple for a lot of reasons, including security, attachment handling, configuration, etc. In this snippet, I\u0026rsquo;ve attached my constant copy and paste notify() function, written into a command line script for easy sending on the command line.\nGmail Setup If you\u0026rsquo;re like me, you have a gmail account with 2-factor authentication (and if you don\u0026rsquo;t, you should get that set up). In order to use this account to send email from, you\u0026rsquo;re going to have to configure gmail as follows:\nAllow less secure apps to access your account Create a Sign in using App Passwords Alternatively you could create an account to solely send notifications from and not give it two factor authentication, but you\u0026rsquo;d still have to do step 1. Even if you do all this stuff, Google Apps can still get in the way, so be sure to inspect any errors you get carefully!\nEnvironment Setup This script and most of my Python scripts contain configuration and security information in the environment. Therefore, open up your .profile or other shell environment and add the following variables.\n## Notify Environment export EMAIL_USERNAME=you@gmail.com export EMAIL_PASSWORD=supersecret export EMAIL_HOST=smtp.gmail.com export EMAIL_PORT=587 export EMAIL_FAIL_SILENT=False I\u0026rsquo;ve also used YAML configuration, dotenv files, and all sorts of other configuration for this as well. Choose what suits your application\nNotify Script And here is a command line version of the script that wraps the notify() function. Note that it\u0026rsquo;s basic functionality is to send a simple alert and maybe attach some log or results files to the email, not to routinely send large amounts of HTML formatted messages!\nUsage So now you can send a simple notification as follows:\n$ notify.py -r jdoe@exmaple.com Or you can edit the subject and message with a few attachments:\n$ notify.py -r jdoe@example.com -s \u0026#34;computation complete\u0026#34; results.csv Future versions of this script will allow you to pipe the message in via stdin so that you can chain the emailer along the command line. I also plan to do a better configuration, similar to how AWS CLI configures itself in a simple file in the home directory.\n","permalink":"https://bbengfort.github.io/2017/04/gmail-notifications-python/","summary":"\u003cp\u003eI routinely have long-running scripts (e.g. for a data processing task) that I  want to know when they\u0026rsquo;re complete. It seems like it should be simple for me to add in a little snippet of code that will send an email using Gmail to notify me, right? Unfortunately, it isn\u0026rsquo;t quite that simple for a lot of reasons, including security, attachment handling, configuration, etc. In this snippet, I\u0026rsquo;ve attached my constant copy and paste \u003ccode\u003enotify()\u003c/code\u003e function, written into a command line script for easy sending on the command line.\u003c/p\u003e","title":"Gmail Notifications with Python"},{"content":"On Tuesday evening I attended a Django District meetup on Grumpy, a transpiler from Python to Go. Because it was a Python meetup, the talk naturally focused on introducing Go to a Python audience, and because it was a Django meetup, we also focused on web services. The premise for Grumpy, as discussed in the announcing Google blog post, is also a web focused one — to take YouTube\u0026rsquo;s API that\u0026rsquo;s primarily written in Python and transpile it to Go to improve the overall performance and stability of YouTube\u0026rsquo;s front-end services.\nWhile still in experimental mode, they show a benchmarking graph in the blog post that shows as the number of threads increases, the number of Grumpy transpiled operations per second also increases linearly, whereas the CPython ops/sec actually decreases to a floor. This is fascinating stuff and actually kind of makes sense; potentially the opportunities for concurrency in Go defeat the GIL in Python and can give Python code deployable scalability.\nStill, I wanted to know, if it\u0026rsquo;s faster, how much faster is it? (skip ahead to results)\nIn both the meetup talk and the blog post, the fibonacci benchmark is discussed. Unfortunately, neither had raw numbers and since I wanted to try it out on my own anyway, I thought I would. In this post I\u0026rsquo;ll review the steps I took to use Grumpy then the benchmarking numbers that I came up with.\nGetting Started Transpiling Because the package is in experimental mode, you must download or clone the Grumpy repository and do all your work in the project root directory. This is because relative paths and a couple of special environment variables are required in order to make things work. First clone the repository and change your working directory to the project root:\n$ git clone https://github.com/google/grumpy.git $ cd grumpy At this point you need to build the grumpy tools and set a couple of environment variables to make things work.\n$ make $ export GOPATH=$PWD/build $ export PYTHONPATH=$PWD/build/lib/python2.7/site-packages Note that the make process actually took quite a bit of time on my MacBook, so be patient! I also added the export statements to an .env file locally so that I could easily set the environment for this directory in the future.\nThe hello world of Grumpy transpiling is quite simple. First create a python file, hello.py:\n#!/usr/bin/env python if __name__ == \u0026#39;__main__\u0026#39;: print \u0026#34;hello world!\u0026#34; You then transpile it and build a binary executable as follows:\n$ build/bin/grumpc hello.py \u0026gt; hello.go $ go build -o hello hello.go The first step uses the grumpc transpiler to create Go code from the Python code, and outputs it to the Go source code file, hello.go. The second step uses the go build tool (which requires the $GOPATH to be set correctly) to compile the hello.go program into a binary executable. You can now execute the file directly:\n$ ./hello hello world! Fibonacci In order to benchmark the code for time I want to compare three executables:\nA Python 2.7 implementation with recursion (fib.py) A pure Go implementation with similar characteristics (fib.go) The transpiled Python implementation (fibpy.go) Note: Obligatory Py2/3 comment: Grumpy is about making the YouTube API better, which is written in Python 2.7; so tough luck Python 3 folks, I guess.\nThe hypothesis is that the Python implementation will be the slowest, the transpiled one slightly faster and the Go implementation will blaze. For reference, here are my implementations:\n#!/usr/bin/env python import sys def fib(i): if i \u0026lt; 2: return 1 return fib(i-1) + fib(i-2) if __name__ == \u0026#39;__main__\u0026#39;: try: idx = sys.argv[1] print fib(int(idx)) except IndexError: print \u0026#34;please specify a fibonacci index\u0026#34; except ValueError: print \u0026#34;please specify an integer\u0026#34; The Python implementation is compact and understandable, coming in at 14 lines of code. The Go implementation is slightly longer at 24 lines of code:\npackage main import ( \u0026#34;fmt\u0026#34; \u0026#34;os\u0026#34; \u0026#34;strconv\u0026#34; ) func fib(i uint64) uint64 { if i \u0026lt; 2 { return uint64(1) } return fib(i-1) + fib(i-2) } func main() { if len(os.Args) != 2 { fmt.Println(\u0026#34;please specify a fibonacci index\u0026#34;) os.Exit(1) } idx, err := strconv.ParseUint(os.Args[1], 10, 64) if err != nil { fmt.Println(\u0026#34;please specify an integer\u0026#34;) os.Exit(1) } fmt.Println(fib(idx)) } In order to transpile the code, build it as follows:\n$ $ build/bin/grumpc fib.py \u0026gt; fibpy.go $ go build -o fibpy fibpy.go And of course build the go code as well:\n$ go build -o fib fib.go The transpiled code comes in at a whopping 255 lines of code, so I\u0026rsquo;ll not show it here, but if you\u0026rsquo;re interested you can find it at this gist.\nOne interesting thing about Grumpy is it uses a π symbol for variable names that reference Python, for example, the grumpy package is imported into the namespace πg.\nSo in terms of code, we have the following characteristics:\nBut frankly that\u0026rsquo;s fair — Grumpy has to do a lot of work to bring over the sys package from Python, handle exceptions in the try/except, handle the builtins and deal with objects and function definitions. I actually think Grumpy is doing pretty well in the translation in terms of LOC.\nBenchmarking Typically I would use Go benchmarking to measure the performance of an operation — it is both formal and does a good job of doing micro-measurements in terms of number of operations per second. However, I can\u0026rsquo;t use this technique for the Python code and I want to make sure that we can capture the benchmarks for the complete executable including imports like the sys module. Therefore the benchmarks are timings of complete runs of the executables, the equivalent of:\n$ time ./fib 40 $ time ./fibpy 40 $ time python fib.py 40 Because the recursive fibonacci implementation does not use memoization or dynamic programming, the computational time increases exponentially as the index gets higher. Therefore the benchmarks are several runs at moderately high indices to push the performance. In order to operationalize this, I wrote a small Python script to execute the benchmarks. You can find the benchmark script on Gist (it is a bit too large to include in this post).\nNOTE: I hope that I have provided everything needed to repeat these benchmarks. If you find a hole in the methodology or different results, I\u0026rsquo;d certainly be interested.\nAfter the timing benchmarks I also wanted to run resource usage benchmarks. Since the fibonacci implementation currently doesn\u0026rsquo;t use multiple threads, I can\u0026rsquo;t compare run times across increasing number of processes (TODO!). Instead, using the memory profiler library I simply measured memory usage. In the results section, I run each process using mprof independently in order to precisely track what is running where. However, using the new multiprocess feature of the memory profiler library you could create a bash script as follows:\n#!/bin/bash ./fib $1 \u0026amp; ./fibpy $1 \u0026amp; python fib.py $1 \u0026amp; wait And run the memory profiler on each of the processes:\n$ mprof run -M ./fibmem.sh 40 $ mprof plot This will background each of the processes so that they are plotted as child processes of the main bash script. Unfortunately they are plotted by index, so it\u0026rsquo;s hard to know which child is which, but I believe that child 0 is the go implementation, child 1 is the transpiled implementation, and child 2 is the Python implementation. Ok, so after that long description of methods, let\u0026rsquo;s get into findings.\nResults For 20 runs of each executable for fibonacci arguments 25, 30, 35, and 40, I recorded the following average times for the various executables shown in the next figure. Note that the amount of time for the next argument increases exponentially, opening up the performance gap between executables.\nUnsurprisingly, the pure Go implementation was blazing fast, about 42 times faster than the Python implementation on average. The real surprise, however, is that the transpiled Go was actually 1.5 times slower than the Python implementation. I actually cannot explain why this might be — I\u0026rsquo;m hugely curious if anyone has an answer.\nIn order to give a clearer picture, here are the log scaled results with a fifth timing for the 45th fibonacci number computation:\nIn order to track memory usage, I used mprof to track memory for each executable ran independently in it\u0026rsquo;s own process, here are the results:\nAnd so that you can actually see the pure Go implementation as well as memory usage initialization and start up, here is a zoomed in version to the first few milliseconds of execution:\nThe memory usage profiling reveals yet another surprise, not only does the transpiled version take longer to execute, but it also uses more memory. Meanwhile, the pure go implementation is so lightweight as to blow away with a stiff breeze.\nConclusions Transpiling is hard.\nGrumpy is still only experimental, and there does seem to be some real promise particularly with concurrency gains. However, I\u0026rsquo;m not sold on transpiling as an approach to squeezing more performance out of a system.\n","permalink":"https://bbengfort.github.io/2017/03/grumpy-transpiling-fib-benchmark/","summary":"\u003cp\u003eOn Tuesday evening I attended a \u003ca href=\"https://www.meetup.com/django-district/events/238128100/\"\u003eDjango District meetup\u003c/a\u003e on \u003ca href=\"https://github.com/google/grumpy\"\u003eGrumpy\u003c/a\u003e, a \u003ca href=\"https://www.stevefenton.co.uk/2012/11/compiling-vs-transpiling/\"\u003etranspiler\u003c/a\u003e from Python to Go. Because it was a Python meetup, the talk naturally focused on introducing Go to a Python audience, and because it was a Django meetup, we also focused on web services. The premise for Grumpy, as discussed in the announcing \u003ca href=\"https://opensource.googleblog.com/2017/01/grumpy-go-running-python.html\"\u003eGoogle blog post\u003c/a\u003e, is also a web focused one — to take YouTube\u0026rsquo;s API that\u0026rsquo;s primarily written in Python and transpile it to Go to improve the overall performance and stability of YouTube\u0026rsquo;s front-end services.\u003c/p\u003e","title":"A Benchmark of Grumpy Transpiling"},{"content":"In my systems I need to handle failure; so unlike in a typical client-server relationship, I\u0026rsquo;m prepared for the remote I\u0026rsquo;m dialing to not be available. Unfortunately when you do this with gRPC-Go there are a couple of annoyances you have to address. They are (in order of solutions):\nVerbose connection logging Background and back-off for reconnection attempts Errors are not returned on demand. There is no ability to keep track of statistics So first the logging. When you dial an unavailable remote as follows:\nconn, err := grpc.Dial(addr, grpc.WithInsecure()) if err != nil { return err } client := pb.NewServiceClient(conn) resp = client.RPC() You will get a lot of log messages in the form of:\n2017/03/20 16:36:16 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \u0026#34;transport: dial tcp 192.168.1.1:port: getsockopt: connection refused\u0026#34;; Reconnecting to {addr:port \u0026lt;nil\u0026gt;} 2017/03/20 16:36:16 grpc: addrConn.resetTransport failed to create client 2017/03/20 16:36:16 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \u0026#34;transport: dial tcp 192.168.1.1:port: getsockopt: connection refused\u0026#34;; Reconnecting to {addr:port \u0026lt;nil\u0026gt;} 2017/03/20 16:36:16 grpc: addrConn.resetTransport failed to create client ... And by a lot, I mean \u0026hellip; a lot; they will continue to spew for a while (probably at least 30 seconds). So to tackle that issue, we\u0026rsquo;ll turn off the logging by creating a noop (nop, no-op) logger that doesn\u0026rsquo;t do anything, and set it as the logger for grpclog. First the logger:\nAs you can see, this logger meets the interface for a SetLogger() function, and we can set the grpc logger in our library\u0026rsquo;s init as follows:\nfunc init() { // Set the random seed to something different each time. rand.Seed(time.Now().Unix()) // Stop the grpc verbose logging grpclog.SetLogger(noplog) } Ok, onto the next two problems that are both solved with context. First, the call to grpc.Dial() happens in the background by default. This can cause panics due to nil dereference errors if you\u0026rsquo;re not careful. Block until connected as follows:\nconn, err := grpc.Dial(addr, grpc.WithInsecure(), grpc.WithBlock()) Now it\u0026rsquo;s up to you to handle concurrency with the connections. Of course blocking doesn\u0026rsquo;t make a whole lot of sense until you limit it. And in fact, no err will be returned from the function unless you cause it to error with a timeout.\nconn, err := grpc.Dial( addr, grpc.WithInsecure(), grpc.WithBlock(), grpc.WithTimeout(1 * time.Second) ) Note that the WithTimeout option does not do anything if WithBlock is not used as well.\nComing Soon: using WithStatsHandler() to address the fourth issue.\nAnd there is my basic start to managing the grpc.Dial function for scenarios when the remote may not be reachable. I\u0026rsquo;m sure there will be a lot more on this later.\n","permalink":"https://bbengfort.github.io/2017/03/sanely-grpc-dial-a-remote/","summary":"\u003cp\u003eIn my systems I need to handle failure; so unlike in a typical client-server relationship, I\u0026rsquo;m prepared for the remote I\u0026rsquo;m dialing to not be available. Unfortunately when you do this with \u003ca href=\"https://godoc.org/google.golang.org/grpc\"\u003egRPC-Go\u003c/a\u003e there are a couple of annoyances you have to address. They are (in order of solutions):\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eVerbose connection logging\u003c/li\u003e\n\u003cli\u003eBackground and back-off for reconnection attempts\u003c/li\u003e\n\u003cli\u003eErrors are not returned on demand.\u003c/li\u003e\n\u003cli\u003eThere is no ability to keep track of statistics\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eSo first the logging. When you dial an unavailable remote as follows:\u003c/p\u003e","title":"Sanely gRPC Dial a Remote"},{"content":"In this post I wanted to catalog the process of an open source contribution I was a part of, which added a feature to the memory profiler Python library by Fabian Pedregosa and Philippe Gervais. It\u0026rsquo;s a quick story to tell but took over a year to complete, and I learned a lot from the process. I hope that the story is revealing, particularly to first time contributors and shows that even folks that have been doing this for a long time still have to find ways to positively approach collaboration in an open source environment. I also think it\u0026rsquo;s a fairly standard example of how contributions work in practice and perhaps this story will help us all think about how to better approach the pull request process.\nThe bottom line is that a feature that was relatively quick to prototype took a long time to get included into the main code, even though there was a lot of interest. The hangup involved all the normal excuses (too busy, worried the code wasn\u0026rsquo;t good enough, etc.) but in the end it was effective, clear, and positive communication that finally made things come together. Here\u0026rsquo;s how it went down in timeline form:\nJuly 13, 2016: asked a Stack Overflow question: How to profile multiple subprocesses using Python multiprocessing and memory_profiler?\nUnfortunately I can\u0026rsquo;t remember exactly why I was asking this, but my best guess is that I was trying to determine memory usage either for the minke parallel NLP application or to do some benchmarking for my research simulations. There are unfortunately no blog posts around that time that hint at what I was doing.\nJuly 14, 2016: submitted a feature request, mprof each child process independently, to the memory_profiler repository.\nAt this point, I received some feedback from @fabianp directing me to some specific locations in code where I might start making changes. Unfortunately I don\u0026rsquo;t know where those comments were added, potentially in an another issue? I forked the project and began a proof of concept.\nJuly 16, 2016: proof of concept, mpmprof created in my repository fork.\nI submitted a comment on the issue to ask @fabianp to take a look at my fork. He (correctly) asked for a pull request. However, I was unsure that my proof of concept was good enough for a PR and asked for help and got a minor comment in return. I decided to try to fix it and I made a critical mistake: I didn\u0026rsquo;t submit the PR.\nAugust 4, 2016 - March 18, 2017: the contribution silence occurs.\nPings and plus ones from @cachedout and @davidgbe bring the project up to my attention again, but it feels like a daunting amount of work, so things stay silent.\nMarch 18, 2017: finally I submit a “work in progress (WIP)” pull request, WIP: Independent child process monitoring #118.\nThis pull request is very brief and simply has my original contribution along with a massive fork update to get to the latest code. However, it is finally at this point that @fabianp takes a look at my code. He asks me to merge my proof of concept into the codebase.\nMarch 20, 2017: I address the merge request with a very simple implementation, code review begins.\nThe code review is a back and forth conversation between @fabianp and I. He tests and runs the example code on his machine, and takes a look at the modifications I made specifically. Any changes or updates requested I can commit to my fork and they are automatically included in the pull request.\nMarch 21, 2017: my submitted pull request is merged.\nWe ended up going back and forth a few times, discussing the impact of multiprocessing on various components and a pickle error that cropped up. The conversation was very good and it led to quite a few updates to the code, and even a couple of changes from @fabianp. Throughout I became more confident since he was looking at the PR and testing it.\nMarch 22, 2017: new release of memory_profiler on PyPI.\nThe release was posted on PyPI along with a nice thank you on Twitter. I can finally answer my own question on Stack Overflow!\nThanks to @bbengfort memory_profiler can now separately track memory usage of forked processes https://t.co/LCOMLgNzM8 pic.twitter.com/Lc46lf0xs8\n\u0026mdash; Fabian Pedregosa (@fpedregosa) March 22, 2017 So let me break down what happened here and do a bit of a post-mortem. First, I had a problem that I wanted to solve with an existing, popular, and well-used codebase (namely track the memory usage of child processes independently to the main process). I thought there must be a way to do this, and while there was a solution to a variant of my problem, there was no direct solution.\nNext, I decided to fix the problem and start a conversation. I was able to (relatively quickly) create a concept that solved my problem. In fact, it worked so well that I used that solution for a little under a year. I thought that by maintaining my solution in my fork, other folks were able to leverage it.\nHowever, there was a problem: I wasn\u0026rsquo;t able to contribute back to the main library. So let\u0026rsquo;s look at what held me back:\nThe changes to the primary module were modest but the changes to the implementation were drastic Fear that I had broken something unrelated since there weren\u0026rsquo;t a lot of tests Style clash: how I write code is different from how this module is constructed. It was easier for me to write my proof of concept outside the original module Specifically, I was able to make the modifications to memory_profiler.py (the library for the code base) by adding a function and modifying the control flow of the primary entry point. This felt relatively safe and non-invasive. However, modifying the command-line script, mprof required a lot more work. It was simpler and faster for me to write my own command line script, mpmprof rather than modify the original version.\nFrankly, if you compare mprof and mpmprof I think it\u0026rsquo;s pretty obvious that there are two drastically different coding styles at work here. I use the argparse library, have things structured functionally rather than with if/else control syntax, have a different docstring implementation, more intermediate functions, use regular expressions for parsing, and have a bit more exception handling (just to name a few notable differences). However, I also did not have a complete implementation from the other code, nor did I completely understand all the problems the original code was trying to solve.\nI thought I faced a problem about whether I should update the code to use argparse and “more modern” syntax (there was even an related pull request) or to potentially introduce breaking changes by trying to stay as close to the original as possible. I even considered forking the project and creating my own, potentially more easily maintained-by-me version. I worried that I was being a jerk by overhauling the code, or not contributing “the right way”. But really the problem was that I wasn\u0026rsquo;t engaging the authors of the library in a meaningful discussion.\nSo what would I do next time to solve the problem? Open a pull request as soon as possible.\nMaybe I thought Fabian would go checkout my fork or maybe I let the list of barriers hold me back, but whatever the case not submitting a PR meant that I couldn\u0026rsquo;t engage the authors in a discussion about my contribution. I had heard the PR ASAP advice before, but it hasn\u0026rsquo;t been until recently that I have fully understood what GitHub and the code review tools allow you to do. Contribution is collaboration and the PR workflow helps you get there!\nI haven\u0026rsquo;t fully implemented all of my changes to the code base (again, for the reasons outlined above) but now, if you run:\n$ pip install -U memory_profiler $ mprof run -M python examples/multiprocessing_example.py $ mprof plot You\u0026rsquo;ll get a figure that looks something similar to:\nThis is great news for an oft-requested feature of a library that is well used and well maintained. For reference, if you\u0026rsquo;d like to see an example of my proof of concept, you can check out my fork, or see my version of the mprof script on Gist. However, you don\u0026rsquo;t have to worry about that gist, and can instead simply pip install memory_profiler to get access to this feature!\n","permalink":"https://bbengfort.github.io/2017/03/contributing-a-multiprocess-memory-profiler/","summary":"\u003cp\u003eIn this post I wanted to catalog the process of an open source contribution I was a part of, which added a feature to the \u003ca href=\"https://pypi.python.org/pypi/memory_profiler/\"\u003ememory profiler\u003c/a\u003e Python library by \u003ca href=\"http://fseoane.net/\"\u003eFabian Pedregosa\u003c/a\u003e and \u003ca href=\"https://github.com/pgervais\"\u003ePhilippe Gervais\u003c/a\u003e. It\u0026rsquo;s a quick story to tell but took over a year to complete, and I learned a lot from the process. I hope that the story is revealing, particularly to first time contributors and shows that even folks that have been doing this for a long time still have to find ways to positively approach collaboration in an open source environment. I also think it\u0026rsquo;s a fairly standard example of how contributions work in practice and perhaps this story will help us all think about how to better approach the pull request process.\u003c/p\u003e","title":"Contributing a Multiprocess Memory Profiler"},{"content":"A Merkle tree is a data structure in which every non-leaf node is labeled with the hash of its child nodes. This makes them particular useful for comparing large data structures quickly and efficiently. Given trees a and b, if the root hash of either is different, it means that part of the tree below is different (if they are identical, they are probably also identical). You can then proceed in a a breadth first fashion, pruning nodes with identical hashes to directly identify the differences.\nThese structures are widely used with file systems or directory trees, for example, Git uses them to identify the file tree structure of a commit so that two commits can be compared easily even for extremely large directory tree structures. Files are leaf nodes, identified by the hash of their contents. Directories are the non-terminal nodes (and this is part of the reason that Git doesn\u0026rsquo;t track directories). The hash of a directory is the hash of the hashes of the files and directories that make up that node\u0026rsquo;s children.\nThe trade-off for fast comparison is that a Merkle tree is time consuming to build and to maintain. Adding a file means computing the hash of the file, then recomputing the hash of the directory that the file is in, then recomputing the hash of that directory\u0026rsquo;s parent and so on to the root. Generally speaking hash computations are expensive, particularly ones that decrease the likelihood of collisions (e.g. something stronger than MD5).\nA simpler data structure that may do the same thing is one that maintains counts of the leaf nodes under it. Instead of computing hashes, the computational work is to simply increment the counter as files are added, all the way to the root node. I can\u0026rsquo;t imagine this type of tree doesn\u0026rsquo;t already exist, and it does suffer from several problems. First, and most harmfully, if the same number of files are added to both trees then the counts will be the same and the trees declared identical. Additionally, if the contents of the files change, this type of tree won\u0026rsquo;t be updated. However, for some applications, particularly those that simply need to identify if changes are occurring with a high likelihood, this structure can be effective.\nThe code is as follows:\nIn principle the API is fairly thread-safe. Simply initialize a Tree with the Build function by supplying a path. The Tree will walk the directory and construct child directories and increment the counts of files. It uses the AddFile method to do this, which locks the tree at the root node, and updates it in a top-down fashion. I say \u0026ldquo;fairly thread-safe\u0026rdquo; because child nodes are not locked as they\u0026rsquo;re being updated, nor is the tree locked on AddChild. So long as the user interacts only with the root node and the AddFile function (the principle use) then it can be used concurrently.\n","permalink":"https://bbengfort.github.io/2017/03/pseudo-merkle-tree/","summary":"\u003cp\u003eA \u003ca href=\"https://en.wikipedia.org/wiki/Merkle_tree\"\u003eMerkle tree\u003c/a\u003e is a data structure in which every non-leaf node is labeled with the hash of its child nodes. This makes them particular useful for comparing large data structures quickly and efficiently. Given trees \u003ccode\u003ea\u003c/code\u003e and \u003ccode\u003eb\u003c/code\u003e, if the root hash of either is different, it means that part of the tree below is different (if they are identical, they are probably also identical). You can then proceed in a a breadth first fashion, pruning nodes with identical hashes to directly identify the differences.\u003c/p\u003e","title":"Pseudo Merkle Tree"},{"content":"Ask a Go programmer what makes Go special and they will immediately say “concurrency is baked into the language”. Go\u0026rsquo;s concurrency model is one of communication (as opposed to locks) and so concurrency primitives are implemented using channels. In order to synchronize across multiple channels, go provides the select statement.\nA common pattern for me has become to use a select to manage broadcasted work (either in a publisher/subscriber model or a fanout model) by initializing go routines and passing them directional channels for synchronization and communication. In the example below, I create a buffered channel for output (so that the workers don\u0026rsquo;t block waiting for the receiver to collect data), a channel for errors (first error kills the program) and a timer to update the state of my process on a routine basis. The select waits for the first channel to receive a message and then continues processing. By keeping the select in a for loop, I can continually read of the channels until I\u0026rsquo;m done.\nThe pattern code is below:\nThe worker function does not return anything (since it\u0026rsquo;s a go routine) but instead takes as input an id, and two directional channels — meaning that the go routines can only send on the channel and not receive. The first channel is the output channel and the second is for errors. The worker pretends to work with a random sleep then just reports back that it has been awakened.\nThe main function creates the output and error channels as well as a ticker, which has a timer channel on it. We then launch the go routines (keeping track of how many are running, similar to a WaitGroup). The for loop is basically while True — it loops until break or return. The select waits until a value comes in on one of the channels, at which point it handles that case and exits from that block (at which point we check if we should break, and if not we continue to block until we receive data on the channel). Even for long running processes, the ticker will cause the loop to iterate once per second, allowing us to manage our state or update the user. If an error occurs on any of the workers we kill the entire process rather than risk anything else.\n","permalink":"https://bbengfort.github.io/2017/03/channel-select/","summary":"\u003cp\u003eAsk a Go programmer what makes Go special and they will immediately say “concurrency is baked into the language”. Go\u0026rsquo;s concurrency model is one of communication (as opposed to locks) and so concurrency primitives are implemented using \u003cem\u003echannels\u003c/em\u003e. In order to synchronize across multiple channels, go provides the \u003ccode\u003eselect\u003c/code\u003e statement.\u003c/p\u003e\n\u003cp\u003eA common pattern for me has become to use a \u003ccode\u003eselect\u003c/code\u003e to manage broadcasted work (either in a publisher/subscriber model or a fanout model) by initializing go routines and passing them \u003cem\u003edirectional channels\u003c/em\u003e for synchronization and communication. In the example below, I create a buffered channel for output (so that the workers don\u0026rsquo;t block waiting for the receiver to collect data), a channel for errors (first error kills the program) and a timer to update the state of my process on a routine basis. The \u003ccode\u003eselect\u003c/code\u003e waits for the first channel to receive a message and then continues processing. By keeping the \u003ccode\u003eselect\u003c/code\u003e in a \u003ccode\u003efor\u003c/code\u003e loop, I can continually read of the channels until I\u0026rsquo;m done.\u003c/p\u003e","title":"Using Select in Go"},{"content":"A natural question to ask after the previous post is “how much overhead does security add?” So I\u0026rsquo;ve benchmarked the three methods discussed; mutual TLS, server-side TLS, and no encryption. The results are below:\nHere are the numeric results for one of the runs:\nBenchmarkMutualTLS-8 200\t9331850 ns/op BenchmarkServerTLS-8 300\t5004505 ns/op BenchmarkInsecure-8 2000\t1179252 ns/op PASS ok github.com/bbengfort/sping\t7.364s Here is the code for the benchmarking for reference:\nvar ( server *PingServer client *PingClient ) func BenchmarkMutualTLS(b *testing.B) { logmsgs = false server = NewServer() client = NewClient(\u0026#34;tester\u0026#34;, 100, 8) go server.ServeMutualTLS(50051) b.ResetTimer() for i := 0; i \u0026lt; b.N; i++ { _, err := client.PingMutualTLS(\u0026#34;localhost:50051\u0026#34;) if err != nil { fmt.Println(err) break } } } It\u0026rsquo;s all pretty straight forward, the other two functions use server.ServeTLS and server.ServeInsecure for the server side and client.PingTLS and client.PingInsecure for the client. The only note is that because the server is running throughout the tests, each benchmark runs with a different port number.\n","permalink":"https://bbengfort.github.io/2017/03/tls-grpc-benchmarks/","summary":"\u003cp\u003eA natural question to ask after the previous post is “how much overhead does security add?” So I\u0026rsquo;ve benchmarked the three methods discussed; mutual TLS, server-side TLS, and no encryption. The results are below:\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"/images/2017-03-05-benchmark.png\"\u003e\u003cimg loading=\"lazy\" src=\"/images/2017-03-05-benchmark.png\" alt=\"Secure gRPC Benchmarks\"  /\u003e\n\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003eHere are the numeric results for one of the runs:\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003eBenchmarkMutualTLS-8   \t     200\t   9331850 ns/op\nBenchmarkServerTLS-8   \t     300\t   5004505 ns/op\nBenchmarkInsecure-8    \t    2000\t   1179252 ns/op\nPASS\nok  \tgithub.com/bbengfort/sping\t7.364s\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eHere is the code for the benchmarking for reference:\u003c/p\u003e","title":"Benchmarking Secure gRPC"},{"content":"One of the primary requirements for the systems we build is something we call the “minimum security requirement”. Although our systems are not designed specifically for high security applications, they must use minimum standards of encryption and authentication. For example, it seems obvious to me that a web application that stores passwords or credit card information would encrypt their data on disk on a per-record basis with a salted hash. In the same way, a distributed system must be able to handle encrypted blobs, encrypt all inter-node communication, and authenticate and sign all messages. This adds some overhead to the system but the cost of overhead is far smaller than the cost of a breach, and if minimum security is the baseline then the overhead is just an accepted part of doing business.\nFor inter-replica communication we are currently using gRPC, an multi-platform RPC framework that uses protocol buffers for message serialization (we have also used zeromq in the past). The nice part about gRPC is that it has authentication baked-in and promotes the use of SSL/TLS to authenticate and encrypt exchanges. The not so nice part is that while the gRPC tutorial has examples in Ruby, C++, C#, Python, Java, Node.js, and PHP there is no guide for Go (at the time of this post). This post is my attempt to figure it out.\nFrom the documentation:\ngRPC has SSL/TLS integration and promotes the use of SSL/TLS to authenticate the server, and encrypt all the data exchanged between the client and the server. Optional mechanisms are available for clients to provide certificates for mutual authentication.\nI\u0026rsquo;m primarily interested in the first part — authenticate the server and encrypt the data exchanged. However, the idea of mutual TLS is something I hadn\u0026rsquo;t considered before this investigation. My original plan was to use Hawk client authentication and message signatures. But potentially that\u0026rsquo;s not something I have to do. So this post has two phases:\nEncrypted communication using TLS/SSL from the server Authenticated, mutual TLS using a certificate authority Since all replicas in my system are both servers and clients, I think that it wouldn\u0026rsquo;t make much sense not to do mutual TLS. After all, we\u0026rsquo;re already creating certificates and exchanging keys and whatnot.\nCreating SSL/TLS Certificates It seems like step one is to generate certificates and key files for encrypting communication. I thought this would be fairly straightforward using openssl from the command line, and it is (kind of) though there are a lot of things to consider. First, the files we need to generate:\nserver.key: a private RSA key to sign and authenticate the public key server.pem/server.crt: self-signed X.509 public keys for distribution rootca.crt: a certificate authority public key for signing .csr files host.csr: a certificate signing request to access the CA So there are a lot of files and a lot of extensions, many of which are duplicates or synonyms (or simply different encodings). I think that\u0026rsquo;s primarily what\u0026rsquo;s made this process so difficult. So to generate some simple .key/.crt pairs using openssl:\n$ openssl genrsa -out server.key 2048 $ openssl req -new -x509 -sha256 -key server.key \\ -out server.crt -days 3650 The first command will generate a 2048 bit RSA key (stronger keys are available as well). The second command will generate the certificate, and will also prompt you for some questions about the location, organization, and contact of the certificate holder. These fields are pretty straight forward, but probably the most important field is the \u0026ldquo;Common Name\u0026rdquo; which is typically composed of the host, domain, or IP address related to the certificate. The name is then used during verification and if the host doesn\u0026rsquo;t match the common name a warning is raised.\nFinally, to generate a certificate signing request (.csr) using openssl:\n$ openssl req -new -sha256 -key server.key -out server.csr $ openssl x509 -req -sha256 -in server.csr -signkey server.key \\ -out server.crt -days 3650 So this is pretty straightforward on the command line. However, it may be simpler to use certstrap, a simple certificate manager written in Go by the folks at Square. The app avoids dealing with openssl (and therefore raises questions about security in implementation), but has a very simple workflow: create a certificate authority, sign certificates with it.\nTo create a new certificate authority:\n$ certstrap init --common-name \u0026#34;umd.fluidfs.com\u0026#34; Created out/umd.fluidfs.com.key Created out/umd.fluidfs.com.crt Created out/umd.fluidfs.com.crl To request a certificate for a specific host:\n$ certstrap request-cert -ip 192.168.1.18 Created out/192.168.1.18.key Created out/192.168.1.18.csr And finally to generate the certificate for the host:\n$ certstrap sign 192.168.1.18 --CA umd.fluidfs.com Created out/192.168.1.18.crt from out/192.168.1.18.csr signed by out/umd.fluidfs.com.key Probably the most interesting opportunity for me is the ability to use certstrap programmatically to automatically generate keys. However, some review will have to be done into how safe it is.\nEncrypted Server The simplest method to encrypt communication using gRPC is to use server-side TLS. This means that the server needs to be initialized with a public/private key pair and the client needs to have the server\u0026rsquo;s public key in order to make the connection. I\u0026rsquo;ve created a small application called sping (secure ping) that basically does an echo request from a client to a server (example repository). The server code is as follows:\nvar ( crt = \u0026#34;server.crt\u0026#34; key = \u0026#34;server.key\u0026#34; ) func (s *PingServer) Serve(addr string) error { // Create the channel to listen on lis, err := net.Listen(\u0026#34;tcp\u0026#34;, addr) if err != nil { return fmt.Errorf(\u0026#34;could not list on %s: %s\u0026#34;, addr, err) } // Create the TLS credentials creds, err := credentials.NewServerTLSFromFile(crt, key) if err != nil { return fmt.Errorf(\u0026#34;could not load TLS keys: %s\u0026#34;, err) } // Create the gRPC server with the credentials srv := grpc.NewServer(grpc.Creds(creds)) // Register the handler object pb.RegisterSecurePingServer(srv, s) // Serve and Listen if err := srv.Serve(lis); err != nil { return fmt.Errorf(\u0026#34;grpc serve error: %s\u0026#34;, err) } return nil } So the steps to the server are pretty straight forward. First, create a TCP connection on the desired address (e.g. pass in \u0026quot;:3264\u0026quot; to listen on the external address on port 3264). Second, load the TLS credentials from their respective key files (both the private and the public keys), then initialize the grpc server with the credentials. Finally, register the handler for the service you implemented (here I\u0026rsquo;m using a method call on a struct that does implement the handler) and serve.\nTo get the client connected, you need to give it the server.crt (or server.pem) public key. In normal operation, this key can be fetched from a certificate authority, but since we\u0026rsquo;re doing internal RPC, the public key must be shipped with the application.\nvar cert = \u0026#34;server.crt\u0026#34; func (c *PingClient) Ping(addr string, ping *pb.Ping) error { // Create the client TLS credentials creds, err := credentials.NewClientTLSFromFile(cert, \u0026#34;\u0026#34;) if err != nil { return fmt.Errorf(\u0026#34;could not load tls cert: %s\u0026#34;, err) } // Create a connection with the TLS credentials conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(creds)) if err != nil { return fmt.Errorf(\u0026#34;could not dial %s: %s\u0026#34;, addr, err) } // Initialize the client and make the request client := pb.NewSecurePingClient(conn) pong, err := client.Echo(context.Background(), ping) if err != nil { return fmt.Errof(\u0026#34;could not ping %s: %s\u0026#34;, addr, err) } // Log the ping log.Printf(\u0026#34;%s\\n\u0026#34;, pong.String()) return nil } Again, this is a fairly straight forward process that adds only three lines and modifies one from the original code. First load the server public key from a file into the credentials object, then pass the transport credentials into the grpc dialer. This will cause GRPC to initiate the TLS handshake every time it sends an echo RPC.\nMutual TLS with Certificate Authority The real problem with using the above method and HAWK authentication is that every single replica will have to maintain both a server public key and a HAWK key for every other node in the system. That frankly sounds like a headache to me. Instead, we\u0026rsquo;ll have every replica (client and server both) load their own public/private key pairs, then load the public keys of a CA (certificate authority) .crt file. Because all client public keys are signed by the CA key, the server and client can exchange and authenticate private keys during communication.\nCAVEAT: when a client connects to a server, it must know the ServerName property to pass into the tls.Config object. This ServerName appears to have to be in agreement with the common name in the certificate.\nThe server code is now modified to create X.509 key pairs directly and to create a certificate pool based on the certificate authority public key.\nvar ( crt = \u0026#34;server.crt\u0026#34; key = \u0026#34;server.key\u0026#34; ca = \u0026#34;ca.crt\u0026#34; ) func (s *PingServer) Serve(addr string) error { // Load the certificates from disk certificate, err := tls.LoadX509KeyPair(crt, key) if err != nil { return fmt.Errorf(\u0026#34;could not load server key pair: %s\u0026#34;, err) } // Create a certificate pool from the certificate authority certPool := x509.NewCertPool() ca, err := ioutil.ReadFile(ca) if err != nil { return fmt.Errorf(\u0026#34;could not read ca certificate: %s\u0026#34;, err) } // Append the client certificates from the CA if ok := certPool.AppendCertsFromPEM(ca); !ok { return errors.New(\u0026#34;failed to append client certs\u0026#34;) } // Create the channel to listen on lis, err := net.Listen(\u0026#34;tcp\u0026#34;, addr) if err != nil { return fmt.Errorf(\u0026#34;could not list on %s: %s\u0026#34;, addr, err) } // Create the TLS credentials creds := credentials.NewTLS(\u0026amp;tls.Config{ ClientAuth: tls.RequireAndVerifyClientCert, Certificates: []tls.Certificate{certificate}, ClientCAs: certPool, }) // Create the gRPC server with the credentials srv := grpc.NewServer(grpc.Creds(creds)) // Register the handler object pb.RegisterSecurePingServer(srv, s) // Serve and Listen if err := srv.Serve(lis); err != nil { return fmt.Errorf(\u0026#34;grpc serve error: %s\u0026#34;, err) } return nil } So quite a bit more work here than in the first version. First, we load the server key pair from disk into a tls.Certificate struct. Then we create a certificate pool, read the CA certificate from disk and append it to the pool. That done, we can create our TLS credentials. Importantly, our server will require client certificates for verification, and we specify the pool as our client certificate authority. Finally we pass our certificates into the configuration and create new TLS grpc server options, passing them into the grpc.NewServer function. The client code is very similar:\nvar ( crt = \u0026#34;client.crt\u0026#34; key = \u0026#34;client.key\u0026#34; ca = \u0026#34;ca.crt\u0026#34; ) func (c *PingClient) Ping(addr string, ping *pb.Ping) error { // Load the client certificates from disk certificate, err := tls.LoadX509KeyPair(crt, key) if err != nil { return fmt.Errorf(\u0026#34;could not load client key pair: %s\u0026#34;, err) } // Create a certificate pool from the certificate authority certPool := x509.NewCertPool() ca, err := ioutil.ReadFile(ca) if err != nil { return fmt.Errorf(\u0026#34;could not read ca certificate: %s\u0026#34;, err) } // Append the certificates from the CA if ok := certPool.AppendCertsFromPEM(ca); !ok { return errors.New(\u0026#34;failed to append ca certs\u0026#34;) } creds := credentials.NewTLS(\u0026amp;tls.Config{ ServerName: addr, // NOTE: this is required! Certificates: []tls.Certificate{certificate}, RootCAs: certPool, }) // Create a connection with the TLS credentials conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(creds)) if err != nil { return fmt.Errorf(\u0026#34;could not dial %s: %s\u0026#34;, addr, err) } // Initialize the client and make the request client := pb.NewSecurePingClient(conn) pong, err := client.Echo(context.Background(), ping) if err != nil { return fmt.Errof(\u0026#34;could not ping %s: %s\u0026#34;, addr, err) } // Log the ping log.Printf(\u0026#34;%s\\n\u0026#34;, pong.String()) return nil } The primary difference here being that we load client certificates as opposed to the server certificate and that we specify RootCAs instead of ClientCAs in the TLS config. One final, important point, is that we also must specify the ServerName, whose value must match the common name on the certificate.\nGo Client In this section, I will describe the method for a client connecting to a secure RPC in the same style as the gRPC authentication examples. These examples use the greeter quick start code and perhaps they can be contributed back to the grpc.io documentation. Frankly, though, they\u0026rsquo;re just a guess so hopefully the PR I submitted gets reviewed thoroughly.\nBase case - No encryption or authentication import ( \u0026#34;google.golang.org/grpc\u0026#34; pb \u0026#34;google.golang.org/grpc/examples/helloworld/helloworld\u0026#34; ) channel, _ := grpc.Dial(\u0026#34;localhost:50051\u0026#34;, grpc.WithInsecure()) client := pb.NewGreeterClient(channel) With server authentication SSL/TLS import \u0026#34;google.golang.org/grpc/credentials\u0026#34; creds := credentials.NewClientTLSFromFile(\u0026#34;roots.pem\u0026#34;, \u0026#34;\u0026#34;) channel, _ := grpc.Dial( \u0026#34;localhost:443\u0026#34;, grpc.WithTransportCredentials(creds) ) client := pb.NewGreeterClient(channel) Authenticate with Google import \u0026#34;google.golang.org/grpc/credentials/oauth\u0026#34; auth, _ := oauth.NewApplicationDefault(context.Background(), \u0026#34;\u0026#34;) channel, _ := grpc.Dial( \u0026#34;greeter.googleapis.com\u0026#34;, grpc.WithPerRPCCredentials(auth) ) client := pb.NewGreeterClient(channel) Conclusion Always use SSL/TLS to encrypt communications and authenticate nodes. It is an open question about how to manage certificates in a larger system, but potentially an internal certificate authority resolves these problems. Getting secure communications up and running isn\u0026rsquo;t necessarily the easiest part of distributed systems, but it is worth taking the time out to do it right. And finally, gRPC, please update your documentation.\nOther Resources:\nSecure Ping on GitHub Using gRPC with Mutual TLS in Golang Simple GolangHTTPS/TLS Examples ","permalink":"https://bbengfort.github.io/2017/03/secure-grpc/","summary":"\u003cp\u003eOne of the primary requirements for the systems we build is something we call the “minimum security requirement”. Although our systems are not designed specifically for high security applications, they must use minimum standards of encryption and authentication. For example, it seems obvious to me that a web application that \u003ca href=\"https://docs.djangoproject.com/en/1.10/topics/auth/passwords/\"\u003estores passwords\u003c/a\u003e or \u003ca href=\"https://www.pcisecuritystandards.org/\"\u003ecredit card information\u003c/a\u003e would encrypt their data on disk on a per-record basis with a \u003ca href=\"https://www.codeproject.com/Articles/704865/Salted-Password-Hashing-Doing-it-Right\"\u003esalted hash\u003c/a\u003e. In the same way, a distributed system must be able to handle \u003ca href=\"https://www.usenix.org/legacy/event/osdi04/tech/full_papers/li_j/li_j.pdf\"\u003eencrypted blobs\u003c/a\u003e, \u003ca href=\"http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/\"\u003eencrypt all inter-node communication\u003c/a\u003e, and \u003ca href=\"https://alexbilbie.com/2012/11/hawk-a-new-http-authentication-scheme/\"\u003eauthenticate and sign all messages\u003c/a\u003e. This adds some overhead to the system but the cost of overhead is far smaller than the cost of a breach, and if minimum security is the baseline then the overhead is just an accepted part of doing business.\u003c/p\u003e","title":"Secure gRPC with TLS/SSL"},{"content":"Go is built for concurrency by providing language features that allow developers to embed complex concurrency patterns into their applications. These language features can be intuitive and a lot of safety is built in (for example a race detector) but developers still need to be aware of the interactions between various threads in their programs.\nIn any shared memory system the biggest concern is synchronization: ensuring that separate go routines operate in the correct order and that no race conditions occur. The primary way to handle synchronization is the use of channels. Channels synchronize execution by forcing sends on the channel to block until the value on the channel is received. In this way, channels act as a barrier since the go routine can not progress while being blocked by the channel and enforce a specific ordering to execution, the ordering of routines arriving at the barrier.\nChannels are made to implement CSP, but there are other concurrency primitives like mutexes (locks designed to enforce mutual exclusion concurrency control). In fact, channels use locks behind the scenes to serialize access, and you\u0026rsquo;re likely going to have to use other concurrency primitives anyway. I\u0026rsquo;ve encountered this problem, and have started using mutexes in a very specific way, which this post is about.\nConsider an operation that is not commutative or not associative (operations that are can be implemented with CRDTs), for example concatenating data to a buffer. This operation must be synchronized because the original state must be preserved during the operation. A simple explanation of this is the += which (for the purpose of our discussion) fetches the original value of the variable, performs the operation and stores the result back to the value. If two processes attempt to += concurrently a race condition occurs because whichever process is first to complete will have its answer overridden. In the following example, the final result of the variable will be \u0026quot;hello Bob\u0026quot; or \u0026quot;hello Alice\u0026quot; depending on which process gets there last, an undesirable state (the second operation may have preferred the concatenation to be \u0026quot;hello Bob and Alice\u0026quot; or \u0026quot;hello Alice and Bob\u0026quot;).\nThe solution is to lock the variable whenever the first process accesses it and then release it when it\u0026rsquo;s done, that way the process is guaranteed the state of the variable for the duration of the operation. Here\u0026rsquo;s how I implement this with a struct in Go:\ntype Buffer struct { sync.Mutex // wraps a synchronization flag buf string // the string being concatenated to } By embedding the sync.Mutex into the struct, it can now be locked and unlocked. Even more powerfully, you can write methods that lock and defer unlock for very easy thread safe synchronization. Here is an example of safe and unsafe concatenation to the buffer:\nfunc (b *Buffer) Concat(s string) { b.buf += s } func (b *Buffer) SafeConcat(s string) { b.Lock() defer b.Unlock() b.Concat(s) } It is important to note that safety does not mean that you\u0026rsquo;re guaranteed some other arbitrary order of operations when using goroutines. Consider the following concurrent concatenate example that injects some sleep into the concat function (find the complete code on Gist):\nvar ( safe bool start time.Time group *sync.WaitGroup buffer *Buffer alphas []string ) func write(idx int, safe bool) { defer group.Done() if idx \u0026gt;= len(alphas) { return } if safe { buffer.SafeConcat(alphas[idx]) } else { buffer.Concat(alphas[idx]) } } group = new(sync.WaitGroup) alphas = []string{\u0026#34;a\u0026#34;, \u0026#34;b\u0026#34;, \u0026#34;c\u0026#34;, \u0026#34;d\u0026#34;, \u0026#34;e\u0026#34;, \u0026#34;f\u0026#34;, \u0026#34;g\u0026#34;, \u0026#34;h\u0026#34;, \u0026#34;i\u0026#34;,} buffer = new(Buffer) start = time.Now() for i := 0; i \u0026lt; len(alphas); i++ { group.Add(1) go write(i, safe) } group.Wait() fmt.Printf(\u0026#34;\\nresult: %s in %s (safe=%t)\\n\u0026#34;, buffer, time.Since(start), safe) Here, we\u0026rsquo;re using a sync.WaitGroup to determine when all the go routines are complete (e.g. join on the collection of routines) and have them write the letter of their index to the buffer. The output is as follows:\nresult: fiedhcjgab in 1.004835942s (safe=false) result: kbahgifjced in 11.020241668s (safe=true) Note that in the unsafe case, one of the letters is missing because of incorrect synchronization and that the safe case took 11 seconds to complete. This is because each goroutine had to wait (for a second) until it could access the buffer since it was locked. However, it\u0026rsquo;s also important to note that neither method (safe or unsafe) produced \u0026quot;abcdefghijk\u0026quot;, since the locking order is about which routine got to the lock first, not about what order the goroutine was started.\nAnd honestly, that\u0026rsquo;s the prime lesson from this post (most of which are my notes from implementing this in a production system).\nBut of course, I have another question - given the sequential case, how much overhead do the locks add? So benchmarking \u0026hellip;\nBenchmarkUnsafeConcat-8 1000000\t47287 ns/op BenchmarkSafeConcat-8 1000000\t53170 ns/op Clearly having locks adds some overhead and if you\u0026rsquo;re not going to do any concurrent programming, then the 6 microseconds it takes to lock and unlock is probably not worth it. On the other hand, if there is the chance that you\u0026rsquo;ll have any concurrency at all - using the sync.Mutex embedding is a very clear and understandable way to go about things.\n","permalink":"https://bbengfort.github.io/2017/02/synchronizing-structs/","summary":"\u003cp\u003eGo is \u003ca href=\"https://divan.github.io/posts/go_concurrency_visualize/\"\u003ebuilt for concurrency\u003c/a\u003e by providing language features that allow developers to embed complex concurrency patterns into their applications. These language features can be intuitive and a lot of safety is built in (for example a \u003ca href=\"https://blog.golang.org/race-detector\"\u003erace detector\u003c/a\u003e) but developers still need to be aware of the interactions between various threads in their programs.\u003c/p\u003e\n\u003cp\u003eIn any shared memory system the biggest concern is \u003ca href=\"https://en.wikipedia.org/wiki/Synchronization_(computer_science)\"\u003esynchronization\u003c/a\u003e: ensuring that separate go routines operate in the correct order and that no race conditions occur. The primary way to handle synchronization is the use of \u003ca href=\"https://gobyexample.com/channels\"\u003echannels\u003c/a\u003e. Channels synchronize execution by forcing sends on the channel to block until the value on the channel is received. In this way, channels act as a \u003ca href=\"https://en.wikipedia.org/wiki/Barrier_(computer_science)\"\u003ebarrier\u003c/a\u003e since the go routine can not progress while being blocked by the channel and enforce a specific ordering to execution, the ordering of routines arriving at the barrier.\u003c/p\u003e","title":"Synchronizing Structs for Safe Concurrency in Go"},{"content":"FluidFS and other file systems break large files into recipes of hash-identified blobs of binary data. Blobs can then be replicated with far more ease than a single file, as well as streamed from disk in a memory safe manner. Blobs are treated as single, independent units so the underlying data store doesn\u0026rsquo;t grow as files are duplicated. Finally, blobs can be encrypted individually and provide more opportunities for privacy.\nChunking files into blobs is a good idea.\nThe question then becomes, how do you meaningfully chunk a file? The most obvious thing to do is simply stride across a file by some block size, generating fixed length chunks. This poses one problem for the last chunk - what if it\u0026rsquo;s only a byte or two? We can slightly modify our algorithm to specify a minimum chunk size, and if the remainder is smaller than that size, append it to the last chunk to have a larger than block size piece.\nFixed length chunks of 512 bytes and a minimum blocksize of 92 bytes highlighting an original and updated file. When the file is updated, all chunks after the first are modified.\nIn the above figure each blob created by fixed length chunking is highlighted in a different color. The file is divided into even, well formed chunks \u0026ndash; however a problem occurs when the file is updated. By inserting a paragraph in between the first and second paragraphs, the chunking algorithm shifts all subsequent chunks; in fact no chunk following the first chunk is preserved. Simple, small updates so radically change the blobs that duplication becomes a large issue.\nVariable length chunking uses the content to determine the splits between blocks by scanning for a specific pattern. Because it breaks up the blobs on pattern identification, the blobs don\u0026rsquo;t have a uniform length. Rabin-Karp chunking using a rolling hash across windows to identify the splits, and is the primary chunking mechanism used in FluidFS.\nRabin-Karp variable length chunks with a target block size of 512 bytes highlighting an original and updated file. When the file is updated, only the chunks surrounding the update are modified.\nIn the above figure you can see that the variable length chunks can be quite small or quite large. However, the key is that when the second paragraph is inserted into the document, only the second chunk is modified. A third chunk is added, but all other chunks are identical. In this way variable length chunking reduces the number of overall blobs that have to be replicated and stored.\nThe visualization method can be found at this gist. The offsets were generated using the FluidFS chunks debugger.\n","permalink":"https://bbengfort.github.io/2017/02/chunking/","summary":"\u003cp\u003eFluidFS and other file systems break large files into recipes of hash-identified blobs of binary data. Blobs can then be replicated with far more ease than a single file, as well as streamed from disk in a memory safe manner. Blobs are treated as single, independent units so the underlying data store doesn\u0026rsquo;t grow as files are duplicated. Finally, blobs can be encrypted individually and provide more opportunities for privacy.\u003c/p\u003e","title":"Fixed vs. Variable Length Chunking"},{"content":"In today\u0026rsquo;s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:\nSo this is pretty simple, just use regular expressions to look for lines that start with one or more \u0026quot;#\u0026quot; or \u0026quot;=\u0026quot; (for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent ## heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I\u0026rsquo;ve also implemented some simple type detection using common extensions to decide which regex to use.\nThe result is a quick view of the structure of a markup file, especially when they can get overly large. From the Markdown of one of my longer blog posts:\n- A Practical Guide to Anonymizing Datasets with Python - Anonymizing CSV Data - Generating Fake Data - Creating A Provider - Maintaining Data Quality - Domain Distribution - Realistic Profiles - Fuzzing Fake Names from Duplicates - Conclusion - Acknowledgments - Footnotes And from the first chapter of Applied Text Analysis with Python:\n- Language and Computation - - - What is Language? - Identifying the Basic Units of Language - Formal vs. Natural Languages - Formal Languages - Natural Languages - Language Models - Language Features - Contextual Features - Structural Features - The Academic State of the Art - Tools for Natural Language Processing - Language Aware Data Products - Conclusion Ok, so clearly there are some bugs, those two blank - bullet points are a note callout which has the form:\n[NOTE] ==== Insert note text here. ==== Therefore misidentifying the first and second ==== as a level 4 heading. I tried a couple of regular expression fixes for this, but couldn\u0026rsquo;t exactly get it. The next step is to add a simple loop to do multiple paths so that I can print out the table of contents for an entire directory (e.g. to get the TOC for the entire book where one chapter == one file).\n","permalink":"https://bbengfort.github.io/2017/02/extract-toc/","summary":"\u003cp\u003eIn today\u0026rsquo;s addition of “really simple things that come in handy all the time” I present a simple script to extract the table of contents from markdown or asciidoc files:\u003c/p\u003e\n\u003cscript src=\"https://gist.github.com/bbengfort/6ab36e0f518fe3e0f92bce6f53bdd80f.js\"\u003e\u003c/script\u003e\n\n\u003cp\u003eSo this is pretty simple, just use regular expressions to look for lines that start with one or more \u003ccode\u003e\u0026quot;#\u0026quot;\u003c/code\u003e or \u003ccode\u003e\u0026quot;=\u0026quot;\u003c/code\u003e (for markdown and asciidoc, respectively) and print them out with an indent according to their depth (e.g. indent \u003ccode\u003e##\u003c/code\u003e heading 2 one block). Because this script goes from top to bottom, you get a quick view of the document structure without creating a nested data structure under the hood. I\u0026rsquo;ve also implemented some simple type detection using common extensions to decide which regex to use.\u003c/p\u003e","title":"Extracting a TOC from Markup"},{"content":"The Filesystem in Userspace (FUSE) software interface allows developers to create file systems without editing kernel code. This is especially useful when creating replicated file systems, file protocols, backup systems, or other computer systems that require intervention for FS operations but not an entire operating system. FUSE works by running the FS code as a user process while FUSE provides a bridge through a request/response protocol to the kernel.\nIn Go, the FUSE library is implemented by bazil.org/fuse. It is a from-scratch implementation of the kernel-userspace communication protocol and does not use the C library. The library has been excellent for research implementations, particularly because Go is such an excellent language (named programming language of 2016). However, it does lead to some questions (particularly because of the questions in the Go documentation):\nHow is the performance of bazil.org/fuse? How complete is the implementation of bazil.org/fuse? How does bazil.org/fuse compare to a native system? In order to start taking a look at these questions, I created an in-memory file system using bazil.org/fuse called MemFS. The bazil.org/fuse library works by providing many interfaces to define a File System (fs.FS* interfaces), Nodes (fs.Node* interfaces) that represent files, links, and directories, and Handles (fs.Handle* interfaces) to open files. A file system that uses bazil.org/fuse must create Go objects that implement these interfaces then pass them to the FUSE server which makes calls to the relevant methods. The goal of MemFS was to provide as complete an implementation of every single interface as possible (since I have yet to find a reference implementation that does do this).\nIn order to evaluate the performance of MemFS vs. the normal file system on my MacBook pro, I created the following protocol, implemented by a simple Python script:\nClone a repository into the file system Make/build the software in the repository Traverse the file system and stat every file Collect time and FS meta data information I then compared MemFS performance to normal FS performance for several popular large and small C applications including databases, web servers, and programming languages:\nRedis (76MB, 591 files, 137,475 LOC) Postgres (422MB, 4,807 files, 910,948 LOC) Nginx (62MB, 440 files, 155,056 LOC) Apache Web Server (362MB, 4,059 files, 503,006 LOC) Ruby (197MB, 3,281 files, 918,052 LOC) Python 3 (382MB, 3,570 files, 931,814 LOC) Unfortunately due to vagaries in the build process with Postgres, Apache Httpd, Ruby, and Python I can only report on Redis and Nginx. But more on that later.\nThe workload code can be found at github.com/bbengfort/compile-workload along with a benchmark.sh script that executes the full test-suite. Here are the time results for the OS X file system (OS X) vs. MemFS for both cloning and compiling:\nAs you can see from the graphs, git clone is horribly slow on MemFS. Further investigation revealed that git is writing 4107 bytes of data at a time as it downloads its compressed pack file. This means approximately 255 times more calls to the Write() FUSE method than other file writing mechanisms which typically write 1MB at a time. Because each call to Write() must be handled by the FUSE server and responded to by MemFS (which is allocating an byte slice under the hood), the more calls to Write() the exponentially worse the system is.\nCompiling, on the other hand, is supposed to be more representative of a workload - containing many reads, writes, and stats in a controlled sequence of events. For both Redis and Nginx, MemFS does add some overhead to compilation, but not nearly as much as git clone did. Note that downloading a zip file from GitHub and then building it exhibited a similar shape to the compiling graph.\nMemory usage for MemFS is currently atrocious, however:\nThe dotted lines are the maximum file usage on disk according to a recursive stat of each file. The solid lines are the memory usage of MemFS during clone and build. Although some extra memory overhead is expected to maintain the journal and references to the file system tree, the amount of overhead necessary seems completely out of whack compared to the storage requirements. Some investigation about freeing data is necessary.\nFinally the last, critical lesson. The reason only Redis and Nginx are represented in the graph is because the other builds failed for one reason or another. The cause of the build failures is primarily due to my goal to implement 100% of the FUSE interface methods. However this is not how bazil.org/fuse works and in fact 100% interface implementation is exactly the wrong thing to do.\nTake for example the ReadAll() vs. Read() methods that implement HandleReadAller and HandleReader respectively. I attempted to implement both and kept receiving different build errors, though I did notice the clone and compile behavior changing as I messed with these two methods. It turns out that the bazil.org/fuse server implementation checks to see if the FS implements HandleReadAller and if so, returns the result from ReadAll, otherwise it enforces HandleRead and sends the Read method the request and response from FUSE.\nMy hypothesis as to why cloning was failing when I had implemented ReadAll is simple. The Read method allows the client to specify an offset in the file to read from and a size of data to respond with. Presumably git clone was attempting to read the last 32 bytes of the compressed pack file (or something like that) so it could perform a CRC check or some other data validation. FUSE, however, returned all of the data rather than just the data from the offset because ReadAll was implemented. As a result, git clone choked with a stream error.\nThe bottom line is that FUSE allows some interfaces for convenience only for higher level FS implementations. MemFS, however, needs to support only the low level FUSE serve interactions. As a general rule of thumb, if the interface method takes a request and response object and simply returns an error - then that FUSE method is probably at a bit lower of a level, exactly what MemFS is looking for.\n","permalink":"https://bbengfort.github.io/2017/01/fuse-inmem-fs/","summary":"\u003cp\u003eThe \u003ca href=\"https://en.wikipedia.org/wiki/Filesystem_in_Userspace\"\u003eFilesystem in Userspace (FUSE)\u003c/a\u003e software interface allows developers to create file systems without editing kernel code. This is especially useful when creating replicated file systems, file protocols, backup systems, or other computer systems that require intervention for FS operations but not an entire operating system. FUSE works by running the FS code as a user process while FUSE provides a bridge through a request/response protocol to the kernel.\u003c/p\u003e\n\u003cp\u003eIn Go, the FUSE library is implemented by \u003ca href=\"https://github.com/bazil/fuse\"\u003ebazil.org/fuse\u003c/a\u003e. It is a from-scratch implementation of the kernel-userspace communication protocol and does not use the C library. The library has been excellent for research implementations, particularly because Go is such an excellent language (named \u003ca href=\"http://www.tiobe.com/tiobe-index/\"\u003eprogramming language of 2016\u003c/a\u003e). However, it does lead to some questions (particularly because of the questions in the Go documentation):\u003c/p\u003e","title":"In-Memory File System with FUSE"},{"content":"For close-to-open consistency, we need to be able to implement a file system that can detect atomic changes to a single file. Most programming languages implement open() and close() methods for files - but what they are really modifying is the access of a handle to an open file that the operating system provides. Writes are buffered in an asynchronous fashion so that the operating system and user program don\u0026rsquo;t have to wait for the spinning disk to figure itself out before carrying on. Additional file calls such as sync() and flush() give the user the ability to hint to the OS about what should happen relative to the state of data and the disk, but the OS provides no guarantees that will happen.\nWe use the FUSE library provided by bazil.org/fuse to implement a file system in user space. FUSE receives kernel calls to file system methods and passes them to a server - developers can write handlers for requests and return responses. Unfortunately, while there is an Open() method that returns the handle to an open file, there is no equivalent Close() method. Instead FUSE allows external processes to make calls to Read(), Write(), Flush(), and Fsync(). This led us to the obvious question - when reading and writing files, what calls are being made to the file system?\nTo answer this question, we wrote a Go program that wrote random data to a file. There are many ways to write to a file, as explained by Go by Example. So we implemented several methods (discussed below). We then ran the data writer program into a file on an in-memory FUSE server that logged different calls. The results are shown below:\nThe bottom line is that Fsync() is on called when the user program calls it - essential for Vim and Emacs, but a hint only. Flush() is always called at close, and Write() is called many times from open to close. The names on the Y-axis describe the various methods of writing to a file I will discuss next.\nThe first step is to generate random data with n bytes. To do this, I chose to write random alphabetic characters to the file, along with a couple of white space characters. The Go function used to implement this is as follows:\nvar letterRunes = []rune(\u0026#34;abcdefghijklmnopqrstuvwxyz\\n \u0026#34;) func randString(n int) string { b := make([]rune, n) // Make a rune slice of length n // For every position in b, assign a random rune for i := range b { b[i] = letterRunes[rand.Intn(len(letterRunes))] } // Convert the rune to a string and return return string(b) } The easiest method to write data to a file is to use the ioutil package - just supply a path, data, and a file mode and ioutil will do all the rest. This is the common mechanism I use for reading and writing files, so we were very interested to see how Go handled files from the file system perspective. Implementing this function is easy:\ndata := []byte(randString(1.049e+8)) err := ioutil.WriteFile(\u0026#34;test.txt\u0026#34;, data, 0644) All we have to do is create a 100MB slice of random data and send it to test.txt - easy! Under the hood, it appears that Go is writing blocks of 1,048,576 (1MB) to the file at a time, then calling access, attrs, and flushing the data. A snippet of the log output shows the sequence of actions:\n... wrote 1048576 bytes offset by 100663296 to file 2 wrote 1048576 bytes offset by 101711872 to file 2 wrote 1048576 bytes offset by 102760448 to file 2 wrote 1048576 bytes offset by 103809024 to file 2 wrote 42400 bytes offset by 104857600 to file 2 access called on node 2 getting attrs on node 2 flush file 2 (dirty: true, contains 104900000) getting attrs on node 2 ... The ioutil package appears to be just a wrapper function around the standard mechanism of opening the file, writing, syncing, and closing the file. We call this the \u0026ldquo;dump\u0026rdquo; method since we\u0026rsquo;re just sticking the data all into disk at once. However, even though we call Write() with the complete data slice, only 1MB is passed to the FUSE Write handler at a time.\nfobj, err := os.Create(\u0026#34;test.txt\u0026#34;) check(err) defer fobj.Close() _, err = fobj.Write([]byte(randString(1.049e+8))) check(err) fobj.Sync() Note the last call to fobj.Sync(), if we omit this call (dump no sync) then FUSE never sees an fsync event and all is well. No matter what, though, Flush is called (probably by fobj.Close()). Since Go is clearly doing some chunking and writing, my last thought was to try to do my own chunking, below Go\u0026rsquo;s 1MB chunks and see if any arbitrary fsync calls occurred as Go was managing the handle to the open file.\nfobj, err := os.Create(path) check(err) defer fobj.Close() nbytes := 1.049e+8 chunks := 524288 for i := 0; i \u0026lt; nbytes; i += chunks { var n int if nbytes-i \u0026lt; chunks { n = nbytes - i } else { n = chunks } _, err = fobj.Write([]byte(randString(n))) check(err) err = fobj.Sync() check(err) } However, as shown in the graph, fsync was only called if it was directly called by the user code. Note that on our file system, it took between 6 and 10 seconds to write the 100MB file to disk. There was plenty of occasion for Go\u0026rsquo;s routine functionality (garbage collection, etc.) to run during the processing of the file.\nFor more information to experiment with different calls, check out the complete write.go command on Gist.\n","permalink":"https://bbengfort.github.io/2017/01/fuse-calls/","summary":"\u003cp\u003eFor close-to-open consistency, we need to be able to implement a file system that can detect atomic changes to a single file. Most programming languages implement \u003ccode\u003eopen()\u003c/code\u003e and \u003ccode\u003eclose()\u003c/code\u003e methods for files - but what they are really modifying is the access of a \u003cem\u003ehandle\u003c/em\u003e to an open file that the operating system provides. Writes are buffered in an asynchronous fashion so that the operating system and user program don\u0026rsquo;t have to wait for the spinning disk to figure itself out before carrying on. Additional file calls such as \u003ccode\u003esync()\u003c/code\u003e and \u003ccode\u003eflush()\u003c/code\u003e give the user the ability to hint to the OS about what should happen relative to the state of data and the disk, but the OS provides no guarantees that will happen.\u003c/p\u003e","title":"FUSE Calls on Go Writes"},{"content":"Working with FUSE to build file systems means inevitably you have to deal with (or return) system call errors. The Go FUSE implementation includes helpers and constants for returning these errors, but simply wraps them around the syscall error numbers. I needed descriptions to better understand what was doing what. Pete saved the day by pointing me towards the errno.h header file on my Macbook. Some Python later and we had the descriptions:\nSo that\u0026rsquo;s a good script to have on your local machine, since now I can just do the following:\n$ syserr.py | grep EAGAIN 35: EAGAIN: Resource temporarily unavailable To get descriptions for the various errors. However for Google reference, I\u0026rsquo;ll also provide them here:\nEnd of post.\n","permalink":"https://bbengfort.github.io/2017/01/syscall-errno/","summary":"\u003cp\u003eWorking with \u003ca href=\"https://bazil.org/fuse/\"\u003eFUSE\u003c/a\u003e to build file systems means inevitably you have to deal with (or return) system call errors. The \u003ca href=\"https://godoc.org/bazil.org/fuse#pkg-constants\"\u003eGo FUSE\u003c/a\u003e implementation includes helpers and constants for returning these errors, but simply wraps them around the \u003ca href=\"https://golang.org/pkg/syscall/#pkg-constants\"\u003esyscall\u003c/a\u003e error numbers. I needed descriptions to better understand what was doing what. Pete saved the day by pointing me towards the \u003ccode\u003eerrno.h\u003c/code\u003e header file on my Macbook. Some Python later and we had the descriptions:\u003c/p\u003e","title":"Error Descriptions for System Calls"},{"content":"Writing systems means the heavy use of go routines to support concurrent operations. My current architecture employs several go routines to run a server for a simple web interface as well as command line app, file system servers, replica servers, consensus coordination, etc. Using multiple go routines (threads) instead of processes allows for easier development and shared resources, such as a database that can support transactions. However, management of all these threads can be tricky.\nMy current plan is to initialize thread-safe resources in a main thread, then pass those resources to the various go routines that need to do their ListenAndServe work. The main thread then listens on an error channel in case anything bad goes down that requires termination of the entire service. The first error that comes in will shut everything down, otherwise if no errors come in, then the main thread is just sitting there listening and managing everything overall.\nAs a reminder how to do this, here is a simple example:\nBasically the main function acts as the main thread here, initializing the error channel, then running 10 bomb threads who create random delays. Whichever bomb goes off first sends the error on the channel and the entire process quits. Simple!\n","permalink":"https://bbengfort.github.io/2017/01/run-until-err/","summary":"\u003cp\u003eWriting systems means the heavy use of go routines to support concurrent operations. My current architecture employs several go routines to run a server for a simple web interface as well as command line app, file system servers, replica servers, consensus coordination, etc. Using multiple go routines (threads) instead of processes allows for easier development and shared resources, such as a database that can support transactions. However, management of all these threads can be tricky.\u003c/p\u003e","title":"Run Until Error with Go Channels"},{"content":"This post is just a reminder as I work through handling JSON data with Go. Go provides first class JSON support through its standard library json package. The interface is simple, primarily through json.Marshal and json.Unmarshal functions which are analagous to typed versions of json.load and json.dump. Type safety is the trick, however, and generally speaking you define a struct to serialize and deserialize as follows:\ntype Person struct { Name string `json:\u0026#34;name,omitempty\u0026#34;` Age int `json:\u0026#34;age,omitempty\u0026#34;` Salary int `json:\u0026#34;-\u0026#34;` } op := \u0026amp;Person{\u0026#34;John Doe\u0026#34;, 42} data, _ := json.Marshal(op) var np Person json.Unmarshall(data, \u0026amp;np) So this is all well and good, until you start wanting to just send around arbirtray data. Luckly the json package will allow you to do that using reflection to load data into a map[string]interface{}, e.g. a dictionary whose keys are strings and whose values are any arbitrary type (anything that implements the null interface, that is has zero or more methods, which all Go types do). So you might see code like this:\nDid you catch the surprise? That\u0026rsquo;s right, the age int got deserialized as a float64! Anyway, this whole post is about how long it took me to figure out that brand of reflection and how to avoid errors in the future.\n","permalink":"https://bbengfort.github.io/2017/01/generic-json-serialization-go/","summary":"\u003cp\u003eThis post is just a reminder as I work through handling JSON data with Go. Go provides first class JSON support through its standard library \u003ccode\u003ejson\u003c/code\u003e package. The interface is simple, primarily through \u003ccode\u003ejson.Marshal\u003c/code\u003e and \u003ccode\u003ejson.Unmarshal\u003c/code\u003e functions which are analagous to typed versions of \u003ccode\u003ejson.load\u003c/code\u003e and \u003ccode\u003ejson.dump\u003c/code\u003e. Type safety is the trick, however, and generally speaking you define a \u003ccode\u003estruct\u003c/code\u003e to serialize and deserialize as follows:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-go\" data-lang=\"go\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kd\"\u003etype\u003c/span\u003e \u003cspan class=\"nx\"\u003ePerson\u003c/span\u003e \u003cspan class=\"kd\"\u003estruct\u003c/span\u003e \u003cspan class=\"p\"\u003e{\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nx\"\u003eName\u003c/span\u003e   \u003cspan class=\"kt\"\u003estring\u003c/span\u003e \u003cspan class=\"s\"\u003e`json:\u0026#34;name,omitempty\u0026#34;`\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nx\"\u003eAge\u003c/span\u003e    \u003cspan class=\"kt\"\u003eint\u003c/span\u003e    \u003cspan class=\"s\"\u003e`json:\u0026#34;age,omitempty\u0026#34;`\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"nx\"\u003eSalary\u003c/span\u003e \u003cspan class=\"kt\"\u003eint\u003c/span\u003e    \u003cspan class=\"s\"\u003e`json:\u0026#34;-\u0026#34;`\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e}\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nx\"\u003eop\u003c/span\u003e \u003cspan class=\"o\"\u003e:=\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"nx\"\u003ePerson\u003c/span\u003e\u003cspan class=\"p\"\u003e{\u003c/span\u003e\u003cspan class=\"s\"\u003e\u0026#34;John Doe\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"mi\"\u003e42\u003c/span\u003e\u003cspan class=\"p\"\u003e}\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nx\"\u003edata\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"nx\"\u003e_\u003c/span\u003e \u003cspan class=\"o\"\u003e:=\u003c/span\u003e \u003cspan class=\"nx\"\u003ejson\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"nf\"\u003eMarshal\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"nx\"\u003eop\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kd\"\u003evar\u003c/span\u003e \u003cspan class=\"nx\"\u003enp\u003c/span\u003e \u003cspan class=\"nx\"\u003ePerson\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nx\"\u003ejson\u003c/span\u003e\u003cspan class=\"p\"\u003e.\u003c/span\u003e\u003cspan class=\"nf\"\u003eUnmarshall\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"nx\"\u003edata\u003c/span\u003e\u003cspan class=\"p\"\u003e,\u003c/span\u003e \u003cspan class=\"o\"\u003e\u0026amp;\u003c/span\u003e\u003cspan class=\"nx\"\u003enp\u003c/span\u003e\u003cspan class=\"p\"\u003e)\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eSo this is all well and good, until you start wanting to just send around arbirtray data. Luckly the \u003ccode\u003ejson\u003c/code\u003e package will allow you to do that using reflection to load data into a \u003ccode\u003emap[string]interface{}\u003c/code\u003e, e.g. a dictionary whose keys are strings and whose values are any arbitrary type (anything that implements the null interface, that is has zero or more methods, which all Go types do). So you might see code like this:\u003c/p\u003e","title":"Generic JSON Serialization with Go"},{"content":"One of the challenges we\u0026rsquo;ve been dealing with in the Yellowbrick library is the proper resolution of colors, a problem that seems to have parallels in matplotlib as well. The issue is that colors can be described by the user in a variety of ways, then that description has to be parsed and rendered as specific colors. To name a few color specifications that exist in matplotlib:\nNone: choose a reasonable default color The name of the color, e.g. \u0026quot;b\u0026quot; or \u0026quot;blue\u0026quot; The hex code of the color e.g. \u0026quot;#377eb8\u0026quot; The RGB or RGBA tuples of the color, e.g. (0.0078, 0.4470, 0.6353) A greyscale intensity string, e.g. \u0026quot;0.76\u0026quot;. The pyplot api documentation sums it up as follows:\nIn addition, you can specify colors in many weird and wonderful ways, including full names (\u0026lsquo;green\u0026rsquo;), hex strings (\u0026rsquo;#008000\u0026rsquo;), RGB or RGBA tuples ((0,1,0,1)) or grayscale intensities as a string (\u0026lsquo;0.8\u0026rsquo;). Of these, the string specifications can be used in place of a fmt group, but the tuple forms can be used only as kwargs.\nThings get even weirder and slightly less wonderful when you need to specify multiple colors. To name a few methods:\nA list of colors whose elements are one of the above color representations. The name of a color map object, e.g. \u0026quot;viridis\u0026quot; A color cycle object (e.g. a fixed length group of colors that repeats) Matplotlib Colormap objects resolve scalar values to RGBA mappings and are typically used by name via the matplotlib.cm.get_cmap function. They come in three varieties: Sequential, Diverging, and Qualitative. Sequential and Diverging color maps are used to indicate continuous, ordered data by changing the saturation or hue in incremental steps. Qualitative colormaps are used when no ordering or relationship is required such as in categorical data values.\nTrying to generalize this across methodologies is downright difficult. So instead let\u0026rsquo;s look at a specific problem. Given a dataset, X, whose shape is (n,d) where n is the number of points and d is the number of dimensions, and a target vector, y, create a figure that shows the distribution or relationship of points defined by X, differentiated by their target y. If d is 1 then we can use a histogram, if d is 2 or 3 we can use a scatter plot, and if d \u0026gt;= 3, then we need RadViz or Parallel Coordinates. If y is discrete, e.g. classes then we need a color map whose length is the number of classes, probably a qualitative colormap. If y is continuous, then we need to perform binning or assign values according to a sequential or diverging color map.\nSo, problem number one is detecting if y is discrete or continuous. There is no automatic way of determining this, so besides having the user directly specify the behavior, I have instead created the following rule-based functions:\ndef is_discrete(vec): \u0026#34;\u0026#34;\u0026#34; Returns True if the given vector contains categorical values. \u0026#34;\u0026#34;\u0026#34; # Convert the vector to an numpy array if it isn\u0026#39;t already. vec = np.array(vec) if vec.ndim != 1: raise ValueError(\u0026#34;can only handle 1-dimensional vectors\u0026#34;) # Check the array dtype if vec.dtype.kind in {\u0026#39;b\u0026#39;, \u0026#39;S\u0026#39;, \u0026#39;U\u0026#39;}: return True if vec.dtype.kind in {\u0026#39;f\u0026#39;, \u0026#39;c\u0026#39;}: return False # For vectors of \u0026gt;= than 50 elements if vec.shape[0] \u0026gt;= 50: if np.unique(vec).shape[0] \u0026lt;= 20: return True return False # For vectors of \u0026lt; than 50 elements else: elems = Counter(vec) if len(elems.keys()) \u0026lt;= 20 and all([c \u0026gt; 1 for c in elems.values()]): return True return False # Raise exception if we\u0026#39;ve made it to this point. raise ValueError( \u0026#34;could not determine if vector is discrete or continuous\u0026#34; ) def is_continuous(vec): \u0026#34;\u0026#34;\u0026#34; Returns True if the given vector contains continuous values. To keep things simple, this is currently implemented as not is_discrete(). \u0026#34;\u0026#34;\u0026#34; return not is_discrete(vec) The rules for determining discrete/categorical values are as follows:\nIf it is a string type - True If it\u0026rsquo;s a bool type - True If it is a floating point type - False If \u0026gt; 50 samples then if there are 20 or fewer discrete values If \u0026lt; 50 samples, then if there are 20 or fewer discrete samples that are represented more than once each. These rules are arbitrary but work on the following test cases:\ndatasets = ( np.random.normal(10, 1, 100), # Normally distributed floats np.random.randint(0, 100, 100), # Random integers np.random.uniform(0, 1, 1000), # Small uniform numbers np.random.randint(0, 1, 100), # Binary data (0 and 1) np.random.randint(1, 4, 100), # Three integer clases (1, 2, 3) np.random.choice(list(\u0026#39;ABC\u0026#39;), 100), # String classes ) for d in datasets: print(is_discrete(d)) The next step is to determine how best to assign colors for continuous vs. discrete values. One typical use case is to directly assign color values using the target variable, then provide a colormap for color assignment as shown:\n# Create some data sets. X = np.random.normal(10, 1, (100, 2)) yc = np.random.normal(10, 1, 100) yd = np.random.randint(1, 4, 100) f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(9,4)) # Plot the Continuous Target ax1.scatter(X[:,0], X[:,1], c=yc, cmap=\u0026#39;inferno\u0026#39;) # Plot the Discrete Target ax2.scatter(X[:,0], X[:,1], c=yd, cmap=\u0026#39;Set1\u0026#39;) Alternatively, the colors can be directly assigned by creating a list of colors. This brings us to our larger problem - how do we create a list of colors in a meaningful way to assign our colormap appropriately? One solution is to use the matplotlib.colors.ListedColormap object which takes a list of colors and can convert a dataset to that list as follows:\nIf the input data is in (0,1) - then uses a percentage to assign the color If the input data is an integer, then uses it as an index to fetch the color This means that some work has to be done ahead of time, e.g. discretizing the values or normalizing them.\nf, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(9,4)) # Plot the Continuous Target norm = col.Normalize(vmin=yc.min(), vmax=yc.max()) cmap = col.ListedColormap([ \u0026#34;#ffffcc\u0026#34;, \u0026#34;#ffeda0\u0026#34;, \u0026#34;#fed976\u0026#34;, \u0026#34;#feb24c\u0026#34;, \u0026#34;#fd8d3c\u0026#34;, \u0026#34;#fc4e2a\u0026#34;, \u0026#34;#e31a1c\u0026#34;, \u0026#34;#bd0026\u0026#34;, \u0026#34;#800026\u0026#34; ]) ax1.scatter(X[:,0], X[:,1], c=cmap(norm(yc))) # Plot the Discrete Target cmap = col.ListedColormap([ \u0026#34;#34495e\u0026#34;, \u0026#34;#2ecc71\u0026#34;, \u0026#34;#e74c3c\u0026#34;, \u0026#34;#9b59b6\u0026#34;, \u0026#34;#f4d03f\u0026#34;, \u0026#34;#3498db\u0026#34; ]) ax2.scatter(X[:,0], X[:,1], c=cmap(yd), cmap=\u0026#39;Set1\u0026#39;) Note that in the above function, the indices 1-3 are used (not the 0 index) since the classes were 1-ordered.\nClearly color handling is tricky, but hopefully these notes will provide us with a reference when we need to continue to resolve these issues developing yellowbrick.\n","permalink":"https://bbengfort.github.io/2017/01/resolving-matplotlib-colors/","summary":"\u003cp\u003eOne of the challenges we\u0026rsquo;ve been dealing with in the Yellowbrick library is the proper resolution of colors, a problem that seems to have parallels in \u003ccode\u003ematplotlib\u003c/code\u003e as well. The issue is that colors can be described by the user in a variety of ways, then that description has to be parsed and rendered as specific colors. To name a few color specifications that exist in \u003ccode\u003ematplotlib\u003c/code\u003e:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eNone: choose a reasonable default color\u003c/li\u003e\n\u003cli\u003eThe name of the color, e.g. \u003ccode\u003e\u0026quot;b\u0026quot;\u003c/code\u003e or \u003ccode\u003e\u0026quot;blue\u0026quot;\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eThe hex code of the color e.g. \u003ccode\u003e\u0026quot;#377eb8\u0026quot;\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eThe RGB or RGBA tuples of the color, e.g. \u003ccode\u003e(0.0078, 0.4470, 0.6353)\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eA greyscale intensity string, e.g. \u003ccode\u003e\u0026quot;0.76\u0026quot;\u003c/code\u003e.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThe \u003ca href=\"http://matplotlib.org/api/pyplot_api.html\"\u003epyplot api documentation\u003c/a\u003e sums it up as follows:\u003c/p\u003e","title":"Resolving Matplotlib Colors"},{"content":"I\u0026rsquo;m starting to get serious about programming in Go, trying to move from an intermediate level to an advanced/expert level as I start to build larger systems. Right now I\u0026rsquo;m working on a problem that involves on demand iteration, and I don\u0026rsquo;t want to pass around entire arrays and instead be a bit more frugal about my memory usage. Yesterday, I discussed using [channels to yield iterators from functions]({% post_url 2016-12-22-yielding-functions-for-iteration-golang %}) and was a big fan of the API, but had some questions about memory usage. So today I created a package, iterfile to benchmark and profile various iteration constructs in Go.\nBased on Ewan Cheslack-Postava\u0026rsquo;s Iterators in Go post, I created iteration functions for line-by-line reading of a file (Readlines), including the channel method, a method using callbacks, and a stateful iterator method that uses a struct to keep track of iteration (for funsies, I also added a Python implementation). Without further ado, here are the results:\nI used an external process to sample the memory of the readlines process every 0.01 seconds, using mprof by Fabian Pedregosa and Philippe Gervais. The four readlines implementations opened a large text file (3.9GB) with 900,002 lines of text containing random lengths of \u0026ldquo;fizz buzz foo bar baz\u0026rdquo; words, counting the total number of characters by summing the length of each line.\nThe python process took by far the longest and most memory as expected. The channel iterator implementation took almost as long as Python, but surprisingly used the least amount of memory. The callback and iterator implementations were the quickest, each using similar amounts of memory. Go benchmarks (go test -bench=.) for each function (except Python) are as follows:\nBenchmarkChanReadlinesSmall-8 20000 74958 ns/op BenchmarkChallbackReadlinesSmall-8 50000 28836 ns/op BenchmarkIteratorReadlinesSmall-8 50000 29006 ns/op BenchmarkChanReadlinesMedium-8 2000 621716 ns/op BenchmarkChallbackReadlinesMedium-8 10000 216734 ns/op BenchmarkIteratorReadlinesMedium-8 10000 219842 ns/op BenchmarkChanReadlinesLarge-8 200 6250004 ns/op BenchmarkChallbackReadlinesLarge-8 1000 2198904 ns/op BenchmarkIteratorReadlinesLarge-8 1000 2229104 ns/op As a result I\u0026rsquo;ll probably be using the stateful iterator approach more often in my code, reserving the channel method only when performance is not required, but a clear API is. Stay tuned for a post on writing stateful iterators.\n","permalink":"https://bbengfort.github.io/2016/12/benchmarking-readlines/","summary":"\u003cp\u003eI\u0026rsquo;m starting to get serious about programming in Go, trying to move from an intermediate level to an advanced/expert level as I start to build larger systems. Right now I\u0026rsquo;m working on a problem that involves on demand iteration, and I don\u0026rsquo;t want to pass around entire arrays and instead be a bit more frugal about my memory usage. Yesterday, I discussed using [channels to yield iterators from functions]({% post_url 2016-12-22-yielding-functions-for-iteration-golang %}) and was a big fan of the API, but had some questions about memory usage. So today I created a package, \u003ca href=\"https://github.com/bbengfort/iterfile\"\u003eiterfile\u003c/a\u003e to benchmark and profile various iteration constructs in Go.\u003c/p\u003e","title":"Benchmarking Readline Iterators"},{"content":"It is very common for me to design code that expects functions to return an iterable context, particularly because I have been developing in Python with the yield statement. The yield statement allows functions to “return” the execution context to the caller while still maintaining state such that the caller can return state to the function and continue to iterate. It does this by actually returning a generator, iterable object constructed from the local state of the closure.\nNow that I\u0026rsquo;m programming in Go, I often want to apply the same pattern, but iteration in Go is very different and is conducted at a slightly lower level. Go does have an iteration construct, range, that allows easy iteration over collection data structures, similar to a for each in construct. The good news is that range also works to collect elements from a channel, which means that an opportunity presents itself to create Go functions that yield by combining goroutines and channels.\nConsider the following example that implements similar (but simple) functionality as Python\u0026rsquo;s xrange iterator, allowing us to loop over the numbers from zero to the limit stepping by 1:\nThe function returns a channel of integers, to which range can be applied. We give up the execution context of our inner loop by running the loop in a goroutine, which sends its results to the caller using the channel as a synchronization mechanism. So long as we ensure to close the channel after iteration - this function works as expected:\nfor i := range XRange(10) { fmt.Println(i) } This pattern speaks to me, it is exactly how I think about constructing iterable functions. As a result, I have a bit less cognitive load than if I had to design stateful iterators and manage calls to Next() and HasNext() or something like that. This simple programming construct (which is Go idiomatic) does come at some performance cost \u0026ndash; Go now has to manage the thread and the communication of the channel. Potentially a solution is to use buffered channels, which will allow the iteration to store more information on the channel as the iterator is slow to collect it.\nI do have some questions about this though, that I hope to answer in the future. Consider the following function for reading a file line by line:\nThis is very common utility code for me, pass in a path, open the file, and read the file one line at a time, buffering in memory only the line of text. Particularly for reading large files, we need to ensure that we minimize the amount of memory we use. The way that I use this function is as follows:\nreader, err := Readlines(\u0026#34;myfile.txt\u0026#34;) if err != nil { log.Fatal(err) } for line := reader { fmt.Println(line) } But it does leave me with a few questions:\nWhat is the memory usage of the goroutine vs. the caller particularly for large files? Is it possible for the goroutine to get ahead of the caller and load huge chunks of data into memory before it can be collected? Speaking of collection, how exactly do lines in the file get cleaned up? I think I\u0026rsquo;d like to do some benchmarking tests with several files and large files using closures for iteration, channels as in this post, and more standard stateful iterator objects; comparing the use of memory and speed of reads. But I\u0026rsquo;ll save that for a later post!\n","permalink":"https://bbengfort.github.io/2016/12/yielding-functions-for-iteration-golang/","summary":"\u003cp\u003eIt is very common for me to design code that expects functions to return an iterable context, particularly because I have been developing in Python with the \u003ccode\u003eyield\u003c/code\u003e statement. The \u003ccode\u003eyield\u003c/code\u003e statement allows functions to “return” the execution context to the caller while still maintaining state such that the caller can return state to the function and continue to iterate. It does this by actually returning a \u003ccode\u003egenerator\u003c/code\u003e, iterable object constructed from the local state of the closure.\u003c/p\u003e","title":"Yielding Functions for Iteration in Go"},{"content":"Data Product Architectures: O\u0026rsquo;Reilly Webinar\nDescription Data products derive their value from data and generate new data in return. As a result, machine-learning techniques must be applied to their architecture and development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back into the data product.\nData product architectures are, in effect, life-cycles. Understanding the data product life-cycle enables architects to develop robust, failure-free workflows and applications. Benjamin Bengfort discusses the data product life-cycle and outlines the Lambda Architecture, demonstrating how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Benjamin then explores wrapping a central computational store for speed and querying and covers monitoring, management, and data exploration for hypothesis-driven development. From web applications to big data appliances, this architecture serves as a blueprint for handling data services of all sizes.\n","permalink":"https://bbengfort.github.io/2016/12/data-product-architectures-oreilly-webinar/","summary":"\u003cp\u003e\u003ca href=\"http://www.oreilly.com/pub/e/3800\"\u003eData Product Architectures: O\u0026rsquo;Reilly Webinar\u003c/a\u003e\u003c/p\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eData products derive their value from data and generate new data in return. As a result, machine-learning techniques must be applied to their architecture and development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back into the data product.\u003c/p\u003e","title":"Data Product Architectures: O'Reilly Webinar"},{"content":"This short tutorial is intended to demonstrate the basics of exception handling and the use of context management in order to handle standard cases. These notes were originally created for a training I gave, and the notebook can be found at Exception Handling. I\u0026rsquo;m happy for any comments or pull requests on the notebook.\nExceptions Exceptions are a tool that programmers use to describe errors or faults that are fatal to the program; e.g. the program cannot or should not continue when an exception occurs. Exceptions can occur due to programming errors, user errors, or simply unexpected conditions like no internet access. Exceptions themselves are simply objects that contain information about what went wrong. Exceptions are usually defined by their type - which describes broadly the class of exception that occurred, and by a message that says specifically what happened. Here are a few common exception types:\nSyntaxError: raised when the programmer has made a mistake typing Python code correctly. AttributeError: attempting to access an attribute on an object that does not exist KeyError: attempting to access a key in a dictionary that does not exist TypeError: raised when an argument to a function is not the right type (e.g. a str instead of int) ValueError: when an argument to a function is the right type but not in the right domain (e.g. an empty string) ImportError: raised when an import fails IOError: raised when Python cannot access a file correctly on disk Exceptions are defined in a class hierarchy - e.g. every exception is an object whose class defines it\u0026rsquo;s type. The base class is the Exception object. All Exception objects are initialized with a message - a string that describes exactly what went wrong. Constructed objects can then be \u0026ldquo;raised\u0026rdquo; or \u0026ldquo;thrown\u0026rdquo; with the raise keyword:\nraise Exception(\u0026#34;Something bad happened!\u0026#34;) The reason the keyword is raise is because Python program execution creates what\u0026rsquo;s called a \u0026ldquo;stack\u0026rdquo; as functions call other functions, which call other functions, etc. When a function (at the bottom of the stack) raises an Exception, it is propagated up through the call stack so that every function gets a chance to \u0026ldquo;handle\u0026rdquo; the exception (more on that later). If the exception reaches the top of the stack, then the program terminates and a traceback is printed to the console. The traceback is meant to help developers identify what went wrong in their code.\nLet\u0026rsquo;s take a look at a simple example:\ndef main(badstep=None, **kwargs): \u0026#34;\u0026#34;\u0026#34; This function is the entry point of the program, it does work on the arguments by calling each step function, which in turn call substep functions. Passing in a number for badstep will cause whichever step that is to raise an exception. \u0026#34;\u0026#34;\u0026#34; step = 0 # count the steps # Execute each step one at a time. step = first(step, badstep) step = second(step, badstep) # Return a report return \u0026#34;Sucessfully executed {} steps\u0026#34;.format(step) def first(step, badstep=None): # Increment the step step += 1 # Check if this is a bad step if badstep == step: raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) # Call sub steps in order step = first_task_one(step, badstep) step = first_task_two(step, badstep) # Return the step that we\u0026#39;re on return step def first_task_one(step, badstep=None): # Increment the step step += 1 # Check if this is a bad step if badstep == step: raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) # Call sub steps in order step = first_task_one_subtask_one(step, badstep) # Return the step that we\u0026#39;re on return step def first_task_one_subtask_one(step, badstep=None): # Increment the step step += 1 # Check if this is a bad step if badstep == step: raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) # Return the step that we\u0026#39;re on return step def first_task_two(step, badstep=None): # Increment the step step += 1 # Check if this is a bad step if badstep == step: raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) # Return the step that we\u0026#39;re on return step def second(step, badstep=None): # Increment the step step += 1 # Check if this is a bad step if badstep == step: raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) # Call sub steps in order step = second_task_one(step, badstep) # Return the step that we\u0026#39;re on return step def second_task_one(step, badstep=None): # Increment the step step += 1 # Check if this is a bad step if badstep == step: raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) # Return the step that we\u0026#39;re on return step The above example represents a fairly complex piece of code that has lots of functions that call lots of other functions. The question is then, how do we know where our code went wrong? The answer is the traceback - which deliniates exactly the functions that the exception was raised through. Let\u0026rsquo;s trigger the exception and the traceback:\nmain(3) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) \u0026lt;ipython-input-5-e46a77400742\u0026gt; in \u0026lt;module\u0026gt;() ----\u0026gt; 1 main(3) \u0026lt;ipython-input-4-03153844a5cc\u0026gt; in main(badstep, **kwargs) 12 13 # Execute each step one at a time. ---\u0026gt; 14 step = first(step, badstep) 15 step = second(step, badstep) 16 \u0026lt;ipython-input-4-03153844a5cc\u0026gt; in first(step, badstep) 28 29 # Call sub steps in order ---\u0026gt; 30 step = first_task_one(step, badstep) 31 step = first_task_two(step, badstep) 32 \u0026lt;ipython-input-4-03153844a5cc\u0026gt; in first_task_one(step, badstep) 44 45 # Call sub steps in order ---\u0026gt; 46 step = first_task_one_subtask_one(step, badstep) 47 48 # Return the step that we\u0026#39;re on \u0026lt;ipython-input-4-03153844a5cc\u0026gt; in first_task_one_subtask_one(step, badstep) 56 # Check if this is a bad step 57 if badstep == step: ---\u0026gt; 58 raise ValueError(\u0026#34;Failed after {} steps\u0026#34;.format(step)) 59 60 # Return the step that we\u0026#39;re on ValueError: Failed after 3 steps The way to read the traceback is to start at the very bottom. As you can see it indicates the type of the exception, followed by a colon, and then the message that was passed to the exception constructor. Often, this information is enough to figure out what is going wrong. However, if we\u0026rsquo;re unsure where the problem occurred, we can step back through the traceback in a bottom to top fashion.\nThe first part of the traceback indicates the exact line of code and file where the exception was raised, as well as the name of the function it was raised in. If you called main(3) than this indicates that first_task_one_subtask_one is the function where the problem occurred. If you wrote this function, then perhaps that is the place to change your code to handle the exception.\nHowever, many times you\u0026rsquo;re using third party libraries or Python standard library modules, meaning the location of the exception raised is not helpful, since you can\u0026rsquo;t change that code. Therefore, you will continue up the call stack until you discover a file/function in the code you wrote. This will provide the surrounding context for why the error was raised, and you can use pdb or even just print statements to debug the variables around that line of code. Alternatively you can simply handle the exception, which we\u0026rsquo;ll discuss shortly. In the example above, we can see that first_task_one_subtask_one was called by first_task_one at line 46, which was called by first at line 30, which was called by main at line 14.\nCatching Exceptions If the exception was caused by a programming error, the developer can simply change the code to make it correct. However, if the exception was created by bad user input or by a bad environmental condition (e.g. the wireless is down), then you don\u0026rsquo;t want to crash the program. Instead you want to provide feedback and allow the user to fix the problem or try again. Therefore in your code, you can catch exceptions at the place they occur using the following syntax:\ntry: # Code that may raise an exception except AttributeError as e: # Code to handle the exception case finally: # Code that must run even if there was an exception What we\u0026rsquo;re basically saying is try to do the code in the first block - hopefully it works. If it raises an AttributeError save that exception in a variable called e (the as e syntax) then we will deal with that exception in the except block. Then finally run the code in the finally block even if an exception occurs. By specifying exactly the type of exception we want to catch (AttributeError in this case), we will not catch all exceptions, only those that are of the type specified, including subclasses. If we want to catch all exceptions, you can use one of the following syntaxes:\ntry: # Code that may raise an exception except: # Except all exceptions or\ntry: # Code that may raise an exception except Exception as e: # Except all exceptions and capture in variable e However, it is best practice to capture only the type of exception you expect to happen, because you could accidentaly create the situation where you\u0026rsquo;re capturing fatal errors but not handling them appropriately. Here is an example:\nimport random class RandomError(Exception): \u0026#34;\u0026#34;\u0026#34; A custom exception for this code block. \u0026#34;\u0026#34;\u0026#34; pass def randomly_errors(p_error=0.5): if random.random() \u0026lt;= p_error: raise RandomError(\u0026#34;Error raised with {:0.2f} likelihood!\u0026#34;.format(p_error)) try: randomly_errors(0.5) print(\u0026#34;No error occurred!\u0026#34;) except RandomError as e: print(e) finally: print(\u0026#34;This runs no matter what!\u0026#34;) This code snippet demonstrates a couple of things. First you can define your own, program-specific exceptions by defining a class that extends Exception. We have done so and created our own RandomError exception class. Next we have a function that raises a RandomError with some likelihood which is an argument to the function. Then we have our exception handling block that calls the function and handles it.\nTry the following the code snippet:\nChange the likelihood of the error to see what happens except Exception instead of RandomError except TypeError instead of RandomError Call randomly_errors again inside of the except block Call randomly_errors again inside of the finally block Make sure you run the code multiple times since the error does occur randomly!\nLBYL vs. EAFP One quick note on exception handling in Python. You may wonder why you must use a try/except block to handle exceptions, couldn\u0026rsquo;t you simply do a check that the exception won\u0026rsquo;t occur before it does? For example, consider the following code:\nif key in mydict: val = mydict[key] # Do something with val else: # Handle the fact that mydict doesn\u0026#39;t have a required key. This code checks if a key exists in the dictionary before using it, then uses an else block to handle the \u0026ldquo;exception\u0026rdquo;. This is an alternative to the following code:\ntry: val = mydict[key] # Do something with val except KeyError: # Handle the fact that mydict doesn\u0026#39;t have a required key. Both blocks of code are valid. In fact they have names:\nLook Before You Leap (LBYL) Easier to Ask Forgiveness than Permission (EAFP) For a variety of reasons, the second example (EAFP) is more pythonic — that is the prefered Python Syntax, commonly accepted by Python developers. For more on this, please see Alex Martelli\u0026rsquo;s excellent PyCon 2016 talk, Exception and error handling in Python 2 and Python 3.\nContext Management Python does provide a syntax for embedding common try/except/finally blocks in an easy to read format called context management. To motivate the example, consider the following code snippet:\ntry: fobj = open(\u0026#39;path/to/file.txt, \u0026#39;r\u0026#39;) data = fobj.read() except FileNotFoundError as e: print(e) print(\u0026#34;Could not find the necessary file!) finally: fobj.close() This is a very common piece of code that opens a file and reads data from it. If the file doesn\u0026rsquo;t exist, we simply alert the user that the required file is missing. No matter what, the file is closed. This is critical because if the file is not closed properly, it can be corrupted or not available to other parts of the program. Data loss is not acceptable, so we need to ensure that no matter what the file is closed when we\u0026rsquo;re done with it. So we can do the following:\nwith open(\u0026#39;path/to/file.txt\u0026#39;, \u0026#39;r\u0026#39;) as fobj: data = fobj.read() The with as syntax implements context management. On with, a function called the enter function is called to do some work on behalf of the user (in this case open a file), and the return of that function is saved in the fobj variable. When this block is complete, the finally is called by implementing an exit function. (Note that the except part is not implemented in this particular code). In this way, we can ensure that the try/finally for opening and reading files is correctly implemented.\nWriting your own context managers is possible, but beyond the scope of this note (though I may write something on it shortly). Suffice it to say, you should always use the with/as syntax for opening files!\n","permalink":"https://bbengfort.github.io/2016/11/exception-handling/","summary":"\u003cp\u003eThis short tutorial is intended to demonstrate the basics of exception handling and the use of context management in order to handle standard cases. These notes were originally created for a training I gave, and the notebook can be found at \u003ca href=\"https://github.com/DistrictDataLabs/ceb-training/blob/master/notes/Exception%20Handling.ipynb\"\u003eException Handling\u003c/a\u003e. I\u0026rsquo;m happy for any comments or pull requests on the notebook.\u003c/p\u003e\n\u003ch2 id=\"exceptions\"\u003eExceptions\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eExceptions\u003c/strong\u003e are a tool that programmers use to describe errors or faults that are \u003cem\u003efatal\u003c/em\u003e to the program; e.g. the program cannot or should not continue when an exception occurs. Exceptions can occur due to programming errors, user errors, or simply unexpected conditions like no internet access. Exceptions themselves are simply objects that contain information about what went wrong. Exceptions are usually defined by their \u003ccode\u003etype\u003c/code\u003e - which describes broadly the class of exception that occurred, and by a \u003ccode\u003emessage\u003c/code\u003e that says specifically what happened. Here are a few common exception types:\u003c/p\u003e","title":"Exception Handling"},{"content":"In order to promote the use of graph data structures for data analysis, I\u0026rsquo;ve recently given talks on dynamic graphs: embedding time into graph structures to analyze change. In order to embed time into a graph there are two primary mechanisms: make time a graph element (a vertex or an edge) or have multiple subgraphs where each graph represents a discrete time step. By using either of these techniques, opportunities exist to perform a structural analysis using graph algorithms on time; for example - asking what time is most central to a particular set of relationships.\nGraphs are primarily useful to simplify modeling and querying, but they are also useful for visual analytics. While visualizing static graphs with time embedded as a structure requires only standard graph techniques, visualizing dynamic graphs requires some sort of animation or interaction. We are currently exploring these techniques in the District Data Labs dynamic graphs research group. Towards that research, we are proposing to use D3 and SVG for interaction and visualization.\nAs time moves forward graph elements (vertices and edges) will change, either being added to the graph or removed from them. To support visual analytics, particularly with layouts that will change depending on the nodes that get added (like force directed layouts), these transitions must not be sudden, but instead give visual clues as to what\u0026rsquo;s going on in the layout. The most obvious choice is to use opacity or size to fade in and out during the transition. However, this does not give the user any sense of how long the node has been on the screen, or how long it has left.\nTherefore, I\u0026rsquo;m interested in creating vertices that have timers associated with them. Inspired by raftscope, I want to create vertices that have a timer that indicates how long they\u0026rsquo;ve been on the screen. Here is my initial attempt:\nThe code to do this uses JavaScript with jQuery as well as CSS but no other libraries. To make this work for graphs, we\u0026rsquo;ll have to find a way to implement this vertex type in D3. But for now, we can just look what\u0026rsquo;s happening.\nFirst I added an SVG element to the body of my HTML:\n\u0026lt;html\u0026gt; \u0026lt;head\u0026gt; \u0026lt;title\u0026gt;Vertex Timer Test\u0026lt;/title\u0026gt; \u0026lt;/head\u0026gt; \u0026lt;body\u0026gt; \u0026lt;svg id=\u0026#34;timer-vertex\u0026#34; xmlns=\u0026#34;http://www.w3.org/2000/svg\u0026#34; version=\u0026#34;1.1\u0026#34; xmlns:xlink=\u0026#34;http://www.w3.org/1999/xlink\u0026#34;\u0026gt; \u0026lt;/svg\u0026gt; \u0026lt;/body\u0026gt; \u0026lt;/html\u0026gt; Then add some simple styles with CSS so that you don\u0026rsquo;t have to manually set them on every single element:\nsvg { width: 100%; height: 120px; } svg .vertex text { text-anchor: middle; dominant-baseline: central; text-align: center; fill: #FEFEFE; } svg .vertex circle { fill: #003F87; } svg .vertex path { fill: none; stroke: #CF0000; } For the rest of the work, we\u0026rsquo;re going to manually add SVG elements with JavaScript, updating their attributes with computed values. To make this easier, a simple function will allow us to create SVG elements in the correct namespace:\nfunction SVG(tag) { var ns = \u0026#39;http://www.w3.org/2000/svg\u0026#39;; return $(document.createElementNS(ns, tag)); } We can now use this function to quickly create the elements of our vertex: the circle representing the node, the text representing the label, and the arc representing the timer. First, let\u0026rsquo;s find the center of the SVG so that we know where to place the vertex, and define other properties like its radius.\n// Set the constant arc width var ARC_WIDTH = 6; // Select the svg to place the vertex into var svg = $(\u0026#34;#timer-vertex\u0026#34;); // Define the vertex center point and radius vertexSpec = { cx: svg.width() / 2, cy: svg.height() / 2, r: 30, } Before we can add all of the elements, we need to define the method by which we create the arc. To do this we\u0026rsquo;re going to create a path that follows an arc. Creating paths with SVG means defining a d attribute, which contains a series of commands and parameters that define the shape of the path. The first command is the \u0026ldquo;move to\u0026rdquo; command, M, that specifies where the path begins, e.g. M50 210 places a point at the coordinates (50, 210). We then define the arc with the A command. The A command is complex, you have to define the x and y radius, axis rotation, sweep flags and an endpoint. However, it is powerful.\nIn the next snippet we will use the arcSpec function to create the d attribute for our path. It returns a string from the spec defining the vertex (the center and radius) as well as the fraction of the circle we want represented on the arc. It also uses another helper function, circleCoord to determine where points around the circle are located.\nfunction circleCoord(frac, cx, cy, r) { var radians = 2 * Math.PI * (0.75 + frac); return { x: cx + r * Math.cos(radians), y: cy + r * Math.sin(radians), }; } function arcSpec(spec, fraction) { var radius = spec.r + ARC_WIDTH/2; var end = circleCoord(fraction, spec.cx, spec.cy, radius); var s = [\u0026#39;M\u0026#39;, spec.cx, \u0026#39;,\u0026#39;, spec.cy - radius]; if (fraction \u0026gt; 0.5) { s.push(\u0026#39;A\u0026#39;, radius, \u0026#39;,\u0026#39;, radius, \u0026#39;0 0,1\u0026#39;, spec.cx, spec.cy + radius); s.push(\u0026#39;M\u0026#39;, spec.cx, \u0026#39;,\u0026#39;, spec.cy + radius); } s.push(\u0026#39;A\u0026#39;, radius, \u0026#39;,\u0026#39;, radius, \u0026#39;0 0,1\u0026#39;, end.x, end.y); return s.join(\u0026#39; \u0026#39;); } Now that we have these two helper functions in place, we can finally define our elements:\nsvg.append( SVG(\u0026#39;g\u0026#39;) .attr(\u0026#39;id\u0026#39;, \u0026#39;vertex-1\u0026#39;) .attr(\u0026#39;class\u0026#39;, \u0026#39;vertex\u0026#39;) .append(SVG(\u0026#39;a\u0026#39;) .append(SVG(\u0026#39;circle\u0026#39;) .attr(\u0026#39;class\u0026#39;, \u0026#39;background\u0026#39;) .attr(vertexSpec)) .append(SVG(\u0026#39;path\u0026#39;) .attr(\u0026#39;class\u0026#39;, \u0026#39;timer-arc\u0026#39;) .attr(\u0026#39;style\u0026#39;, \u0026#39;stroke-width: \u0026#39; + ARC_WIDTH) .attr(\u0026#39;d\u0026#39;, arcSpec(vertexSpec, 1.0))) ) .append(SVG(\u0026#39;text\u0026#39;) .attr(\u0026#39;class\u0026#39;, \u0026#39;vlabel\u0026#39;) .text(\u0026#39;v1\u0026#39;) .attr({x: vertexSpec.cx, y: vertexSpec.cy})) ); This is simply a matter of appending various SVG elements together to create the group of shapes that together make up the vertex.\nNow to animate, I\u0026rsquo;ll simply recompute the path of the ARC for a smaller fraction of the vertex at each time step. To do this I\u0026rsquo;ll use a function that updates the path, then uses setTimeout to schedule the next update once it\u0026rsquo;s complete:\nfunction updateArcTimer(elems, spec, current) { var amt = current - 0.015; if (amt \u0026lt; 0) { amt = 1.0; } elems.attr(\u0026#39;d\u0026#39;, arcSpec(spec, amt)); setTimeout(function() { updateArcTimer(elems, spec, amt) }, 100); } Playing around with the delay between update (100 ms in this example) and the amount of the arc to reduce (0.015 in this example) changes how fast and smooth the timer is. However, making it too granular can cause weird jitters and artifacts to appear. Kick this function off right after creating the vertex as follows:\nupdateArcTimer($(\u0026#34;.timer-arc\u0026#34;), vertexSpec, 1.0); Future work for this project will be to implement this style vertex with D3, and the ability to set timers with a meaningful time measurement. I\u0026rsquo;d also like to look into other styles, for example the circle fill emptying out (like a sand timer) at the rate of the timer or the halo of the vertex flashing slowly or more quickly as it moves to the end of the timer. Importantly, these elements should also be able to be paused and hooked into other update mechanisms, such that sliders or other interactive functionality can be used. Finally, I\u0026rsquo;m not sure how edges will interact with the timer halo, but it is also important to consider.\n","permalink":"https://bbengfort.github.io/2016/11/svg-timer-vertex/","summary":"\u003cp\u003eIn order to promote the use of graph data structures for data analysis, I\u0026rsquo;ve recently given talks on \u003ca href=\"https://youtu.be/RgixxVpfXDY\"\u003edynamic graphs\u003c/a\u003e: embedding time into graph structures to analyze change. In order to embed time into a graph there are two primary mechanisms: make time a graph element (a vertex or an edge) or have multiple subgraphs where each graph represents a discrete time step. By using either of these techniques, opportunities exist to perform a structural analysis using graph algorithms on time; for example - asking what time is most central to a particular set of relationships.\u003c/p\u003e","title":"SVG Vertex with a Timer"},{"content":"Building distributed systems means passing messages between devices over a network connection. My research specifically considers networks that have extremely variable latencies or that can be partition prone. This led me to the natural question, “how variable are real world networks?” In order to get real numbers, I built a simple echo protocol using Go and gRPC called Orca.\nI ran Orca for a few days and got some latency measurements as I traveled around with my laptop. Orca does a lot of work, including GeoIP look ups, IP address resolution, and database queries and storage. This post, however, is not about Orca. The latencies I was getting were very high relative to the round-trip latencies reported by the simple ping command that implements the ICMP protocol.\nAt first, I attributed this difference to the database overhead, but it was still far too high. In order to measure the difference between ping and the echo protocol I implemented, I created a branch that strips everything except the communications protocol: a protocol buffers service implemented with gRPC. I believe there are two potential places that introduce the overhead, either in the gRPC communications protocol or in Go itself, for example the garbage collector.\nExperiment To see how much of a difference there is in the overhead of the Go implementation, I simultaneously ran both ping and orca from my house to a server at the University for an hour. I collected slightly under 3600 round-trip latencies (RTTs) for each (there were a few dropped packets). The result was that, on average, ping is approximately 16.384 ms faster than the gRPC protocol, and less variable by 4.933 ms! The variability might be explained by language-specific elements like garbage collection and threading, but the ease of use of protocol buffers comes at a cost!\nThe results of the two pings are as follows:\nThis above figure shows a box plot of the dataset with outliers trimmed using the z-score method and 2 passes. The ends of the bar represent the 5th and 95th percentile respectively, and data points outside the 95th percentile are plotted individually. The box goes from the first to the third quartile and the middle line is the median. As you can see from this plot, there is no overlap from the high percentile of the ping protocol to the lower percentile of the echo protocol. Moreover, the majority of the ping points are in a much smaller range than the majority of the echo protocol points.\nThis second image shows the violin plot - such that the curve represents the kernel density estimate (KDE) of the histogram of the data. It then similarly shows the median and the first and third quartiles inside of the violin. Both distributions are significantly right skewed, but the ping distribution has a much steeper curve than the more variable echo protocol.\nHere are the raw statistics for the small experiment:\n|\u0026mdash;\u0026mdash;-|\u0026mdash;\u0026mdash;\u0026mdash;\u0026ndash;|\u0026mdash;\u0026mdash;\u0026mdash;\u0026ndash;|\nping echo count 3,567 3,538 mean 13.037 ms 29.431 ms std 1.877 ms 2.908 ms min 10.616 ms 23.806 ms 25% 12.169 ms 27.366 ms 50% 12.747 ms 28.989 ms 75% 13.422 ms 31.016 ms max 42.806 ms 49.039 ms :\u0026mdash;\u0026mdash; \u0026mdash;\u0026mdash;\u0026mdash;\u0026ndash; \u0026mdash;\u0026mdash;\u0026mdash;\u0026ndash; So what does this mean? Of course, I could do extensive experimentation, moving the laptop and getting different times of day for latency measurements. However, I honestly believe that the one hour test was enough to demonstrate how significant a gap there is between the ping implementation and a gRPC implementation of the communications. In normal systems there will always be some message processing overhead and database accesses, however right off the bat you do incur a significant overhead.\nMethod To run the experiment to collect data for comparison (and as documentation in case I have to do this again), I did it as follows. First clone the orca repository:\n$ go get github.com/bbengfort/orca/... You\u0026rsquo;ll then have to cd into that directory, which is in your $GOHOME/src location. Checkout the ping branch as follows:\n$ git fetch $ git checkout ping You should see a pretty significant change in the amount of code and the README should indicate you\u0026rsquo;re in the ping branch. Set up a server to listen for the ping requests:\n$ go run cmd/orca listen If you want, you can run it in silent mode with the -s flag to further reduce latency as much as possible. In silent mode, the command prints nothing to the console. Then run 3600 pings on a different machine as follows:\n$ go run cmd/orca -n 3600 ping 1.2.3.4:3265 Make sure you insert the correct IP address and port! As quickly as you can, also start the ping service:\n$ ping -c 3600 1.2.3.4 After about an hour, the dataset is sitting at your disposal ready to copy and paste into a text file. You can use the ping_vs_echo.ipynb Jupyter Notebook to perform the analysis. It includes regular expressions to parse each type of line output and to aggregate them into the visualizations you saw above.\nLocal Subnet There are many reasons that ping could be faster than gRPC, not just the overhead of serializing and deserializing protocol buffers and HTTP transport. For example, ICMP could be given special routing, ICMP is handled closer to the kernel level, or the fact that ICMP frames are much, much smaller. In order to test this I ran the test from two machines on the same subnet; the violin plot for the distribution is below:\nBoth ping and echo latencies are much smaller, by approximately the same amount. Because the gap between them is approximately the same percentage (though not fixed), I think this graph identifies clearly what is overhead and what is network latency. However, because the gap is also smaller, it shows that bandwidth and other message traffic may be having an influence in the disparity as well (e.g. that ping has preferential routes through wide area networks).\n","permalink":"https://bbengfort.github.io/2016/11/ping-vs-grpc/","summary":"\u003cp\u003eBuilding distributed systems means passing messages between devices over a network connection. My research specifically considers networks that have extremely variable latencies or that can be partition prone. This led me to the natural question, “how variable are real world networks?” In order to get real numbers, I built a simple echo protocol using Go and gRPC called \u003ca href=\"https://github.com/bbengfort/orca\"\u003eOrca\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eI ran Orca for a few days and got some latency measurements as I traveled around with my laptop. Orca does a lot of work, including GeoIP look ups, IP address resolution, and database queries and storage. This post, however, is not about Orca. The latencies I was getting were very high relative to the round-trip latencies reported by the simple \u003ccode\u003eping\u003c/code\u003e command that implements the \u003ca href=\"https://en.wikipedia.org/wiki/Internet_Control_Message_Protocol\"\u003eICMP protocol\u003c/a\u003e.\u003c/p\u003e","title":"Message Latency: Ping vs. gRPC"},{"content":"Ashley and I have been going over the District Data Labs Blog trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she\u0026rsquo;s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what\u0026rsquo;s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that the Longreads Blog does. This may help give readers a better sense of the time committment and be able to engage more easily.\nSo computing the reading time is simple right? Take the number of words in the post divided by the average words per minute reading rate and bam - the number of minutes per post. Also, we\u0026rsquo;re not going to simply split on space, we know better - so we can use NLTK\u0026rsquo;s word_tokenize function. Seems like we\u0026rsquo;re good to go, but what\u0026rsquo;s the average words per minute reading rate of the average DDL reader?\nAfter a bit of a search, we first found a study published by Reading Plus that charted the normal reading read in words per minute against high school grade level. Unfortunately, this led to the question, what level is our content at? Further searching found an LSAT reading speed calculation formula by Graeme Blake, moderator of the Reddit LSAT forum. We figured our content is probably as complex as the LSAT, and moreover, he gave speeds for slow, average, high average, fast, and rare LSAT students.\nWe ran each of these WPM speeds against published articles in the DDL corpus and came up with the following words per minute for each title:\nPost LSAT Slow Average Fast Announcing the District Data Labs Blog 26 seconds 23 seconds 18 seconds 15 seconds How to Transition from Excel to R 12 minutes 11 minutes 9 minutes 7 minutes What Are the Odds? 12 minutes 10 minutes 8 minutes 7 minutes How to Develop Quality Python Code 28 minutes 25 minutes 20 minutes 17 minutes Markup for Fast Data Science Publication 16 minutes 14 minutes 11 minutes 9 minutes The Age of the Data Product 27 minutes 24 minutes 19 minutes 16 minutes A Practical Guide to Anonymizing Datasets with Python \u0026amp; Faker 19 minutes 17 minutes 14 minutes 11 minutes Computing a Bayesian Estimate of Star Rating Means 19 minutes 17 minutes 14 minutes 11 minutes Conditional Probability with R 12 minutes 11 minutes 9 minutes 7 minutes Creating a Hadoop Pseudo-Distributed Environment 13 minutes 12 minutes 10 minutes 8 minutes Getting Started with Spark (in Python) 32 minutes 29 minutes 23 minutes 19 minutes Graph Analytics Over Relational Datasets with Python 11 minutes 10 minutes 8 minutes 7 minutes An Introduction to Machine Learning with Python 18 minutes 16 minutes 13 minutes 11 minutes Modern Methods for Sentiment Analysis 12 minutes 11 minutes 9 minutes 7 minutes Parameter Tuning with Hyperopt 12 minutes 11 minutes 9 minutes 7 minutes Simple CSV Data Wrangling with Python 18 minutes 16 minutes 13 minutes 11 minutes Time Maps: Visualizing Discrete Events Across Many Timescales 10 minutes 9 minutes 7 minutes 6 minutes We\u0026rsquo;d be happy to have any feedback on if these times look correct or not. The code to produce the table follows:\nOf course this is a straight count of words and does not take into account the number of sections or whether or not there are any code blocks. In the future, I hope to do an HTML version of this that takes into account the number of paragraphs, the density of each paragraph and the length of sentences, as well as the frequency of vocabulary words etc. I\u0026rsquo;ll need to gather feedback for a supervised learning algorithm though to train actual WPM on these features!\n","permalink":"https://bbengfort.github.io/2016/10/reading-speed/","summary":"\u003cp\u003eAshley and I have been going over the \u003ca href=\"http://blog.districtdatalabs.com/\"\u003eDistrict Data Labs Blog\u003c/a\u003e trying to figure out a method to make it more accessible both to readers (who are at various levels) and to encourage writers to contribute. To that end, she\u0026rsquo;s been exploring other blogs to see if we can put multiple forms of content up; long form tutorials (the bulk of what\u0026rsquo;s there) and shorter idea articles, possibly even as short as the posts I put on my dev journal. One interesting suggestion she had was to mark the reading time of each post, something that \u003ca href=\"http://bit.ly/2ePtm3z\"\u003e the Longreads Blog\u003c/a\u003e does. This may help give readers a better sense of the time committment and be able to engage more easily.\u003c/p\u003e","title":"Computing Reading Speed"},{"content":"I gave this talk twice, both at PyData DC on October 24, 2016 and at PyData Carolinas on September 15, 2016. Both videos are below if you feel like figuring out which presentation was better!\nPyData DC PyData Carolinas Slides Description Network analyses are powerful methods for both visual analytics and machine learning but can suffer as their complexity increases. By embedding time as a structural element rather than a property, we will explore how time series and interactive analysis can be improved on Graph structures. Primarily we will look at decomposition in NLP-extracted concept graphs using NetworkX and Graph Tool.\nModeling data as networks of relationships between entities can be a powerful method for both visual analytics and machine learning; people are very good at distinguishing patterns from interconnected structures, and machine learning methods get a performance improvement when applied to graph data structures. However, as these structures become more complex or embed more information over time, both visual and algorithmic methods get messy; visual analyses suffer from the \u0026ldquo;hairball\u0026rdquo; effect, and graph algorithms require either more traversal or increased computation at each vertex. A growing area to reduce this complexity and optimize analytics is the use of interactive and subgraph techniques that model how graph structures change over time.\nIn this talk, I demonstrate two practical techniques for embedding time into graphs, not as computational properties, but rather as structural elements. The first technique is to add time as a node to the graph, which allows the graph to remain static and complete, but minimizes traversals and allows filtering. The second is to represent a single graph as multiple subgraphs where each is a snapshot at a particular time. This allows us to use time series analytics on our graphs, but perhaps more importantly, to use animation or interactive methodologies to visually explore those changes and provide meaningful dynamics.\n","permalink":"https://bbengfort.github.io/2016/10/dynamics-in-graph-analysis-adding-time-as-a-structure-for-visual-and-statistical-insight/","summary":"\u003cp\u003eI gave this talk twice, both at \u003ca href=\"http://pydata.org/dc2016/schedule/presentation/36/\"\u003ePyData DC\u003c/a\u003e on October 24, 2016 and at \u003ca href=\"http://pydata.org/carolinas2016/schedule/presentation/39/\"\u003ePyData Carolinas\u003c/a\u003e on September 15, 2016. Both videos are below if you feel like figuring out which presentation was better!\u003c/p\u003e\n\u003ch3 id=\"pydata-dc\"\u003ePyData DC\u003c/h3\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/QhMZ1PmlJn4?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"pydata-carolinas\"\u003ePyData Carolinas\u003c/h3\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/RgixxVpfXDY?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"slides\"\u003eSlides\u003c/h3\u003e\n\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/66065281?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eNetwork analyses are powerful methods for both visual analytics and machine learning but can suffer as their complexity increases. By embedding time as a structural element rather than a property, we will explore how time series and interactive analysis can be improved on Graph structures. Primarily we will look at decomposition in NLP-extracted concept graphs using NetworkX and Graph Tool.\u003c/p\u003e","title":"Dynamics in Graph Analysis Adding Time as a Structure for Visual and Statistical Insight"},{"content":"When making slides, I generally like to use Flickr to search for images that are licensed via Creative Commons to use as backgrounds. My slide deck tools of choice are either Reveal.js or Google Slides. Both tools allow you to specify an image as a background for the slide, but for Google Slides in particular, if the aspect ratio of the image doesn\u0026rsquo;t match the aspect ratio of the slide deck, then weird things can happen.\nFor a long time, I\u0026rsquo;d been manually cropping images in Preview. I would decide which was the long dimension based on the aspect ratio I was using (e.g. 16x9) and trim the image along that dimension to the desired ratio. I could then upload to the background with no weird scaling or centering occurring. Making my slides for PyData, however, I realized this was inefficient - and anyway I was running out of time! Therefore I decided to use the Python Image Library (PIL) to do this automatically for me on the command line:\nThis script allows you to pass an aspect ratio as a WxH string, where W is the width integer and H is the height integer. The default is 16x9 as per my slides at the time. It then opens the Image using PIL, computes the image width and height, then determines the width and height difference from the ratio. If the difference is bigger, that amount is cropped off evenly from both sides. The new image is then saved as a copy so the original isn\u0026rsquo;t destroyed.\nHere is an example image of Shanghai city by barnyz, used under a CC BY-NC-ND 2.0 creative commons license:\nIn order to crop this image to the 16x9 aspect ratio we run the script as follows:\n$ python img2aspect.py -a 16x9 city.jpg This saves a file called city-16x9.jpg in the same directory as city.jpg. As you can see the resulting image is cropped evenly from the top and the bottom to bring the height dimension into alignment with the aspect:\nNote that this result is because an analysis of the dimensions of the original with respect to the aspect ratio showed that the height was out of alignment with the desired aspect ratio, not the width. If we had passed in a very long picture, then the width would have been cropped.\n","permalink":"https://bbengfort.github.io/2016/09/image-aspect-ratio/","summary":"\u003cp\u003eWhen making slides, I generally like to use \u003ca href=\"https://www.flickr.com/\"\u003eFlickr\u003c/a\u003e to search for images that are licensed via \u003ca href=\"https://creativecommons.org/\"\u003eCreative Commons\u003c/a\u003e to use as backgrounds. My slide deck tools of choice are either \u003ca href=\"http://lab.hakim.se/reveal-js/#/\"\u003eReveal.js\u003c/a\u003e or \u003ca href=\"https://www.google.com/slides/about/\"\u003eGoogle Slides\u003c/a\u003e. Both tools allow you to specify an image as a background for the slide, but for Google Slides in particular, if the aspect ratio of the image doesn\u0026rsquo;t match the aspect ratio of the slide deck, then weird things can happen.\u003c/p\u003e","title":"Modifying an Image's Aspect Ratio"},{"content":"This is mostly a post of annoyance. I\u0026rsquo;ve been working with graphs in Python via NetworkX and trying to serialize them to GraphML for use in Gephi and graph-tool. Unfortunately the following error is really starting to get on my nerves:\nnetworkx.exception.NetworkXError: GraphML writer does not support \u0026lt;class \u0026#39;datetime.datetime\u0026#39;\u0026gt; as data values. Also it doesn\u0026rsquo;t support \u0026lt;type NoneType\u0026gt; or list or dict or \u0026hellip;\nSo I have to do something about it:\nThis is my first attempt, I\u0026rsquo;m simply going through all nodes and edges and directly updating/serializing their data values (note that Graph properties are missing). This pretty much makes the graph worthless after writing to disk. It also means that you have to do the deserialization after reading in the GraphML. There has to be a better way.\n","permalink":"https://bbengfort.github.io/2016/09/serialize-graphml/","summary":"\u003cp\u003eThis is mostly a post of annoyance. I\u0026rsquo;ve been working with graphs in Python via NetworkX and trying to serialize them to GraphML for use in Gephi and graph-tool. Unfortunately the following error is really starting to get on my nerves:\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003enetworkx.exception.NetworkXError: GraphML writer does not support \u0026lt;class \u0026#39;datetime.datetime\u0026#39;\u0026gt; as data values.\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eAlso it doesn\u0026rsquo;t support \u003ccode\u003e\u0026lt;type NoneType\u0026gt;\u003c/code\u003e or \u003ccode\u003elist\u003c/code\u003e or \u003ccode\u003edict\u003c/code\u003e or \u0026hellip;\u003c/p\u003e\n\u003cp\u003eSo I have to do something about it:\u003c/p\u003e","title":"Serializing GraphML"},{"content":"I was recently asked about the parallelization of both the enqueuing of tasks and their processing. This is a tricky subject because there are a lot of factors that come into play. For example do you have two parallel phases, e.g. a map and a reduce phase that need to be synchronized, or is there some sort of data parallelism that requires multiple tasks to be applied to the data (e.g. Storm-style topology). While there are a lot of tools for parallel processing in batch for large data sets, how do you take care of simple problems with large datasets (say hundreds of gigabytes) on a single machine with a quad core or hyperthreading multiprocessor?\nFor quick Python scripts, you have to use the multiprocessing module in order to get parallelism. Now adays, multiprocessing has a very nice interface with the Pool and map_async or apply_async functions. However, consider the following situation:\nYou have several CSV files that you want to process on a row-by-row basis. For each row, you have to do an independent computation that is CPU bound. You want to reduce the results of the per-row computations sequentially. For example, consider the construction of a bloom filter from a list of multiple CSV files; you\u0026rsquo;ll have to do parsing, hashing, filtering, aggregation, etc. on each row, then build the bloom filter from the bottom up. To do this, we\u0026rsquo;ll use two parallel stages:\nMultiple processes reading multiple CSV files, parsing each row and enqueuing it. Multiple processes reading the queue of parsed rows and doing computation, then pushing the results to a done queue. I\u0026rsquo;ve had to reuse a bit of code from a few places, and this is untested, but I think it demonstrates what is happening:\nThe enqueue function takes a path to a csv file as well as a synchronized queue (that uses locks to ensure only one process has access to the queue at a time). It reads each row from the CSV file, parses it, and puts it onto the queue. This type of work is similar to the map phase of MapReduce.\nThe worker function sits and watches an input queue, and attempts to get values of the queue with a timeout of 10 seconds. If the timeout expires or it sees the string 'STOP' then it will break (exiting the forever watching loop) and return. Thus if a row gets added to the input queue within 10 seconds of the last time it fetched a row, the worker will continue working. It then does some computations (e.g. the function could save state and do a reduction, building a partial bloom filter, or other CPU/IO sensitive work). It then puts the results of its computation on the results queue.\nThe parallelize function is the primary process and coordinates both the enqueuing and the workers. It first sets up the two queues, the tasks (parsed rows) and results. It then creates a pool for the enqueue processes and uses map_async which will call the callback once all processes are complete. At that point, we simply put the 'STOP' semaphore into the queue so that the workers know there are no more rows. We then create each worker, not using a pool, but just creating direct processes to watch the input queue and do other work. We then join on all these process to wait until they\u0026rsquo;ve terminated.\nFor simple tasks this workflow can get you a lot of raw performance for free, though if this is more routine type workflow, you may want to consider a language with concurrency built in \u0026ndash; like Go.\n","permalink":"https://bbengfort.github.io/2016/09/parallel-enqueue-and-work/","summary":"\u003cp\u003eI was recently asked about the parallelization of both the enqueuing of tasks and their processing. This is a tricky subject because there are a lot of factors that come into play. For example do you have two parallel phases, e.g. a map and a reduce phase that need to be synchronized, or is there some sort of data parallelism that requires multiple tasks to be applied to the data (e.g. Storm-style topology). While there are a lot of tools for parallel processing in batch for large data sets, how do you take care of simple problems with large datasets (say hundreds of gigabytes) on a single machine with a quad core or hyperthreading multiprocessor?\u003c/p\u003e","title":"Parallel Enqueue and Workers"},{"content":"A common source of natural language corpora comes from the web, usually in the form of HTML documents. However, in order to actually build models on the natural language, the structured HTML needs to be transformed into units of discourse that can then be used for learning. In particular, we need to strip away extraneous material such as navigation or advertisements, targeting exactly the content we\u0026rsquo;re looking for. Once done, we need to split paragraphs into sentences, sentences into tokens, and assign part-of-speech tags to each token. The preprocessing therefore transforms HTML documents to a list of paragraphs, which are themselves a list of sentences, which are lists of token, tag tuples.\nUnfortunately this preprocessing can take a lot of time, particularly for larger corpora. It is therefore efficient to preprocess HTML into these data structures and store them as pickled Python objects, serialized to disk. In order to get the most bang for our buck - we can use multiprocessing to parallelize the preprocessing on each document, increasing the speed of processing due to data parallelism.\nIn this post, we\u0026rsquo;ll focus on the parallelization aspects, rather than on the preprocessing aspects (you\u0026rsquo;ll have to buy our book for that). In the following code snippet we will look at parallel preprocessing html files in a single directory to pickle files in another directory using the builtin multiprocessing library, nltk and beautifulsoup for the actual work, and tqdm to track our progress.\nThe preprocess function takes an input path as well as a directory to write the output to. After reading in the HTML data and creating a parsed Soup object using lxml, we then extract all \u0026lt;p\u0026gt; tags as the paragraphs, apply the nltk.sent_tokenize function to each paragraph, then tokenize and tag each sentence. The final data structure is a list of lists of token, tag tuples \u0026ndash; perfect for downstream NLP preprocessing! We then extract the base name of the input path and separate the .html extension, adding .pickle and creating our output path. From there we can simply open the output file for writing bytes and dump our pickled object to it.\nWe take advantage of data parallelism (applying the preprocess function to each html file) in the parallelize function, which takes an input directory and an output directory, as well as the number of tasks to run, which defaults to the number of cores on the machine. The user interface will be a progress bar that displays how many bytes of HTML data have been preprocessed (an alternative is the number of documents processed). First, we list the input directory to get all the paths, then figure out the total number of input bytes using the operating system stat via os.path.getsize. We can then instantiate a progress bar with the total and units, and create a callback function that updates the progress bar from the result of the preprocess function.\nHere is where we get into the parallelism - we create a pool of processes that are ready for work, then use apply_async to queue the work (input paths) to the processes. Each process will pop off an input path, perform the preprocessing, then return the file size of the input path it just processed. It will continue to do so as long as there is work. We have to use apply_async instead of map_async in order to ensure that on_result is called after each process completes (thereby updating the progress bar) otherwise the callback wouldn\u0026rsquo;t be called until all work is done.\n$ python3 parallel.py 100%|██████████████████| 120809/120809 [00:15\u0026lt;00:00, 7237.22Bytes/s] Running this function, you should see a linear speedup in the amount of preprocessing time as the number of processes are increased!\n","permalink":"https://bbengfort.github.io/2016/08/parallel-nlp-preprocessing/","summary":"\u003cp\u003eA common source of natural language corpora comes from the web, usually in the form of HTML documents. However, in order to actually build models on the natural language, the structured HTML needs to be transformed into units of discourse that can then be used for learning. In particular, we need to strip away extraneous material such as navigation or advertisements, targeting exactly the content we\u0026rsquo;re looking for. Once done, we need to split paragraphs into sentences, sentences into tokens, and assign part-of-speech tags to each token. The preprocessing therefore transforms HTML documents to a list of paragraphs, which are themselves a list of sentences, which are lists of token, tag tuples.\u003c/p\u003e","title":"Parallel NLP Preprocessing"},{"content":"It feels like there are many questions like this one on Stack Overflow: Representing Directory \u0026amp; File Structure in Markdown Syntax, basically asking \u0026ldquo;how can we represent a directory structure in text in a pleasant way?\u0026rdquo; I too use these types of text representations in slides, blog posts, books, etc. It would be very helpful if I had an automatic way of doing this so I didn\u0026rsquo;t have to create it from scratch.\nOf course, many tools do this; and it should be pretty easy to write a Python script to do this exactly the way you want to. So, I did. The Script in Gist form is below. My suggestion is to download this file, stick it into your path, and change it\u0026rsquo;s permissions to be executable. You can do this as follows:\n$ mkdir ~/bin/ $ curl -o ~/bin/pdir http://bit.ly/2aDnXtj $ chmod +x ~/bin/pdir This assumes that your ~/bin directory is on your path (which it usually is on Unix systems). Then usage is as simple as:\n$ pdir path/to/dir/to/print For example, the website for this blog looks like this:\nbbengfort.github.io/_site/ ├── 404.html ├── Gemfile ├── Gemfile.lock ├── LICENSE ├── README.md ├── about.html ├── archive.html └── assets | ├── 2016-06-23-graph-tool-viz.png | ├── apple-touch-icon-precomposed.png | └── css | | ├── hyde.css | | ├── libelli.css | | ├── poole.css | | └── syntax.css | └── data | | ├── pi-10k.txt | | └── timestepping.csv | ├── favicon.ico | ├── icon.png | └── images | | ├── 2016-01-28-timeline.svg | | ├── 2016-02-16-observer.png | | ├── 2016-03-04-pi-grid.png | | ├── 2016-03-14-matplotlib-segfault.png | | ├── 2016-03-14-pi-grid.png | | ├── 2016-04-15-interact-plot.png | | ├── 2016-04-15-timestepping.png | | ├── 2016-04-19-ml-data-management-workflow.png | | ├── 2016-04-26-cloudscope-consistency-visualization.png | | ├── 2016-04-26-epaxos-message-flow.png | | ├── 2016-04-26-raft-message-flow.png | | ├── 2016-04-26-raftscope-replay-visualization.png | | ├── 2016-04-26-secret-lives-of-data-raft-visualization.png | | ├── 2016-05-10-mora-architecture.png | | ├── 2016-05-19-nltk-sklearn-text-pipeline.png | | ├── 2016-06-23-graph-tool-viz.png | | ├── 2016-06-27-big-sigma-curve.png | | └── 2016-06-27-small-sigma-curve.png ├── feed.xml ├── index.html └── observations | └── 2016 | | └── 04 | | | └── 15 | | | | └── lessons-in-discrete-event-simulation.html | | | └── 26 | | | | └── visualizing-distributed-systems.html ... [snip] ... Ok, that\u0026rsquo;s a lot, so it was snipped for brevity, but you get the picture: here\u0026rsquo;s the code:\nEnjoy this code snippet brought to you by procrastination and coffee; also me skipping lunch.\n","permalink":"https://bbengfort.github.io/2016/08/pretty-print-directories/","summary":"\u003cp\u003eIt feels like there are many questions like this one on Stack Overflow: \u003ca href=\"http://stackoverflow.com/questions/19699059/representing-directory-file-structure-in-markdown-syntax\"\u003eRepresenting Directory \u0026amp; File Structure in Markdown Syntax\u003c/a\u003e, basically asking \u0026ldquo;how can we represent a directory structure in text in a pleasant way?\u0026rdquo; I too use these types of text representations in slides, blog posts, books, etc. It would be very helpful if I had an automatic way of doing this so I didn\u0026rsquo;t have to create it from scratch.\u003c/p\u003e","title":"Pretty Print Directories"},{"content":" Description We talk to Benjamin Bengfort about his Data Day Seattle talks, District Data Labs, and Ben\u0026rsquo;s popular O\u0026rsquo;Reilly books.\n","permalink":"https://bbengfort.github.io/2016/07/interview-ben-bengfort-of-district-data-labs/","summary":"\u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/ZiY5tjgg7lU?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eWe talk to Benjamin Bengfort about his Data Day Seattle talks, District Data Labs, and Ben\u0026rsquo;s popular O\u0026rsquo;Reilly books.\u003c/p\u003e","title":"Interview - Ben Bengfort of District Data Labs"},{"content":" Description Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it\u0026rsquo;s beginnings in academia, and with tools like Scikit-Learn, it\u0026rsquo;s easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model\u0026rsquo;s evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.\n","permalink":"https://bbengfort.github.io/2016/07/visualizing-the-model-selection-process/","summary":"\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/64311820?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eMachine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it\u0026rsquo;s beginnings in academia, and with tools like Scikit-Learn, it\u0026rsquo;s easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is \u003cem\u003emodel selection\u003c/em\u003e. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model\u0026rsquo;s evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.\u003c/p\u003e","title":"Visualizing the Model Selection Process"},{"content":" Description Data products derive their value from data and generate new data in return; as a result, machine learning techniques must be applied to their architecture and their development. Machine learning fits models to make predictions on unknown inputs and must be generalizable and adaptable. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back in to the data product. Data product architectures are therefore life cycles and understanding the data product life cycle will enable architects to develop robust, failure free workflows and applications. In this talk we will discuss the data product life cycle, explore how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Following the lambda architecture, we will investigate wrapping a central computational store for speed and querying, as well as incorporating a discussion of monitoring, management, and data exploration for hypothesis driven development. From web applications to big data appliances; this architecture serves as a blueprint for handling data services of all sizes!\n","permalink":"https://bbengfort.github.io/2016/07/data-product-architectures-seattle-data-day/","summary":"\u003ciframe\n  style=\"width: 100%; height: 500px;\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"\n  src=\"https://www.slideshare.net/slideshow/embed_code/64265501?rel=0\" allowfullscreen webkitallowfullscreen mozallowfullscreen\u003e \u003c/iframe\u003e\n\u003cbr\u003e\u003cbr\u003e\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eData products derive their value from data and generate new data in return; as a result, machine learning techniques must be applied to their architecture and their development. Machine learning fits models to make predictions on unknown inputs and must be \u003cem\u003egeneralizable\u003c/em\u003e and \u003cem\u003eadaptable\u003c/em\u003e. As such, fitted models cannot exist in isolation; they must be operationalized and user facing so that applications can benefit from the new data, respond to it, and feed it back in to the data product. Data product architectures are therefore \u003cem\u003elife cycles\u003c/em\u003e and understanding the data product life cycle will enable architects to develop robust, failure free workflows and applications. In this talk we will discuss the data product life cycle, explore how to engage a model build, evaluation, and selection phase with an operation and interaction phase. Following the lambda architecture, we will investigate wrapping a central computational store for speed and querying, as well as incorporating a discussion of monitoring, management, and data exploration for hypothesis driven development. From web applications to big data appliances; this architecture serves as a blueprint for handling data services of all sizes!\u003c/p\u003e","title":"Data Product Architectures: Seattle Data Day"},{"content":"Many of us are spoiled by the use of matplotlib\u0026rsquo;s colormaps which allow you to specify a string or object name of a color map (e.g. Blues) then simply pass in a range of nearly continuous values which are spread along the color map. However, using these color maps for categorical or discrete values (like the colors of nodes) can pose challenges as the colors may not be distinct enough for the representation you\u0026rsquo;re looking for.\nColor is a very interesting topic, and I\u0026rsquo;m very partial to the work of Cynthia Brewer who suggests color scales for maps in particular that are understandable even by those who are color blind. D3 makes very heavy use of the Brewer palettes and provides Every ColorBrewer Scale as a JSON file. Unfortunately Python doesn\u0026rsquo;t natively have ColorBrewer in any of it\u0026rsquo;s projects (with the notable exception of seaborn).\nXKCD also did a color survey and Nathan Yao also has a lot to say about color on Flowing Data. Color is hugely important, and we shouldn\u0026rsquo;t just be forced to stick with the status quo. We should make things with awesome colors. I like to use tools like Paletton and Color Lovers to get unique palettes for my projects.\nSo you need something like the colormap in order to actually use these things. Therefore, for descrete values, I give you the ColorMap:\nSo how do you use this tool? First you instantiate the object with either a list of colors or one of the names I provided in the script (and expand your script with your own names!). You then have a callable object that you can get the color for any hashable object. The ColorMap retains the color information and raises an exception if you ask for more colors than you have in the map.\n\u0026gt;\u0026gt;\u0026gt; cmap = Colors(\u0026#39;flatui\u0026#39;) \u0026gt;\u0026gt;\u0026gt; cmap(\u0026#39;A\u0026#39;) #9b59b6 \u0026gt;\u0026gt;\u0026gt; cmap(\u0026#39;B\u0026#39;) #3498db \u0026gt;\u0026gt;\u0026gt; cmap(\u0026#39;A\u0026#39;) #9b59b6 You can then use this tool in Graph Tool graphs or other utilities. For example, here is some standard graph tool code that I use to visualize graphs:\nimport graph_tool.all as gt # Draw the vertices with labels using their name property # and their size according to their degree. vlabel = G.vp[\u0026#39;name\u0026#39;] vsize = G.degree_property_map(\u0026#34;in\u0026#34;) vsize = gt.prop_to_size(vsize, ma=60, mi=20) # Set the vertex color using the color map and the flatui scheme. vcolor = G.new_vertex_property(\u0026#39;string\u0026#39;) vcmap = ColorMap(\u0026#39;flatui\u0026#39;, shuffle=False) # Add the color from the \u0026#39;type\u0026#39; property of the vertex. for vertex in G.vertices(): vcolor[vertex] = vcmap(G.vp[\u0026#39;type\u0026#39;][vertex]) # Set the edge color using the set1 colorbrewer scale ecolor = G.new_edge_property(\u0026#39;string\u0026#39;) ecmap = ColorMap(\u0026#39;set1\u0026#39;, shuffle=False) # Add the color from the \u0026#39;label\u0026#39; property of the edge. for edge in G.edges(): ecolor[edge] = ecmap(G.ep[\u0026#39;label\u0026#39;][edge]) # Label the edge and size it according to the norm and weight elabel = G.ep[\u0026#39;label\u0026#39;] esize = G.ep[\u0026#39;norm\u0026#39;] esize = gt.prop_to_size(esize, mi=.1, ma=3) eweight = G.ep[\u0026#39;weight\u0026#39;] # Draw the graph! gt.graph_draw( G, output_size=(1200,1200), output=os.path.join(FIGURES, name), vertex_text=vlabel, vertex_size=vsize, vertex_font_weight=1, vertex_pen_width=1.3, vertex_fill_color=vcolor, edge_pen_width=esize, edge_color=ecolor, edge_text=elabel ) And there you have it, put colors everywhere.\n","permalink":"https://bbengfort.github.io/2016/07/color-mapper/","summary":"\u003cp\u003eMany of us are spoiled by the use of matplotlib\u0026rsquo;s \u003ca href=\"http://matplotlib.org/examples/color/colormaps_reference.html\"\u003ecolormaps\u003c/a\u003e which allow you to specify a string or object name of a color map (e.g. \u003ccode\u003eBlues\u003c/code\u003e) then simply pass in a range of nearly continuous values which are spread along the color map. However, using these color maps for categorical or discrete values (like the colors of nodes) can pose challenges as the colors may not be distinct enough for the representation you\u0026rsquo;re looking for.\u003c/p\u003e","title":"Color Map Utility"},{"content":"Normal distributions are the backbone of random number generation for simulation. By selecting a mean (μ) and standard deviation (σ) you can generate simulated data representative of the types of models you\u0026rsquo;re trying to build (and certainly better than simple uniform random number generators). However, you might already be able to tell that selecting μ and σ is a little backward! Typically these metrics are computed from data, not used to describe data. As a result, utilities for tuning the behavior of your random number generators are simply not discussed.\nI wanted to be able to quickly and easily predict what would happen as I varied μ and σ in my distribution. These are easy to think about: if you picture the normal curve, then μ describes where the middle of the curve is. If you want to have data centered around 100, then you would choose a μ of 100. Standard deviation describes how spread out the data is, or how tall or flat the normal curve is above the mean. A standard deviation of zero would be a spike right at the mean, where as a very high standard deviation will be extremely flat with a wide range.\nRemembering that ±3σ from the mean captures most of the data from the random generator, I set up creating a visual way to inspect the properties and behaviors of the normal generators I was creating. In particular, my goal is to visually inspect the range of the data, as well as the density of results. This helps debug issues in simulations. Therefore, I give you a normal distribution simulation:\nBy running this simply Python script:\n$ python norm.py 12.0 2.0 You end up with visuals as follows:\nShifting the mean and increasing the standard deviation gives you the following:\n$ python norm.py 14 12.4 Which, as you can see, definitely changes the scale of the domain of the random number generator!\nIt may be hard to see - but check out the domains of both axes to get a feel for the magnitude of that change! Now you have a simple and effective way to reason about how μ and σ might change the way that random numbers are selected!\n","permalink":"https://bbengfort.github.io/2016/06/normal-distribution-viz/","summary":"\u003cp\u003eNormal distributions are the backbone of random number generation for simulation. By selecting a mean (μ) and standard deviation (σ) you can generate simulated data representative of the types of models you\u0026rsquo;re trying to build (and certainly better than simple uniform random number generators). However, you might already be able to tell that selecting μ and σ is a little backward! Typically these metrics are computed from data, not used to describe data. As a result, utilities for tuning the behavior of your random number generators are simply not discussed.\u003c/p\u003e","title":"Visualizing Normal Distributions"},{"content":"As I\u0026rsquo;m moving deeper into my PhD, I\u0026rsquo;m getting into more Go programming for the systems that I\u0026rsquo;m building. One thing that I\u0026rsquo;m constantly doing is trying to create a background process that runs forever, and does some work at an interval. Concurrency in Go is native and therefore the use of threads and parallel processing is very simple, syntax-wise. However I am still solving problems that I wanted to make sure I recorded here.\nToday\u0026rsquo;s problem involves getting a go routine to execute a function on an interval, say every 5 seconds or something like that. The foreground process will presumably be working until finished, and we want to make sure it can gracefully shutdown the background process without a delay. In order to communicate between threads in Go, you have to use a channel. I\u0026rsquo;ve put together the work from Timer Routines And Graceful Shutdowns In Go into a single snippet to remind myself how to do this:\nThe end result is a program that looks like this:\n$ go run main.go 2016/06/25 21:27:51 Main started 2016/06/25 21:27:51 Worker Started 2016/06/25 21:27:58 Action complete! 2016/06/25 21:28:03 Action complete! 2016/06/25 21:28:11 Main out! As you can see it\u0026rsquo;s seven seconds between \u0026ldquo;Worker Started\u0026rdquo; and the first \u0026ldquo;Action Complete!\u0026rdquo; (5 second delay plus 2 seconds work). The second \u0026ldquo;Action Complete!\u0026rdquo; is 5 seconds later however because the worker only waits 3 seconds to make up for the work time from the previous interval. Shutdown is called, and the program gracefully shuts down with no more actions.\n","permalink":"https://bbengfort.github.io/2016/06/background-work-goroutines-timer/","summary":"\u003cp\u003eAs I\u0026rsquo;m moving deeper into my PhD, I\u0026rsquo;m getting into more Go programming for the systems that I\u0026rsquo;m building. One thing that I\u0026rsquo;m constantly doing is trying to create a background process that runs forever, and does some work at an interval. Concurrency in Go is native and therefore the use of threads and parallel processing is very simple, syntax-wise. However I am still solving problems that I wanted to make sure I recorded here.\u003c/p\u003e","title":"Background Work with Goroutines on a Timer"},{"content":" This week I discovered graph-tool, a Python library for network analysis and visualization that is implemented in C++ with Boost. As a result, it can quickly and efficiently perform manipulations, statistical analyses of Graphs, and draw them in a visual pleasing style. It\u0026rsquo;s like using Python with the performance of C++, and I was rightly excited:\nIt\u0026#39;s a bear to get setup, but once you do things get pretty nice. Moving my network viz over to it now!\n\u0026mdash; Benjamin Bengfort (@bbengfort) June 24, 2016 The visualization piece also excited me; as I tweeted, graph-tool sits between matplotlib and Gephi. It does a better job than matplotlib at the visualization, including things like edge curvature and directionality markers that are very difficult to do in native matplotlib. The graphs are very comparative to Gephi renderings, though it is probably a lot easier to do in Gephi then coding in graph-tool.\nBecause graph-tool is a C++ implementation, you can\u0026rsquo;t simply pip install it (giant bummer). I used homebrew to install it, and got it working outside of a virtual environment. I am not sure how to add it to a virtualenv or to the requirements of a project yet, but I imagine I will simply lazy load the module and raise an exception if a graph tool function is called (or gracefully fallback to NetworkX).\nOk, enough loving on graph-tool. Let\u0026rsquo;s do what we came here to do today, and that\u0026rsquo;s convert a NetworkX graph into a graph-tool graph.\nConverting a NetworkX Graph to Graph-Tool Both NetworkX and Graph-Tool support property graphs, a data model that allows graphs, vertices, and edges to have arbitrary key-value pairs associated with them. In NetworkX the property graphs are implemented as a Python dictionary, and as a result, you can use them just like you\u0026rsquo;d use a dictionary. However, in graph-tool these properties are maintained as a PropertyMap, a typed object that must be defined before the property can be added to a graph element. This and other C++ requirements make graph-tool Graphs a harder to generate and work with, though the results are worth it.\nFirst a note:\nimport networkx as nx import graph_tool.all as gt We will refer to networkx as nx and graph-tool as gt and prefix variables accordingly, e.g. nxG and gtG refer to the networkx and graph-tool graphs respectively. The snippet is long, but hopefully well commented for readability.\nAs you can see, converting a networkx graph or indeed even creating a graph-tool graph is very involved primarily because of the typing requirements of the C++ implementation. We haven\u0026rsquo;t even dealt with pitfalls like other Python objects like datetimes and the like. In case you didn\u0026rsquo;t want to inspect the code in detail, the phases are as follows:\nCreate a graph-tool graph, using is_directed to determine the type. Add all the graph properties from nxG.graph Iterate through the nodes and add all their properties Create a special id property for nodes since nx uses any hashable type Iterate through the edges and add all their properties Iterate through the nodes (again) and add them and their values to the graph Iterate through the edges, use a look up to add them and their values Potentially there is a mechanism to clean up and make this better, faster or stronger - but I think the main point is to illustrate how to get graph-tool graphs going so that you can use their excellent analytics and visualization tools!\n","permalink":"https://bbengfort.github.io/2016/06/graph-tool-from-networkx/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/images/2016-06-23-graph-tool-viz.png\" alt=\"A Directed Graph Visualization Generated by Graph-Tool\"  /\u003e\n\u003c/p\u003e\n\u003cp\u003eThis week I discovered \u003ca href=\"https://graph-tool.skewed.de/\"\u003egraph-tool\u003c/a\u003e, a Python library for network analysis and visualization that is implemented in C++ with Boost. As a result, it can quickly and efficiently perform manipulations, statistical analyses of Graphs, and draw them in a visual pleasing style. It\u0026rsquo;s like using Python with the performance of C++, and I was rightly excited:\u003c/p\u003e\n\u003cblockquote class=\"twitter-tweet\"\u003e\u003cp lang=\"en\" dir=\"ltr\"\u003eIt\u0026#39;s a bear to get setup, but once you do things get pretty nice. Moving my network viz over to it now!\u003c/p\u003e","title":"Converting NetworkX to Graph-Tool"},{"content":"Natural Language Processing with NLTK and Gensim\nDescription Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, Gensim, and the Natural Language Toolkit (NLTK).\nNLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.\nIn this tutorial we will begin by exploring NLTK from the view of the corpora that it already comes with, and in this way we will get a feel for the various features and functionality that NLTK has. However, most NLP practitioners want to work on their own corpora, therefore during the second half of the tutorial we will focus on building a language aware data product from a specific corpus - a topic identification and document clustering algorithm from a web crawl of blog sites. The clustering algorithm will use a simple Lesk K-Means clustering to start, and then will improve with an LDA analysis using the Gensim library.\n","permalink":"https://bbengfort.github.io/2016/05/natural-language-processing-with-nltk-and-gensim/","summary":"\u003cp\u003e\u003ca href=\"https://us.pycon.org/2016/schedule/presentation/1597/\"\u003eNatural Language Processing with NLTK and Gensim\u003c/a\u003e\u003c/p\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/itKNpCPHq3I?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eNatural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, Gensim, and the Natural Language Toolkit (NLTK).\u003c/p\u003e","title":"Natural Language Processing with NLTK and Gensim"},{"content":" This post is an early draft of expanded work that will eventually appear on the District Data Labs Blog. Your feedback is welcome, and you can submit your comments on the draft GitHub issue.\nI\u0026rsquo;ve often been asked which is better for text processing, NLTK or Scikit-Learn (and sometimes Gensim). The answer is that I use all three tools on a regular basis, but I often have a problem mixing and matching them or combining them in meaningful ways. In this post, I want to show how I use NLTK for preprocessing and tokenization, but then apply machine learning techniques (e.g. building a linear SVM using stochastic gradient descent) using Scikit-Learn. In a follow on post, I\u0026rsquo;ll talk about vectorizing text with word2vec for machine learning in Scikit-Learn.\nAs a note, in this post for the sake of speed, I\u0026rsquo;ll be building a text classifier on the movie reviews corpus that comes with NLTK. Here, movie reviews are classified as either positive or negative reviews and this follows a simple sentiment analysis pattern. In the DDL post, I will build a multi-class classifier using the Baleen corpus.\nIn order to follow along, make sure that you have NLTK and Scikit-Learn installed, and that you have downloaded the NLTK corpus:\n$ pip install nltk scikit-learn $ python -m nltk.downloader all I will also be using a few helper utilities like a timeit decorator and an identity function. The complete code for this project can be found here: sentiment.py. Note that I will also omit imports for the sake of brevity, so please review the complete code if trying to execute the snippets on this tutorial.\nPipelines The heart of building machine learning tools with Scikit-Learn is the Pipeline. Scikit-Learn exposes a standard API for machine learning that has two primary interfaces: Transformer and Estimator. Both transformers and estimators expose a fit method for adapting internal parameters based on data. Transformers then expose a transform method to perform feature extraction or modify the data for machine learning, and estimators expose a predict method to generate new data from feature vectors.\nPipelines allow developers to combine a sequential DAG of transformers with an estimator, to ensure that the feature extraction process is associated with the predictive process. This is especially important for text, where raw data is usually in the form of documents on disk or a list of strings. While Sckit-Learn does provide some text based feature extraction mechanisms, actually NLTK is far better suited for this type of text processing. As a result, most of my text processing pipelines have something like this at its core:\nThe CorpusReader reads files one at a time off a structured corpus (usually zipped) on disk and acts as the source of the data (I also usually include special methods to make sure that I can also get a vector of targets as well). The tokenizer splits raw text into sentences, words and punctuation, then tags their part of speech and lemmatizes them using the WordNet lexicon. The vectorizer encodes the tokens in the document as a feature vector, for example as a TF-IDF vector. Finally the classifier is fit to the documents and their labels, pickled to disk and used to make predictions in the future.\nPreprocessing In order to limit the number of features, as well as to provide a high quality representation of the text, I use NLTK\u0026rsquo;s advanced text processing mechanisms including the Punkt segmenter and tokenizer, the Brill tagger, and lemmatization using the WordNet lexicon. This not only reduces the vocabulary (and therefore the size of the feature vectors), it also combines redundant features into a single token (e.g. bunny, bunnies, Bunny, bunny!, and _bunny_ all become one feature: bunny).\nIn order to add this type of preprocessing to Scikit-Learn, we must create a Transformer object as follows:\nimport string from nltk.corpus import stopwords as sw from nltk.corpus import wordnet as wn from nltk import wordpunct_tokenize from nltk import WordNetLemmatizer from nltk import sent_tokenize from nltk import pos_tag from sklearn.base import BaseEstimator, TransformerMixin class NLTKPreprocessor(BaseEstimator, TransformerMixin): def __init__(self, stopwords=None, punct=None, lower=True, strip=True): self.lower = lower self.strip = strip self.stopwords = stopwords or set(sw.words(\u0026#39;english\u0026#39;)) self.punct = punct or set(string.punctuation) self.lemmatizer = WordNetLemmatizer() def fit(self, X, y=None): return self def inverse_transform(self, X): return [\u0026#34; \u0026#34;.join(doc) for doc in X] def transform(self, X): return [ list(self.tokenize(doc)) for doc in X ] def tokenize(self, document): # Break the document into sentences for sent in sent_tokenize(document): # Break the sentence into part of speech tagged tokens for token, tag in pos_tag(wordpunct_tokenize(sent)): # Apply preprocessing to the token token = token.lower() if self.lower else token token = token.strip() if self.strip else token token = token.strip(\u0026#39;_\u0026#39;) if self.strip else token token = token.strip(\u0026#39;*\u0026#39;) if self.strip else token # If stopword, ignore token and continue if token in self.stopwords: continue # If punctuation, ignore token and continue if all(char in self.punct for char in token): continue # Lemmatize the token and yield lemma = self.lemmatize(token, tag) yield lemma def lemmatize(self, token, tag): tag = { \u0026#39;N\u0026#39;: wn.NOUN, \u0026#39;V\u0026#39;: wn.VERB, \u0026#39;R\u0026#39;: wn.ADV, \u0026#39;J\u0026#39;: wn.ADJ }.get(tag[0], wn.NOUN) return self.lemmatizer.lemmatize(token, tag) This is a big chunk of code, so we\u0026rsquo;ll go through it method by method. First when this transformer is initialized, it loads a variety of corpora and models for use in tokenization. By default the set of english stopwords from NLTK is used, and the WordNetLemmatizer looks up data from the WordNet lexicon. Note that this takes a noticeable amount of time, and should only be done on instantiation of the transformer.\nNext we have the Transformer interface methods: fit, inverse_transform, and transform. The first two are simply pass throughs since there is nothing to fit on this class, nor any ability to do inverse_transform — how would you take a lower case lemmatized, unordered tokens and come up with the original text? The best we can do is simply join the tokens with a space. The transform method takes a list of documents (given as the variable, X) and returns a new list of tokenized documents, where each document is transformed into list of ordered tokens.\nThe tokenize method breaks raw strings into sentences, then breaks those sentences into words and punctuation, and applies a part of speech tag. The token is then normalized: made lower case, then stripped of whitespace and other types of punctuation that may be appended. If the token is a stopword or if every character is punctuation, the token is ignored. If it is not ignored, the part of speech is used to lemmatize the token, which is then yielded.\nLemmatization is the process of looking up a single word form from the variety of morphologic affixes that can be applied to indicate tense, plurality, gender, etc. First we need to identify the WordNet tag form based on the Penn Treebank tag, which is returned from NLTK\u0026rsquo;s standard pos_tag function. We simply look to see if the Penn tag starts with \u0026lsquo;N\u0026rsquo;, \u0026lsquo;V\u0026rsquo;, \u0026lsquo;R\u0026rsquo;, or \u0026lsquo;J\u0026rsquo; and can correctly identify if its a noun, verb, adverb, or adjective. We then use the new tag to look up the lemma in the lexicon.\nBuild and Evaluate The next stage is to create the pipeline, train a classifier, then to evaluate it. Here I present a very simple version of build and evaluate where:\nThe model is split into a training and testing set by shuffling the data The model is trained on the training set, and evaluated on testing. A new model is then fit on all of the data and saved to disk. Elsewhere we can discuss evaluation techniques like K-part cross validation, grid search for hyperparameter tuning, or visual diagnostics for machine learning. My simple method is as follows:\nfrom sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.linear_model import SGDClassifier from sklearn.metrics import classification_report as clsr from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cross_validation import train_test_split as tts @timeit def build_and_evaluate(X, y, classifier=SGDClassifier, outpath=None, verbose=True): @timeit def build(classifier, X, y=None): \u0026#34;\u0026#34;\u0026#34; Inner build function that builds a single model. \u0026#34;\u0026#34;\u0026#34; if isinstance(classifier, type): classifier = classifier() model = Pipeline([ (\u0026#39;preprocessor\u0026#39;, NLTKPreprocessor()), (\u0026#39;vectorizer\u0026#39;, TfidfVectorizer( tokenizer=identity, preprocessor=None, lowercase=False )), (\u0026#39;classifier\u0026#39;, classifier), ]) model.fit(X, y) return model # Label encode the targets labels = LabelEncoder() y = labels.fit_transform(y) # Begin evaluation if verbose: print(\u0026#34;Building for evaluation\u0026#34;) X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2) model, secs = build(classifier, X_train, y_train) if verbose: print(\u0026#34;Evaluation model fit in {:0.3f} seconds\u0026#34;.format(secs)) print(\u0026#34;Classification Report:\\n\u0026#34;) y_pred = model.predict(X_test) print(clsr(y_test, y_pred, target_names=labels.classes_)) if verbose: print(\u0026#34;Building complete model and saving ...\u0026#34;) model, secs = build(classifier, X, y) model.labels_ = labels if verbose: print(\u0026#34;Complete model fit in {:0.3f} seconds\u0026#34;.format(secs)) if outpath: with open(outpath, \u0026#39;wb\u0026#39;) as f: pickle.dump(model, f) print(\u0026#34;Model written out to {}\u0026#34;.format(outpath)) return model This is a fairly procedural method of going about things. There is an inner function, build that takes a classifier class or instance (if given a class, it instantiates the classifier with the defaults) and creates the pipeline with that classifier and fits it. Note that when using the TfidfVectorizer you must make sure that its default preprocessor, normalizer, and tokenizer are all turned off using the identity function and passing None to the other parameters.\nThe function times the build process, evaluates it via the classification report that reports precision, recall, and F1. Then builds a new model on the complete dataset and writes it out to disk. In order to build the model, run the following code:\nfrom nltk.corpus import movie_reviews as reviews X = [reviews.raw(fileid) for fileid in reviews.fileids()] y = [reviews.categories(fileid)[0] for fileid in reviews.fileids()] model = build_and_evaluate(X,y, outpath=PATH) The output is as follows:\nBuilding for evaluation Evaluation model fit in 100.777 seconds Classification Report: precision recall f1-score support neg 0.84 0.84 0.84 193 pos 0.85 0.85 0.85 207 avg / total 0.84 0.84 0.84 400 Building complete model and saving ... Complete model fit in 115.402 seconds Model written out to model.pickle This is certainly not too bad — but consider how much time it took. For much larger corpora, you\u0026rsquo;ll only want to run this once, and in a time saving way. You could also preprocess your corpora in advance, however if you did so you would not be able to use the Pipeline as given, and would have to create separate feature extraction and modeling steps.\nMost Informative Features In order to use the model you just built, you would load the pickle from disk and use it\u0026rsquo;s predict method on new text as follows:\nwith open(PATH, \u0026#39;rb\u0026#39;) as f: model = pickle.load(f) yhat = model.predict([ \u0026#34;This is the worst movie I have ever seen!\u0026#34;, \u0026#34;The movie was action packed and full of adventure!\u0026#34; ]) print(model.named_steps[\u0026#39;classifier\u0026#39;].labels_.inverse_transform(yhat)) # [\u0026#39;neg\u0026#39; \u0026#39;pos\u0026#39;] In order to better understand how our linear model makes these decisions, we can use the coefficients for each feature (a word) to determine its weight in terms of positivity (and because \u0026lsquo;pos\u0026rsquo; is 1, this will be a positive number) and negativity (because \u0026rsquo;neg\u0026rsquo; is 0 this will be a negative number). We can also vectorize a piece of text and see how it\u0026rsquo;s features inform the class decision by multiplying it\u0026rsquo;s vector against its weights as follows:\ndef show_most_informative_features(model, text=None, n=20): # Extract the vectorizer and the classifier from the pipeline vectorizer = model.named_steps[\u0026#39;vectorizer\u0026#39;] classifier = model.named_steps[\u0026#39;classifier\u0026#39;] # Check to make sure that we can perform this computation if not hasattr(classifier, \u0026#39;coef_\u0026#39;): raise TypeError( \u0026#34;Cannot compute most informative features on {}.\u0026#34;.format( classifier.__class__.__name__ ) ) if text is not None: # Compute the coefficients for the text tvec = model.transform([text]).toarray() else: # Otherwise simply use the coefficients tvec = classifier.coef_ # Zip the feature names with the coefs and sort coefs = sorted( zip(tvec[0], vectorizer.get_feature_names()), key=itemgetter(0), reverse=True ) # Get the top n and bottom n coef, name pairs topn = zip(coefs[:n], coefs[:-(n+1):-1]) # Create the output string to return output = [] # If text, add the predicted value to the output. if text is not None: output.append(\u0026#34;\\\u0026#34;{}\\\u0026#34;\u0026#34;.format(text)) output.append( \u0026#34;Classified as: {}\u0026#34;.format(model.predict([text])) ) output.append(\u0026#34;\u0026#34;) # Create two columns with most negative and most positive features. for (cp, fnp), (cn, fnn) in topn: output.append( \u0026#34;{:0.4f}{: \u0026gt;15} {:0.4f}{: \u0026gt;15}\u0026#34;.format( cp, fnp, cn, fnn ) ) return \u0026#34;\\n\u0026#34;.join(output) For the model I trained, this reports the 20 most informative features for both positive and negative coefficients as follows:\n3.4326 fun -6.5962 bad 3.3835 great -3.2906 suppose 3.0014 performance -3.2527 plot 2.7226 see -3.1964 nothing 2.5224 quite -3.1688 attempt 2.5076 matrix -3.1104 unfortunately 2.1876 also -3.0741 waste 2.1336 true -2.5946 poor 2.1140 terrific -2.5943 boring 2.1076 different -2.5043 awful 2.0689 job -2.4893 ridiculous 2.0450 hilarious -2.4519 carpenter 2.0088 trek -2.4446 look 1.9704 memorable -2.2874 stupid 1.9501 well -2.2667 guess 1.9267 excellent -2.1953 even 1.8948 sometimes -2.1946 anyway 1.8939 perfectly -2.1719 lame 1.8506 bulworth -2.1406 reason 1.8453 portray -2.1098 script This seems to make a lot of sense!\nConclusion There are great tools for doing machine learning, topic modeling, and text analysis with Python: Scikit-Learn, Gensim, and NLTK respectively. Unfortunately in order to combine these tools in meaningful ways, you often have to jump through some hoops because they overlap. My approach was to leverage the API model of Scikit-Learn to build Pipelines of transformers that took advantage of other libraries.\nHelpful Links Using Scikit-Learn Pipelines and FeatureUnions Working with Text Data: Sckit-Learn 0.17 ","permalink":"https://bbengfort.github.io/2016/05/text-classification-nltk-sckit-learn/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis post is an early draft of expanded work that will eventually appear on the \u003ca href=\"http://blog.districtdatalabs.com/\"\u003eDistrict Data Labs Blog\u003c/a\u003e. Your feedback is welcome, and you can submit your comments on the \u003ca href=\"https://github.com/bbengfort/bbengfort.github.io/issues/4\"\u003edraft GitHub issue\u003c/a\u003e.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eI\u0026rsquo;ve often been asked which is better for text processing, NLTK or Scikit-Learn (and sometimes Gensim). The answer is that I use all three tools on a regular basis, but I often have a problem mixing and matching them or combining them in meaningful ways. In this post, I want to show how I use NLTK for preprocessing and tokenization, but then apply machine learning techniques (e.g. building a linear SVM using stochastic gradient descent) using Scikit-Learn. In a follow on post, I\u0026rsquo;ll talk about vectorizing text with word2vec for machine learning in Scikit-Learn.\u003c/p\u003e","title":"Text Classification with NLTK and Scikit-Learn"},{"content":" Yesterday I built my first microservice (a RESTful API) using Go, and I wanted to collect a few of my thoughts on the experience here before I forgot them. The project, Scribo, is intended to aid in my research by collecting data about a specific network that I\u0026rsquo;m looking to build distributed systems for. I do have something running, which will need to evolve a lot, and it could be helpful to know where it started.\nWhen I first sat down to do this project, I thought it was pretty straight forward. I watched “Writing JSON REST APIs in Go (Go from A to Z)” and read through the “Making a RESTful JSON API in Go” tutorial. I was going to deploy the service on Heroku, add testing with Ginkgo and Gomega, and use continuous integration with Travis-CI. I was used to doing the same kind of thing with Flask or Django and figured it couldn\u0026rsquo;t take that long. After a full day of coding, I did manage to do all the things I mentioned above, but with a number of angsty decisions that have caused me to write this post.\nThere were three main holdups that caused me to have trouble moving forward quickly:\nThe choice of a RESTful API framework Structuring the project Managing dependencies and versions Briefly I want to go over how each went down and the choices I made.\nFramework At the moment I\u0026rsquo;ve ended up using Gorilla mux though it was in pretty strong contention with go-json-rest. Note that these frameworks were the two proposed in both of the tutorials I mentioned earlier. I saw but did not consider Martini, which is no longer maintained, and Gin which apparently is Martini-like but faster. I was warned off of these frameworks by a post by Stephen Searles even though the majority of tutorials on the first page of Google results mentioned and used these frameworks.\nI think the response by Code Gangsta to Searles\u0026rsquo; criticism highlights the trouble that I had selecting a framework. I was expecting to come in and have to perform some hoop jumping to select a framework sort of like Flask vs. Django or Sinatra vs. Rails. I hoped that I would have been easily steered away from projects like Bottle (not a bad project, just not very popular) simply because of the number of tutorials.\nThe issue is that Go is so new and Go developers come from other communities, that idiomatic Go frameworks are still pretty tough to write because a lot of thought has to go into what that means. Moreover, Go\u0026rsquo;s standard library, namely net/http is so good that you don\u0026rsquo;t really have to build a lot on top of it (whereas you would never build a web app directly on top of Python\u0026rsquo;s HTTPServer).\nGo is intended to be the compilation of small, lightweight packages that are very good at the one thing they do well. It is not intended for large frameworks. Even Gorilla seems a bit to large with this context. I guess what I want is some small lightweight Resource API like the one described in “A RESTful Microframework in Go” — which I intend to build for my platform. Since we can\u0026rsquo;t be expected to build all these small components on our own this me led to the next problem: packaging.\nProject Structure There is actually a lot about how to structure Go code, in fact it is one of the first things discussed in the Go documentation. This is because the Go tools are dependent on how you organize your projects. The src/pkg/bin structuring along with namespaces based on repositories and use of go get to fetch and manage dependencies (see next section) makes Go “open source ready”.\nHowever, at the end of the day, it still feels weird for me to create multiple repositories for a single project \u0026ndash; particularly as it seems that they are suggesting that you create your library in one repository, and your commands and program main.go in a second repository. Moreover, I don\u0026rsquo;t like my code to be at the top level of the repository, I need some organization for large projects that don\u0026rsquo;t span repositories, and would like things to be in a subfolder (maybe I\u0026rsquo;ll get over this and be a better Go programmer).\nBased on Heroku\u0026rsquo;s suggestions, I followed the advice of Ben Johnson in “Structuring Applications in Go”. I put my main.go in a cmd folder so that it wouldn\u0026rsquo;t be built automatically on go get. My library I still forced into a subpackage, which requires me to specify ./... for most go commands to recursively search the directory for Go code. I\u0026rsquo;m decently ok with how things are now, but still not wholley comfortable.\nAlso - this is a web application, so I need to add HTML, CSS, and JavaScript files. Where to put those? Right now they\u0026rsquo;re in the root of the directory, but honestly this doesn\u0026rsquo;t feel right. I just wanted to create a small and simple one page app to view the microservice under the hood. The essential problem was that I couldn\u0026rsquo;t find a single example of a web app built using Go. This is a matter of not being able to Google correctly for it, but I still need those examples!\nDependency Management Apparently there was some discussion in Go between when I first started using it (1.3) and when I came back to it (1.6), and during Go 1.5 there was a “vendor experiment”. Vendoring is a mechanism of preserving specific dependency requirements for a project by including them (usually in a subfolder called vendor) in the source version control for your project. This is opposed to other mechanisms where you simply specify the version of the dependency you want and can fetch it (e.g. with go get) during the build process.\nFrom what I can tell, the vendor experiment one, and dependency management tools like Godep and The Vendor tool for Go had to do a bit of reorganizing. Because of Travis-CI and Heroku (which automatically look for a folder in your project called Godeps, created by the godep save command), I went with Godep over anything else.\nStill I\u0026rsquo;m not happy with this solution. I have no guide about what projects to select or use. Moreover, my src/github.com directory is getting filled up with a TON of projects. I feel like more investigation needs to be done here as well.\nConclusion Yesterday I was super excited, today I\u0026rsquo;m nervous but ready. I have a lot of questions, but I hope that I\u0026rsquo;ll be moving forward to doing some serious Go programming in the future. I hope to be as good a Go programmer as I am a Python programmer in the future, so that I can naturally create fast, effective systems.\n","permalink":"https://bbengfort.github.io/2016/05/a-microservice-in-go/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/images/2016-05-10-mora-architecture.png\" alt=\"The Mora Architecture Diagram\"  /\u003e\n\u003c/p\u003e\n\u003cp\u003eYesterday I built my first \u003ca href=\"http://martinfowler.com/articles/microservices.html\"\u003emicroservice\u003c/a\u003e (a RESTful API) using \u003ca href=\"https://golang.org/\"\u003eGo\u003c/a\u003e, and I wanted to collect a few of my thoughts on the experience here before I forgot them. The project, \u003ca href=\"https://github.com/bbengfort/scribo\"\u003eScribo\u003c/a\u003e, is intended to aid in my research by collecting data about a specific network that I\u0026rsquo;m looking to build distributed systems for. I do have \u003ca href=\"https://mora-scribo.herokuapp.com/\"\u003esomething running\u003c/a\u003e, which will need to evolve a lot, and it could be helpful to know where it started.\u003c/p\u003e","title":"Creating a Microservice in Go"},{"content":"One of the first steps to performing analysis of Git repositories is extracting the changes over time, e.g. the Git log. This seems like it should be a very simple thing to do, as visualizations on GitHub and elsewhere show file change analyses through history on a commit by commit basis. Moreover, by using the GitPython library you have direct access to Git repositories that is scriptable. Unfortunately, things aren\u0026rsquo;t as simple as that, so I present a snippet for extracting change information from a Repository.\nFirst thing first, dependencies. To use this code you must install GitPython:\n$ pip install gitpython What I\u0026rsquo;m looking for in this example is the change for every single file throughout time for every commit. This doesn\u0026rsquo;t necessarily mean the change in the blobs themselves, but metadata about the change that occurred. For example:\nObject: the path or name of the file Commit: the commit in which the file was changed Author: the username or email of the author of the file Timestamp: when the file was changed Size: the number of bytes changed (negative for deletions) Type of change: whether the file was added, deleted, modified, or renamed. Stats: the number of lines changed/inserted/deleted. This pretty straight forward analysis will allow us to build a graph model of how users and files interact inside of a particular project. So here\u0026rsquo;s the snippet:\nThe result from this snippet is a generator that yields dictionaries that look something like:\n{ \u0026#34;deletions\u0026#34;: 0, \u0026#34;insertions\u0026#34;: 18, \u0026#34;author\u0026#34;: \u0026#34;benjamin@bengfort.com\u0026#34;, \u0026#34;timestamp\u0026#34;: \u0026#34;2016-02-23T12:36:59-0500\u0026#34;, \u0026#34;object\u0026#34;: \u0026#34;cloudscope/tests/test_utils/__init__.py\u0026#34;, \u0026#34;lines\u0026#34;: 18, \u0026#34;commit\u0026#34;: \u0026#34;00c5dd71d86f94dce5fd31b254a1c690c5ec1a53\u0026#34;, \u0026#34;type\u0026#34;: \u0026#34;A\u0026#34;, \u0026#34;size\u0026#34;: 509 } This can be used to create a history of file changes, or to create a graph of files that are commonly changed together.\n","permalink":"https://bbengfort.github.io/2016/05/git-diff-extract/","summary":"\u003cp\u003eOne of the first steps to performing analysis of Git repositories is extracting the changes over time, e.g. the Git log. This seems like it should be a very simple thing to do, as visualizations on GitHub and elsewhere show file change analyses through history on a commit by commit basis. Moreover, by using the \u003ca href=\"http://gitpython.readthedocs.io/en/stable/\"\u003eGitPython\u003c/a\u003e library you have direct access to Git repositories that is scriptable. Unfortunately, things aren\u0026rsquo;t as simple as that, so I present a snippet for extracting change information from a Repository.\u003c/p\u003e","title":"Extracting Diffs from Git with Python"},{"content":"As I\u0026rsquo;ve dug into my distributed systems research, one question keeps coming up: “How do you visualize distributed systems?” Distributed systems are hard, so it feels like being able to visualize the data flow would go a long way to understanding them in detail and avoiding bugs. Unfortunately, the same things that make architecting distributed systems difficult also make them hard to visualize.\nI don\u0026rsquo;t have an answer to this question, unfortunately. However, in this post I\u0026rsquo;d like to state my requirements and highlight some visualizations that I think are important. Hopefully this will be the start of a more complete investigation or at least allow others to comment on what they\u0026rsquo;re doing and whether or not visualization is important.\nStatic Visualization A distributed system can loosely be described as multiple instances of a software program running on different machines that react to events. These events can be either external (a user making a request) or internal (handling requests from other instances). The collective individual behavior of each node informs how the entire system behaves.\nOne high level view of the design of a system looks at the propagation of events, or messages being sent between nodes in the distributed system. This can be visualized using a message sequence chart which embeds the time flow of a system and displays the interaction between nodes as they generate messages in reaction to received messages.\nIn the message sequence chart, every lane represents a single replica and arrows between them represent message passing and receipt order. Often, crossed arrows represent the difficulty in determining the happens before relationship with respect to message order. These charts are good at defining a single situation and the reaction of the system, but do not do a good job at describing the general interaction. How do we describe a system in terms of the decisions it must make in reaction to received events that might be unordered?\nOne method of designing a distributed system is to consider the design of only a single instance. Each instance reacts to events (messages) then can update their state or do some work, and generate messages of their own. The receipt and sending of messages defines the collective behavior. This is a simplification of the actor model of distributed computing. This seems like it might make things a bit easier, because now we only have to visualize the behavior of a single instance, and describe message handling as a flow chart of decision making.\nThe flow chart above represents one of the attempts I\u0026rsquo;ve made to describe how the Raft consensus protocol works from the perspective of a single replica server. Raft is generally understood to be one of the most understandable consensus protocols, and as such it should be easy to describe visually. Here, messages are represented as colored circles. Raft has two primary RPC messages: request vote and append entries, therefore the circles represent the send and receive events of both RPC messages and their responses (8 total message types). Each RPC roughly has their own zone in the flow chart. State changes are represented by the purple boxes, decisions by diamonds, and actions by square boxes. As you can see the flow chart is not completely connected, but hopefully by following from a \u0026ldquo;send\u0026rdquo; node to a \u0026ldquo;recv\u0026rdquo; node, one can track how the system interacts over time as well as individual nodes.\nThis visualization still needs a lot of help, however. It is complex, and doesn\u0026rsquo;t necessarily embed all the information of how the complete system handles failure or messages.\nInteractive Visualization The most interesting combination of message traffic and behavior that I\u0026rsquo;ve seen so far requires JavaScript to create a dynamic, interactive visualization. Here, the user can play with different scenarios to see how the distributed system will react to different events or scenarios. It visualizes both the decision making process of the replica servers, as well as the ordering of messages as they\u0026rsquo;re sent and received.\nOne of the first places I encountered this was the RaftScope visualization. Here colored balls with an arrow represent the messages themselves (responses are not filled). The state of each node is shown by the edge color (a timer for followers, dotted for candidates, and solid for leaders). The log of each replica server is also displayed to show how the log repairs itself and commits values.\nMoreover, users can also click on nodes and disable them, make \u0026ldquo;client requests\u0026rdquo;, pause, or otherwise modify their behavior. This allows custom scenarios to be constructed and interpreted similar to the message sequence diagram, but with more flexibility. The problem is that the entire protocol must be implemented in JavaScript in order to ensure correct visualization (and is therefore a non-trivial, non-development approach to explaining how a system works).\nThis idea was taken one step further by The Secret Lives of Data, which uses a tutorial style presentation to show in detail each phase of the Raft algorithm. This allows the visualization to show specific scenarios rather than force the user to design them. I hope to see more tutorials for different algorithms soon!\nThese two examples inspired me to create my own interactive visualization for the work I\u0026rsquo;m doing on consistency fragmentation. I use a similar design of circles for messages interacting with nodes in a circular topology. Right now it is still unfinished, but I\u0026rsquo;ve at least put together an MVP of what it might look like.\nMy goal is to feed the visualization actual traces from the backend simulation I\u0026rsquo;m writing using SimPy or from the logs of live systems. The visualization will be less interactive (in the sense you can\u0026rsquo;t create specific scenarios) but will hopefully give insight into what is going on in the real system and allow me easier development and architecture.\nConclusion So I pose to you the following questions:\nIs visualization important to the architecture of distributed systems? How can we implement better static and interactive visualizations? Visualization is not part of my research, but I hope an important part of describing what is happening in the system. Any feedback would be appreciated!\n","permalink":"https://bbengfort.github.io/2016/04/visualizing-distributed-systems/","summary":"\u003cp\u003eAs I\u0026rsquo;ve dug into my distributed systems research, one question keeps coming up:\n\u003cem\u003e“How do you visualize distributed systems?”\u003c/em\u003e Distributed systems are \u003ca href=\"https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/\"\u003ehard\u003c/a\u003e, so it feels like being able to visualize the data flow would go a long way to understanding them in detail and avoiding bugs. Unfortunately, the same things that make architecting distributed systems difficult also make them hard to visualize.\u003c/p\u003e\n\u003cp\u003eI don\u0026rsquo;t have an answer to this question, unfortunately. However, in this post I\u0026rsquo;d like to state my requirements and highlight some visualizations that I think are important. Hopefully this will be the start of a more complete investigation or at least allow others to comment on what they\u0026rsquo;re doing and whether or not visualization is important.\u003c/p\u003e","title":"Visualizing Distributed Systems"},{"content":"One large issue that I encounter in development with machine learning is the need to structure our data on disk in a way that we can load into Scikit-Learn in a repeatable fashion for continued analysis. My proposal is to use the sklearn.datasets.base.Bunch object to load the data into data and target attributes respectively, similar to how Scikit-Learn\u0026rsquo;s toy datasets are structured. Using this object to manage our data will mirror the native API and allow us to easily copy and paste code that demonstrates classifiers and techniques with the built in datasets. Importantly, this API will also allow us to communicate to other developers and our future-selves exactly how to use the data.\nMoreover, we need to be able to structure more and varied datasets as most projects aren\u0026rsquo;t dedicated to building a single classifier, but rather lots of them. Data is extracted and written to disk through SQL queries, then models are written back into the database. All of these fixtures (for building models) as well as the extraction method, and meta data need to be versioned so that we can have a repeatable process (for science). The workflow is as follows:\nThis post is largely concerned with the “Data Directory” and the “Load and Transform Data” highlighted processes in the flow chart. The first step is to structure a fixtures directory with our data code. The fixtures directory will contain named subdirectories where each name is related to a dataset we want to load. These directories will contain the following files.\nquery.sql: a sql file that can be executed against the database to extract and wrangle the dataset. dataset.txt: a numpy whitespace delimited file containing either a dense or sparse matrix of numeric data to pass to the model fit process. (This can be easily adapted to a CSV file of raw data if needed). README.md: a markdown file containing information about the dataset and attribution. Will be exposed by the DESCR attribute. meta.json: a helper file that contains machine readable information about the dataset like target_names and feature_names. A very simple project will therefore have a fixtures directory that looks like:\n$ project . ├── fixtures | ├── energy | | ├── dataset.txt | | ├── meta.json | | ├── README.md | | └── query.sql | └── solar | ├── dataset.txt | ├── meta.json | ├── README.md | └── query.sql └── index.json Dataset utilities code should know about this directory and how to access it by using paths relative to the source code and environment variables as follows:\nimport os SKL_DATA = \u0026#34;SCIKIT_LEARN_DATA\u0026#34; BASE_DIR = os.path.normpath(os.path.join(os.path.dirname(__file__), \u0026#34;..\u0026#34;)) DATA_DIR = os.path.join(BASE_DIR, \u0026#34;fixtures\u0026#34;) def get_data_home(data_home=None): \u0026#34;\u0026#34;\u0026#34; Returns the path of the data directory \u0026#34;\u0026#34;\u0026#34; if data_home is None: data_home = os.environ.get(SKL_DATA, DATA_DIR) data_home = os.path.expanduser(data_home) if not os.path.exists(data_home): os.makedirs(data_home) return data_home The get_data_home variable looks for the root directory of the fixtures, by accepting a passed in path, or by looking in the environment, finally defaulting to the project fixtures directory. Note that this function creates the directory if it doesn\u0026rsquo;t exist in order for automatic writes to go through without failing.\nThe Bunch object in Scikit-Learn is simply a dictionary that exposes dictionary keys as properties so that you can access them with dot notation. This by itself isn\u0026rsquo;t particularly useful, but let\u0026rsquo;s look at how the toy datasets are structured:\n\u0026gt;\u0026gt;\u0026gt; from sklearn.datasets import load_digits, load_boston \u0026gt;\u0026gt;\u0026gt; dataset = load_digits() \u0026gt;\u0026gt;\u0026gt; print dataset.keys() [\u0026#39;images\u0026#39;, \u0026#39;data\u0026#39;, \u0026#39;target_names\u0026#39;, \u0026#39;DESCR\u0026#39;, \u0026#39;target\u0026#39;] \u0026gt;\u0026gt;\u0026gt; print load_boston().keys() [\u0026#39;data\u0026#39;, \u0026#39;feature_names\u0026#39;, \u0026#39;DESCR\u0026#39;, \u0026#39;target\u0026#39;] We can see that the bunch object keeps track of the primary matrix (usually labeled X) in the data attribute and the targets (usually called y) in the target attribute. Moreover, it shows a README with information about the dataset including citations in the DESCR property, as well as other information like names and images. We will create a similar load_data methodology to use in our projects.\nNow that we have everything we need stored on disk, we can create a load_data function, which will accept the name of a dataset, and appropriately look it up using the structure above. Moreover, it extracts the data required for a Bunch object including extracting the target value from the first or last columns of the dataset and using the meta.json file for other important information.\nimport json import numpy as np from sklearn.datasets.base import Bunch def load_data(path, descr=None, target_index=-1): \u0026#34;\u0026#34;\u0026#34; Returns a sklearn dataset Bunch which includes several important attributes that are used in modeling: data: array of shape n_samples * n_features target: array of length n_samples feature_names: names of the features target_names: names of the targets filenames: names of the files that were loaded DESCR: contents of the readme This data therefore has the look and feel of the toy datasets. \u0026#34;\u0026#34;\u0026#34; root = os.path.join(get_data_home(), path) filenames = { \u0026#39;meta\u0026#39;: os.path.join(root, \u0026#39;meta.json\u0026#39;), \u0026#39;rdme\u0026#39;: os.path.join(root, \u0026#39;README.md\u0026#39;), \u0026#39;data\u0026#39;: os.path.join(root, \u0026#39;dataset.txt\u0026#39;), } target_names = None feature_names = None DESCR = None with open(filenames[\u0026#39;meta\u0026#39;], \u0026#39;r\u0026#39;) as f: meta = json.load(f) target_names = meta[\u0026#39;target_names\u0026#39;] feature_names = meta[\u0026#39;feature_names\u0026#39;] with open(filenames[\u0026#39;rdme\u0026#39;], \u0026#39;r\u0026#39;) as f: DESCR = f.read() dataset = np.loadtxt(filenames[\u0026#39;data\u0026#39;]) data = None target = None # Target assumed to be either last or first row if target_index == -1: data = dataset[:, 0:-1] target = dataset[:, -1] elif target_index == 0: data = dataset[:, 1:] target = dataset[:, 0] else: raise ValueError(\u0026#34;Target index must be either -1 or 0\u0026#34;) return Bunch(data=data, target=target, filenames=filenames, target_names=target_names, feature_names=feature_names, DESCR=DESCR) The primary work of the load_data function is to locate the appropriate files on disk, given a root directory that\u0026rsquo;s passed in as an argument (if you saved your data in a different directory, you can modify the root to have it look in the right place). The meta data is included with the bunch, and is also used split the train and test datasets into data and target variables appropriately, such that we can pass them correctly to the Scikit-Learn fit and predict estimator methods.\nNow we can create named aliases for specific datasets as follows:\ndef load_energy(): return load_data(\u0026#39;energy\u0026#39;) def load_solar(): return load_data(\u0026#39;solar\u0026#39;) And we have a system that looks and feels exactly like the datasets that Scikit-Learn ships with.\n","permalink":"https://bbengfort.github.io/2016/04/bunch-data-management/","summary":"\u003cp\u003eOne large issue that I encounter in development with machine learning is the need to structure our data on disk in a way that we can load into Scikit-Learn in a repeatable fashion for continued analysis. My proposal is to use the \u003ccode\u003esklearn.datasets.base.Bunch\u003c/code\u003e object to load the data into data and target attributes respectively, similar to how Scikit-Learn\u0026rsquo;s toy datasets are structured. Using this object to manage our data will mirror the native API and allow us to easily copy and paste code that demonstrates classifiers and techniques with the built in datasets. Importantly, this API will also allow us to communicate to other developers and our future-selves exactly how to use the data.\u003c/p\u003e","title":"Scikit-Learn Data Management: Bunches"},{"content":"Part of my research involves the creation of large scale distributed systems, and while we do build these systems and deploy them, we do find that simulating them for development and research gives us an advantage in trying new things out. To that end, I employ discrete event simulation (DES) using Python\u0026rsquo;s SimPy library to build very large simulations of distributed systems, such as the one I\u0026rsquo;ve built to inspect consistency patterns in variable latency, heterogenous, partition prone networks: CloudScope.\nVery briefly (perhaps I\u0026rsquo;ll do a SimPy tutorial at another date), SimPy creates an environment that dispatches and listens for events. You create these event processes in the environment by implementing [generators]({% post_url 2016-02-05-iterators-generators %}) that yield control of the execution as they\u0026rsquo;re working. As a result, the SimPy environment can call the next() method of your generator to do processing on schedule. Consider the following code:\nimport simpy def wake_and_sleep(env): state = \u0026#39;Awake\u0026#39; while True: # change state and alert state = \u0026#39;Awake\u0026#39; if state == \u0026#39;Asleep\u0026#39; else \u0026#39;Asleep\u0026#39; print \u0026#34;{} at {}\u0026#34;.format(state, env.now) # wait 5 timesteps yield env.timeout(5) if __name__ == \u0026#39;__main__\u0026#39;: env = simpy.Environment() env.process(wake_and_sleep(env)) env.run(until=100) This simple generator function runs forever and constantly switches its state from Awake to Asleep to Awake again every 5 timesteps. If you run this you\u0026rsquo;ll get something that looks as follows:\nAsleep at 0 Awake at 5 Asleep at 10 Awake at 15 Asleep at 20 ... The neat thing is that SimPy doesn\u0026rsquo;t have a counter that simply increments the timestep and checks if any events have gone off \u0026ndash; and actually that\u0026rsquo;s the whole point of discrete event simulation: to simulate events and their interactions as they occur in order without the burden of waiting for a real time simulation. Instead, SimPy has a master schedule that is created by calling the next() method of all its processes. When you yield a timeout of 5, your next() method will be called at env.now + 5. SimPy just increments now to the next timestep that has an event in it (or the minimum time of the current schedule).\nThis means that it\u0026rsquo;s a bad idea to constantly yield 1 timestep timeouts. Especially if you\u0026rsquo;re doing something like checking a state for work. Instead you should register a callback to the state change that you\u0026rsquo;re looking for, and call the closure there. However, when you\u0026rsquo;re modeling real world systems that do check their state every timestep, it\u0026rsquo;s very difficult. This led me to the question:\nHow bad is it to constantly yield 1 timestep timeouts in large simulations?\nSo of course, we need data.\nI created a SimPy process that would yield a timeout with a specific number of steps as a timeout, then track how many executions the process and the amount of real time that passes by. The experiment variables were the maximum number of timesteps allowed in the simulation and the number of steps to yield in the process. I selected a range of 10 maximum times between 1,000,000 and 50,000,000, with a stride of 5,000,000 and a range of steps between 1 and 10. This led to 100 runs of the simulation with each dimension pair. The [results]({{ site.baseurl }}assets/data/timestepping.csv) were as follows:\n![Simulation Timestepping Results]({{ site.baseurl }}assets/images/2016-04-15-timestepping.png)\nAs you can see, there is an exponential decrease in the amount of real time taken by the system, and the amount of time you yield in your event process. Even just going from doing a check every single timestep to every other timestep will save you a lot of real time in your simulation process!\nAnd a different view, the interaction plot is as follows:\n![Simulation Interaction Plot]({{ site.baseurl }}assets/images/2016-04-15-interact-plot.png)\nThe interact plot shows every experiment, which is a grid of max simulation time (until) and the number of steps between event process yield. The heatmap shows that the amount of real time is exponentially dependent on steps (the curve around the X access) and linearly dependent on until (there is a straight line through the center of the curves).\nThe code to reproduce the data and the experiment is as follows:\nThe code does contain some cloudscope utilities, but they\u0026rsquo;re not large and can be found in the cloudscope repository.\nFor the results reported, the code was run on a MacBook Pro (Retina, 15-inch, Early 2013) with a 2.8 GHz Intel Core i7 processor and 16 GB of 1600 MHz DDR3 memory, running OS X El Capitan Version 10.11.4.\nThe visualizations were generated with the following code:\nimport numpy as np import pandas as pd import seaborn as sns # Set the context and style sns.set_context(\u0026#39;talk\u0026#39;) sns.set_style(\u0026#39;whitegrid\u0026#39;) # Load the data data = pd.read_csv(\u0026#39;timestepping.csv\u0026#39;) # Plot the means of the experiment times by step. sns.lmplot( \u0026#39;steps\u0026#39;, \u0026#39;time\u0026#39;, size=12, x_estimator=np.mean, fit_reg=False, data=data ) # Plot the interact plot of all experiments sns.interactplot(\u0026#39;steps\u0026#39;, \u0026#39;until\u0026#39;, \u0026#39;time\u0026#39;, size=12, data=data) So the lesson is - yield at least three timesteps in your simulation!\n","permalink":"https://bbengfort.github.io/2016/04/lessons-in-discrete-event-simulation/","summary":"\u003cp\u003ePart of my research involves the creation of large scale distributed systems, and while we do build these systems and deploy them, we do find that simulating them for development and research gives us an advantage in trying new things out. To that end, I employ \u003ca href=\"https://en.wikipedia.org/wiki/Discrete_event_simulation\"\u003ediscrete event simulation\u003c/a\u003e (DES) using Python\u0026rsquo;s \u003ca href=\"https://simpy.readthedocs.org/en/latest/\"\u003eSimPy\u003c/a\u003e library to build very large simulations of distributed systems, such as the one I\u0026rsquo;ve built to inspect consistency patterns in variable latency, heterogenous, partition prone networks: \u003ca href=\"https://github.com/bbengfort/cloudscope\"\u003eCloudScope\u003c/a\u003e.\u003c/p\u003e","title":"Lessons in Discrete Event Simulation"},{"content":"Yesterday I wrote a blog about [extracting a corpus]({% post_url 2016-04-10-extract-ddl-corpus %}) from a directory containing Markdown, such as for a blog that is deployed with Silvrback or Jekyll. In this post, I\u0026rsquo;ll briefly show how to use the built in CorpusReader objects in nltk for streaming the data to the segmentation and tokenization preprocessing functions that are built into NLTK for performing analytics.\nThe dataset that I\u0026rsquo;ll be working with is the District Data Labs Blog, in particular the state of the blog as of today. The dataset can be downloaded from the ddl corpus, which also has the code in this post for you to use to perform other analytics.\nThe mdec.py program extracted our corpus in two formats: html and text. It also setup the corpus as follows:\nREADME describing the corpus (no extension) all text files in the same directory with the .txt or .html extension If this had been a categorized corpus, then we would have created subdirectories for each category in the corpus, and placed the correct files there. This organization has important implications for using the base readers without too much extension. Plus it helps others understand how to set up corpora with ease.\nReading Corpora NLTK\u0026rsquo;s CorpusReader objects provide a useful interface to streaming, end-to-end reads of a text corpus from multiple files on disk. To construct a corpus you need to pass the path to the directory containing the corpus, as well as a pattern for a regular expression matching the files that belong to the corpus. By default the CorpusReader opens everything with UTF-8 encoding and generally provides the following descriptive methods:\nreadme(): returns the contents of a README file citation(): returns the contents of a citation.bib file license(): returns the contents of a LICENSE file Generally speaking, your corpora should include all of these meta data files in the root directory in order to be considered complete.\nThere are many types of CorpusReader subclasses available in NLTK. The base classes provide readers for syntax corpora (those that are already structured as parses), bracket corpora (already part of speech tagged), and categorized corpora (documents associated with specific files). There are also a host of readers for the specific corpora that come included with NLTK. In general, these readers should provide an API that contain the following methods:\nparas(): returns an iterable of paragraphs (a list of lists of sentences) sents(): returns an iterable of sentences (a list of lists of words) words(): returns an iterable of words (a list of words) raw(): simply returns the raw text from the corpus Most CorpusReader classes can be accessed and filtered by a specific file or category or a list of files or categories. There are two primary methods for listing these if available to the corpus:\nfileids(): lists the names of the files that are in the corpus categories(): lists the names of the categories in the corpus This listing of API methods is by no means comprehensive. However, for most of the text analytics you\u0026rsquo;ll be doing, these methods will do the bulk of the work. I would consider a CorpusReader complete if it contained all of these methods.\nReading the Text Corpus The simplest thing to do is read our plaintext corpus, as we have to write no code to do so. Instead we can simply use the nltk.corpus.PlaintextCorpusReader directly, instantiating it with the correct path and pattern for our files. For the DDL corpus this looks like something as follows:\nfrom nltk.corpus.reader.plaintext import PlaintextCorpusReader corpus = PlaintextCorpusReader(CORPUS_TEXT, \u0026#39;.*\\.txt\u0026#39;) That\u0026rsquo;s it! As long as we path it a correct path to the corpus and a pattern for identifying text files, then we\u0026rsquo;re good to go! Note that the pattern is formatted as a Python regular expression, hence the escaped . \u0026ndash; unfortunately NLTK doesn\u0026rsquo;t use glob or other patterns for file identification.\nWe can now print out some information about our corpus using the reader directly:\nfrom nltk import FreqDist def corpus_info(corpus): \u0026#34;\u0026#34;\u0026#34; Prints out information about the status of a corpus. \u0026#34;\u0026#34;\u0026#34; fids = len(corpus.fileids()) paras = len(corpus.paras()) sents = len(corpus.sents()) sperp = sum(len(para) for para in corpus.paras()) / float(paras) tokens = FreqDist(corpus.words()) count = sum(tokens.values()) vocab = len(tokens) lexdiv = float(count) / float(vocab) print(( \u0026#34;Text corpus contains {} files\\n\u0026#34; \u0026#34;Composed of {} paragraphs and {} sentences.\\n\u0026#34; \u0026#34;{:0.3f} sentences per paragraph\\n\u0026#34; \u0026#34;Word count of {} with a vocabulary of {}\\n\u0026#34; \u0026#34;lexical diversity is {:0.3f}\u0026#34; ).format( fids, paras, sents, sperp, count, vocab, lexdiv )) And the result is:\nText corpus contains 17 files Composed of 1367 paragraphs and 2817 sentences. 2.061 sentences per paragraph Word count of 57762 with a vocabulary of 5602 lexical diversity is 10.311 Pretty simple!\nReading the HTML Corpus The PlaintextCorpusReader determined paragraphs as those separated by newlines, something that is not guaranteed for all corpora. HTML documents provide a bit more structure for us to parse, but there is no built in HTML corpus reader, unfortunately. Let\u0026rsquo;s take a look at how to extend our corpus reader to read HTML:\nimport bs4 class HTMLCorpusReader(PlaintextCorpusReader): tags = [ \u0026#39;h1\u0026#39;, \u0026#39;h2\u0026#39;, \u0026#39;h3\u0026#39;, \u0026#39;h4\u0026#39;, \u0026#39;h5\u0026#39;, \u0026#39;h6\u0026#39;, \u0026#39;h7\u0026#39;, \u0026#39;p\u0026#39;, \u0026#39;li\u0026#39; ] def _read_word_block(self, stream): soup = bs4.BeautifulSoup(stream, \u0026#39;lxml\u0026#39;) return self._word_tokenizer.tokenize(soup.get_text()) def _read_para_block(self, stream): \u0026#34;\u0026#34;\u0026#34; The stream is a single block (file) to extract paragraphs from. Method must return list(list(list(str))) of paragraphs, sentences, and words, so all tokenizers must be used here. \u0026#34;\u0026#34;\u0026#34; soup = bs4.BeautifulSoup(stream, \u0026#39;lxml\u0026#39;) paras = [] for para in soup.find_all(self.tags): paras.append([ self._word_tokenizer.tokenize(sent) for sent in self._sent_tokenizer.tokenize(para.text) ]) return paras The PlaintextCorpusReader accepts as additional input a word_tokenizer, a sent_tokenizer, and a para_block: functions that deal with tokenizing the text into various chunks. By default these are the wordpunct_tokenzie, sent_tokenize, and blank line blocks reader, respectively.\nIn order to add different functionality, you can either pass a callable into the constructor, or you can override some internal methods. Note that you should not override the paras, sents, or words methods \u0026ndash; these methods handle the streaming. Instead you should override the following protected methods:\n_read_word_block: tokenizes 20 lines at a time from the stream. _read_sent_block: passes the file paragraph at a time into the segmenter. _read_para_block: deals with a file at a time from the stream. Although protected, you can see how easy it is to get access to the block stream and override it. Here we simply look for a variety of tags to call \u0026ldquo;paragraphs\u0026rdquo; by using BeautifulSoup, then correctly return the segmented and tokenized text. Our word block tokenizer simply does an HTML strip tags.\n","permalink":"https://bbengfort.github.io/2016/04/nltk-corpus-reader/","summary":"\u003cp\u003eYesterday I wrote a blog about [extracting a corpus]({% post_url 2016-04-10-extract-ddl-corpus %}) from a directory containing Markdown, such as for a blog that is deployed with Silvrback or Jekyll. In this post, I\u0026rsquo;ll briefly show how to use the built in \u003ccode\u003eCorpusReader\u003c/code\u003e objects in \u003ccode\u003enltk\u003c/code\u003e for streaming the data to the segmentation and tokenization preprocessing functions that are built into NLTK for performing analytics.\u003c/p\u003e\n\u003cp\u003eThe dataset that I\u0026rsquo;ll be working with is the \u003ca href=\"http://blog.districtdatalabs.com/\"\u003eDistrict Data Labs Blog\u003c/a\u003e, in particular the state of the blog as of today. The dataset can be downloaded from the \u003ca href=\"http://bit.ly/ddl-blogs-corpus\"\u003eddl corpus\u003c/a\u003e, which also has the code in this post for you to use to perform other analytics.\u003c/p\u003e","title":"NLTK Corpus Reader for Extracted Corpus"},{"content":"We have some simple text analyses coming up and as an example, I thought it might be nice to use the DDL blog corpus as a data set. There are relatively few DDL blogs, but they all are long with a lot of significant text and discourse. It might be interesting to try to do some lightweight analysis on them.\nSo, how to extract the corpus? The DDL blog is currently hosted on Silvrback which is designed for text-forward, distraction-free blogging. As a result, there isn\u0026rsquo;t a lot of cruft on the page. I considered doing a scraper that pulled the web pages down or using the RSS feed to do the data ingestion. After all, I wouldn\u0026rsquo;t have to do a lot of HTML cleaning.\nThen I realized \u0026ndash; hey, we have all the Markdown in a repository!\nBy having everything in one place, as Markdown, I don\u0026rsquo;t have to do a search or a crawl to get all the links. Moreover, I get a bit finer-grained control of what text I want. The question came down to rendering \u0026ndash; do I try to analyze the Markdown, or do I render it into HTML?\nIn the end, I figured rendering the Markdown to HTML with Python would probably provide the best corpus result. I\u0026rsquo;ve created a tool that takes a directory of Markdown files, renders them as HTML or text and then creates the corpus organized directory expected by NLTK. Nicely, this also works with Jekyll! Here is the code:\nSorry that was so long, I tried to cut it down a bit, but the argparse stuff really does make it quite verbose. Still the basic methodology is to loop through all the files (recursively going down subdirectories) looking for *.md or *.markdown files. I then use the Python Markdown library with the markdown.extensions.extra package to render HTML, and to render the text from the HTML, I\u0026rsquo;m currently using BeautifulSoup get_text.\nNote also that this tool writes a README with information about the extraction. You can now use the nltk.PlainTextCorpusReader to get access to this text!\n","permalink":"https://bbengfort.github.io/2016/04/extract-ddl-corpus/","summary":"\u003cp\u003eWe have some simple text analyses coming up and as an example, I thought it might be nice to use the DDL blog corpus as a data set. There are relatively few DDL blogs, but they all are long with a lot of significant text and discourse. It might be interesting to try to do some lightweight analysis on them.\u003c/p\u003e\n\u003cp\u003eSo, how to extract the corpus? The \u003ca href=\"http://blog.districtdatalabs.com\"\u003eDDL blog\u003c/a\u003e is currently hosted on \u003ca href=\"https://www.silvrback.com/\"\u003eSilvrback\u003c/a\u003e which is designed for text-forward, distraction-free blogging. As a result, there isn\u0026rsquo;t a lot of cruft on the page. I considered doing a scraper that pulled the web pages down or using the RSS feed to do the data ingestion. After all, I wouldn\u0026rsquo;t have to do a lot of HTML cleaning.\u003c/p\u003e","title":"Extracting the DDL Blog Corpus"},{"content":"A while I ago, I discussed the [observer pattern]({% post_url 2016-02-16-observer-pattern %}) for dispatching events based on a series of registered callbacks. In this post, I take a look at a similar, but very different methodology for dispatching based on type with pre-assigned handlers. For me, this is actually the more common pattern because the observer pattern is usually implemented as an API to outsider code. On the other hand, this type of dispatcher is usually a programmer\u0026rsquo;s pattern, used for development and decoupling.\nFor example, the project I\u0026rsquo;m currently working on involves replica servers handling remote procedure calls (RPCs) from remote servers. Each RPC is basically a typed packet of specific data, much like the arguments you would pass to a function. It\u0026rsquo;s completely intended for one single procedure on the local server. I treat RPCs as events because I research distributed systems (and messages are events, but more on that later) and so each replica server needs to route (dispatch) the RPC event to the correct handler.\nHowever, when you\u0026rsquo;re programming \u0026ndash; you\u0026rsquo;re basically naming things. So the question is, why create a mapping of message types to handlers when you already have the name of the event? Isn\u0026rsquo;t there some way to do this automatically? The answer is, yes of course there is. This gives you the following benefits:\nEasy extensibility: create an event type and handler with the same name. No magic strings that may be typo\u0026rsquo;d! Single point of dispatch, no need to subclass your routing. A clear and understandable API for future you. So the strategy is to create types (classes for the point of this discussion) that can be identified by name. Then create a dispatcher that uses that name, automatically looks up the appropriate handler based on that name, and calls it. The code to do so is as follows:\nOk, so there are a couple of extra things here, specifically the need to do things in PEP8 naming style. The type names should be in CamelCase while the method names should be in snake_case. It\u0026rsquo;s not trivial to put together helper functions to transform strings to camel case, or to snake case. You can use generators, string processing, regular expressions, transformers, and more.\nIn the snippet I\u0026rsquo;ve included the methods that I prefer (using regular expressions that are compiled in advance for performance). Moreover, since this is so common to add to code, I\u0026rsquo;ve not only included a downloadable Gist of the code, but also tests so that you can easily add it to your code base.\n","permalink":"https://bbengfort.github.io/2016/04/dispatching-types-handler-methods/","summary":"\u003cp\u003eA while I ago, I discussed the [observer pattern]({% post_url 2016-02-16-observer-pattern %}) for dispatching events based on a series of registered callbacks. In this post, I take a look at a similar, but very different methodology for dispatching based on type with pre-assigned handlers. For me, this is actually the more common pattern because the observer pattern is usually implemented as an API to outsider code. On the other hand, this type of dispatcher is usually a programmer\u0026rsquo;s pattern, used for development and decoupling.\u003c/p\u003e","title":"Dispatching Types to Handler Methods"},{"content":"These snippets are just a short reminder of how class variables work in Python. I understand this topic a bit too well, I think; I always remember the gotchas and can\u0026rsquo;t remember which gotcha belongs to which important detail. I generally come up with the right answer then convince myself I\u0026rsquo;m wrong until I write a bit of code and experiment. Hopefully this snippet will shortcut that process.\nConsider the following class hierarchy:\nfrom itertools import count class Foo(object): counter = count() def __init__(self): self.id = self.counter.next() def __str__(self): return \u0026#34;{} {}\u0026#34;.format(self.__class__.__name__, self.id) class Bar(Foo): pass class Baz(Bar): pass Every instance of Foo will be assigned a unique, automatically incrementing id using the count iterator for itertools. The thing to remember is that Bar and Baz are also instances of Foo:\n\u0026gt;\u0026gt;\u0026gt; isinstance(Baz(), Foo) True Keep that in mind given the following code:\n\u0026gt;\u0026gt;\u0026gt; import random \u0026gt;\u0026gt;\u0026gt; things = (Foo, Bar, Baz) \u0026gt;\u0026gt;\u0026gt; for _ in xrange(10): ... print random.choice(things)() What is the expected result? If you said something like as follows:\nBar 0 Baz 1 Foo 2 Baz 3 Foo 4 Bar 5 Bar 6 Bar 7 Foo 8 Bar 9 Then you\u0026rsquo;re on the right track.\nThe problem is that the code above is typically not what is meant by programmers. And while I typically come to the conclusion that what I\u0026rsquo;m actually expressing by the above code is counting instances of Foo, what I actually want to do is count instances of each class (how many of each Foo, Bar, and Baz).\nThen I realize \u0026hellip; oh crap, I\u0026rsquo;ve strayed into metaprogramming land. And that\u0026rsquo;s why I need the reminder of this post. I definitely get that I need a metaclass to make subclass counters work as expected, but I never remember exactly how to do it. So here\u0026rsquo;s how.\nclass Countable(type): def __new__(cls, name, bases, attrs): attrs[\u0026#39;counter\u0026#39;] = count() return super(Countable, cls).__new__(cls, name, bases, attrs) class Foo(object): __metaclass__ = Countable Basically, what we\u0026rsquo;ve done here is is told the Foo class that it should be constructed using Countable instead of type. When the class is created, therefore it is given the class attribute counter. Now the output is as follows:\nFoo 0 Bar 0 Foo 1 Baz 0 Bar 1 Foo 2 Foo 3 Foo 4 Baz 1 Foo 5 This post isn\u0026rsquo;t about a long discussion on the metaclass in Python or how type is a subclass of type, but simply serves as a reminder for the very rare occasion that I have to rock something other than type. For more information on the subject, a very nice write-up, A Primer on Python Metaclasses by @jakevdp is the way to go.\n","permalink":"https://bbengfort.github.io/2016/04/class-variables/","summary":"\u003cp\u003eThese snippets are just a short reminder of how class variables work in Python. I understand this topic a bit too well, I think; I always remember the gotchas and can\u0026rsquo;t remember which gotcha belongs to which important detail. I generally come up with the right answer then convince myself I\u0026rsquo;m wrong until I write a bit of code and experiment. Hopefully this snippet will shortcut that process.\u003c/p\u003e\n\u003cp\u003eConsider the following class hierarchy:\u003c/p\u003e","title":"Class Variables"},{"content":"I was talking with @looselycoupled the other day about how we generate passwords for use on websites. We both agree that every single domain should have its own password (to prevent one crack ruling all your Internets). However, we\u0026rsquo;ve both evolved on the method over time, and I\u0026rsquo;ve written a simple script that allows me to generate passwords using methodologies discussed in this post.\nIn particular I use the generator to create passwords for pwSafe, the tool I currently use for password management (due to its use of the open source database format created by Bruce Schneier). It is my hope that this script can be embedded directly into pwSafe, or at least allow me to write directly to the database; but for now I just copy and paste with the pbcopy utility.\nThe point of this post, though, is not the generator (though I hope it can be useful). The point of the post is that because I do not believe in security through obscurity or obfuscation I want to expose my techniques to criticism and testing publicly. I believe that an approach to security must be ongoing, living, and continuous. While the foundation of security revolves around some shared secret (a key, a password) that must not be made public, the way that secret is used should be open and constantly inspected for flaws.\nSo although this post isn\u0026rsquo;t really about anything and is quite simple, it does put my money where my mouth is, philosophically speaking. If you\u0026rsquo;re reading this, I hope you read with a critical eye and report any flaws you spot in the comments of the snippet on GitHub Gist.\nPrevious Approaches My original approach was pretty simple: keep three versions of the same password in a variety of strengths:\npassword p4ssw0rd p4$$w0rd In order to generate domain specific passwords, just append a few characters from the site. So for example, for Facebook, Gmail, and PayPal you might use passwordfb, p4ssw0rdgmail, and p4$$w0rdpp respectively, in order of increasing \u0026ldquo;security\u0026rdquo;. The benefit of this approach is simple, memorable passwords that meet a variety of security features required by websites.\nOf course, the flaws are numerous here. All passwords should have the same level of security, using a variety of characters. Additionally the use of a discoverable pattern could make a targeted attack easy. The real problem for me, however, was that on sites that required a routine password change (which should be done anyway). This method does not lend flexibility. However, it does highlight the trade-off between ease of use and security, however, which led my to my next approach.\nThe next approach did the same thing, but instead of adding letters and symbols, I used a long, correct sentence inspired by the xkcd Password Strength comic. A characteristic of the domain is simply fit into the base sentence:\nbase: \u0026quot;Silly banana, happy elephant.\u0026quot; facebook: \u0026quot;Silly facebook, happy elephant.\u0026quot; This allowed me to change the password by moving the domain phrase around or performing other manipulations, while also having longer, more secure passwords that I could easily remember. Still, having a template seemed like a bad idea.\nPassword Management After a while, however, I finally started using pwSafe and started generating random passwords for each site. This is clearly the best way to do things but in all honesty - the most important thing to secure is your email and any social auth accounts. To that end, I set up 2 factor authentication using Google Authenticator as much as I could, and text messaging where I couldn\u0026rsquo;t.\nI don\u0026rsquo;t think I\u0026rsquo;m paranoid, this just seems like good Internet hygiene. The appendix includes a list of the sites that I use 2 factor authentication and please let me know if there are any others I should include.\nThe issue, of course, is that I\u0026rsquo;m paranoid that I\u0026rsquo;ll lose or misplace my pwSafe master key, have the database corrupted, or otherwise be locked out of my accounts. I want a methodology that creates passwords that are as secure as possible that I can also reproduce on my own. The password should be domain specific, unique, long, and contain a variety of characters.\nReproducible Methodology So here\u0026rsquo;s the idea:\nCreate a master password Create a password string by concatenating the master, the base url of the site, and some salt or incrementing value. Hash the password string Encode the hash and use as password. Now all you have to remember is the master password and the salt or incrementing value on a per-domain basis.\nIssues:\nWhat encoding scheme do you choose? Some sites require special characters and numbers, others cannot accept them. The domain of the characters is now also reduced to 16 chars for hex encoding, or 64 chars for base64 (even if you use one with special characters like hexbin). How do you verify the master password (ensuring there was no typo). How do you remember the salt? Perhaps something device specific like a UUID? I\u0026rsquo;m sure there are more issues here as well.\nHowever, I\u0026rsquo;m hoping this is good enough when combined with 2FA, this password is just a first step, and gives me the ability to reproduce the sequence of characters for crucial sites in case the pwSafe database is corrupted. For most sites, I\u0026rsquo;ll simply use a randomly generated password and rely on the \u0026ldquo;forgot password\u0026rdquo; link.\nConclusion So in this post, I\u0026rsquo;ve presented a simple, reproducible method for generating somewhat secure passwords that I intend to use in combination with two factor authentication. It is my hope that these passwords are long enough, unique to the domain, but able to be generated on demand.\nThis is a list of the sites that I have set up 2 factor authentication with, for reference.\nGoogle and Gmail Accounts Apple ID GitHub Facebook Twitter PayPal Heroku AWS LinkedIn ","permalink":"https://bbengfort.github.io/2016/03/password-generator/","summary":"\u003cp\u003eI was talking with \u003ca href=\"https://github.com/looselycoupled\"\u003e@looselycoupled\u003c/a\u003e the other day about how we generate passwords for use on websites. We both agree that every single domain should have its own password (to prevent one crack ruling all your Internets). However, we\u0026rsquo;ve both evolved on the method over time, and I\u0026rsquo;ve written a simple script that allows me to generate passwords using methodologies discussed in this post.\u003c/p\u003e\n\u003cp\u003eIn particular I use the generator to create passwords for \u003ca href=\"http://pwsafe.info/\"\u003epwSafe\u003c/a\u003e, the tool I currently use for password management (due to its use of the \u003ca href=\"https://raw.githubusercontent.com/jpvasquez/PasswordSafe/master/docs/formatV3.txt\"\u003eopen source database format\u003c/a\u003e created by \u003ca href=\"https://www.schneier.com/blog/archives/2005/06/password_safe.html\"\u003eBruce Schneier\u003c/a\u003e). It is my hope that this script can be embedded directly into pwSafe, or at least allow me to write directly to the database; but for now I just copy and paste with the \u003ccode\u003epbcopy\u003c/code\u003e utility.\u003c/p\u003e","title":"Simple Password Generation"},{"content":"Happy Pi day! As is the tradition at the University of Maryland (and to a certain extent, in my family) we are celebrating March 14 with pie and Pi. A shoutout to @konstantinosx who, during last year\u0026rsquo;s Pi day, requested blueberry pie, which was the strangest pie request I\u0026rsquo;ve received for Pi day. Not that blueberry pie is strange, just that someone would want one so badly for Pi day (he got a mixed berry pie).\nIn honor of Pi day, I\u0026rsquo;ve created a simple a grid visualization of the digits of Pi, computed using the Chudnovsky algorithm with Python. The grid is displayed using some simple matplotlib code (yes, this is what \u0026ldquo;simple\u0026rdquo; matplotlib code looks like). This unfortunately took a bit longer than expected (see updates below), but in the end, I ended up with the following visualizations for 1024, 5625, and 10000 digits of Pi:\nUPDATE 0: 1,024 digits took 16.052 seconds, even though 10,000 and 5,625 digits went into \u0026ldquo;Python not responding\u0026rdquo; mode after 3-4 minutes. Noticed no significant memory or CPU usage. Going to just let the 5k digits run for now.\nUPDATE 1: 5,625 digits took 3053.405 seconds (about 51 minutes) on my Macbook Pro. (10k digits running now).\nUPDATE 2: 10,000 digits took 18661.043 seconds (5 hours, 11 minutes), all the digits printed out, but the visualization caught a SEGFAULT. Seriously, matplotlib? Anyway, a quick copy and paste of the digits saved the viz!\nThe Python script that does the work of generating both Pi and the grid are as follows:\nThe Pi digits computation uses the decimal.Decimal data structure in Python as well as setting a context to ensure that we lose no information by overflowing a float or a double precision data type. This code isn\u0026rsquo;t my own, see the Stack Overflow link in the comment for more on it.\nThe visualization is done using the matshow function in matplotlib, which I\u0026rsquo;ve used in the past to visualize cellular automata and even animate them with matplotlib. It simply assigns a color from the color map to each digit and fills in the appropriate square. I don\u0026rsquo;t like the color map — but I didn\u0026rsquo;t take the time to put together a categorical colormap with a better scheme (sorry).\nAppendix: SEGFAULT So this happened after 5 hours of computing Pi:\nAppendix: Compute Times Compute times were as follows:\ndigits time (sec) accuracy 1024 16.052 TBD 5625 3053.405 TBD 10000 18661.043 TBD ","permalink":"https://bbengfort.github.io/2016/03/pi-day/","summary":"\u003cp\u003eHappy Pi day! As is the tradition at the University of Maryland (and to a certain extent, in my family) we are celebrating March 14 with pie and Pi. A shoutout to \u003ca href=\"https://github.com/konstantinosx/\"\u003e@konstantinosx\u003c/a\u003e who, during last year\u0026rsquo;s Pi day, requested blueberry pie, which was the strangest pie request I\u0026rsquo;ve received for Pi day. Not that blueberry pie is strange, just that someone would want one so badly for Pi day (he got a mixed berry pie).\u003c/p\u003e","title":"Visualizing Pi with matplotlib"},{"content":"You may have seen the following type of header at the top of my source code:\n# main # short description # # Author: Benjamin Bengfort \u0026lt;benjamin@bengfort.com\u0026gt; # Created: Tue Mar 08 14:07:24 2016 -0500 # # Copyright (C) 2016 Bengfort.com # For license information, see LICENSE.txt # # ID: main.py [] benjamin@bengfort.com $ All of this is pretty self explanatory with the exception of the final line. This final line is a throw back to Subversion actually, when you could add a $Id$ tag to your code, and Subversion would automatically populate it with something that looks like:\n$Id: test.php 110 2009-04-28 05:20:41Z dordal $ This was pretty cool for tracking who did what in files and for specifically finding the correct version of changes. Back when I first started programming, I used Mercurial for revision control, and we had a pre-commit hook that would automatically populate our files with this line and the first seven characters of the SHA-1 hash for the commit. This line has remained in my code banner ever since, even when I switched to Git, and as a result, much of my code doesn\u0026rsquo;t have this ID!\nThe problem is that this really only works in a centralized version control system - because you know what the next commit will be before you do it. In a decentralized VCS like Git (and Mercurial for that matter), the commit identity is not known until after commit and merge. Moreover, if you change the file in place in a pre-commit hook in Git, then the staging index is modified and the hash changes. It\u0026rsquo;s definitely not very easy.\nStill, I really wanted to populate these tags with something - so instead of adding the revision that the file was modified, I changed it to the revision that the file was created. At the very least now you can look at the file and go to the revision in GitHub and follow the changes to the file through revisions. So I came up with the following:\nThe first problem was reading the commits from the local repository. I added the gitpython dependency so I would have access to Git from Python. Trust me, it was not easy to figure out how to create a simple mapping of paths in the repository to commit objects. As you can see in the snippet, I finally figured out I could simply iterate through all the commits, then traverse the tree of that commit. Because the iter_commits goes backward through commit history, it has the effect that the file\u0026rsquo;s creation commit is the last stored in the dictionary. This is, however, at the cost of having to iterate through every tree of every commit to get the mapping. I tried using diffs and other tools, but they wouldn\u0026rsquo;t do exactly what I wanted.\nNow as you can see in the Baleen repository, the files all have version information tagged in their headers as follows:\n# ID: opml.py [b2f890b] benjamin@bengfort.com $ Is it perfect? No, I\u0026rsquo;d still like to have latest commits, but if I commit the files with the new header, then that will be the latest commit, and that\u0026rsquo;s definitely not the effect I want. To really get detailed I\u0026rsquo;d have to check the diff to see if it was only the line it changed, and that seems like too much work. However, now at least my ID lines have something meaningful in them, so that\u0026rsquo;s nice.\n","permalink":"https://bbengfort.github.io/2016/03/git-version-id/","summary":"\u003cp\u003eYou may have seen the following type of header at the top of my source code:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# main\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# short description\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e#\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Author:   Benjamin Bengfort \u0026lt;benjamin@bengfort.com\u0026gt;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Created:  Tue Mar 08 14:07:24 2016 -0500\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e#\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Copyright (C) 2016 Bengfort.com\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# For license information, see LICENSE.txt\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e#\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# ID: main.py [] benjamin@bengfort.com $\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eAll of this is pretty self explanatory with the exception of the final line. This final line is a throw back to Subversion actually, when you could add a \u003ccode\u003e$Id$\u003c/code\u003e tag to your code, and Subversion would \u003ca href=\"http://www.startupcto.com/server-tech/subversion/setting-the-id-tag\"\u003eautomatically populate\u003c/a\u003e it with something that looks like:\u003c/p\u003e","title":"Adding a Git Commit to Header Comments"},{"content":"Programming life has finally caused me to give into something that I\u0026rsquo;ve resisted for a while: the creation of a Bengfort Toolkit and specifically a benlib. This post is mostly a reminder that this toolkit now exists and that I spent valuable time creating it against my better judgement. And as a result, I should probably use it and update it.\nI\u0026rsquo;ve already written (whoops, I almost said “you\u0026rsquo;ve already read” but I know no one reads this) posts about tools that I use frequently including [clock.py]({% post_url 2016-01-12-codetime-and-clock %}) and [requires]({% post_url 2016-01-21-freezing-requirements %}). These things have been simply Python scripts that I\u0026rsquo;ve put in ~/bin, which is part of my $PATH. These are too small or simple to require full blown repositories and PyPI listings on their own merit. Plus, I honestly believe that I\u0026rsquo;m the only one that uses them.\nDependencies are the problem though. For example, clock.py requires python-dateutil. This has resulted in my pretty much installing the dependent packages in every single one of my repositories. You may have noticed and removed extra dependencies on a pull request. However, I could live with that until \u0026hellip;\nI needed gitpython for another quick script to modify the ID line of my codebase. Finally, I couldn\u0026rsquo;t just stick the dependency in all of my projects, I needed management. Also this third script put me over the top on managing all the stuff that I have going on. The toolkit was created.\nI\u0026rsquo;m excited because now I can pip install bengfort-toolkit anywhere I go, including on remote machines. But I\u0026rsquo;m still a little bit weary of having to maintain this toolkit. Particularly when stuff like this happens:\nimport sys import fileinput def comment_file(path): \u0026#34;\u0026#34;\u0026#34; Comments out a Python file so that it can\u0026#39;t be imported. \u0026#34;\u0026#34;\u0026#34; for line in fileinput.input(path, inplace=True): if not line.startswith(\u0026#34;#\u0026#34;): line = \u0026#34;# \u0026#34; + line sys.stdout.write(line) Do you know what this does (I mean other than the obvious of printing out a python code that is commented out)? It literally modifies the file in place by moving the original file to a \u0026ldquo;.bak\u0026rdquo; extension, reading it, then hooking stdout up to the original file descriptor for writing. The module will then remove the \u0026ldquo;.bak\u0026rdquo; if it successfully completes.\nMind. Blown. Unix programmers, am I right?\nAlso, seriously?\nPreviously:\nimport os import shutil import tempfile def comment_file(path): \u0026#34;\u0026#34;\u0026#34; Writes commented file to a temporary file. Then moves the temp file to original location. \u0026#34;\u0026#34;\u0026#34; with open(path, \u0026#39;r\u0026#39;) as f: o, tmp = tempfile.mkstemp() for line in f.readlines(): if not line.startswith(\u0026#39;#\u0026#39;): line = \u0026#34;# \u0026#34; + line o.write(line) shutil.copy(tmp, path) os.remove(tmp) And the conflict is that I\u0026rsquo;m not sure what I prefer. When I open source stuff and expect other folks to use it, I feel confident that stuff like this can be resolved by general consensus. If others who use open source tools decide that one method is less secure than the other, they will update it. Bengfort Toolkit has none of these safe guards. But hopefully the toolkit will help me manage my own development workflow.\n","permalink":"https://bbengfort.github.io/2016/03/toolkit/","summary":"\u003cp\u003eProgramming life has finally caused me to give into something that I\u0026rsquo;ve resisted for a while: the creation of a \u003ca href=\"https://github.com/bbengfort/toolkit\"\u003eBengfort Toolkit\u003c/a\u003e and specifically a \u003ccode\u003ebenlib\u003c/code\u003e. This post is mostly a reminder that this toolkit now exists and that I spent valuable time creating it against my better judgement. And as a result, I should probably use it and update it.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;ve already written (whoops, I almost said “you\u0026rsquo;ve already read” but I know no one reads this) posts about tools that I use frequently including [clock.py]({% post_url 2016-01-12-codetime-and-clock %}) and [requires]({% post_url 2016-01-21-freezing-requirements %}). These things have been simply Python scripts that I\u0026rsquo;ve put in \u003ccode\u003e~/bin\u003c/code\u003e, which is part of my \u003ccode\u003e$PATH\u003c/code\u003e. These are too small or simple to require full blown repositories and PyPI listings on their own merit. Plus, I honestly believe that I\u0026rsquo;m the only one that uses them.\u003c/p\u003e","title":"The Bengfort Toolkit"},{"content":" This post is an early draft of expanded work that will eventually appear on the District Data Labs Blog. Your feedback is welcome, and you can submit your comments on the draft GitHub issue.\nIn order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to show examples and how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson because it provides for deep and meaningful further exploration. Non-trivial datasets can provide surprise and intuition in a way that toy datasets just cannot. Unfortunately, non-trivial datasets can be hard to find for a few reasons, but one common reason is that the dataset contains personally identifying information (PII).\nA possible solution to dealing with PII is to anonymize1 the data set by replacing information that can identify a real individual with information about a fake (but similarly behaving) fake individual. Unfortunately this is not as easy at it sounds at a glance. A simple mapping of real data to randomized data is not enough because anonymization needs to preserve the semantics of the dataset in order to be used as a stand in for analytical purposes. As a result, issues related to entity resolution2 like managing duplicates or producing linkable results come into play.\nThe good news is that we can take a cue from the database community, who routinely generate data sets in order to evaluate the performance of a database system. This community, especially in a web or test driven development context, has a lot of tools for generating very realistic data for a variety of information types. For this post, I\u0026rsquo;ll explore using the Faker library to generate a realistic, anonymized dataset that can be utilized for downstream analysis.\nThe goal can therefore be summarized as follows: given a target dataset (let\u0026rsquo;s say for simplicity, a CSV file with multiple columns), produce a new dataset such that for each row in the target, the anonymized dataset does not contain any personally identifying information. The anonymized dataset should have the same amount of data, as well as maintain its value for analysis.\nAnonymizing CSV Data In this example we\u0026rsquo;re going to do something very simple, we\u0026rsquo;re going to anonymize only two fields: full name and email. Sounds easy, right? The issue is that we want to preserve the semantic relationships and patterns in our target dataset so that we can hand it off to be analyzed or mined for interesting patterns. What happens if there are multiple rows per user? Since CSV data is naturally denormalized (e.g. contains redundant data like rows with repeated full names and emails) we will need to maintain a mapping of profile information.\nNote: Since we\u0026rsquo;re going to be using Python 2.7 in this example, you\u0026rsquo;ll need to install the unicodecsv module with pip. Additionally you\u0026rsquo;ll need the Faker library:\n$ pip install fake-factory unicodecsv The following example shows a simple anonymize_rows function that maintains this mapping and also shows how to generate data with Faker. We\u0026rsquo;ll also go a step further and read the data from a source CSV file and write the anonymized data to a target CSV file. The end result is that the file should be very similar in terms of length, row order, and fields, the only difference being that names and emails have been replaced with fake names and emails.\nimport unicodecsv as csv from faker import Factory from collections import defaultdict def anonymize_rows(rows): \u0026#34;\u0026#34;\u0026#34; Rows is an iterable of dictionaries that contain a name and email field that need to be anonymized. \u0026#34;\u0026#34;\u0026#34; # Load the faker and its providers faker = Factory.create() # Create mappings of names \u0026amp; emails to faked names \u0026amp; emails. names = defaultdict(faker.name) emails = defaultdict(faker.email) # Iterate over the rows and yield anonymized rows. for row in rows: # Replace the name and email fields with faked fields. row[\u0026#39;name\u0026#39;] = names[row[\u0026#39;name\u0026#39;]] row[\u0026#39;email\u0026#39;] = emails[row[\u0026#39;email\u0026#39;]] # Yield the row back to the caller yield row def anonymize(source, target): \u0026#34;\u0026#34;\u0026#34; source is a path to a CSV file containing data to anonymize. target is a path to write the anonymized CSV data to. \u0026#34;\u0026#34;\u0026#34; with open(source, \u0026#39;rU\u0026#39;) as f: with open(target, \u0026#39;w\u0026#39;) as o: # Use the DictReader to easily extract fields reader = csv.DictReader(f) writer = csv.DictWriter(o, reader.fieldnames) # Read and anonymize data, writing to target file. for row in anonymize_rows(reader): writer.writerow(row) The entry point for this code is the anonymize function itself. It takes as input the path to two files: the source, where the target data is held in CSV form, and target a path to write out the anonymized data to. Both of these paths are opened for reading and writing respectively, then the unicodecsv module is used to read and parse each row, transforming them into Python dictionaries. Those dictionaries are passed into the anonymize_rows function, which transforms and yields each row to be written by the CSV writer to disk.\nThe anonymize_rows function takes any iterable of dictionaries which contain name and email keys. It loads the fake factory using Factory.create - a class function that loads various providers with methods that generate fake data (more on this later). We then create two defaultdict to map names to fake names and emails to fake emails.\nThe Python collections module provides the defaultdict which is similar to a regular dict except that if the key does not exist in the dictionary, a default value is supplied by the callable passed in at instantiation. For example, d = defaultdict(int) would provide a default value of 0 for every key not already in the dictionary. Therefore when we use defaultdict(faker.name) we\u0026rsquo;re saying that for every key not in the dictionary, create a fake name (and similar for email). This allows us to generate a mapping of real data to fake data, and make sure that the real value always maps to the same fake value.\nFrom there we simply iterate through all the rows, replacing data as necessary. If our target CSV file looked like this (imagine clickstream data from an email marketing campaign):\nname,email,value,time,ipaddr James Hinglee,jhinglee@gmail.com,a,1446288248,202.12.32.123 Nancy Smithfield,unicorns4life@yahoo.com,b,1446288250,67.212.123.201 J. Hinglee,jhinglee@gmail.com,b,1446288271,202.12.32.123 It would be transformed to something as follows:\nMr. Sharif Lehner,keion.hilll@gmail.com,a,1446288248,202.12.32.123 Webster Kulas,nienow.finnegan@gmail.com,b,1446288250,67.212.123.201 Maceo Turner MD,keion.hilll@gmail.com,b,1446288271,202.12.32.123 We now have a new wrangling tool in our toolbox that will allow us to transform CSVs with name and email fields into anonymized datasets! This naturally leads us to the question: what else can we anonymize?\nGenerating Fake Data There are two third party libraries for generating fake data with Python that come up on Google search results: Faker by @deepthawtz and Fake Factory by @joke2k, which is also called “Faker”. Faker provides anonymization for user profile data, which is completely generated on a per-instance basis. Fake Factory (used in the example above) uses a providers approach to load many different fake data generators in multiple languages. Because Fake Factory has multiple language support, and a wider array of fake data generators, I typically use it over the more intuitive and simple to use Faker library which only does fake user profiles and we\u0026rsquo;ll inspect it in detail for the rest of this post (everywhere except in this paragraph, when I refer to Faker, I\u0026rsquo;m referring to Fake Factory).\nThe primary interface that Faker provides is called a Generator. Generators are a collection of Provider instances which are responsible for formatting random data for a particular domain. Generators also provide a wrapper around the random module, and allow you to set the random seed and other operations. While you could theoretically instantiate your own Generator with your own providers, Faker provides a Factory to automatically load all the providers on your behalf:\n\u0026gt;\u0026gt;\u0026gt; from faker import Factory \u0026gt;\u0026gt;\u0026gt; fake = Factory.create() If you inspect the fake object, you\u0026rsquo;ll see around 158 methods (at the time of this writing) that all generate fake data. Please allow me to highlight a few:\n\u0026gt;\u0026gt;\u0026gt; fake.credit_card_number() u\u0026#39;180029425031151\u0026#39; \u0026gt;\u0026gt;\u0026gt; fake.military_ship() u\u0026#39;USCGC\u0026#39; \u0026gt;\u0026gt;\u0026gt; (fake.latitude(), fake.longitude()) (Decimal(\u0026#39;-39.4682475\u0026#39;), Decimal(\u0026#39;50.449170\u0026#39;)) \u0026gt;\u0026gt;\u0026gt; fake.hex_color() u\u0026#39;#559135\u0026#39; \u0026gt;\u0026gt;\u0026gt; fake.pyset(3) set([u\u0026#39;Et possimus.\u0026#39;, u\u0026#39;Blanditiis vero.\u0026#39;, u\u0026#39;Ad odio ad qui.\u0026#39;, 9855]) Importantly, providers can also be localized using a language code; and this is probably the best reason to use the Factory object — to ensure that localized providers, or subsets of providers are loaded correctly. For example, to load the French localization:\n\u0026gt;\u0026gt;\u0026gt; fake = Factory.create(\u0026#39;fr_FR\u0026#39;) \u0026gt;\u0026gt;\u0026gt; fake.catch_phrase_verb() u\u0026#34;d\u0026#39;atteindre vos buts\u0026#34; And for fun, some Chinese:\n\u0026gt;\u0026gt;\u0026gt; fake = Factory.create(\u0026#39;cn_ZH\u0026#39;) \u0026gt;\u0026gt;\u0026gt; print fake.company() u\u0026#34;快讯科技有限公司\u0026#34; As you can see there are a wide variety of tools and techniques to generate fake data from a variety of domains. The best way to explore all the providers in detail is simply to look at the providers package on GitHub.\nCreating A Provider Although the Faker library has a very comprehensive array of providers, occasionally you need a domain specific fake data generator. In order to add a custom provider, you will need to subclass the BaseProvider and expose custom fake methods as class methods using the @classmethod decorator. One very easy approach is to create a set of random data you\u0026rsquo;d like to expose, and simply randomly select it:\nfrom faker.providers import BaseProvider class OceanProvider(BaseProvider): __provider__ = \u0026#34;ocean\u0026#34; __lang__ = \u0026#34;en_US\u0026#34; oceans = [ u\u0026#39;Atlantic\u0026#39;, u\u0026#39;Pacific\u0026#39;, u\u0026#39;Indian\u0026#39;, u\u0026#39;Arctic\u0026#39;, u\u0026#39;Southern\u0026#39;, ] @classmethod def ocean(cls): return cls.random_element(cls.oceans) In order to change the likelihood or distribution of which oceans are selected, simply add duplicates to the oceans list so that each name has the probability of selection that you\u0026rsquo;d like. Then add your provider to the Faker object:\n\u0026gt;\u0026gt;\u0026gt; fake = Factory.create() \u0026gt;\u0026gt;\u0026gt; fake.add_provider(OceanProvider) \u0026gt;\u0026gt;\u0026gt; fake.ocean() u\u0026#39;Indian\u0026#39; In routine data wrangling operations, you may create a package structure with localization similar to how Faker is organized and load things on demand. Don\u0026rsquo;t forget — if you come up with a generic provider that may be useful to many people, submit it back as a pull request!\nMaintaining Data Quality Now that we understand the wide variety of fake data we can generate, let\u0026rsquo;s get back to our original example of creating user profile data of just name and email address. First, if you look at the results in the section above, we can make a few observations:\nPro: exact duplicates of name and email are maintained via the mapping. Pro: our user profiles are now fake data and PII is protected. Con: the name and the email are weird and don\u0026rsquo;t match. Con: fuzzy duplicates (e.g. J. Smith vs. John Smith) are blown away. Con: all the domains are \u0026ldquo;free email\u0026rdquo; like Yahoo and Gmail. Basically we want to improve our user profile to include email addresses that are similar to the names (or a non-name based username), and we want to ensure that the domains are a bit more realistic for work addresses. We also want to include aliases, nicknames, or different versions of the name. Faker does provide a profile provider:\n\u0026gt;\u0026gt;\u0026gt; fake.simple_profile() u\u0026#39;{ \u0026#34;username\u0026#34;: \u0026#34;autumn.weissnat\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Jalyn Crona\u0026#34;, \u0026#34;birthdate\u0026#34;: \u0026#34;1981-01-29\u0026#34;, \u0026#34;sex\u0026#34;: \u0026#34;F\u0026#34;, \u0026#34;address\u0026#34;: \u0026#34;Unit 2875 Box 1477\\nDPO AE 18742-1954\u0026#34;, \u0026#34;mail\u0026#34;: \u0026#34;zollie.schamberger@hotmail.com\u0026#34; }\u0026#39; But as you can see, it suffers from the same problem. In this section, we\u0026rsquo;ll explore different techniques that allow us to pass over the data and modify our fake data generation such that it matches the distributions we\u0026rsquo;re seeing in the original data set. In particular we\u0026rsquo;ll deal with the domain, creating more realistic fake profiles, and adding duplicates to our data set with fuzzy matching.\nDomain Distribution One idea to maintain the distribution of domains is to do a first pass over the data and create a mapping of real domain to fake domain. Moreover, many domains like gmail.com can be whitelisted and mapped directly to itself (we just need a fake username). Additionally, we can also preserve capitalization and spelling via this method, e.g. “Gmail.com” and “GMAIL.com” which might be important for data sets that have been entered by hand.\nIn order to create the domain mapping/whitelist, we\u0026rsquo;ll need to create an object that can load a whitelist from disk, or generate one from our original dataset. I propose the following utility:\nimport csv import json from faker import Factory from collections import Counter from collections import MutableMapping class DomainMapping(MutableMapping): @classmethod def load(cls, fobj): \u0026#34;\u0026#34;\u0026#34; Load the mapping from a JSON file on disk. \u0026#34;\u0026#34;\u0026#34; data = json.load(fobj) return cls(**data) @classmethod def generate(cls, emails): \u0026#34;\u0026#34;\u0026#34; Pass through a list of emails and count domains to whitelist. \u0026#34;\u0026#34;\u0026#34; # Count all the domains in each email address counts = Counter([ email.split(\u0026#34;@\u0026#34;)[-1] for email in emails ]) # Create a domain mapping domains = cls() # Ask the user what domains to whitelist based on frequency for idx, (domain, count) in enumerate(counts.most_common())): prompt = \u0026#34;{}/{}: Whitelist {} ({} addresses)?\u0026#34;.format( idx+1, len(counts), domain, count ) print prompt ans = raw_input(\u0026#34;[y/n/q] \u0026gt; \u0026#34;).lower() if ans.startswith(\u0026#39;y\u0026#39;): # Whitelist the domain domains[domain] = domain elif ans.startswith(\u0026#39;n\u0026#39;): # Create a fake domain domains[domain] elif ans.startswith(\u0026#39;q\u0026#39;): break else: continue return domains def __init__(self, whitelist=[], mapping={}): # Create the domain mapping properties self.fake = Factory.create() self.domains = mapping # Add the whitelist as a mapping to itself. for domain in whitelist: self.domains[domain] = domain def dump(self, fobj): \u0026#34;\u0026#34;\u0026#34; Dump the domain mapping whitelist/mapping to JSON. \u0026#34;\u0026#34;\u0026#34; whitelist = [] mapping = self.domains.copy() for key in mapping.keys(): if key == mapping[key]: whitelist.append(mapping.pop(key)) json.dump({ \u0026#39;whitelist\u0026#39;: whitelist, \u0026#39;mapping\u0026#39;: mapping }, fobj, indent=2) def __getitem__(self, key): \u0026#34;\u0026#34;\u0026#34; Get a fake domain for a real domain. \u0026#34;\u0026#34;\u0026#34; if key not in self.domains: self.domains[key] = self.fake.domain_name() return self.domains[key] def __setitem__(self, key, val): self.domains[key] = val def __delitem__(self, key): del self.domains[key] def __iter__(self): for key in self.domains: yield key Right so that\u0026rsquo;s quite a lot of code all at once, so let\u0026rsquo;s break it down a bit. First, the class extends MutableMapping which is an abstract base class in the collections module. The ABC gives us the ability to make this class act just like a dict object. All we have to do is provide __getitem__, __setitem__, __delitem__, and __iter__ methods and all other dictionary methods like pop, or values work on our behalf. Here, we\u0026rsquo;re just wrapping an inner dictionary called domains.\nThe thing to note about our __getitem__ method is that it acts very similar to a defaultdict, that is if you try to fetch a key that is not in the mapping, then it generates fake data on your behalf. This way, any domains that we don\u0026rsquo;t have in our whitelist or mapping will automatically be anonymized.\nNext, we want to be able to load and dump this data to a JSON file on disk, that way we can maintain our mapping between anonymization runs. The load method is fairly straight forward, it just takes an open file-like object and parses it uses the json module, and instantiates the domain mapping and returns it. The dump method is a bit more complex, it has to break down the whitelist and mapping into separate objects, so that we can easily modify the data on disk if needed. Together, these methods will allow you to load and save your mapping into a JSON file that will look similar to:\n{ \u0026#34;whitelist\u0026#34;: [ \u0026#34;gmail.com\u0026#34;, \u0026#34;yahoo.com\u0026#34; ], \u0026#34;mapping\u0026#34;: { \u0026#34;districtdatalabs.com\u0026#34;: \u0026#34;fadel.org\u0026#34;, \u0026#34;umd.edu\u0026#34;: \u0026#34;ferrystanton.org\u0026#34; } } The final method of note is the generate method. The generate method allows you to do a first pass through a list of emails, count the frequency of the domains, then propose to the user in order of most frequent domain whether or not to add it to the whitelist. For each domain in the emails, the user is prompted as follows:\n1/245: Whitelist \u0026#34;gmail.com\u0026#34; (817 addresses)? [y/n/q] \u0026gt; Note that the prompt includes a progress indicator (this is prompt 1 of 245) as well as a method to quit early. This is especially important for large datasets that have a lot of single domains; if you quit, the domains will still be faked, and the user only sees the most frequent examples for whitelisting. The idea behind this mechanism to read through your CSV once, generate the whitelist, then save it to disk so that you can use it for anonymization on a routine basis. Moreover, you can modify domains in the JSON file to better match any semantics you might have (e.g. include .edu or .gov domains, which are not generated by the internet provider in Faker).\nRealistic Profiles To create realistic profiles, we\u0026rsquo;ll create a provider that uses the domain map from above and generates fake data for every combination we see in the data set. This provider will also provide opportunities for mapping multiple names and email addresses to a single profile so that we can use the profile for creating fuzzy duplicates in the next section. Here is the code:\nclass Profile(object): def __init__(self, domains): self.domains = domains self.generator = Factory.create() def fuzzy_profile(self, name=None, email=None): \u0026#34;\u0026#34;\u0026#34; Return an profile that allows for fuzzy names and emails. \u0026#34;\u0026#34;\u0026#34; parts = self.fuzzy_name_parts() return { \u0026#34;names\u0026#34;: {name: self.fuzzy_name(parts, name)}, \u0026#34;emails\u0026#34;: {email: self.fuzzy_email(parts, email)}, } def fuzzy_name_parts(self): \u0026#34;\u0026#34;\u0026#34; Returns first, middle, and last name parts \u0026#34;\u0026#34;\u0026#34; return ( self.generator.first_name(), self.generator.first_name(), self.generator.last_name() ) def fuzzy_name(self, parts, name=None): \u0026#34;\u0026#34;\u0026#34; Creates a name that has similar case to the passed in name. \u0026#34;\u0026#34;\u0026#34; # Extract the first, initial, and last name from the parts. first, middle, last = parts # Create the name, with chance of middle or initial included. chance = self.generator.random_digit() if chance \u0026lt; 2: fname = u\u0026#34;{} {}. {}\u0026#34;.format(first, middle[0], last) elif chance \u0026lt; 4: fname = u\u0026#34;{} {} {}\u0026#34;.format(first, middle, last) else: fname = u\u0026#34;{} {}\u0026#34;.format(first, last) if name is not None: # Match the capitalization of the name if name.isupper(): return fname.upper() if name.islower(): return fname.lower() return fname def fuzzy_email(self, parts, email=None): \u0026#34;\u0026#34;\u0026#34; Creates an email similar to the name and original email. \u0026#34;\u0026#34;\u0026#34; # Extract the first, initial, and last name from the parts. first, middle, last = parts # Use the domain mapping to identify the fake domain. if email is not None: domain = self.domains[email.split(\u0026#34;@\u0026#34;)[-1]] else: domain = self.generator.domain_name() # Create the username based on the name parts chance = self.generator.random_digit() if chance \u0026lt; 2: username = u\u0026#34;{}.{}\u0026#34;.format(first, last) elif chance \u0026lt; 3: username = u\u0026#34;{}.{}.{}\u0026#34;.format(first, middle[0], last) elif chance \u0026lt; 6: username = u\u0026#34;{}{}\u0026#34;.format(first[0], last) elif chance \u0026lt; 8: username = last else: username = u\u0026#34;{}{}\u0026#34;.format( first, self.generator.random_number() ) # Match the case of the email if email is not None: if email.islower(): username = username.lower() if email.isupper(): username = username.upper() else: username = username.lower() return u\u0026#34;{}@{}\u0026#34;.format(username, domain) Again, this is a lot of code, make sure you go through it carefully so you understand what is happening. First off, a profile in this case is the combination of a mapping of names to fake names and emails to fake emails. The key is that the names and emails are related to original data somehow. In this case, the relationship is through case such that \u0026ldquo;DANIEL WEBSTER\u0026rdquo; is faked to \u0026ldquo;JAKOB WILCOTT\u0026rdquo; instead of to \u0026ldquo;Jakob Wilcott\u0026rdquo;. Additionally through our domain mapping, we also maintain the relationship of the original email domain to the fake domain mapping, e.g. everyone with the an email domain \u0026ldquo;@districtdatalabs.com\u0026rdquo; will be mapped to the same fake domain.\nIn order to maintain the relationship of names to emails (which is very common), we need to be able to access the name more directly. In this case we have a name parts generator which generates fake first, middle, and last names. We then randomly generate names of the form \u0026ldquo;first last\u0026rdquo;, \u0026ldquo;first middle last\u0026rdquo;, or \u0026ldquo;first i. last\u0026rdquo; with random chance. Additionally the email can take a variety of forms based on the name parts as well. Now we get slightly more realistic profiles:\n\u0026gt;\u0026gt;\u0026gt; fake.fuzzy_profile() {\u0026#39;names\u0026#39;: {None: u\u0026#39;Zaire Ebert\u0026#39;}, \u0026#39;emails\u0026#39;: {None: u\u0026#39;ebert@von.com\u0026#39;}} \u0026gt;\u0026gt;\u0026gt; fake.fuzzy_profile( ... name=\u0026#39;Daniel Webster\u0026#39;, email=\u0026#39;dictionaryguy@gmail.com\u0026#39;) {\u0026#39;names\u0026#39;: {\u0026#39;Daniel Webster\u0026#39;: u\u0026#39;Georgia McDermott\u0026#39;}, \u0026#39;emails\u0026#39;: {\u0026#39;dictionaryguy@gmail.com\u0026#39;: u\u0026#39;georgia9@gmail.com\u0026#39;}} Importantly this profile object makes it easy to map multiple names and emails to the same profile object to create \u0026ldquo;fuzzy\u0026rdquo; profiles and duplicates in your dataset. We will discuss how to perform fuzzy matching in the next section.\nFuzzing Fake Names from Duplicates If you noticed in our original data set we had the situation where we had a clear entity duplication: same email, but different names. In fact, the second name was simply the first initial and last name but you could imagine other situations like nicknames (\u0026ldquo;Bill\u0026rdquo; instead of \u0026ldquo;William\u0026rdquo;), or having both work and personal emails in the dataset. The fuzzy profile objects we generated in the last section allow us to maintain a mapping of all name parts to generated fake names, but we need some way to be able to detect duplicates and combine their profile: enter the fuzzywuzzy module.\n$ pip install fuzzywuzzy python-Levenshtein Similar to how we did the domain mapping, we\u0026rsquo;re going to pass through the entire dataset and look for similar name, email pairs and propose them to the user. If the user thinks they\u0026rsquo;re duplicates, then we\u0026rsquo;ll merge them together into a single profile, and use the mappings as we anonymize. Although I won\u0026rsquo;t go through an entire object to do this as with the domain map, this is also something you can save to disk and load on demand for multiple anonymization passes and to include user based edits.\nThe first step is to get pairs, and eliminate exact duplicates. To do this we\u0026rsquo;ll create a hashable data structure for our profiles using a namedtuple.\nfrom collections import namedtuple from itertools import combinations Person = namedtuple(\u0026#39;Person\u0026#39;, \u0026#39;name, email\u0026#39;) def pairs_from_rows(rows): \u0026#34;\u0026#34;\u0026#34; Expects rows of dictionaries with name and email keys. \u0026#34;\u0026#34;\u0026#34; # Create a set of person tuples (no exact duplicates) people = set([ Person(row[\u0026#39;name\u0026#39;], row[\u0026#39;email\u0026#39;]) for row in rows ]) # Yield ordered pairs of people objects without replacement for pair in combinations(people, 2): yield pair The namedtuple is an immutable data structure that is compact, efficient, and allows us to access properties by name. Because it is immutable it is also hashable (unlike mutable dictionaries), meaning we can use it as keys in sets and dictionaries. This is important, because the first thing our pairs_from_rows function does is eliminate exact matches by creating a set of Person tuples. We then use the combinations function in itertools to generate every pair without replacement.\nThe next step is to figure out how similar each pair is. To do this we\u0026rsquo;ll use the fuzzywuzzy library to come up with a partial ratio score: the mean of the similarity of the names and the emails for each pair:\nfrom fuzzywuzzy import fuzz from functools import partial def normalize(value, email=False): \u0026#34;\u0026#34;\u0026#34; Make everything lowercase and remove spaces. If email, only take the username portion to compare. \u0026#34;\u0026#34;\u0026#34; if email: value = value.split(\u0026#34;@\u0026#34;)[0] return value.lower().replace(\u0026#34; \u0026#34;, \u0026#34;\u0026#34;) def person_similarity(pair): \u0026#34;\u0026#34;\u0026#34; Returns the mean of the normalized partial ratio scores. \u0026#34;\u0026#34;\u0026#34; # Normalize the names and the emails names = map(normalize, [p.name for p in pair]) email = map( partial(normalize, email=True), [p.email for p in pair] ) # Compute the partial ratio scores for both names and emails scores = [ fuzz.partial_ratio(a, b) for a, b in [names, emails] ] # Return the mean score of the pair return float(sum(scores)) / len(scores) The score will be between 0 (no similarity) and 100 (exact match), though hopefully you won\u0026rsquo;t get any scores of 100 since we eliminated exact matches above. For example:\n\u0026gt;\u0026gt;\u0026gt; person_similarity([ ... Person(\u0026#39;John Lennon\u0026#39;, \u0026#39;john.lennon@gmail.com\u0026#39;), ... Person(\u0026#39;J. Lennon\u0026#39;, \u0026#39;jlennon@example.org\u0026#39;) ... ]) 80.5 The fuzzing process will go through your entire dataset, and create pairs of people it finds and compute their similarity score. Filter all pairs except for scores that meet a threshold (say, 50) then propose them to the user to decide if they\u0026rsquo;re duplicates in descending score order. When a duplicate is found, merge the profile object to map the new names and emails together.\nConclusion Anonymization of datasets is a critical method to promote the exploration and practice of data science through open data. Fake data generators that already exist give us the opportunity to ensure that private data is obfuscated. This issue becomes how to leverage these fake data generators while still maintaining a high quality dataset with semantic relations preserved for further analysis. As we\u0026rsquo;ve seen throughout the post, even just the anonymization of just two fields, name and email can lead to potential problems.\nThis problem, and the code in this post are associated with a real case study. For District Data Labs\u0026rsquo; Entity Resolution Research Lab3 I wanted to create a dataset that removed PII of DDL members while maintaining duplicates and structure to study entity resolution. The source dataset was 1,343 records in CSV form and contained name and emails that I wanted to anonymize.\nUsing the strategy I mentioned for domain name mapping, the dataset contained 245 distinct domain names, 185 of which were hapax legomena (appeared only once). There was a definite long tail, as the first 20 or so most frequent domains were the majority of the records. Once I generated the whitelist as above, I manually edited the mappings to ensure that there were no duplicates and that major work domains were “professional enough”.\nUsing the fuzzy matching process was also a bear. It took on average, 28 seconds to compute the pairwise scores. Using a threshold score of 50, I was proposed 5,110 duplicates (out of a possible 901,153 combinations). I went through 354 entries (until the score was below 65) and was satisfied that I covered many of the duplicates in the dataset.\nIn the end the dataset that I anonymized was of a high quality. It obfuscated personally identifying information like name and email and I\u0026rsquo;m happy to make the data set public. Of course, you could reverse the some of the information in the dataset. For example, I\u0026rsquo;m listed in the dataset, and one of the records indicates a relationship between a fake user and a blog post, which I\u0026rsquo;m on record as having written. However, even though you can figure out who I am and what else I\u0026rsquo;ve done in the dataset, you wouldn\u0026rsquo;t be able to use it to extract my email address, which was the goal.\nIn the end, anonymizing a dataset is a lot of work, with a lot of gotchas and hoops to jump through. However, I hope you will agree that it is invaluable in an open data context. By sharing data, resources, and tools we can use many eyes to provide multiple insights and to drive data science forward.\nFootnotes 1. Anonymize: remove identifying particulars from (test results) for statistical or other purposes.\n2. Entity Resolution: tools or techniques that identify, group, and link digital mentions or manifestations of some object in the real world.\n3. DDL Research Labs is an applied research program intended to develop novel, innovative data science solutions towards practical applications.\n","permalink":"https://bbengfort.github.io/2016/02/anonymizing-profile-data/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis post is an early draft of expanded work that will eventually appear on the \u003ca href=\"http://blog.districtdatalabs.com/\"\u003eDistrict Data Labs Blog\u003c/a\u003e. Your feedback is welcome, and you can submit your comments on the \u003ca href=\"https://github.com/bbengfort/bbengfort.github.io/issues/3\"\u003edraft GitHub issue\u003c/a\u003e.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eIn order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to show examples and how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson because it provides for deep and meaningful further exploration. Non-trivial datasets can provide surprise and intuition in a way that toy datasets just cannot. Unfortunately, non-trivial datasets can be hard to find for a few reasons, but one common reason is that the dataset contains personally identifying information (PII).\u003c/p\u003e","title":"Anonymizing User Profile Data with Faker"},{"content":"I was looking back through some old code (hoping to find a quick post before I got back to work) when I ran across a project I worked on called Mortar. Mortar was a simple daemon that ran in the background and watched a particular directory. When a file was added or removed from that directory, Mortar would notify other services or perform some other task (e.g. if it was integrated into a library). At the time, we used Mortar to keep an eye on FTP directories, and when a file was uploaded Mortar would move it to a staging directory based on who uploaded it, then do some work on the file.\nMortar was a specific implementation of the observer design pattern. And while it might seem that this means that Mortar was observing the directory, in fact Mortar was the thing being observed, which for the purposes of a design pattern discussion we will call the subject (or something that implements the observable interface in Java terms). The observers were actually the things that did work when Mortar noticed a change in the file system; e.g. add something to a directory, move the file, do some work on the file, etc.\nOk, so a brief note on the observer pattern, which you should read about somewhere that is not here (like in the link above). The basic pattern is that we have a subject that contains some state. Other objects called observers register themselves with the subject and ask to be notified when the state changes. There are a couple of ways to implement this, but the most common is to give the observers a method called update. When the state changes on the subject, it simply calls the update method for each observer in the order that they registered.\nOf course, this brings up a whole host of other issues like synchronization or side-effects. Like I said, explore this pattern in detail! But back to the code snippet I rediscovered.\nComing from an event oriented programming environment like JavaScript or ActionScript, the observer pattern is very easy to understand. In this case the subject is whatever is listening to user actions like mouse clicks or key presses. Rather than calling a single update function on all the observers; observers register callbacks (callables like functions or callable classes) to specific event types. Events themselves are are also data, and contain information that is passed to the callback function. Way back in 2010, I wanted to bring this style of event dispatcher to my Python programming, so with some inspiration from Python Event Dispatcher by @makemachine, I came up with the following:\nThe idea here is that you would create (or subclass) the event dispatcher, and then have observers register their callbacks with specific event types (or multiple event types if needed). Event types in this case are just strings that can be compared, and I\u0026rsquo;ve provided several examples as static variables on the Event class itself. The dispatcher guarantees that when an event occurs, all (and only) callbacks that are registered at the time of the event will receive an unmodified copy of the event, no matter the order of their registration. It does this through the deepycopy and clone functions.\nWhile this is not fundamentally different than the observer pattern, it does implement things in a style that I think other data scientists may understand, particularly if they do JavaScript for visualization. Moreover, I like the idea of having multiple event types and passing state through a packet.\nIn order to make this thread safe, some mutex would need to be added to the dispatcher class. If you\u0026rsquo;re willing to make that happen, I\u0026rsquo;d love to see it!\n","permalink":"https://bbengfort.github.io/2016/02/observer-pattern/","summary":"\u003cp\u003eI was looking back through some old code (hoping to find a quick post before I got back to work) when I ran across a project I worked on called Mortar. Mortar was a simple daemon that ran in the background and watched a particular directory. When a file was added or removed from that directory, Mortar would notify other services or perform some other task (e.g. if it was integrated into a library). At the time, we used Mortar to keep an eye on FTP directories, and when a file was uploaded Mortar would move it to a staging directory based on who uploaded it, then do some work on the file.\u003c/p\u003e","title":"Implementing the Observer Pattern with an Event System"},{"content":"Automation with Python is a lovely thing, particularly for very repetitive or long running tasks; but unfortunately someone still has to press the button to make it go. It feels like there should be an easy way to set up a program such that it runs routinely, in the background, without much human intervention. Daemonized services are the route to go in server land; but how do you routinely schedule a process to run on your local computer, which may or may not be turned off1? Moreover, long running daemon processes seem expensive when you just want a quick job to execute routinely.\nLet\u0026rsquo;s consider the following use case: you\u0026rsquo;re working on a data analysis project that requires the mashup of two different data sources. The first data source has to be ingested routinely, every hour, and the second has to be fetched sometime after, depending on the result of the first query. Obviously, you don\u0026rsquo;t want to have to go to your computer and run your service, so your choices are:\nLet the OS run your program for you (launchd or cron) Let an external daemon service run your program (celery or luigi) Create a long running program that mostly sleeps (schedule or sched) Frankly, these aren\u0026rsquo;t great choices, but they\u0026rsquo;re the best we\u0026rsquo;ve got. In this post, I will explore the first and third options in a bit more detail. The second option is the more services-oriented route that you might expect to see on servers rather than on your local machine. I will probably discuss those options in other posts, as I start to use them more frequently in my work.\nThe Infamous Cron There are actually many versions of cron, which was originally studied in the late 1970\u0026rsquo;s in parallel with research concerning discrete event simulation. The modern version that is typically used is Vixie or ISC cron, named after its original programmer, Paul Vixie who wrote it in 1987. Because of its rich history, maturity, and standard inclusion with most Linux distros, cron is the defacto tool for scheduling periodic tasks in the background.\ncron is a Linux/Unix utility which allows users to execute commands automatically at a specified time and date or periodically on a schedule. While technically cron is a daemon service that is launched when the OS boots, because it is available preinstalled on almost all Linux/Unix systems I believe it is legitimate to talk about it being a part of the operating system. However, it is important to check that the crond daemon is running on your computer, otherwise your scheduled command won\u0026rsquo;t execute.\nCron Voodoo Working with cron means editing crontab (cron configuration) files. System wide jobs can be installed by modifying /etc/crontab, however users should use the crontab tool if available to create local jobs. The crontab files can contain variables that modify how cron is used, but the most important part are the entry lines that describe when and what to execute. Consider that we have a file called ingest.py, which is installed on the path, in order to run that every five minutes, we would write an entry similar to the following:\n0-59/5 * * * * $HOME/bin/ingest.py \u0026gt;\u0026gt; $HOME/log/ingest.out 2\u0026gt;\u0026amp;1 There are two parts to the voodoo of this entry, the schedule and the command. The schedule has five fields: minute, hour, day of month, month, day of week. By specifying a single number, you specify exactly when to run the job. For example to run a job on the first of April at 8:15 AM:\n15 8 1 4 * echo \u0026#34;April Fools!\u0026#34; The * stands for “first-last” a short cut for the maximum range. In our first example we used 0-59 to specify that we wanted it to run every minute between the 0th minute and the 59th minute. We could have replaced this with * to shorten the syntax. The / allows us to specify a step, therefore in our example */5 means run every five minutes.\nThe second part is our command. In the ingest example we execute a Python file (which should have a #!/usr/bin/env python at the top of it and have executable permissions) that is in our home directory, in the bin folder. We then append the output to a log file, and redirect the standard error pipe to standard out (so that we can have one log file). It is important to understand where your output is going in order to debug errors and capture messages that are printed to the command line!\nOS X Launchd If you\u0026rsquo;re working on OS X, the preferred method for creating periodic or timed jobs is to use launchd, though cron is technically available2. Every launchd job is specified by property list (plist) file in XML format, therefore instead of maintaining a single crontab file with all entries, managing launchd jobs is as simple as adding and removing .plist files!\nConfiguring launchd plist files is more expressive than crontab, and allows you to include a lot of information about your background process; for more information see Creating a launchd Property List File. There are four properties that must be included with each configuration: Label to identify your job, ProgramArguments used to launch your job, inetdCompatibility which is specifically for servers, and KeepAlive which specifies if your job launches on demand or must always be running. Our 5 minute ingest.py command is specified as follows:\n\u0026lt;?xml version=\u0026#34;1.0\u0026#34; encoding=\u0026#34;UTF-8\u0026#34;?\u0026gt; \u0026lt;!DOCTYPE plist PUBLIC \u0026#34;-//Apple//DTD PLIST 1.0//EN\u0026#34; \u0026#34;http://www.apple.com/DTDs/PropertyList-1.0.dtd\u0026#34;\u0026gt; \u0026lt;plist version=\u0026#34;1.0\u0026#34;\u0026gt; \u0026lt;dict\u0026gt; \u0026lt;key\u0026gt;Label\u0026lt;/key\u0026gt; \u0026lt;string\u0026gt;com.districtdatalabs.ingest\u0026lt;/string\u0026gt; \u0026lt;key\u0026gt;ProgramArguments\u0026lt;/key\u0026gt; \u0026lt;array\u0026gt; \u0026lt;string\u0026gt;$HOME/bin/ingest.py\u0026lt;/string\u0026gt; \u0026lt;/array\u0026gt; \u0026lt;key\u0026gt;StandardOutPath\u0026lt;/key\u0026gt; \u0026lt;string\u0026gt;$HOME/log/ingest.out\u0026lt;/string\u0026gt; \u0026lt;key\u0026gt;StandardErrorPath\u0026lt;/key\u0026gt; \u0026lt;string\u0026gt;$HOME/log/ingest.out\u0026lt;/string\u0026gt; \u0026lt;key\u0026gt;StartInterval\u0026lt;/key\u0026gt; \u0026lt;integer\u0026gt;300\u0026lt;/integer\u0026gt; \u0026lt;/dict\u0026gt; \u0026lt;/plist\u0026gt; Although a bit more verbose, launchd configuration gives a bit more flexibility and a bit more readability about what is happening. You can also specify calendar based intervals, or even modify a directory to detect if paths have been changed. After creating the plist file for your ingest command, install it to /Library/LaunchAgents or in the LaunchAgents directory of the user specific Library folder. If you don\u0026rsquo;t want to specify the entire path to the executable, you can symlink ingest.py to /usr/local/libexec as follows:\n$ ln -s $HOME/bin/ingest.py /usr/local/libexec/ingest.py On OS X, the term “daemon” is used to specify system-level background processes, where the term “agent” is used to specify per-user background processes3. Note that an agent will not run if its assigned user is not logged in. Similarly by installing a launchd plist to /Library/LaunchDaemons, the service will run at the system level.\nIs the Computer On? For OS X, if your system is off or asleep, cron jobs will not execute, and will run when the next scheduled time occurs and the computer is turned back on. Similarly, most launchd jobs are skipped if the computer is off or asleep as well. However, if a launchd job is specified by the StartCalendarInterval key, and the computer is asleep when the job should have run, it will run when the computer wakes up. This doesn\u0026rsquo;t count if the computer is off, however.\nIt is important to keep in mind when your computer is on and running, and how it might affect your background services. If the computer is always off or asleep at the job\u0026rsquo;s scheduled time, then it will never run.\nScheduling and Waiting While cron and launchd are great for scheduling jobs that run periodically, it does have some issues4. For example, cron is a per-machine configuration, not an application configuration, which makes it difficult to scale the number of machines that are working together. Both cron and launchd are also difficult to debug and finally, bigger problems can be designed with tools like queues and workers that are easier to work with but not suitable for scheduling with cron.\nThe bottom line is that as your program gets more complex, it\u0026rsquo;s better to turn it into a long-running service or daemon with its own built-in scheduler than to let the OS run it every once in a while. Note you\u0026rsquo;ll still use launchd to ensure that the daemon is running in the background, or something like upstart on Linux. In this section we\u0026rsquo;ll look at a program that creates its own delays using the standard library sched and third party schedule utilities.\nPython Event Scheduler The standard library sched module defines a scheduler class that implements general purpose periodic events and callbacks for single process Python programs. The scheduler requires two functions to actually handle the scheduling: a timefunc, which should be a callable without arguments that returns a number that represents the current time and a delayfunc which should accept one argument compatible with the output of the timefunc and should delay that many units. The simplest implementation of our ingest function is as follows:\n#!/usr/bin/env python import sched, time from ingest import ingest scheduler = sched.scheduler(time.time, time.sleep) def ingestion_runner(*args, **kwargs): \u0026#34;\u0026#34;\u0026#34; Runs ingestion every 5 minutes for an hour. \u0026#34;\u0026#34;\u0026#34; # Pass arguments to ingest function doingest = lambda: ingest(*args, **kwargs) # Set the scheduler to run doingest for interval in xrange(0, 60, 5): scheduler.enter(interval*60, 1, doingest, ()) # Run the scheduler print \u0026#34;Ingestion started at {}\u0026#34;.format(time.time) scheduler.run() print \u0026#34;Ingestion finished at {}\u0026#34;.format(time.time) if __name__ == \u0026#39;__main__\u0026#39;: ingestion_runner() This style of scheduler basically allows you to create a chain of events ahead of time using the enter method. Then when the scheduler is run, it simply calls time.sleep for the number of seconds before its next scheduled event, executes that event, and then sleeps until the next event. The sched module is really nice to create a complex sequence of events, so that you don\u0026rsquo;t have to do the math about sleeping in between. However, once the schedule is running, it is completely blocking (because of the sleep call), and your program won\u0026rsquo;t be able to do anything (not even catch signals like KeyboardInterrupt) until the next event occurs.\nSchedule API As an alternative to the standard library sched, the third party schedule library allows you to build an in-process scheduler for periodic jobs, without necessarily blocking. Schedule is designed as a lightweight API that runs a callable and pre-determined intervals, and has the most friendly syntax of any of the tools we\u0026rsquo;ve discussed so far. To use schedule, install it with pip:\n$ pip install schedule We can then convert our ingestion runner from above into something a lot less verbose, and which will allow us to sleep on our own terms, and exit if we want to. The schedule ingestion runner is as follows:\n#!/usr/bin/env python import sys import time import schedule from ingest import ingest from functools import partial def ingestion_runner(*args, **kwargs): \u0026#34;\u0026#34;\u0026#34; Runs the ingest function with the given arguments every 5 minutes. \u0026#34;\u0026#34;\u0026#34; # Use partial based method instead of lambda doingest = partial(ingest, *args, **kwargs) # Set the scheduler to do ingest. schedule.every(5).minutes.do(doingest) # Run the scheduler, with the ability to cancel early counter = 0 while True: try: schedule.run_pending() counter += 1 time.sleep(1) except (KeyboardInterrupt, SystemExit): break print \u0026#34;Ran ingest {} times\u0026#34;.format(counter) sys.exit(0) if __name__ == \u0026#39;__main__\u0026#39;: ingestion_runner() The schedule api allows us to only block 1 second at a time, which gives us the opportunity to check if someone is trying to exit. Moreover, we don\u0026rsquo;t have to specify or compute exactly when to schedule our job; the every method just keeps the job running as long as we want!\nConclusion In the context of data science, we\u0026rsquo;re used to saying that we can create automated platforms for performing ingestion, wrangling, model building, etc. However, outside the context of a web application, sometimes it is not clear how to get these tools up and running in an automated fashion. I hope this post presents a simple method for getting routine jobs going on your machine, and that it will enable you to ingest enough data to perform high quality analytics. At the very least, it should serve as a reference to point you towards the tools that you need to know.\nThis post is the first in a series where I discuss “software immortality: daemons, schedulers, and programs that live forever”. I hope to continue this discussion with task queues and workers, discuss Celery and other Python projects that let comptuers do a lot of work on your behalf.\nFootnotes 1. Stack Overflow asks: How do I get a Cron like scheduler in Python\n2. Mac OS X Daemons and Services: Scheduling Timed Jobs.\n3. See Daemons and Agents from the Apple Developer Library for more.\n4. Schedule was inspired by Adam Wiggins\u0026rsquo; article, Rethinking Cron.\n","permalink":"https://bbengfort.github.io/2016/02/running-on-schedule/","summary":"\u003cp\u003eAutomation with Python is a lovely thing, particularly for very repetitive or long running tasks; but unfortunately someone still has to press the button to make it go. It feels like there should be an easy way to set up a program such that it runs routinely, in the background, without much human intervention. Daemonized services are the route to go in server land; but how do you routinely schedule a process to run on your local computer, which may or may not be turned off\u003c!-- raw HTML omitted --\u003e\u003c!-- raw HTML omitted --\u003e\u003ca href=\"#ros-footnote-1\"\u003e1\u003c/a\u003e\u003c!-- raw HTML omitted --\u003e\u003c!-- raw HTML omitted --\u003e? Moreover, long running daemon processes seem expensive when you just want a quick job to execute routinely.\u003c/p\u003e","title":"Running on Schedule"},{"content":"This post is an attempt to explain what iterators and generators are in Python, defend the yield statement, and reveal why a library like SimPy is possible. But first some terminology (that specifically targets my friends who Java). Iteration is a syntactic construct that implements a loop over an iterable object. The for statement provides iteration, the while statement may provide iteration. An iterable object is something that implements the iteration protocol (Java folks, read interface). A generator is a function that produces a sequence of results instead of a single value and is designed to make writing iterable objects easier.\nIterables Iterable objects are constructed by the built-in function, iter, which takes an iterable object and returns an iterator. The Python data model allows you to define custom objects that implement double underscore methods related to the built-in functions and operators. Therefore if you implement an object with an __iter__ method, your object can be passed to the iter built-in.\nThe __iter__ method must return an iterable object, which if it is the same object, can simply return self. Iterable objects must have a next method that is called on every pass of the loop. When iteration is complete, the next method should raise StopIteration. Here is an example of a Dealer iterator that shuffles a deck of cards on iter then deals out cards on each call of next, until there are no more cards left in the deck:\nThe thing to note here is that the object keeps track of its own state, through it\u0026rsquo;s own pointer value (the \u0026ldquo;shoe\u0026rdquo;). This means that the iterable can be \u0026ldquo;exhausted\u0026rdquo; without returning any more data. Try the following and see what happens:\ndealer = Dealer() for card in dealer: for card in dealer: print card Note that I also used the shorthand and didn\u0026rsquo;t call the iter function directly, but let the syntax of the for loop handle it for me. Also note that other built-in functions consume iterables like list which will take the contents of the iterable and store it in memory in a list, or enumerate which will also provide an index of each value in the iterator.\nGenerators Generators are designed to allow you to easily create iterables without having to deal with the iterator interface. Instead you can create a function that does not return but rather yield values. When the yield keyword is used inside a function, a generator is immediately returned that has a next method. Look how simple our dealer is using a generator function:\ndef dealer(): cards = [ u\u0026#34;{: \u0026gt;2}{}\u0026#34;.format(*card) for card in zip(FACES * len(SUITS), SUITS * len(FACES)) ] random.shuffle(cards) for card in cards: yield card The generator allows us to forget about how to implement an iterable, keep track of state, etc. which greatly simplifies the process. You can get access to the generator directly from the function:\ndealer_generator = dealer() print dealer_generator.next() Or you can simply loop over the function as we\u0026rsquo;ve been doing so far:\nfor card in dealer(): print card The yield statement is often mistaken for yielding a value instead of simply returning one. What the generator is actually doing is yielding the execution context back to the caller. Whenever the caller calls next() on the generator, the execution is returned directly to the line where the yield was executed. Consider the following example:\ndef surround(n): for idx in xrange(n): print \u0026#34;above {}\u0026#34;.format(idx) yield idx print \u0026#34;below {}\u0026#34;.format(idx) for idx in surround(4): print \u0026#34;around {}\u0026#34;.format(idx) You get output that appears as follows:\nabove 0 around 0 below 0 above 1 around 1 below 1 above 2 around 2 below 2 above 3 around 3 below 3 What is happening here? On the for loop call, a generator is returned, the \u0026ldquo;above\u0026rdquo; print statement happens, then control is yielded to the executing context, which prints \u0026ldquo;around\u0026rdquo;. That block complete, the loop continues, going to the next cycle, and calls next on the generator, which returns control right after the yield, printing the \u0026ldquo;below\u0026rdquo; statement, continuing to the next \u0026ldquo;above\u0026rdquo; then yielding, so on and so forth.\nSimPy and Context Generators are incredibly handy for things like comprehensions, memory safe iteration, reading from multiple files simultaneously, and more. However, I want to talk about their ingenious use in the discrete event simulation library, SimPy.\nSimPy allows you to create processes which are essentially generators. These processes can run forever, but they must yield events that occur in the simulation. One very important event is the timeout event that allows time to pass in the simulation. So how would we implement a simple SimPy environment using generators? Consider a blinking light generator:\ndef blinker(env): while True: print \u0026#39;Blink at {}!\u0026#39;.format(env.now) yield 5 The desired effect is that this prints \u0026ldquo;Blink\u0026rdquo; every 5 time steps in the simulation (env in this case is just a SimPy environment). The offset allows us to start blinking lights that blink at different times. Note that this while loop doesn\u0026rsquo;t terminate, so if we just hit go on this thing, even if we manage to wait 5 (however we do that) then this will go forever, how do we cancel it? Moreover, how do we cancel multiple blinking lights?\nBasically what we can do is we can simply manage the generators for our simulation and call next on them when appropriate, and if we want to terminate, then simply don\u0026rsquo;t call their next method. Here is a simple implementation:\nfrom collections import defaultdict class BlinkerEnvironment(object): def __init__(self, blinkers=4): self.now = 0 self.blinkers = defaultdict(list) for idx in xrange(blinkers): # schedule blinkers by offset self.blinkers[idx].append(blinker(self)) def run(self, until=100): while self.now \u0026lt; until: if self.now in self.blinkers: for blinker in self.blinkers.pop(self.now): timeout = blinker.next() + self.now self.blinkers[timeout].append(blinker) self.now = min(self.blinkers.keys()) As you can see in this code, the blinkers dictionary is a list of blinkers keyed to the time value that they are supposed to be called again. The environment keeps track of the current timestamp, and initializes 4 blinkers that are offset so that the blinkers aren\u0026rsquo;t all blinking at the same time.\nThe run method is passed an until argument, which limits how long the simulation goes on. If the current timestamp is in the blinkers schedule, then we go and fetch all the generators for the now value, then call their next method. We reschedule the blinker based on the timeout number that it yields to us, then we increment now by the next scheduled blink to take place (skipping over time steps that don\u0026rsquo;t matter is what gives discrete event simulation its desired properties). And voila, we\u0026rsquo;ve implemented a simple simulation using generators!\n","permalink":"https://bbengfort.github.io/2016/02/iterators-generators/","summary":"\u003cp\u003eThis post is an attempt to explain what iterators and generators are in Python, defend the \u003ccode\u003eyield\u003c/code\u003e statement, and reveal why a library like \u003ca href=\"https://simpy.readthedocs.org/en/latest/\"\u003eSimPy\u003c/a\u003e is possible. But first some terminology (that specifically targets my friends who Java). \u003cem\u003eIteration\u003c/em\u003e is a syntactic construct that implements a loop over an \u003cem\u003eiterable\u003c/em\u003e object. The \u003ccode\u003efor\u003c/code\u003e statement provides \u003cem\u003eiteration\u003c/em\u003e, the \u003ccode\u003ewhile\u003c/code\u003e statement may provide iteration. An \u003cem\u003eiterable\u003c/em\u003e object is something that implements the \u003cem\u003eiteration protocol\u003c/em\u003e (Java folks, read interface). A \u003cem\u003egenerator\u003c/em\u003e is a function that produces a sequence of results instead of a single value and is designed to make writing \u003cem\u003eiterable\u003c/em\u003e objects easier.\u003c/p\u003e","title":"Iterators and Generators"},{"content":"Event driven programming can be a wonderful thing, particularly when the execution of your code is dependent on user input. It is for this reason that JavaScript and other user facing languages implement very strong event based semantics. Many times event driven semantics depends on elapsed time (e.g. wait then execute). Python, however, does not provide a native setTimeout or setInterval that will allow you to call a function after a specific amount of time, or to call a function again and again at a specific interval.\nConsider a naive example where the program just waits a specific amount of time then calls a function:\nimport time def wait(delay, func): \u0026#34;\u0026#34;\u0026#34; Waits a certain amount of time, then calls func. \u0026#34;\u0026#34;\u0026#34; time.sleep(delay) func() When this function is called it begins blocking — that is the code cannot continue while we are in the delay. Therefore if you want to listen for user input, it won\u0026rsquo;t be evaluated until after the delay is complete. This is bad.\nIn order to implement something that is nonblocking in Python — that is it runs independently of the main execution of the code, we need to use the threading module. This is not a blog on threading, which is an enormous topic of its own. It should suffice to say for this blog post that the threading module allows us to spin off an independent thread that executes on its own while the main process continues. This will allow us to schedule functions to be called at a later date in a non blocking fashion.\nThe threading module has a helpful threading.Timer object that you can use to set a delay and run the function:\nimport threading def wait(delay, func): timer = threading.Timer(delay, func) timer.start() You can cancel the timer, and even pass both positional and keyword arguments directly to the object. This gives us the ability to easily wait a delay then call the function. However, what if you want to run the function multiple times on an interval? The simple answer is have your function, when run, create a new timer object. However, your main thread then loses control of its hook to the timer object, which means that you can\u0026rsquo;t cancel the interval (and your program will never terminate)! My method is personalized from an answer to the Stack Overflow question: “Run certain code every n seconds” and is as follows:\nMy special sauce is the use of functools.partial to create a closure and the __call__ override, which allows me to actually interrupt the interval and execute the function ahead of time, resetting the interval. As you can see, the elapsed time gets printed out every n seconds, without blocking the code waiting for user input (in this case, a KeyboardInterrupt).\nSo how might you use this in practice? Well, I originally was thinking about this to do lightweight routine memory sampling for a quick analysis. Adapting an answer to the Stack Overflow question: “How do I profile memory usage in Python?”, I came up with the following wrapper for the resource module:\nDisclaimer: this is not the best method for memory profiling, there are definitely way better tools out there for this!\nHere you can see that every 5 seconds, the memory usage is written to a CSV file, without interrupting the main code execution! Although this is a simple way to add a lot of rich features to your code; take care - the threading module can be tricky! Note that if you don\u0026rsquo;t stop the interval, then your program won\u0026rsquo;t stop! So make sure on exit you do the work of cleaning these things up!\n","permalink":"https://bbengfort.github.io/2016/02/intervals-with-threads/","summary":"\u003cp\u003eEvent driven programming can be a wonderful thing, particularly when the execution of your code is dependent on user input. It is for this reason that JavaScript and other user facing languages implement very strong event based semantics. Many times event driven semantics depends on elapsed time (e.g. wait then execute). Python, however, does not provide a native \u003ccode\u003esetTimeout\u003c/code\u003e or \u003ccode\u003esetInterval\u003c/code\u003e that will allow you to call a function after a specific amount of time, or to call a function again and again at a specific interval.\u003c/p\u003e","title":"On Interval Calls with Threading"},{"content":"Several times it\u0026rsquo;s come up that I\u0026rsquo;ve needed to visualize a time sequence for a collection of events across multiple sources. Unlike a normal time series, events don\u0026rsquo;t necessarily have a magnitude, e.g. a stock market series is a graph with a time and a price. Events simply have times, and possibly types.\nA one dimensional number line is still interesting in this case, because the frequency or density of events reveal patterns that might not easily be analyzed with non-visual methods. Moreover, if you have multiple sources, overlaying a timeline on each can show which is busier, when and possibly also demonstrate some effect or causality.\nThe timelines plot above shows what I mean. Here I have five sensors that can observe different events: red, green, and blue. Each sensor records the time it sees the event from an initial time, zero along with the type and source. To plot this, I simply used Matplotlib to create a scatterplot where the y value was simply the index of the sensor in a sorted list. Some careful axis hacking led to the result.\nThe script and a sample of the dataset follow:\nObviously this function is very dataset dependent, though I tried to make it as generic as possible. Still it serves as a guide to create these kind of plots. Again, this is something I\u0026rsquo;ve copy and pasted from former code at least twice now, so it\u0026rsquo;s good to have it in one place!\n","permalink":"https://bbengfort.github.io/2016/01/timeline-visualization/","summary":"\u003cp\u003eSeveral times it\u0026rsquo;s come up that I\u0026rsquo;ve needed to visualize a time sequence for a collection of events across multiple sources. Unlike a normal time series, events don\u0026rsquo;t necessarily have a \u003cem\u003emagnitude\u003c/em\u003e, e.g. a stock market series is a graph with a time and a price. Events simply have times, and possibly types.\u003c/p\u003e\n\u003cp\u003eA one dimensional number line is still interesting in this case, because the frequency or density of events reveal patterns that might not easily be analyzed with non-visual methods. Moreover, if you have multiple sources, overlaying a timeline on each can show which is busier, when and possibly also demonstrate some effect or causality.\u003c/p\u003e","title":"Timeline Visualization with Matplotlib"},{"content":"Applications like Git or Django\u0026rsquo;s management utility provide a rich interaction between a software library and their users by exposing many subcommands from a single root command. This style of what is essentially better argument parsing simplifies the user experience by only forcing them to remember one primary command, and allows the exploration of the utility hierarchy by using --help and other visibility mechanisms. Moreover, it allows the utility writer to decouple different commands or actions from each other.\nAnd, this is actually not very hard to do, as shown in [“Simple CLI Script with Argparse”]({% post_url 2016-01-10-simple-cli-argparse %}), the argparse module in the standard library will allow you to create subparsers. By setting a default “handling” function associated with each subparser, you can simply execute different functions with different arguments from the command line. Unfortunately, while easy, the organization and definition of the utility quickly gets out of control, particularly as the argparse.add_argument method is so verbose.\nEnter Commis, a library designed to make define and organizing complex command line utilities easier. Commis was inspired by the Django management utility, and was written specifically to provide similar functionality and code organization in other projects. The design principles are simple:\nMaintain console commands inside of a library. Define arguments simply and extensibly (with better formatting). Easily and automatically add commands to the console utility. Decouple the execution context (argument parsing, output). Compose the most simple executable script possible. In this tutorial we will see how to build a console utility using Commis. This tutorial is applicable both to user facing console tools (e.g. Git) or library specific tools (e.g. django-admin). We will focus on organization and package management rather than the details of writing command code, as this is where Commis shines. For the purposes of this tutorial, we will consider the building of a console utility that acts like a static site generator and has two primary commands: build and serve.\nCode Organization One of the most important things to understand about Commis is how to organize larger projects in order to manage complex utilities. In our tutorial example we are creating a static site generator called foo with two commands, build and serve. Following the basic template for a Python project, a very simple organization for foo would be as follows:\n$ project . ├── foo | ├── __init__.py | ├── console | | ├── commands | | | ├── __init__.py | | | ├── build.py | | | └── serve.py | | ├── __init__.py | | └── app.py ├── foo-app.py ├── LICENSE.txt ├── README.md ├── requirements.txt └── setup.py Our primary code base is in the foo library, which should hold 99% of the Python code. The only other Python modules in this example that are outside of the foo library are foo-app.py and setup.py. The foo-app.py script is the main entry point for our application and is very simple, which we will see shortly. The setup.py script is for packaging and distribution via pip, which will will also discuss in a bit.\nThe foo package includes a foo.console module, which in turn contains a foo.console.commands and foo.console.app modules. The app module will contain a subclass of commis.ConsoleProgram, which defines how our console application should behave. The commands module will organize our various subcommands, and as you can see, the build and serve modules are already listed, in which the build and serve commands will be implemented by extending commis.Command. We will discuss the ConsoleProgram and Command interfaces in detail.\nThe foo-app.py should be incredibly simple, even though it is the main entry point to the application. In fact, all it should do is import the console utility from foo.console.app and execute it, that\u0026rsquo;s it. It will pretty much look as follows:\n#!/usr/bin/env python from foo.console.app import FooApp if __name__ == \u0026#39;__main__\u0026#39;: app = FooApp.load() app.execute() The shebang (#!/usr/bin/env python) ensures that this simple program will execute with Python. Give it executable permissions as follows:\n$ chmod +x foo-app.py This script is the part of your Python project that will eventually get installed into the $PATH of the user. Using setuptools (pip) for packaging, you would simply list foo-app.py in the scripts keyword argument of the setup function as follows:\nfrom setuptools import setup if __name__ == \u0026#39;__main__\u0026#39;: setup( name=\u0026#39;foo\u0026#39;, version=\u0026#39;1.0\u0026#39;, py_modules=[\u0026#39;foo\u0026#39;], scripts=[\u0026#39;foo-app.py\u0026#39;] ) For more details on Python code organization see [“Basic Python Project Files”]({% post_url 2016-01-09-project-start %}). For more details on packaging and the setup.py file see [“Packaging Python Libraries with PyPI”]({% post_url 2016-01-20-packaging-with-pypi %}).\nCreating a Console Utility The Commis library utilizes a class-based interface for defining console utilities and commands. The primary usage is to subclass (extend) both the ConsoleProgram and the Command class for your purposes, however this is not required. In fact, given two commands, you could easily build a console utility as follows:\n#!/usr/bin/env python from commis import ConsoleProgram from foo.console.commands import BuildCommand, ServeCommand app = ConsoleProgram( description=\u0026#39;my foo app\u0026#39;, epilog=\u0026#39;postscript\u0026#39;, version=\u0026#39;1.0\u0026#39; ) app.register(BuildCommand) app.register(ServeCommand) app.execute() The ConsoleProgram.register command takes a Command subclass, and registers it to the console utility, building the necessary parser and subparser classes that the argparse module requires. You cannot add a command to a console utility without calling register. While the register method is easy, it does not allow you to manage, extend, or reuse the utility for different purposes. Instead, I recommend extending ConsoleProgram and modifying it as follows.\n# foo.console.app # An extended console utility import foo from commis import ConsoleProgram from foo.console.commands import * COMMANDS = [ BuildCommand, ServeCommand, ] class FooApp(ConsoleProgram): description = \u0026#34;my foo app\u0026#34; epilog = \u0026#34;please submit any issues to the bug tracker\u0026#34; version = foo.__version__ @classmethod def load(klass, commands=COMMANDS): utility = klass() for command in commands: utility.register(command) return utility This technique integrates your application with your library in a couple of meaningful ways. First, the importing and inclusion of commands from in your library means that you can easily control and version which commands are part of the utility and which are deprecated. Secondly, the version is tied to the library version, and other constants like the description and epilog are also easily maintained and can be string formatted from other meta information.\nCreating Commands Now that we have the infrastructure in place, it\u0026rsquo;s time to start creating commands for our application. Adding new commands to the utility is as simple as creating a command class, importing it in foo.console.app and adding the command class to the COMMANDS list. This technique means you have an easy way to add, edit, and manage commands without affecting other commands. Here is an example serve command:\nfrom commis import Command # From the Python standard library from BaseHTTPServer import HTTPServer from SimpleHTTPServer import SimpleHTTPRequestHandler class ServeCommand(Command): name = \u0026#39;serve\u0026#39; help = \u0026#39;a simple web server which serves files from the working directory\u0026#39; args = { (\u0026#39;-p\u0026#39;, \u0026#39;--port\u0026#39;): { \u0026#39;type\u0026#39;: int, \u0026#39;default\u0026#39;: 8080, \u0026#39;help\u0026#39;: \u0026#39;the port to serve on\u0026#39;, }, (\u0026#39;-a\u0026#39;, \u0026#39;--addr\u0026#39;): { \u0026#39;type\u0026#39;: str, \u0026#39;default\u0026#39;: \u0026#39;localhost\u0026#39;, \u0026#39;help\u0026#39;: \u0026#39;the address to serve on\u0026#39; } } def handle(self, args): \u0026#34;\u0026#34;\u0026#34; Create the web server \u0026#34;\u0026#34;\u0026#34; server = HTTPServer((args.addr, args.port), SimpleHTTPRequestHandler) print \u0026#34;Server started on http://{}:{}\u0026#34;.format(args.addr, args.port) try: server.serve_forever() except (KeyboardInterrupt, SystemExit): return \u0026#34;Server successfully stopped!\u0026#34; The Command subclass basically defines how the command is utilized in the console through four primary attributes: name, help, args, and handle. The name and help arguments are used to describe the command and are passed to the argparse library. When you do:\n$ foo-app.py {name} --help The {name} is the Command.name and the description will be what you listed in Command.help. The args attribute specifies the expected arguments for parsing on the command line. It is a dictionary, whose key is either a string or a tuple, which defines the name of the argument, and whose value is another dictionary representing the keyword arguments that get passed to argparse.add_argument. These arguments are added automatically to the parser and subparser during command registration. Specifying commands this way is a clean and easy way of creating argparse subparsers!\nDefault Options There are two default options that are included with every command by default: --traceback and --pythonpath. The --traceback argument specifies that if there is an error, then print out the entire stack trace (similar to what you might expect from a Python program with an exception that is not caught). This is useful for debugging, but often not useful for users. For that reason --traceback is by default False. Instead, the string representation of the error will be printed in red text. In fact, if there is a user related error, it is usually best to raise a commis.ConsoleError with a string message for users in particular:\nfrom commis import Command from commis.exceptions import ConsoleError class MyCommand(Command): name = \u0026#39;open\u0026#39; help = \u0026#39;opens the bay doors.\u0026#39; def handle(self, args): raise ConsoleError(\u0026#34;I\u0026#39;m sorry, I cannot do that, Dave.\u0026#34;) The --pythonpath option allows you to append paths to sys.path to include Python code that is not in your site-packages. This is also good for development and for tools that are intended for developers as it helps avoid import errors.\nReusing Options The default options above were created through a subclass of argparse.ArgumentParser as shown:\nimport argparse class DefaultParser(argparse.ArgumentParser): TRACEBACK = { \u0026#39;action\u0026#39;: \u0026#39;store_true\u0026#39;, \u0026#39;default\u0026#39;: False, \u0026#39;help\u0026#39;: \u0026#39;On error, show the Python traceback\u0026#39;, } PYTHONPATH = { \u0026#39;type\u0026#39;: str, \u0026#39;required\u0026#39;: False, \u0026#39;metavar\u0026#39;: \u0026#39;PATH\u0026#39;, \u0026#39;help\u0026#39;: \u0026#39;A directory to add to the Python path\u0026#39;, } def __init__(self, *args, **kwargs): ## Create the parser kwargs[\u0026#39;add_help\u0026#39;] = False super(DefaultParser, self).__init__(*args, **kwargs) ## Add the defaults self.add_default_arguments() def add_default_arguments(self): self.add_argument(\u0026#39;--traceback\u0026#39;, **self.TRACEBACK) self.add_argument(\u0026#39;--pythonpath\u0026#39;, **self.PYTHONPATH) If you have options that you are reusing again and again, you can do something similar for your arguments, e.g. FooParser, then add them to your commands with the parents attribute as follows:\nfrom commis import Command from commis.command import DefaultParser from foo.console import FooParser class BarCommand(Command): name = \u0026#39;bar\u0026#39; help = \u0026#39;an example command\u0026#39; parents = [DefaultParser(), FooParser()] This will ensure that you have both the default arguments as well as the foo arguments that are shared. Additionally if you wish to remove --traceback and --pythonpath then simply set parents to an empty list.\nConclusion In this post we have seen how to build a console utility with Commis - a library designed for easy console programs included with much larger libraries. As you can see, Commis is mostly about code organization and reusability. Hopefully this package will allow you to quickly and easily create utilities of your own. I\u0026rsquo;m always interested in feedback, please feel free to submit pull requests to the Commis GitHub repository!\n","permalink":"https://bbengfort.github.io/2016/01/console-utility-commis/","summary":"\u003cp\u003eApplications like \u003ca href=\"https://git-scm.com/\"\u003eGit\u003c/a\u003e or \u003ca href=\"https://docs.djangoproject.com/en/1.9/ref/django-admin/\"\u003eDjango\u0026rsquo;s management utility\u003c/a\u003e provide a rich interaction between a software library and their users by exposing many subcommands from a single root command. This style of what is essentially better argument parsing simplifies the user experience by only forcing them to remember one primary command, and allows the exploration of the utility hierarchy by using \u003ccode\u003e--help\u003c/code\u003e and other visibility mechanisms. Moreover, it allows the utility writer to decouple different commands or actions from each other.\u003c/p\u003e","title":"Building a Console Utility with Commis"},{"content":"I have a minor issue with freezing requirements, and so I put together a very complex solution. One that is documented here. Not 100% sure why this week is all about packaging, but there you go.\nFirst up, what is a requirement file? Basically they are a list of items that can be installed with pip using the following command:\n$ pip install -r requirements.txt The file therefore mostly serves as a list of arguments to the pip install command. The requirements file itself has a very specific format and can be created by hand, but generally the pip freeze command is used to dump out the requirements as follows:\n$ pip freeze \u0026gt; requirements.txt This produces an alphabetically sorted list of requirements and dumps them to your requirements text file. It is particularly useful when you are working in a virtual environment as you can track your project specific dependencies. Moreover, your setup.py file can read the requirements.txt and install dependencies via INSTALL_REQUIRES. However, consider the following requirements.txt file from Confire:\n## Confire requirements PyYAML==3.11 ## Testing requirements #coverage==3.7.1 #nose==1.3.3 #coveralls==0.5 ## Added by virtualenv #wsgiref==0.1.2 Here you can see that we have only one true dependency for Confire, PyYAML. The issue, however, is that we also have the development testing dependencies (nose, coverage, etc.) and then some weird other dependencies from virtualenv or from pip. We\u0026rsquo;ve also added comments and whitespace to make this more readable. We have to comment out the testing requirements, because those shouldn\u0026rsquo;t be installed when you pip install confire but they should be listed so contributors know what to expect. Luckily, pip freeze has us covered in terms of ordering, comments, and whitespace:\n$ pip freeze -r requirements.txt \u0026gt; requirements-new.txt $ mv requirements-new.txt requirements.txt However because the commented packages are skipped before review, you end up with the following:\n## Confire requirements PyYAML==3.11 ## Testing requirements #coverage==3.7.1 #nose==1.3.3 #coveralls==0.5 ## Added by virtualenv #wsgiref==0.1.2 ## The following requirements were added by pip --freeze: coveralls==0.5 coverage==3.7.1 nose==1.3.3 wsgiref==0.1.2 Like I said, a minor beef. If you\u0026rsquo;ve added or upgraded a package, then you have to manually deal with all the commented dependencies. Therefore I created a script to help me with this issue as follows.\nBasically I\u0026rsquo;ve stuck this file into ~/bin/requires and now I can simply do the following to get my requirements:\n$ requires -o reqs.txt $ mv reqs.txt requirements.txt The script automatically detects the local requirements.txt file. This script is far from perfect, and the following things I would like to do:\nOverwrite the actual freeze method shown on GitHub to tighten things up. Deal with new line parsing, inline comments, and versioning a bit better. Output helpful hints to sys.stderr as pip freeze does. But for me, this is solving a problem, which is great!\n","permalink":"https://bbengfort.github.io/2016/01/freezing-requirements/","summary":"\u003cp\u003eI have a minor issue with freezing requirements, and so I put together a very complex solution. One that is documented here. Not 100% sure why this week is all about packaging, but there you go.\u003c/p\u003e\n\u003cp\u003eFirst up, what is a \u003ca href=\"https://pip.readthedocs.org/en/stable/user_guide/#requirements-files\"\u003erequirement file\u003c/a\u003e? Basically they are a list of items that can be installed with \u003ccode\u003epip\u003c/code\u003e using the following command:\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003e$ pip install -r requirements.txt\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eThe file therefore mostly serves as a list of arguments to the \u003ca href=\"https://pip.pypa.io/en/stable/reference/pip_install/\"\u003e\u003ccode\u003epip install\u003c/code\u003e\u003c/a\u003e command. The requirements file itself has a very \u003ca href=\"https://pip.readthedocs.org/en/stable/reference/pip_install/#requirements-file-format\"\u003especific format\u003c/a\u003e and can be created by hand, but generally the \u003ca href=\"https://pip.pypa.io/en/stable/reference/pip_freeze/\"\u003e\u003ccode\u003epip freeze\u003c/code\u003e\u003c/a\u003e command is used to dump out the requirements as follows:\u003c/p\u003e","title":"Freezing Package Requirements"},{"content":"Package deployment is something that is so completely necessary, but such a pain in the butt that I avoid it a little bit. However to reuse code in Python and to do awesome things like pip install mycode, you need to package it up and stick it on to PyPI (pronounced /pīˈpēˈī/ according to one site I read, though I still prefer /pīˈpī/). This process should be easy, but it\u0026rsquo;s detail oriented and there are only two good walk throughs (see links below).\nThe Python Package Index or PyPI is the official third-party software repository for the Python programming language. Python developers intend it to be a comprehensive catalog of all open source Python packages. — Wikipedia\nI\u0026rsquo;ve outlined my process for publishing libraries to PyPI in this post. It is mostly for my own future reference, but I am writing an upcoming post about publishing data projects to PyPI on District Data Labs.\nGetting Started Before you can publish a package to PyPI, you need to make sure that you\u0026rsquo;re doing Python right. Mostly this means to ensure that you\u0026rsquo;ve structured your Python package according to the guide: How to Develop Quality Python Code. You should also have several files already part of your project, see [Basic Python Project Files]({% post_url 2016-01-09-project-start %}) for those.\nHowever, there are some things you probably haven\u0026rsquo;t done yet, so here is my checklist of stuff to take care of:\nCreating Accounts You must create accounts on both the PyPI Test and PyPI Live sites in order to upload code. So do that now and log in to your PyPI account. Once you\u0026rsquo;ve done that, create a .pypirc configuration file with your account information. Mine looks like this:\n[distutils] index-servers = pypi pypitest [pypi] repository = https://pypi.python.org/pypi username = bbengfort password = theeaglefliesatmidnight [pypitest] repository = https://testpypi.python.org/pypi username = bbengfort password = shadowofthedawnawaits Make sure this file is in your home directory; whenever you work with pip or a setup.py file, it will use this configuration for interactions with the remote index servers. As a side note, you can also build your own internal index servers using S3 or other tools!\nFinal Notes Ok for the purposes of this post, we\u0026rsquo;re going to assume that we\u0026rsquo;re working on a library called “foo” and that the directory structure looks like this:\n$ project . ├── .gitignore ├── .travis.yml ├── DESCRIPTION.rst ├── LICENSE.txt ├── Makefile ├── MANIFEST.in ├── mkdocs.yml ├── README.md ├── requirements.txt ├── setup.py ├── setup.cfg ├── bin | └── app.py ├── docs | ├── images | | └── banner.jpg | └── index.md ├── fixtures ├── foo | ├── __init__.py | └── version.py └── tests └── __init__.py Honestly, I hate that these repos grow to such massive sizes, but honestly, this is a minimal setup for a normal Python project. Or at least, a minimal one the way I do it. Needless to say, I\u0026rsquo;ll be discussing many of these files, in particular, DESCRIPTION.rst, MANIFEST.in, requirements.txt, setup.py, setup.cfg, and version.py in this post. Most of the other files are either self explanatory or contained in another post.\nSetup and Meta The first step is to configure your project with the necessary setup and meta data files. The first and most important of these is the setup.py file which will use the other meta files in the project. Basically, I just copy and paste the following file into all my projects and modify as needed. Apparently this is just a thing Python developers do.\nSo there is a lot going on here, but you can see that the basic meta information is right at the top. I hoped to top load this file so that copy and paste would be as easy as possible. A couple of notes:\nThe license can just be the name of the license like “MIT” or “Apache” — the LICENSE.txt file will spell everything out. The GitHub repository is important; particularly because the download url is formed from a tag, v + the version number. The classifiers must be selected from Python Classifiers. The get_version function must be stored in a file called version.py such that the setup.py script can read the file and exec it without accidentally importing any dependencies. Unfortunately, PyPI doesn\u0026rsquo;t display Markdown, so for the long description (which is displayed on the PyPI project page) I have created a file called DESCRIPTION.rst which is in reStructuredTxt. The setup script uses the find_packages function to discover the contained packages (which allows you to easily create packages with multiple top level modules). Therefore you need to tell it which directories not to look in, as specified by EXCLUDES. The script, bin/app.py will be installed to the $PATH of the user installing the program, but is not included as a module. I probably do need to break down these notes a bit more, but they are for reference here since I tend to speed write these posts. Check back later, maybe I\u0026rsquo;ll have updated them!\nConfiguration and Manifest The setup.cfg file allows you to specify other configurations. In my case it looks like this (assuming a Python 2 and 3 compatible package):\n[metadata] description-file = README.md [wheel] universal = 1 Basically the metadata tag is an attempt to get the Markdown README into the package, but it doesn\u0026rsquo;t really work (sadly). The manifest lists all the other files that should be included in the package when uploading to PyPI. Mine looks like this:\ninclude *.md include *.txt include *.yml include Makefile recursive-include docs *.md recursive-include docs *.jpg recursive-include tests *.py recursive-include bin *.py Final Notes on Configuration I find it really annoying that you have to create an extra description file for PyPI. Everywhere I read says that you should just put a reStructuredTxt file in as your README, but then of course GitHub doesn\u0026rsquo;t work. I prefer GitHub working, so I go with Markdown. You could write a script to do a conversion with Pandoc, but is it really worth the effort? In the future I\u0026rsquo;ll find a way to manage this a bit better.\nIf you want the files and directories from MANIFEST.in to also be installed (e.g. fixtures or data for machine learning or database setup), you will have to set include_package_data=True in your setup() call.\nBuilding and Submitting Basically there are two phases to submitting a project to PyPI: build and upload. During the upload phase you first send to PyPI Test to make sure everything is good, then send to PyPI live.\nBuild First build the package for distribution along with the binary wheel distribution:\n$ python setup.py sdist bdist_wheel This will create a build directory with the binary distribution, a foo.egg-info directory with packaging information, and finally a dist directory with two packages, the versioned distribution (foo-0.1.tar.gz) and the wheel (foo-0.1-py2-none-any.whl). Note that if you\u0026rsquo;re using my Makefile, make clean will clean up all of this extra stuff, but it should be ignored in your .gitignore already.\nAt this point you can (and should) test both the wheel and the sdist package by creating a virtual environment and attempting to install the package with pip directly as follows:\n$ virtualenv venv ... $ source venv/bin/activate $ pip install dist/foo-0.1.tar.gz $ python \u0026gt;\u0026gt;\u0026gt; import foo \u0026gt;\u0026gt;\u0026gt; print foo.__version__ 0.1 \u0026gt;\u0026gt;\u0026gt; exit() $ deactivate $ rm -rf venv Upload The first step to submitting your package to an index server is to register it.\n$ python setup.py register -r pypitest The -r flag here specifies which index server you wish to use as listed by the .pypirc file. We can then upload the package with twine, which is the currently preferred method of uploading due to its security (TLS) and ability to prebuild and test. If you don\u0026rsquo;t have twine setup, simply pip install it.\n$ twine upload -r pypitest dist/foo-0.1* Note that you can also sign the package with a GnuPG key with the -s option, but we will skip that for now. Once again, we should test our packages with a virtual environment as above, but this time downloading them from PyPI Test directly:\n$ pip install -i https://testpypi.python.org/pypi foo Once this is done and everything is ready to rock, you can repeat the process for uploading to the package to PyPI, simplified here as follows:\n$ python setup.py register $ twine upload dist/foo-0.1* Documentation Did you know that PyPI hosts documentation? Well, it does, and even though you\u0026rsquo;re mainly hosting on Read the Docs which gets built on each push; it\u0026rsquo;s pretty handy to upload those same docs to PyPI.\nAssuming you\u0026rsquo;re using MkDocs as recommended then you can upload this documentation as follows:\n$ mkdocs build --clean $ python setup.py upload_docs --upload-dir=site Clean Up You\u0026rsquo;ll probably want to clean up after yourself, which is as simple as make clean if you\u0026rsquo;re using my Makefile. If you\u0026rsquo;d like to do it with bash it\u0026rsquo;s as follows:\n$ find . -name \u0026#34;*.pyc\u0026#34; -print0 | xargs -0 rm -rf $ rm -rf htmlcov $ rm -rf .coverage $ rm -rf build $ rm -rf dist $ rm -rf foo.egg-info Also you should probably remove that site folder created by the documentation build.\nConclusion Hopefully this post makes your life easier by giving you a simple guide to push new packages to PyPI. I know I shoot fast and loose with some of the stuff, but the post was super long anyway. If you\u0026rsquo;re really looking for awesome integrations, checkout How to Travis-CI Deploy for automatic deployment after testing.\nVery Helpful Links Official Documentation How to submit a package to PyPI Sharing Your Labor of Love: PyPI Quick and Dirty Packaging and Distributing Projects ","permalink":"https://bbengfort.github.io/2016/01/packaging-with-pypi/","summary":"\u003cp\u003ePackage deployment is something that is so completely necessary, but such a pain in the butt that I avoid it a little bit. However to reuse code in Python and to do awesome things like \u003ccode\u003epip install mycode\u003c/code\u003e, you need to package it up and stick it on to PyPI (pronounced /pīˈpēˈī/ according to one site I read, though I still prefer /pīˈpī/). This process should be easy, but it\u0026rsquo;s detail oriented and there are only two good walk throughs (see links below).\u003c/p\u003e","title":"Packaging Python Libraries with PyPI"},{"content":"The topic of the day is a simple one: JSON serialization. Here is my question, if you have a data structure like this:\nimport json import datetime data = { \u0026#34;now\u0026#34;: datetime.datetime.now(), \u0026#34;range\u0026#34;: xrange(42), } Why can\u0026rsquo;t you do something as simple as: print json.dumps(data)? These are simple Python datetypes from the standard library. Granted serializing a datetime might have some complications, but JSON does have a datetime specification. Moreover, a generator is just an iterable, which can be put into memory as a list, which is exactly the kind of thing that JSON likes to serialize. It feels like this should just work. Luckily, there is a solution to the problem as shown in the Gist below:\nOk, so basically this encoder replaces the default encoding mechanism by trying first, and if that doesn\u0026rsquo;t work follows the following strategy:\nCheck if the object has a serialize method; if so, return the call to that. Check if the encoder has a encode_type method, where “type” is the type of the object, and if so, return a call to that. Note that this encoder already has two special encodings - one for datetime, and the other for a generator. Wave the white flag; encoding isn\u0026rsquo;t possible but it will tell you exactly how to remedy the situation and not just yell at you for trying to encode something impossible. So how do you use this? Well you can create complex objects like:\nclass Student(object): def __init__(self, name, enrolled): self.name = name # Should be a string self.enrolled = enrolled # Should be a datetime def serialize(self): return { \u0026#34;name\u0026#34;: self.name, \u0026#34;enrolled\u0026#34;: self.enrolled, } class Course(object): def __init__(self, students): self.students = students # Should be a list of students def serialize(self): for student in self.students: yield student And boom, you can now serialize them with the JSON encoder — json.dumps(course, cls=Encoder)! If you have other types that you don\u0026rsquo;t have direct access to, for example, UUID (part of the Python standard library), then simply extend the encoder and add a encode_UUID method.\nNote that extending the json.JSONDecoder is a bit more complicated, but you could do it along the same lines as the encoder methodology.\n","permalink":"https://bbengfort.github.io/2016/01/better-json-encoding/","summary":"\u003cp\u003eThe topic of the day is a simple one: JSON serialization. Here is my question, if you have a data structure like this:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kn\"\u003eimport\u003c/span\u003e \u003cspan class=\"nn\"\u003ejson\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"kn\"\u003eimport\u003c/span\u003e \u003cspan class=\"nn\"\u003edatetime\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"n\"\u003edata\u003c/span\u003e \u003cspan class=\"o\"\u003e=\u003c/span\u003e \u003cspan class=\"p\"\u003e{\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"s2\"\u003e\u0026#34;now\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e \u003cspan class=\"n\"\u003edatetime\u003c/span\u003e\u003cspan class=\"o\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003edatetime\u003c/span\u003e\u003cspan class=\"o\"\u003e.\u003c/span\u003e\u003cspan class=\"n\"\u003enow\u003c/span\u003e\u003cspan class=\"p\"\u003e(),\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e    \u003cspan class=\"s2\"\u003e\u0026#34;range\u0026#34;\u003c/span\u003e\u003cspan class=\"p\"\u003e:\u003c/span\u003e \u003cspan class=\"n\"\u003exrange\u003c/span\u003e\u003cspan class=\"p\"\u003e(\u003c/span\u003e\u003cspan class=\"mi\"\u003e42\u003c/span\u003e\u003cspan class=\"p\"\u003e),\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"p\"\u003e}\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eWhy can\u0026rsquo;t you do something as simple as: \u003ccode\u003eprint json.dumps(data)\u003c/code\u003e? These are simple Python datetypes from the standard library. Granted serializing a datetime might have some complications, but JSON does have a datetime specification. Moreover, a generator is just an iterable, which can be put into memory as a list, which is exactly the kind of thing that JSON \u003cem\u003elikes\u003c/em\u003e to serialize. It feels like this should just work. Luckily, there is a solution to the problem as shown in the Gist below:\u003c/p\u003e","title":"Better JSON Encoding"},{"content":"Programming with databases is a fact of life for any seasoned programmer (read, “worth their salt”). From embedded databases like SQLite and LevelDB to server databases like PostgreSQL, data management is a fundamental part of any significant project. The first thing I should say here is skip the ORM and learn SQL. SQL is such a powerful tool to query and manage a database, and is far more performant thanks to 40 years of research and development.\nOk, now that we\u0026rsquo;ve got that out of the way, the question becomes, how do we embed SQL into our programming language of choice? What you\u0026rsquo;ll typically see in tutorials is the direct embedding of strings into the codebase. While this works, and is nice because now your SQL is also versioned, it can also create many security related complications that I won\u0026rsquo;t go into as well as an organizational nightmare. So you\u0026rsquo;ve got to wrap your SQL statements somehow.\nUnfortunately, there is no standard answer for this because there are a lot of questions including connection management for performance; size and frequency of queries, etc. Each use case has it\u0026rsquo;s own optimization. Therefore, I\u0026rsquo;d like to look at a simple wrapper for a Query, as shown in the Gist below and discussed after the code.\nAs you can see from the example, we have a routine query where we want to get the orders between a particular time range for a customer identified by their email. Presumably this query will be executed many times in the course of our program, so the factory gives us the ability to run many different queries simultaneously.\nBasically what this method gets us is the wrapping of a parameterized query — e.g. a query that uses PEP 249 string formatting to add arguments on execution. Calls to query\u0026rsquo;s iterator initiate a connection to the database and execute the query, returning the results of fetch row. By using the factory method, this technique basically gives us the ability to execute many queries with different parameters over the course of program execution, such that each query has a separate connection, cursor, and error handling.\nThere are also two techniques involving the engine and the query that I generally use. The engine in this case connects to a particular database. For a SQLite database you have to specify a path on disk, for a PostgreSQL database a url, username, and password. My preference is to use a database url but you\u0026rsquo;ll note that the Query object is database-agnostic. Although beyond the scope of this post, a simple Engine can be created as follows:\nimport psycopg2 class PostgreSQLEngine(object): def __init__(self, database, user, password, host, port): self.params = { \u0026#39;database\u0026#39;: database, \u0026#39;user\u0026#39;: user, \u0026#39;password\u0026#39;: password, \u0026#39;host\u0026#39;: host, \u0026#39;port\u0026#39;: port, } def connect(self): return psycopg2.connect(**self.params) def query_factory(sql, **kwargs): def factory(): return Query(sql, PostgreSQLEngine(**kwargs)) return factory You could then create an engine object that reads configuration details from Confire, parses a database URL, or selects from SQLite or PostgreSQL depending on which is available.\nAlso, the Gist uses a query that is embedded as a docstring. I prefer to store my more complex SQL in .sql files and load them from disk. (Smaller queries I might have constants stored in a queries.py or similar). This changes the factory again:\ndef query_factory(path, **kwargs): engine = PostgreSQLEngine(**kwargs) with open(path, \u0026#39;r\u0026#39;) as f: sql = f.read().strip() def factory(): return Query(sql, engine) return factory Advanced implementation of this particular technique will use:\nRow format classes to return Python objects or namedtuples. Context managers to ensure the connection to the database gets closed. A connection pool as the engine to reuse connection objects. Advanced error handling for not found or parameter errors. We do this so much that we plan to create a package called ORMBad which will implement engines and a more advanced query pattern. We just have to get around to doing it!\n","permalink":"https://bbengfort.github.io/2016/01/query-factory/","summary":"\u003cp\u003eProgramming with databases is a fact of life for any seasoned programmer (read, “worth their salt”). From embedded databases like SQLite and LevelDB to server databases like PostgreSQL, data management is a fundamental part of any significant project. The first thing I should say here is \u003cem\u003eskip the ORM and learn SQL\u003c/em\u003e. SQL is such a powerful tool to query and manage a database, and is far more performant thanks to 40 years of research and development.\u003c/p\u003e","title":"Simple SQL Query Wrapper"},{"content":"If you\u0026rsquo;ve pair programmed with me, you might have seen me type something to the following effect on my terminal, particularly if I have just created a new file:\n$ codetime Then somehow I can magically paste a formatted timestamp into the file! Well it\u0026rsquo;s not a mystery, in fact, it\u0026rsquo;s just a simple alias:\nalias codetime=\u0026#34;clock.py code | pbcopy\u0026#34; Oh, well that\u0026rsquo;s easy — why the blog post? Hey, what\u0026rsquo;s clock.py? A great question! This Python script is the dumbest thing that I have ever written, that has become the most useful tool that I use on a daily basis. Whenever there is a dumb to useful ratio like that, it\u0026rsquo;s blogging time. Here is clock.py:\nSo that\u0026rsquo;s it. It literally just prints out a string formatted datetime based on a named argument like \u0026ldquo;code\u0026rdquo;. In fact, this Jekyll blog has a date property in the YAML front matter that I can get using clock.py blog! So why do this? Well first, I was tired of aliasing date, particularly because there is a different implementation on OS X and Linux. Secondly, I needed JSON timestamps in UTC rather than my current time. This simple printer does that for me! So voila!\n","permalink":"https://bbengfort.github.io/2016/01/codetime-and-clock/","summary":"\u003cp\u003eIf you\u0026rsquo;ve pair programmed with me, you might have seen me type something to the following effect on my terminal, particularly if I have just created a new file:\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003e$ codetime\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eThen somehow I can magically paste a formatted timestamp into the file! Well it\u0026rsquo;s not a mystery, in fact, it\u0026rsquo;s just a simple alias:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"nb\"\u003ealias\u003c/span\u003e \u003cspan class=\"nv\"\u003ecodetime\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e\u003cspan class=\"s2\"\u003e\u0026#34;clock.py code | pbcopy\u0026#34;\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eOh, well that\u0026rsquo;s easy — why the blog post? Hey, what\u0026rsquo;s \u003ccode\u003eclock.py\u003c/code\u003e? A great question! This Python script is the \u003cem\u003edumbest\u003c/em\u003e thing that I have ever written, that has become the most \u003cem\u003euseful\u003c/em\u003e tool that I use on a daily basis. Whenever there is a dumb to useful ratio like that, it\u0026rsquo;s blogging time. Here is \u003ccode\u003eclock.py\u003c/code\u003e:\u003c/p\u003e","title":"The codetime and clock Commands"},{"content":"The standard library logging module is excellent. It is also quite tedious if you want to use it in a production system. In particular you have to figure out the following:\nconfiguration of the formatters, handlers, and loggers object management throughout the script (e.g. the logging.getLogger function) adding extra context to log messages for more complex formatters handling and logging warnings (and to a lesser extent, exceptions) The logging module actually does all of these things. The problem is that it doesn\u0026rsquo;t do them all at once for you, or with one single API. Therefore we typically go the route that we want to wrap the logging module so that we can provide extra context on demand, as well as handle warnings with ease. Moreover, once we have a wrapped logger, we can do fun things like create mixins to put together classes that have loggers inside of them.\nBelow is a very typical example of a logger.py that we use in many of our projects. Note that the configuration is embedded into the module as a dictionary, but uses some configuration values from our settings object. The wrapper class simply takes a class based logger property, specified by logging.getLogger such that all instances uses the same logger. It then provides functions for the various levels, and a generic log method.\nNote that if you logger.warn on a logger with raise_warnings=True, then it will kick out to the warnings module. Finally I provide a mixin class for providing loggers on demand as properties.\nThe thing I still haven\u0026rsquo;t figured out is how to put the configuration easily into a YAML file, particularly while using Confire. This is what led to putting the configuration dictionary directly into the utility module as seen above. The primary problem is that unless I create classes for every nested level of the logging configuration, by adding anything to the YAML file you blow away the other keys. I think that I\u0026rsquo;ll have to create a LoggingConfiguration type thing in Confire specifically, and figure it out there.\n","permalink":"https://bbengfort.github.io/2016/01/logging-mixin/","summary":"\u003cp\u003eThe standard library \u003ccode\u003elogging\u003c/code\u003e module is excellent. It is also quite tedious if you want to use it in a production system. In particular you have to figure out the following:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003econfiguration of the formatters, handlers, and loggers\u003c/li\u003e\n\u003cli\u003eobject management throughout the script (e.g. the \u003ccode\u003elogging.getLogger\u003c/code\u003e function)\u003c/li\u003e\n\u003cli\u003eadding extra context to log messages for more complex formatters\u003c/li\u003e\n\u003cli\u003ehandling and logging warnings (and to a lesser extent, exceptions)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThe \u003ccode\u003elogging\u003c/code\u003e module actually does \u003cem\u003eall\u003c/em\u003e of these things. The problem is that it doesn\u0026rsquo;t do them all at once for you, or with one single API. Therefore we typically go the route that we want to \u003cem\u003ewrap\u003c/em\u003e the logging module so that we can provide extra context on demand, as well as handle warnings with ease. Moreover, once we have a wrapped logger, we can do fun things like create mixins to put together classes that have loggers inside of them.\u003c/p\u003e","title":"Wrapping the Logging Module"},{"content":"Let\u0026rsquo;s face it, most of the Python programs we write are going to be used from the command line. There are tons of command line interface helper libraries out there. My preferred CLI method is the style of Django\u0026rsquo;s management utility. More on this later, when we hopefully publish a library that gives us that out of the box (we use it in many of our projects already).\nSometimes though, you just want a simple CLI script. These days we use the standard library argparse module to parse commands off the command line. Here is my basic script that I use for most of my projects:\nSo how do you use this? Well essentially you just add subcommand parsers and their associated helper functions. Generally speaking you should do most of the work in the module and simply import that work to be executed here; only the command line context should be managed from your helper functions.\n","permalink":"https://bbengfort.github.io/2016/01/simple-cli-argparse/","summary":"\u003cp\u003eLet\u0026rsquo;s face it, most of the Python programs we write are going to be used from the command line. There are \u003cem\u003etons\u003c/em\u003e of command line interface helper libraries out there. My preferred CLI method is the style of Django\u0026rsquo;s management utility. More on this later, when we hopefully publish a library that gives us that out of the box (we use it in many of our projects already).\u003c/p\u003e\n\u003cp\u003eSometimes though, you just want a simple CLI script. These days we use the standard library \u003ccode\u003eargparse\u003c/code\u003e module to parse commands off the command line. Here is my basic script that I use for most of my projects:\u003c/p\u003e","title":"Simple CLI Script with Argparse"},{"content":"I don\u0026rsquo;t use project templates like cookiecutter. I\u0026rsquo;m sure they\u0026rsquo;re fine, but when I start a new project I like to get a cup of coffee, go to my zen place and manually create the workspace. It gets me in the right place to code. Here\u0026rsquo;s the thing, there is a right way to set up a Python project. Plus, I have a particular style for my repositories — particularly how I use Creative Commons Flickr photos as the header for my README files.\nHere\u0026rsquo;s my primary structure for a project with a primary Python module called “foo”:\n$ project . ├── .gitignore ├── .travis.yml ├── bin | └── app.py ├── docs | ├── images | | └── banner.jpg | └── index.md ├── fixtures ├── foo | └── __init__.py ├── LICENSE.txt ├── Makefile ├── mkdocs.yml ├── README.md ├── requirements.txt ├── setup.py └── tests └── __init__.py So, that\u0026rsquo;s actually a lot of files! Maybe I should put myself into copy-and-paste land else my coffee get cold while I\u0026rsquo;m doing this.\nMake and Dependencies I use a Makefile. I won\u0026rsquo;t apologize. I just like it. I wish there was something similar to rake for Python. There I said it.\nSo this Makefile essentially shows how I clean up after myself and run tests, as well as publish to GitHub Pages if I have a subdirectory with HTML for that environment. The requirements are just requirements that I have in basically every single project that I create.\nVersioning Python modules should be well versioned, especially as I prefer to have good numbering for GitHub releases. Occasionally I will create an actual version.py in the root of my module, but more often than not, I just stick it into the __init__.py of the module.\nTo “version bump” as it were, I simply modify the information in __version_info__ by updating the release numbers. I also think that someday there is probably also a way to do this automatically or with a version bump script.\nTesting I like to use a combination of Travis-CI and Coveralls to get pretty badges on my README.md file. Here are my basic test cases and a .travis.yml file.\nNote that these files are named for easy location in Gist, not the names of the actual files in the Repository.\nGitHub My labels in the Github Issues are defined in the blog post: How we use labels on GitHub Issues at Mediocre Laboratories. I really like adding both a “type” and “priority” to every one of my cards. Makes issue management so much easier.\nI also now tend to use both a master and a develop branch, such that my branches are setup in a typical production/release/development cycle as described in A Successful Git Branching Model. A typical workflow is as follows:\nSelect a card from the dev board - preferably one that is \u0026ldquo;ready\u0026rdquo; then move it to \u0026ldquo;in-progress\u0026rdquo;.\nCreate a branch off of develop called “feature-[feature name]”, work and commit into that branch.\n~$ git checkout -b feature-myfeature develop Once you are done working (and everything is tested) merge your feature into develop.\n~$ git checkout develop ~$ git merge --no-ff feature-myfeature ~$ git branch -d feature-myfeature ~$ git push origin develop Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server.\nPull requests will be reviewed when the Travis-CI tests pass, so including tests with your pull request is ideal!\nREADME.md And finally, here is some Markdown that I typically use for the README:\nOk, that\u0026rsquo;s all the project templates for now!\n","permalink":"https://bbengfort.github.io/2016/01/project-start/","summary":"\u003cp\u003eI don\u0026rsquo;t use project templates like \u003ca href=\"https://cookiecutter.readthedocs.org/en/latest/\"\u003ecookiecutter\u003c/a\u003e. I\u0026rsquo;m sure they\u0026rsquo;re fine, but when I start a new project I like to get a cup of coffee, go to my zen place and manually create the workspace. It gets me in the right place to code. Here\u0026rsquo;s the thing, \u003ca href=\"http://blog.districtdatalabs.com/how-to-develop-quality-python-code\"\u003ethere is a right way to set up a Python project\u003c/a\u003e. Plus, I have a particular style for my repositories — particularly how I use Creative Commons \u003ca href=\"https://www.flickr.com/\"\u003eFlickr\u003c/a\u003e photos as the header for my README files.\u003c/p\u003e","title":"Basic Python Project Files"},{"content":"I have a bit of catch up to do — and I think that this notepad and development journal is the perfect resource to do it. You see, I am constantly copy and pasting code from other projects into the current project that I\u0026rsquo;m working on. Usually this takes the form of a problem that I had solved previously that has a similar domain to a new problem, but requires a slight amount of tweaking. Other times I am just doing the same task over and over again.\n“But Ben, if you\u0026rsquo;re repeating yourself, shouldn\u0026rsquo;t you just make an open source module, require it as a dependency and import it?”\nSays you, whose voice sounds strangely like that of @looselycoupled. Well of course I should, but the problem is that takes time — how much of that do you think I have? I\u0026rsquo;ve tried to get a benlib going in the past; but upkeep is tough. And anyway I have done that. The prime example is confire, because we kept using the same YAML configuration code over and over again.\nIn fact there are two massive pieces of code that need to be made into a library, if only for our own sanity:\nconsole utilities: we like to wrap argparse into a Django-like command program. Then all we have to do is write Command subclasses and they\u0026rsquo;re automatically added to our application. This needs to be a library ASAP. While we\u0026rsquo;re at it, we may as well stick our WrappedLogger utility in as well.\nsql query (ormbad): ORMs are such a pain, especially if you\u0026rsquo;re good at SQL (we are). We constantly write this Query class to wrap our SQL and load them from disk, etc. In fairness, we actually have started the dependency: ormbad, but we need to finish it.\nHowever, there is also a whole host of stuff that we use in our utilities, like the famous Timer class that we got from (somewhere?) and use all the time.\nBut you know, hunting for Gists is hard, hunting for code in other repositories is hard. So you know what? I\u0026rsquo;m just going to put it all here. Quick and dirty in the hopes that I\u0026rsquo;ll have a one stop shop for copy and paste. Plus embedding those Gists is very, very handy.\n","permalink":"https://bbengfort.github.io/2016/01/frequently-copy-pasted/","summary":"\u003cp\u003eI have a bit of catch up to do — and I think that this notepad and development journal is the perfect resource to do it. You see, I am \u003cem\u003econstantly\u003c/em\u003e copy and pasting code from other projects into the current project that I\u0026rsquo;m working on. Usually this takes the form of a problem that I had solved previously that has a similar domain to a new problem, but requires a slight amount of tweaking. Other times I am just doing the same task over and over again.\u003c/p\u003e","title":"Frequently Copied and Pasted"},{"content":"My family does \u0026ldquo;one big gift\u0026rdquo; every Christmas; that is instead of everyone simply buying everyone else a smaller gift; every person is assigned to one other person to give them a single large gift. Selection of who gives what to who is a place of some (minor) conflict. Therefore we simply use a random algorithm. Unfortunately, apparently a uniform random sample of pairs is not enough, therefore we take 100 samples to vote for each combination to see who gets what as follows:\nfrom random import shuffle from collections import Counter from itertools import combinations from collections import defaultdict def random_combinations(items, iterations=100): \u0026#34;\u0026#34;\u0026#34; Randomly combines items until a group reaches the minimum number of votes. This function will yield both the item voted for and the # of votes. \u0026#34;\u0026#34;\u0026#34; votes = defaultdict(Counter) giftees = set([]) for idx in xrange(iterations): shuffle(items) for (giver, giftee) in combinations(items, 2): votes[giver][giftee] += 1 combos = [] for giver, votes in votes.iteritems(): for giftee, vote in votes.most_common(): if giftee not in giftees: combos.append((giver, giftee, vote)) giftees.add(giftee) break if len(combos) != len(items): return random_combinations(items, iterations) return combos def read_names(path): \u0026#34;\u0026#34;\u0026#34; Reads the names from the associated text file (newline delimited). There must be an even number of names otherwise this won\u0026#39;t work very well. \u0026#34;\u0026#34;\u0026#34; with open(path, \u0026#39;r\u0026#39;) as f: return [ name.strip() for name in f if name.strip() ] if __name__ == \u0026#39;__main__\u0026#39;: # Print the names and the votes! for data in random_combinations(read_names(\u0026#39;names.txt\u0026#39;)): print \u0026#34;{} --\u0026gt; {} ({} votes)\u0026#34;.format(*data) Follow Ups The (secret) Gist contains all the code including the data file. Need to implement a method that does not allow for repeat giftee pairs. ","permalink":"https://bbengfort.github.io/2015/12/one-big-gift/","summary":"\u003cp\u003eMy family does \u0026ldquo;one big gift\u0026rdquo; every Christmas; that is instead of everyone simply buying everyone else a smaller gift; every person is assigned to one other person to give them a single large gift. Selection of who gives what to who is a place of some (minor) conflict. Therefore we simply use a random algorithm. Unfortunately, apparently a uniform random sample of pairs is not enough, therefore we take 100 samples to vote for each combination to see who gets what as follows:\u003c/p\u003e","title":"One Big Gift Selection Algorithm"},{"content":"Natural Language Processing and Hadoop\nDescription Benjamin Bengfort and Sean Murphy discuss how NLP can be integrated with Hadoop to gain insights in big data.\n","permalink":"https://bbengfort.github.io/2013/11/natural-language-processing-and-hadoop/","summary":"\u003cp\u003e\u003ca href=\"http://strataconf.com/stratany2013/public/schedule/detail/30806\"\u003eNatural Language Processing and Hadoop\u003c/a\u003e\u003c/p\u003e\n\n\n    \n    \u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/2642kr9-cB0?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003eBenjamin Bengfort and Sean Murphy discuss how NLP can be integrated with Hadoop to gain insights in big data.\u003c/p\u003e","title":"Natural Language Processing and Hadoop"},{"content":" Description Presentation by the CTO of Unbound Concepts, Benjamin Bengfort, at the Columbia TechBreakfast 2012.\n","permalink":"https://bbengfort.github.io/2012/12/unbound-concepts-columbia-techbreakfast/","summary":"\u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen=\"allowfullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/N-Bi_MwvZiY?autoplay=0\u0026controls=1\u0026end=0\u0026loop=0\u0026mute=0\u0026start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\n      \u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"description\"\u003eDescription\u003c/h3\u003e\n\u003cp\u003ePresentation by the CTO of Unbound Concepts, Benjamin Bengfort, at the Columbia TechBreakfast 2012.\u003c/p\u003e","title":"Unbound Concepts: Columbia TechBreakfast Dec. 2012"},{"content":"This page is primarily my development journal and really only contains notes and ramblings for me to refer to as I practice programming. Please feel free to read and use anything you find on this site, but note it is not meant for publication or wide public consumption. If you want to find my more formal writing, check out the District Data Labs Blog where I write about Python, data science, streaming, distributed systems, and more.\nName Origin A libellus (plural libelli) was a document given to a Roman citizen to certify performance of a pagan sacrifice, hence demonstrating loyalty to the authorities of the Roman Empire. They could also mean certificates of indulgence, in which the confessors or martyrs interceded for apostate Christians. —Wikipedia\nSo these notes certify the performance of my programming, demonstrating my loyalty to Open Source development, as well as a confession of my programming sins.\nThanks This site was built with Hugo using the PaperMod theme by Aditya Telange. It is hosted on GitHub and served with GitHub Pages. The logo and icon I\u0026rsquo;ve used is Bear by Gregor Cresnar from the Noun Project.\n","permalink":"https://bbengfort.github.io/about/","summary":"about","title":"About"}]