{"description":"Hi, I’m Christian! I take data and make it useful to people","favicon":"https://www.ctmartin.dev/favicon.ico","feed_url":"https://www.ctmartin.dev/feed.json","home_page_url":"https://www.ctmartin.dev/","icon":"https://www.ctmartin.dev/icon.png","items":[{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"I did a data analytics project with a nonprofit as part of completing my Master’s degree. This project was both the largest breadth and depth of any analysis done by the organization to date, looking at everything from basic membership statistics to attendance at major events to financial contributions.\nThe largest outcome from this was being able to quantify how financial contributions were changing over time and what that would look like in the next 10-15 years.\nChristian’s help in this critical area is deeply appreciated; we will be benefiting from his analysis for years to come.\nZen Bow, Spring/Summer 2024 Rather than me rambling on, it’d probably be simplest to show the poster and slides I produced; if you’re interested in the behind the scenes, it’ll be just a bit further down 😉\nCapstone Poster Presentation Publicly shared portion of slides Behind the Scenes The data I worked with for this project included:\n33k line-item transactions for financial contributions going back about 20 years 7.4k membership status changes going back about 30 years 3.4k people 180 retreats going back 20 years Website form responses Some of this data was fairly straightforward; the membership database is essentially a single table. Other aspects, however, were a bit more involved. Additionally, this data all served two purposes – first, for the analysis, and second, for importing into the new membership database (since it can handle these additional types of data).\nData Processing The membership status changes were stored in a comments field. The previous database (only moved off of about 1.5 years ago) stored these changes in a standardized format of either “Added MM/DD/YYYY” or “To [Status] MM/DD/YYYY” (and of course the couple odd exceptions).\nSince this was so standardized, PostgreSQL’s handy regexp_matches() could be used to extract all of these annotations via RegEx capture groups. The regexp_matches() function is particularly useful in that it can be added to a SQL FROM statement to return the resulting matches as rows instead of an array. This can be used to easily create a view and then search ahead/behind to calculate duration as well as how users are changing between specific statuses.\nConversion of text comments to analytical tableThe biggest hurdle however was the event rosters. These were stored in a variety of file formats, including Excel spreadsheets, Word documents, and also the older Word .DOC format. These files were sorted by file extension and a spreadsheet was preferred if available. For the older .DOC files, I used LibreOffice to convert them to the newer .DOCX format, which I could then read using a Python library.\nThankfully, most of the Word documents stored the names in tables, which made them easy to pull out. However, across all of these sources, the columns used were inconsistent and I needed logic to determine what (and how many) columns I needed to pull out.\nAfter pulling out the names, I had to do a lot of processing to link them to records in the member database. While most Western names only contain a First \u0026 Last name, I also had preferred vs legal names and Buddhist practice has a few titles that are used in names as well as often involves someone getting a Dharma name (sort of a religious name that acts like a First name for most purposes). Since most of these files were also hand-crafted, how different people accounted for accented characters was also variable. Thus, I also parsed the names using a few different methods of handling accented characters, so that I could link spellings with ö, o, and oe all to the correct person (and thankfully PostgreSQL has a normalize() function and the unaccent extension which can help with this). After doing all of this, I was left with a list of unmatched names that was short enough I could go through it by hand to search the membership database and link some additional cases where misspellings or nicknames were used.\nFinally, other parts of data processing included:\nMapping transactions to a new chart of accounts Cross-referencing contributions from the financial transactions with birthdays from the membership database to calculate age at time of transaction Separating contributions that the organization substantiates for tax purposes vs ones it does not (e.g. was donated through another organization) Inflation-adjusting transactions via CPI Using the Tukey formula to calculate contributions that were statistical outliers Systems To make my life easier, there are a few applications I set up. Kubernetes was used as a base because I could deploy everything with just a few YAML files.\nPostgreSQL was used as the primary database MySQL/MariaDB was used for importing website data Minio was used as a quick place to drop files that could be access by applications within Kubernetes Airbyte was used as an ETL tool for loading data into PostgreSQL (Airbyte handles a lot of things about cleaning field names and handling data types to be appropriate for the database being imported into; while it’s possible to import a CSV using a myriad of tools, this was significantly less painful) DBT was used for managing the SQL views, including refreshing materialized Apache Superset was used for most visualizations \u0026 rapid iteration on the data VS Code \u0026 DBeaver for convenience \u0026 debugging Addendum Thanks to everyone at the Rochester Zen Center who supported me throughout this project!\nPostscript After finishing the original project, there was a bit more work that I did to follow-up. While direct access to the data was helpful (particularly with a deadline), running a full database limited the reproducibility of the report by others year-over-year.\nThe original project did a lot of work to combine and convert data to live in the new membership database (and use the new Chart of Accounts). In the follow-up I took advantage of this, recreating a subset of the original report using Power BI and based on the schema of the new member database.\nWhile this new report is limited to a subset of charts, the approach has a few benefits that weren’t available for the original project:\nHaving most of the data in the new member database means that (apart from CPI) all the data can be pulled from one place Similarly, having transactions mostly filtered and in the new Chart of Accounts means there’s less processing that needs to be done While still non-trivial, M \u0026 DAX bear resemblance to Excel formulas which makes the report easier for others to modify Using Power BI Desktop means that re-running the report with updated data requires significantly less technical knowledge than a SQL database All the documentation fits on a single Power BI slide/page that I can mark as hidden so it doesn’t appear in PDF exports Combined, the simpler package makes it significantly easier to maintain and re-use for regular reporting Despite some rough edges, Power BI was the best tool for this task and was successful at meeting the goals for re-writing the report. It’s a program that’s easy to download and run, handles tables with a changing number of rows cleaner than Excel, doesn’t have the cost of Tableau, and doesn’t require the technical knowledge that running a SQL-based solution requires. It was able to yield all of the important financial reporting I needed it to, and the ease of use was a really important objective for the re-write.\n","date_modified":"2025-09-17T17:00:00-04:00","date_published":"2024-05-04T20:55:00-05:00","id":"https://www.ctmartin.dev/projects/capstone/","image":"https://www.ctmartin.dev/projects/capstone/martin-SP24-capstone-web-1mb.png","summary":"How often do organizations you know make decisions based on gut instinct? I analyzed the data to understand how members are engaging.","tags":["Projects","Graduate Works"],"title":"Measuring Membership \u0026 Impact","url":"https://www.ctmartin.dev/projects/capstone/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Since getting my Databricks certifications in December, I’ve been working on a small project to demonstrate a variety of ways to do Data Engineering in Databricks.\nIn the GitHub repo for this project is a “Databricks Asset Bundle”, which is the Databricks way of doing GitOps version control for a Databricks workspace.\nSee the code The code contains several ways of doing data engineering tasks, including:\nImporting data from CSV \u0026 JSON data files A Databricks ETL pipeline A dbt project A Spark Declarative Pipeline CDC \u0026 merge into statements to allow quicker vector index rebuilding Plenty of code for normalizing fields, handling missing values, graceful fallback of text fields, etc. Data mart providing Markdown-like text describing objects (created from data fields) Vector index to facilitate searching documents Data quality monitoring using “expectations” constraints Quality of life user-defined functions (UDFs) In Screenshots After ingesting the data files, the ETL workflow goes as follows:\nSeveral tasks are directly executed as SQL files, such as creating UDFs or table schemas. Most of the processing happens within two tasks:\nA dbt project handles most processing of data from the Met A Spark Declarative Pipeline creates all other materialized views The Spark Declarative Pipeline does several things, including: processing all data from the Art Institute of Chicago, creating a materialized view for the Met that dbt was having trouble with, and creating Markdown-like document marts for each museum.\nThe declarative pipeline also uses Databricks’ data quality checks, which allows setting expectations like a title existing for each object:\nGoing back to the ETL pipeline, these documents are union-ed and brought into a data mart using a CDC merge statement. After processing, a Python notebook creates or syncs a vector embedding index.\nReports Now that data exists in usable forms, we can create a couple reports. One note is that the museums request mentioning that I’m transforming \u0026 interpreting the data, and that this may not match the museums’ perspective.\nThe first is a Databricks dashboard, which is also managed by the Databricks Asset Bundle:\nWe can also use the PowerBI connector to bring data into PowerBI for processing. This mock report uses the forecasting \u0026 AI analysis features:\nMore Coming There’s still more to add to this project. In particular, I’d like to add a demo AI app using Genie spaces and the vector index previously created.\nStay tuned!\n","date_modified":"2026-03-13T17:00:00-04:00","date_published":"2026-02-19T15:30:00-05:00","id":"https://www.ctmartin.dev/articles/write-ups/databricks-museum-demo/","image":"https://www.ctmartin.dev/articles/write-ups/databricks-museum-demo/etl.png","summary":"A demo project showcasing Databricks features for data engineering using open museum datasets.","tags":["Articles/Write-Ups","Articles","Museum Data"],"title":"Databricks Museum Demo","url":"https://www.ctmartin.dev/articles/write-ups/databricks-museum-demo/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"My laptop was running out of disk space, which prompted me to clean up files consuming a lot of disk space.\nWhen this happens, I usually turn to a program like WizTree or WinDirStat to find the files/directories consuming a large amount of disk space. In doing so, the culprit (this time) was that I had been doing a lot of data processing in WSL and that had caused the virtual disk to expand significantly.\nHere’s my instance of Ubuntu 24.04 consuming 56.6GB:\nWhen searching for how to shrink the disk, I found this blog post by Scott Hanselman to be the most helpful.\nThe TL;DR of the post is to run wsl --shutdown to close WSL and then use optimize-vhd. Note that optimize-vhd is a PowerShell command and uses the Windows Hyper-V tools. If you don’t have the Hyper-V tools installed, this Super User (Stack Exchange) thread provides some additional ways to shrink the disk using other tools.\nHere’s an example of using optimize-vhd after shutting down WSL (remember to run PowerShell as Administrator):\nPowerShell optimize-vhd -Path C:\\Users\\ctmartin\\AppData\\Local\\Packages\\CanonicalGroupLimited.Ubuntu24.04LTS_79rhkp1fndgsc\\LocalState\\ext4.vhdx -Mode full After shrinking the disk, it brought the size down to 12GB (almost 20% the size of the original disk):\nPart of the reason this change is so dramatic is that the data I was processing was temporary and deleted after I was done with it. A different disk with an old backup only shrunk from 24.1GB to 23.6GB (saving about 2% of the disk size). If you’d like to save more space, it would be helpful to run some commands within the distro to clean up system/temporary files before trying to shrink the disk, particularly if you’re using Docker.\nWhile this is a small bit of information, hopefully it’ll be of use.\n","date_modified":"2026-01-12T13:36:00-05:00","date_published":"2026-01-12T13:36:00-05:00","id":"https://www.ctmartin.dev/posts/wsl-reclaim-disk-space/","image":"https://www.ctmartin.dev/posts/wsl-reclaim-disk-space/after-u24.png","summary":"A quick way to reclaim disk space from WSL","tags":["Informal Posts","Home Lab"],"title":"How to reclaim disk space from WSL","url":"https://www.ctmartin.dev/posts/wsl-reclaim-disk-space/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Over the past year I’ve been slowly rewriting the Ansible playbook I use for managing my Linux machines. What started as a way to automate some common configurations has gotten significantly more modular and polished, and now it’s ready to show to the world.\nWhat was learned from v1? Trying to create an omnibus-style/generic management role didn’t scale well; I ended up with a lot of the following: YAML - include_tasks: tailscale.yml when: \"tailscale is truthy\" Using unique variables to manage most of the control flow also didn’t scale well. Most of what became “stacks” (like in #1) had multiple variable per set of tasks to trigger the conditional logic and then most config options had their own variables, all of which had to be declared at the host level. In the new roles, the variables are simplified to reduce the conditional logic and to use fewer, more effective variables As I brought in more systems, many of the assumptions I had about workflows or how to approach best practices were proven wrong. As an example of this, the roles had been written primarily with web servers (and similar) in mind, and were expected to be run regularly. However, when I brought on my workstation with an NVIDIA GPU, I found that CUDA does not upgrade cleanly across some versions, and because of how it’s tied to so many other packages, this could cause real problems What’s new? Roles are significantly more modular New “Stacks” roles that are used to install/configure a variety of software Stacks-related variables can cleanly be configured in the group vars instead of the host vars A bunch of host/group variables were removed as a result of simpler conditional logic Roles now use dependencies to stay modular and some are even just a list of dependencies Tags were added to speed up partial playbook runs A lot of templates and tasks were rewritten to allow more robust and flexible Some roles are now loosely aware of other roles to interoperate better (e.g. the firewall is aware of Docker), however, this is still largely a WIP and very light-handed The web role was the largest beneficiary of this, and now heavily uses shared blocks for templates (like listen directives) Stacks The most notable new feature, and why it gets its own section, is what I’m calling “stacks” (the same way you’d use the phrase “LAMP stack”). “Stacks” are a series of roles I’ve written to install and configure a variety of software. Unlike those from Ansible Galaxy, writing the roles myself has allowed me to keep them manageable but also allowed me to be flexible or opinionated when it suits my needs (for example, all incoming web traffic is required to be HTTPS).\nSome of the roles align better with the term (such as stacks/lamp), however, other are more of tools I wanted to make easy to install (like stacks/github-cli). Some if the roles have dependencies (such as stacks/nginx depending on stacks/certbot), and some are just dependencies (stacks/lamp has no tasks but depends on stacks/mysql and stacks/php, the latter of which depends on stacks/nginx).\nThere’s also a special role called stacks/fromvars which merges any variables called stacks or that begin with stacks_. This allows working around how variables get overridden; in the previous playbooks, there wasn’t a good way to specify stacks at both a group and host level (since the latter would overwrite the former). This also opens up the door to significantly nicer variable management, as using group vars is now a viable option.\nAs an example, in my inventory, I have the following:\nYAML # group_vars/vpn.yml stacks_vpn: - tailscale --- # group_vars/web.yml stacks_web_all: - nginx additional_mime_types: - \"...\" --- # host_vars/lamp.yml (also part of the web group) stacks_lamp: - mysql - php php_version: - \"...\" Final Thoughts Refactoring a project is often a non-trivial amount of work and this was no exception. The improvement was worth the work however; the result is more modular, easier to use, and allows for writing host configurations that are more expressive with less boilerplate. I’ve certainly found this useful, and hopefully you will too.\nView on GitHub ","date_modified":"2025-10-25T08:00:00-04:00","date_published":"2025-10-25T08:00:00-04:00","id":"https://www.ctmartin.dev/posts/ansible-v2/","summary":"Better stacks, smarter variables","tags":["Informal Posts","Home Lab"],"title":"V2 of my Ansible Playbook","url":"https://www.ctmartin.dev/posts/ansible-v2/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Notoriously, Docker doesn’t play well with firewalls and will bypass them when port forwarding by default.\nThis problem happened to me when I purchased a VPS from a cloud hosting provider and discovered their cloud firewall, which blocks packets prior to reaching the server, did not work for that product (at least not yet).\nThe problem that occurs is that Docker injects port “publishing” via the DNAT table. Because DNAT happens before ALLOW/DENY rules, setting a rule to deny all incoming traffic will never take effect.\nOnline forums and documentation often recommend disabling iptables in the Docker Configuration - per the firewalld documentation:\nTo have full control of docker containers via firewalld one must first disable iptables in docker.\nThis can be done by adding iptables: false to the daemon configuration.\n(Note that while the documentation refers to iptables specifically, it also applies to nftables via the compatibility layer and to firewall managers like firewalld)\nHowever, doing this creates its own host of problems (pun intended) and there’s a better way to prevent Docker from bypassing the firewall on public servers. Docker’s documentation states:\nBut, this option is not appropriate for most users, it is likely to break container networking for the Docker Engine.\nFor example, with Docker’s firewalling disabled and no replacement rules, containers in bridge networks will not be able to access internet hosts by masquerading, but all of their ports will be accessible to hosts on the local network.\nThis seemingly leaves us with two unappealing options:\nDisable iptables in Docker, manually creating firewalls rules, and using the Docker API to look up container IPs Inject additional rules to the top of the Docker firewall tables, with logic to re-apply these rules every time the firewall gets reloaded However, there’s a better way: we can instead change how Docker binds ports.\nWhat does this mean? By default, Docker binds ports to 0.0.0.0. This is a wildcard IP address that means the port will be bound on all IPv4 addresses. Due to the usage of DNAT mentioned above, if our server has a public IP of 1.2.3.4, then publishing port 80 will cause all incoming traffic to 1.2.3.4:80 to be redirected to the container before any ALLOW/DENY rules have the chance to take effect.\nIn the Docker configuration, we can change the default bind port to 127.0.0.1, the localhost address on both the default network bridge (which gets used for ad-hoc invokations like docker run ...) and for all newly created networks (such as those created by Docker Compose).\nJSON { \"default-network-opts\": { \"bridge\": { \"com.docker.network.bridge.host_binding_ipv4\": \"127.0.0.1\" } }, \"ip\": \"127.0.0.1\" } Interestingly, after discovering this solution, I found it’s also recommended by the CIS Security Benchmark for Docker.\n2.4 Ensure Docker is allowed to make changes to iptables\nRationale: …We recommended letting Docker make changes to iptables automatically in order to avoid networking misconfigurations that could affect the communication between containers and with the outside world. Additionally, this reduces the administrative overhead of updating iptables every time you add containers or modify networking options.\n5.14 Ensure that incoming container traffic is bound to a specific host interface\nRationale: If you have multiple network interfaces on your host machine, the container can accept connections on exposed ports on any network interface. This might not be desirable and may not be secured.\n…ensure that the exposed container ports are bound to a specific interface and not to the wildcard IP address 0.0.0.0.\nThis solution isn’t 100% perfect - any existing networks will need to be re-created and some applications (inc. CLI tools) expect to be able to be access containers without restrictions (and this breaks that assumption). If you’re using Docker on a remote server, like I am, this means you’ll need to use a proxy (like Nginx for web traffic) or something like SSH port forwarding to access those applications directly. Because this binds to localhost, this also means any VPNs you’re using to access the machine will also not work without using one of those solutions.\nDespite those limitations, it opens up the door to a much simpler way of using firewalld. Since Docker is aware of firewalld and will add its interfaces (like docker0) to it, we can now override the default Docker zone \u0026 policy. Unfortunately, these firewalld rules are redundant with changing the default bind address for incoming traffic since, because of the DNAT problem, they won’t change anything if a port is manually published to 0.0.0.0 (e.g. -p 0.0.0.0:80:80). However, declaring the zone and policy do give us an opportunity to put further restrictions on Docker since it won’t change the zone or policy if it already exists.\nDocker uses the following Go code to define the firewalld zone:\nGo dockerZone = \"docker\" // ... dz := firewalldZone{ version: \"1.0\", name: dockerZone, description: \"zone for docker bridge network interfaces\", target: \"ACCEPT\", } We can use the following to remove the default “ACCEPT” rule by declaring the following zone:\nXML \u003c?xml version=\"1.0\" encoding=\"utf-8\"?\u003e \u003czone\u003e \u003cshort\u003edocker\u003c/short\u003e \u003cdescription\u003ezone for docker bridge network interfaces\u003c/description\u003e \u003cforward/\u003e \u003c/zone\u003e Similarly, we can override the Docker policy. In the Go code, this is defined as:\nGo policy := map[string]interface{}{ \"version\": \"1.0\", \"description\": \"allow forwarding to the docker zone\", \"ingress_zones\": []string{\"ANY\"}, \"egress_zones\": []string{dockerZone}, \"target\": \"ACCEPT\", } We can use the following policy to change the ingress of “ANY” to “HOST”:\nXML \u003c?xml version=\"1.0\" encoding=\"utf-8\"?\u003e \u003cpolicy version=\"1.0\" target=\"ACCEPT\"\u003e \u003cdescription\u003eallow forwarding to the docker zone\u003c/description\u003e \u003cingress-zone name=\"HOST\"/\u003e \u003cegress-zone name=\"docker\"/\u003e \u003c/policy\u003e If you are using firewalld, you’ll also likely want a rule allowing outgoing traffic from Docker to the internet or any internal service.\nXML \u003c?xml version=\"1.0\" encoding=\"utf-8\"?\u003e \u003cpolicy version=\"1.0\" target=\"ACCEPT\"\u003e \u003cdescription\u003eallow docker to make outgoing connections to the public internet\u003c/description\u003e \u003cingress-zone name=\"docker\"/\u003e \u003cegress-zone name=\"public\"/\u003e \u003c/policy\u003e Additionally, you can add rules to allow select traffic to specific internal services (that aren’t already running in Docker). In the following example, I’m allowing HTTPS traffic to services available over my Tailscale VPN connection. Note that target provides the default behavior when rules don’t match; in this case, skip and allow the default deny to block it.\nXML \u003c?xml version=\"1.0\" encoding=\"utf-8\"?\u003e \u003cpolicy version=\"1.0\" target=\"CONTINUE\"\u003e \u003cdescription\u003eallow docker to make outgoing connections to the public internet\u003c/description\u003e \u003cingress-zone name=\"docker\"/\u003e \u003cegress-zone name=\"tailscale\"/\u003e \u003c!-- Allowed services --\u003e \u003cservice name=\"https\"/\u003e \u003cservice name=\"http3\"/\u003e\u003c!-- HTTP3/QUIC is always HTTPS but is over UDP instead of TCP --\u003e \u003c/policy\u003e In summary, for solving the original problem of running Docker on a VPS with a public IP and no provider-managed firewall, this solution provides a really nice way to ensure containers aren’t unintentionally publicly exposed. Additionally, while we can’t prevent manually binding to 0.0.0.0, we can put further limits on Docker’s interaction with other services.\n","date_modified":"2025-10-20T15:00:00-04:00","date_published":"2025-10-20T15:00:00-04:00","id":"https://www.ctmartin.dev/posts/docker-firewalling/","summary":"A better way to firewall Docker (without disabling iptables)","tags":["Informal Posts","Home Lab"],"title":"Docker Firewalling","url":"https://www.ctmartin.dev/posts/docker-firewalling/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Looking back on the Power BI part of my Capstone project (“Measuring Membership \u0026 Impact”), I wanted to reflect on some of the things I learned about working with Power BI.\nIn the original report, I heavily used PostgreSQL’s materialized views since they allowed creating a more modular report and re-using intermediate aggregations. For example, I had a materialized view to rollup amounts per fund, per person, per year since that table was heavily used and slow to calculate with all the processing that needed to be done to the raw ledger. This view was used more directly for some reports, but for others it gets filtered and rolled up again for just contributions that count as charitable (for tax purposes), and then again to get some specific aggregations that are easier to calculate with some pre-processing.\nThis approach served me well in the original report, and I used it again in Power BI. Keeping slightly more piecemeal “queries” in PowerQuery/M was also beneficial in that it made debugging easier and refreshing data a lot quicker. A lot of the structure of the materialized views transferred well into PowerQuery, and the way PowerQuery handles refreshing data was also convenient in that it functioned similar to a materialized view.\nThere was definitely a learning curve in learning two new languages, M \u0026 DAX, and how to effectively use both. My experience was that DAX was better at doing common aggregations on-the-fly and M was better at more specific/literal calculations when you can afford the time to process. I found this unintuitive at first since I expected DAX to let me use aggregates with the expressiveness of window functions in SQL or Tableau, whereas in practice I found DAX to be less expressive than either. These problems with both filtering and using aggregates also extended into how DAX interacts with page and chart filters vs ones that were in the expression itself.\nIn many cases I found that decomposing an aggregation into an initial, simpler aggregation (or calculation) in M stacked with a second aggregation in DAX was a simpler solution than trying to do everything in one complicated aggregate. This was particularly the case when trying to run a window function-like aggregation over time-series data while also filtering on the date. I would also compare this to when you’d use an LOD/Table Calculation in Tableau, where you’re considering when a filter/aggregate is being run relative to filters, but perhaps less well-defined.\nI had to use similar workarounds when using Parameters since DAX doesn’t allow/have access to Parameters. Some of these were solved by moving calculations to M, however, another (albeit imperfect) trick I was able to use was creating a “Query” that was simply the value of the Parameter to a table with one row.\nPower BI also doesn’t handle dates as cleanly as either SQL or Tableau; when you’re working with all of M, DAX, chart filters, \u0026 chart formatting, what works well in one doesn’t neccesarily work well in others. This is most noticable when using DAX since it seems to want everything to be aggregate. However, this also appears in the chart formatting where instead of options like “Date, Month, Quarter, Year” for Display Units, I instead get the same “None, Thousands, Millions, …” options that numeric values get; and even if I specify the Format of the field as the year (yyyy) or use the “Year” part of the Date Hierarchy, the chart axis can ignore what I’ve specified, yielding odd values like “Jul 02” instead of “2015”. In cases where I only needed the year, there were times I found it cleaner to convert the Date column to a integer column for the year.\nThe last main problem I had with Power BI was that its line charts don’t play nicely with missing values in continuous/time-series data. Most of the accounts being reported on have data for every year, however, the Extraordinary Income account (large donations and bequests) has some years without any transactions. In Excel and other tools, I have options for skipping missing values or defaulting to a value like zero. However, in Power BI, my options are to either create a custom “Query” using PowerQuery/M or to use a bar chart instead of a line chart.\nWith all of that said, Power BI was the best tool for the job at hand, and I wrote a bit more on the reasons it was a successful rewrite in the Postscript of project page.\nHowever, if want the TL;DR: “Power BI is a program that’s easy to download and run, handles tables with a changing number of rows cleaner than Excel, doesn’t have the cost of Tableau, and doesn’t require the technical knowledge that running a SQL-based solution requires. It was able to yield all of the important financial reporting I needed it to, and the ease of use was a really important objective for the re-write.”\n","date_modified":"2025-10-01T12:00:00-04:00","date_published":"2025-10-01T12:00:00-04:00","id":"https://www.ctmartin.dev/posts/powerbi-learnings/","summary":"Reflecting on working with Power BI","tags":["Informal Posts","Graduate Works"],"title":"Power BI Learnings","url":"https://www.ctmartin.dev/posts/powerbi-learnings/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"I’ve been wanting a personal data lab for several years now so I can practice data engineering; this is the culmination of that research. The goal of this project is to have a personal/home lab-style setup using open-source projects.\nWhy Open Source? Since I have hardware that can run this, the cost is effectively just electricity (aka, basically free). Additionally, most of the large cloud vendors have built at least some of their offerings on these open source projects (see the ElasticSearch \u0026 Redis license changes).\nSome of these offerings are fairly obviously using or based on open-source projects, for example:\nDetails AWS ElastiCache (Redis, Memcached) AWS Keyspaces (Apache Cassandra) AWS EMR (Hadoop ecosystem) AWS Managed Service for Prometheus AWS Managed Grafana AWS Managed Workflows doe Apache Airflow AWS OpenSearch (fork of ElasticSearch) AWS RDS (PostgreSQL, MySQL/MariaDB) AWS SageMaker Notebooks (Jupyter Notebooks) AWS Streaming Service for Apache Kafka Azure Cache for Redis Azure Database for MySQL Azure Database for PostgreSQL Azure Events Hub for Apache Kafka Azure Managed Instance for Apache Cassandra GCP AlloyDB (based on PostgreSQL) GCP Cloud Composer (Apache Airflow) GCP Cloud SQL (PostgreSQL, MySQL/MariaDB) GCP Dataflow (Apache Beam) GCP Dataproc (Apache Spark, Hadoop) GCP Managed Service for Prometheus GCP Memorystore (Redis, Memcached) GCP Vertex AI Workbench (Jupyter Notebook) Google Colab (Jupyter Notebook) Databricks (originally built on Apache Spark) Datastax Astra DB (Apache Cassandra) Additionally, there are several open source projects based on vendor offerings, for example:\nApache HBase (based on GCP Bigtable) Minio (based on AWS S3 API) And many, many more, for both lists…\nSo, in many cases, it’s not just that the skills translate, but that the open source project is the actual software being used. It might have a pretty UI or management platform on top of it, but it may be the exact same software under the hood.\nSetup Important Note/Disclaimer: This is not a guide on security. Some of the tools used don’t allow user management in their open-source versions. This was all done on a server that I can access via a software VPN (not open to the public internet) and I highly recommend you do the same. That setup is, however, outside the scope of this post.\nThe Base For the base OS \u0026 deployments, I’m using an Ubuntu VM and MicroK8s (the latter since Kubernetes will be the easiest way to manage several of the applications we’re using). MicroK8s was chosen for being a lightweight Kubernetes distribution and well-supported.\nFind the code/configs on GitHub Kubernetes First, I set up MicroK8s per the Getting Started guide (don’t forget to log out and back in, or use su - $USER, to apply new groups).\nFor convenience I used several MicroK8s addons (as opposed to setting up the charts/manifests manually). MicroK8s addons are essentially just pre-packaged Kubernetes manifests/Helm charts, but some of them are important – for example, the hostpath-storage addon allows persistent volumes if you don’t have another storage provider configured (a must-have in our situation).\nShell microk8s enable cert-manager microk8s enable dashboard microk8s enable dns microk8s enable helm microk8s enable hostpath-storage microk8s enable host-access microk8s enable ingress microk8s enable minio microk8s enable metrics-server microk8s enable registry I also set up several useful dev tools on the server so I can use VS Code’s Kubernetes extension:\nShell sudo snap install kubectl --classic sudo snap install docker sudo snap install helm --classic To get (and protect) your ~/.kube/config file, you can use the following commands:\nShell microk8s config \u003e ~/.kube/config chmod 600 ~/.kube/config To have a consistent way to access, I set up a DNS record to the IP of my cluster. In these examples, I’m going to use k8s.example.com as a stand in.\nTo not deal with browser warnings about insecure connections, I configured cert-manager with Let’s Encrypt \u0026 the Cloudflare DNS challenge. One tip that was not clear in the documentation is that when using a ClusterIssuer (which allows you to generate certs in multiple namespaces), the issuer and any secrets must be in the namespace cert-manager looks in (by default, cert-manager).\nYAML # manifests/cert-manager-issuer.yml apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod namespace: cert-manager spec: acme: email: \u003clet's encrypt email\u003e privateKeySecretRef: name: prod-issuer-account-key server: https://acme-v02.api.letsencrypt.org/directory solvers: - selector: {} dns01: cloudflare: apiTokenSecretRef: name: cloudflare-api-token-secret key: api-token YAML # manifests/cert-manager-secret.yml apiVersion: v1 kind: Secret metadata: name: cloudflare-api-token-secret namespace: cert-manager type: Opaque stringData: api-token: \u003ctoken\u003e Then, for my own convenience while debugging, I created an Ingress (reverse proxy config) for the Kubernetes Dashboard, which was installed earlier as a MicroK8s addon:\nYAML # manifests/dashboard-ingress.yml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: dashboard-ingress namespace: kube-system annotations: cert-manager.io/cluster-issuer: \"letsencrypt-prod\" kubernetes.io/ingress.class: nginx nginx.ingress.kubernetes.io/backend-protocol: \"HTTPS\" spec: ingressClassName: nginx tls: - hosts: - dashboard.k8s.example.com secretName: dashboard-tls rules: - host: dashboard.k8s.example.com http: paths: - path: / pathType: Prefix backend: service: name: kubernetes-dashboard port: number: 443 Tip: if you forget your token (or it expires), you can use the following command in MicroK8s:\nShell kubectl -n kube-system describe secret microk8s-dashboard-token PostgreSQL The base of the lab I’m building is going to be using PostgreSQL. To some extent, any SQL database would be fine, however, PostgreSQL is arguably the golden standard for SQL databases.\nShell helm install pg-release oci://registry-1.docker.io/bitnamicharts/postgresql After that I added a NodePort service so I can access PostgreSQL outside the cluster, which maps the application’s port to a port on the host machine. The better way to do this would be to add a TCP Ingress config, but the documentation for doing that on MicroK8s’ implementation of the Nginx ingress controller seems to be broken right now.\nYAML # manifests/postgresql-ingress.yml apiVersion: v1 kind: Service metadata: name: postgresql-ingress spec: type: NodePort selector: app.kubernetes.io/component: primary app.kubernetes.io/instance: pg-release app.kubernetes.io/name: postgresql ports: - name: tcp-postgresql nodePort: 32032 protocol: TCP port: 5432 targetPort: tcp-postgresql If you didn’t copy the password for PostgreSQL that was printed by the helm chart setup, you get get it with this command:\nShell kubectl get secret --namespace default pg-release-postgresql -o jsonpath=\"{.data.postgres-password}\" | base64 -d Here’s what the connection details look like in DBeaver (a SQL GUI):\nConnection configurationTip: It’ll make life easier if you check the box for “Show all databases” next to the “Database” field (otherwise DBeaver will only show the default “postgres” database).\nWhile we’re at it, let’s also create a database for our project (you can replace postgres with your own user if you’ve created one):\nSQL -- Create project database create database myproject with owner postgres encoding UTF8; Apache Superset For visualization, I’m going to install Apache Superset.\nFirst, as a note, the examples loader in Superset is broken for Kubernetes right now. To do a workaround we need to do a few things:\nCreate a database in our PostgreSQL instance called “examples” Create a schema called “main” in the examples database to match the schema name used by the loader code Set the “main” schema as default and delete the “public” schema Configure Superset to use our examples database, and to use a “search path” that looks at the “main” schema instead of the “public” schema To use our existing PostgreSQL instance, I’m going to create a user and two databases for Apache Superset:\nSQL -- As admin/privileged user create user superset with password 'superset'; create database superset with owner superset encoding UTF8; create database examples with owner superset encoding UTF8; -- Connect to the examples database -- e.g. `\\c examples` create schema main authorization superset; drop schema public; --\\c myproject -- Project/Lab Database (owned by me/user) -- Superset Permissions grant select on all tables in schema public to superset; alter default privileges in schema public grant select on tables to superset; Superset demands you create a non-default secret key, which can be done with the following:\nShell openssl rand -base64 42 Since we’re using Kubernetes to install, I need a values file like the following:\nYAML # values/superset.yml configOverrides: secret: | SECRET_KEY = '\u003cgenerated secret, e.g. `openssl rand -base64 42`\u003e' examples_config: | SQLALCHEMY_EXAMPLES_URI = 'postgresql://superset:superset@pg-release-postgresql:5432/examples?options=-csearch_path%3Ddbo,main' ingress: enabled: true ingressClassName: public annotations: cert-manager.io/cluster-issuer: \"letsencrypt-prod\" kubernetes.io/ingress.class: nginx hosts: - superset.k8s.example.com tls: - hosts: - superset.k8s.example.com secretName: superset-tls init: loadExamples: true postgresql: enabled: false supersetNode: connections: db_host: 'pg-release-postgresql' And then we can install it with the following command:\nShell helm repo add superset https://apache.github.io/superset helm upgrade --install --values values/superset.yml superset superset/superset We can now log into Superset with the default admin/admin login. You should be able to open the example Sales Dashboard and see the following:\nExample dashboard Concluding Thoughts This lab is already adequate for many data engineering purposes, particularly as it uses PostgreSQL for the base. However, there’s still a lot of room to grow this setup and support more use cases, particularly ETL/ELT pipelines.\nFuture posts on this will be coming, so keep an eye out!\n","date_modified":"2023-11-10T14:36:00-06:00","date_published":"2023-11-10T14:36:00-06:00","id":"https://www.ctmartin.dev/articles/write-ups/data-lab-part-1/","image":"https://www.ctmartin.dev/articles/write-ups/data-lab-part-1/example-dashboard.png","summary":"I wanted to get more hands-on while learning data tools, so I built myself a home lab using open-source software designed for data.","tags":["Articles/Write-Ups","Articles","Data Lab","Graduate Works","Independent Study – Rapid Prototyping in Data \u0026 AI"],"title":"Building a local, open-source data lab (Part 1)","url":"https://www.ctmartin.dev/articles/write-ups/data-lab-part-1/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"A few years ago, Google released a neat little product called Coral, a “tensor processing unit” (TPU), aka, an AI accelerator. Targeted at IoT/embedded devices, such as a Raspberry Pi, Coral can run models using TensorFlow Lite and has enough performance to allow these devices to do some AI in a reasonable amount of time.\nOf course, this isn’t without its limitations. TensorFlow Lite only supports models quantized to/using int8 (signed 8-bit integers), so no floating point numbers. Additionally, not all types of neural network operations are allowed. For example, Recurrent Neural Networks (RNNs) are limited to using LSTM, and even still has limitations \u0026 known issues on converting.\nFinally, the biggest limitation is that the hardware isn’t that powerful. However, that’s somewhat by design; Coral is designed to run efficiently with very little power, and TensorFlow Lite also targets mobile devices. So, if you have a GPU, even something simple will be able to run circles around the Coral. But if you have an embedded/IoT use case, Coral will perform significantly better than the CPU will (also, back when it was released, ARM chips didn’t have any sort of neural processing and while some do now in 2023, it’s still not common).\nRecently, I got my hands on the USB version of the Coral, so let’s try out the getting started guide as a way to learned about the Coral.\nFollowing the guide, I spun up a VM running Debian 10, since that’s one of the supported operating systems. After that I did a USB pass-through to access the Coral from the VM. This was the first hiccup I encountered – for some weird reason the Coral will switch between two hardware IDs:\n  1a6e:089a Global Unichip Corp. 18d1:9302 Google Inc. Thankfully the solution seems to be as simple as just passing both IDs through to the VM.\nThe second issue I encountered was that the kernel driver wants privileged access, and I was using an unprivileged user. A quick search yielded this GitHub issue, and rather than always running commands using sudo, I opted for the simpler (and more secure) option of adding my user to the plugdev group.\nShell sudo usermod -aG plugdev $USER After dealing with those two quirks, the rest of the process was seamless and I was able to run the MobileNet image classification example:\nShell Session $ python3 examples/classify_image.py --model test_data/mobilenet_v2_1.0_224_inat_bird_quant_edgetpu.tflite --labels test_data/inat_bird_labels.txt --input test_data/parrot.jpg ----INFERENCE TIME---- Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory. 120.3ms 14.9ms 15.4ms 15.3ms 15.3ms -------RESULTS-------- Ara macao (Scarlet Macaw): 0.75781 Oddly, this is a lot slower than the getting started guide’s results. The accuracy was about the same, but their times were ~12ms for the first run and ~3ms for the subsequent runs. Google does provide an overclocked driver for better performance, but warns that it causes the hardware to get very hot.\nInterestingly, identical results are given for the M.2/PCIe variant of the Coral, which is able to perform better to higher bandwidth (USB has more overhead and Coral has the USB variant categorized as a “prototyping” device) and align with the benchmarks page, which is noted as using C++ instead of Python due to C++ having lower overhead. So, part of me wonders if they copy-pasted and the getting started guide’s performance is not accurate.\nEither way, this was a fun little experiment and gives me a new toy to play with. I was also pleasantly surprised that the getting started guide worked with minimal issues, given that it’s been a few years since the product was released and the operating system support is a bit dated.\nCatch you next time!\n","date_modified":"2023-09-22T19:14:00-05:00","date_published":"2023-09-22T19:14:00-05:00","id":"https://www.ctmartin.dev/posts/coral-getting-started-test/","image":"https://www.ctmartin.dev/posts/coral-getting-started-test/feature.png","summary":"A few years ago, Google released a neat little product called Coral, a small AI accelerator. Let's try it out!","tags":["Informal Posts","Graduate Works","Independent Study – Rapid Prototyping in Data \u0026 AI"],"title":"Trying out a Coral TPU","url":"https://www.ctmartin.dev/posts/coral-getting-started-test/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Continuing from the last post on searching the Linux manual (“man”) pages, this week I’m going to be using ElasticSearch and see how well it works.\nWhy ElasticSearch? Well, as much as I have an issue with its licensing, it’s nearly synonymous with searching text documents.\nIronically, for being a well-deployed application, it goes on my list of “software whose quick start guide doesn’t work” – after a couple hours of debugging, the best I can tell is that the Docker image now defaults to trying to start in clustered mode instead of the single-node mode.\nTo remedy this, I used the Docker compose file for the multi-node quickstart, deleted the other nodes, and then set the remaining node to discovery.type=single-node\nWe can also use similar code to last time for loading the man pages into the database:\nPython from elasticsearch import Elasticsearch import gzip import os # Configuration manpath = \"/usr/share/man/\" # Global vars es = Elasticsearch( \"https://127.0.0.1:9200\", verify_certs=False, basic_auth=(\"elastic\",\"changeme\") ) # Print connection info to check it's working print(es.info()) # ... (same get_* functions as last post) data = [] # Loop through all sections and get pages for section in get_sections(): # Loop through pages \u0026 add content for page in get_section_pages(section): content = get_page_contents(section, page) #data.append((section, page, content)) es.index( index='man', document={ 'section': section, 'page': page, 'content': content } ) # Reindex es.indices.refresh(index='man') Searching is also fairly easy:\nPython from elasticsearch import Elasticsearch # Global vars es = Elasticsearch( \"https://127.0.0.1:9200\", verify_certs=False, basic_auth=(\"elastic\",\"changeme\") ) # Print connection info to check it's working #print(es.info()) results = es.search(index=\"man\",q=\"password requirements\") for doc in results[\"hits\"][\"hits\"]: print(doc[\"_score\"], doc[\"_source\"][\"section\"], doc[\"_source\"][\"page\"]) Let’s see the results!\nShell Session $ python3 search.py 8.162808 man3 getpass.3 8.019694 man3 endspent.3 8.019694 man3 lckpwdf.3 8.019694 man3 ulckpwdf.3 8.019694 man3 fgetspent.3 8.019694 man3 setspent.3 8.019694 man3 sgetspent.3 8.019694 man3 sgetspent_r.3 8.019694 man3 getspent.3 8.019694 man3 getspnam.3 And it looks like it’s about the same as using the word match search in SQLite. A look at the documentation seems to confirm this:\nq – Query in the Lucene query string syntax using query parameter search. Query parameter searches do not support the full Elasticsearch Query DSL but are handy for testing.\nsearch() Since it looks like the Python library is not the easiest way to query, let’s jump over to Kibana since it was included in the Docker compose file.\nHere’s what the same query looks like in the Kibana console:\nHTTP POST /man/_search?pretty { \"query\": { \"query_string\": { \"query\": \"password requirements\" } }, \"_source\": [\"section\",\"page\"] } Using the Standard analyzer appears to get the same results:\nHTTP POST /man/_search?pretty { \"query\": { \"match\": { \"content\":{ \"query\": \"password requirements\", \"analyzer\": \"standard\" } } }, \"_source\": [\"section\",\"page\"] } Using the English text analyzer is promising though!\nHTTP POST /man/_search?pretty { \"query\": { \"match\": { \"content\":{ \"query\": \"password requirements\", \"analyzer\": \"english\" } } }, \"_source\": [\"section\",\"page\"] } Since the results are a giant JSON document, let’s use a little jq to simplify the results:\nShell Session $ jq '.hits.hits[]._source' en_analyzer.json { \"section\": \"man1\", \"page\": \"passwd.1\" } { \"section\": \"man5\", \"page\": \"shadow.5\" } { \"section\": \"man8\", \"page\": \"pam_pwquality.8\" } { \"section\": \"man8\", \"page\": \"pam_unix.8\" } { \"section\": \"man8\", \"page\": \"pam_extrausers.8\" } { \"section\": \"man1\", \"page\": \"apg.1\" } { \"section\": \"man1\", \"page\": \"systemd-ask-password.1\" } { \"section\": \"man1\", \"page\": \"systemd-tty-ask-password-agent.1\" } { \"section\": \"man5\", \"page\": \"pwquality.conf.5\" } { \"section\": \"man8\", \"page\": \"systemd-ask-password-console.service.8\" } { \"section\": \"man1\", \"page\": \"passwd.1\" } { \"section\": \"man5\", \"page\": \"shadow.5\" } { \"section\": \"man8\", \"page\": \"pam_pwquality.8\" } { \"section\": \"man8\", \"page\": \"pam_unix.8\" } { \"section\": \"man8\", \"page\": \"pam_extrausers.8\" } { \"section\": \"man1\", \"page\": \"apg.1\" } { \"section\": \"man1\", \"page\": \"systemd-ask-password.1\" } { \"section\": \"man1\", \"page\": \"systemd-tty-ask-password-agent.1\" } { \"section\": \"man5\", \"page\": \"pwquality.conf.5\" } { \"section\": \"man8\", \"page\": \"systemd-ask-password-console.service.8\" } pwquality.conf is the 9th result, but it’s not as bad as it looks – pam_pwquality is the 3rd result! This is the PAM module rather than the actual configuration, but it will explain a lot of the options and the “see also” section will send the user to the correct place:\nSEE ALSO pwscore(1), pwquality.conf(5), pam_pwquality(8), pam.conf(5), PAM(8)\nThis still doesn’t beat SQLite’s 2nd place for the correct place, but there’s a bit more to look at in the results. For sake of simplicity, I’m going to put the page names for the top results side-by-side:\nRank ElasticSearch SQLite Full-Text Search 1 passwd.1 getpass.3 2 shadow.5 pwquality.conf.5 3 pam_pwquality.8 putspent.3 4 pam_unix.8 endspent.3 5 pam_extrausers.8 lckpwdf.3 6 apg.1 ulckpwdf.3 7 systemd-ask-password.1 fgetspent.3 8 systemd-tty-ask-password-agent.1 getspent.3 9 pwquality.conf.5 getspnam.3 10 systemd-ask-password-console.service.8 setspent.3 If we read through the top results on the ElasticSearch side, we get passwd, the command for changing passwords, “shadow”, which is where passwords are stored on modern Linux distributions, the PAM module for password requirements (the module itself and then later the config), other PAM modules, a password generating command, and a few ways systemd can prompt the user to enter their password. Meanwhile, in the SQLite results, the password requirements config is second but all of the other top 10 results are internal Linux APIs. So, while what we wanted ranked lower on ElasticSearch, the results overall were much closer to what we’d want.\nThat’s going to be it for this time; keep an eye out since sometime in the future I intend to try this with a vector database.\n","date_modified":"2023-09-08T22:08:00-05:00","date_published":"2023-09-08T22:08:00-05:00","id":"https://www.ctmartin.dev/posts/searching-manuals-with-elasticsearch/","image":"https://www.ctmartin.dev/posts/searching-manuals-with-elasticsearch/feature.png","summary":"Creating a better way to search Linux manual pages","tags":["Informal Posts","Graduate Works","Independent Study – Rapid Prototyping in Data \u0026 AI"],"title":"Searching Manuals with ElasticSearch","url":"https://www.ctmartin.dev/posts/searching-manuals-with-elasticsearch/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Linux distributions come with a built-in documentation function through what are called “man” (manual) pages. However, reading the manual generally requires knowing the name of the program or function you’re working with. So, let’s see if we can do a little better.\nThe example I’m going to run with for this mini-series of posts is going to be password requirements. The reason I’m picking this example is that it’s built-in, involves a complex part of the Linux ecosystem, and isn’t readily findable in the manual using common terms. We’re also going to assume that you don’t have access to a search engine, or that it should be much more timely to be able to access the manual locally.\nSpecifically, password requirements on Linux (assuming the system is managing its own passwords) involves an area called “Pluggable Authentication Modules” (or PAM for short). PAM controls many aspects of logins, including things like two-factor authentication, but one of the functions is changing passwords. The module handling the policy aspect is called pwquality.conf (“password quality configuration”). To access the manual pages, you’d run the man pwquality.conf command.\nNow, let’s take a look at why this isn’t easy to find (beyond needing to know the name to access the manual). Here are some of the word counts in the man pwquality.conf page, using common words associated with password complexity:\nWord Count “password” 23 “requirement” 1 “complex” 0 “policy” 0 “strong” and “strength” 0 Now, why do these word counts matter? Because there’s one use of “requirement” but almost no other common terms are used. If you’re looking at organizational password policies, or coming from the Microsoft Active Directory world, you’re not going to find it easily in the manual; meanwhile if you search for the Active Directory equivalent, you’ll get a page called “Password must meet complexity requirements.”\nConversely, the manual page has the word “quality” 9 times.\nSo, to try to make this better, let’s first figure out how the manual pages are stored. If we run man man it will tell us:\nManual pages are normally stored in nroff(1) format under a directory such as /usr/share/man. In some installations, there may also be preformatted cat pages to improve performance. See man‐path(5) for details of where these files are stored.\n(And, for a tough of irony, man man-path gives “No manual entry for man-path”)\nEditorial note: while the output I was given said the page was man-page(5), the correct name is “manpath” (e.g. man manpath). The manpath returns all of the locations the system has been configured to search, similar to the PATH environment variable, and may include locations other than /usr/share/man.\nHowever, there’s also one other important piece of information we’ll need in a minute – that the manual has sections:\n1 Executable programs or shell commands\n2 System calls (functions provided by the kernel)\n3 Library calls (functions within program libraries)\n4 Special files (usually found in /dev)\n5 File formats and conventions, e.g. /etc/passwd\n6 Games\n7 Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7), man-pages(7)\n8 System administration commands (usually only for root)\n9 Kernel routines [Non standard]\nOk, so we can see a lot of localization directories and then some directories matching man* – since there’s no en directory (and Linux development is primarily in English), we can assume that the man directories are in English by default (and we’ll find this to be true in a bit).\nNow, if we look in the directories, we’ll find files named by their name, the section, and that they’re gzipped:\nShell Session $ ls /usr/share/man/man1 '[.1.gz' docker-service-scale.1.gz gnome-extensions.1.gz The first file name is a bit odd, but it’s for a bash thing and it’s just a string so we can handle it fine. However, as we start digging deeper we’ll notice there are a lot of section suffixes:\n  openssl-rsa.1ssl.gz Git.3pm.gz rwarray.3am.gz The easiest way to deal with this will be simply including it in the page name. So, pulling all of this together, here’s what some Python code looks like:\nPython import gzip import os import sqlite3 # Configuration manpath = \"/usr/share/man/\" # Get list of (English) sections def get_sections(): # Array for returning sections = [] # Loop through man sections for file in os.listdir(manpath): filename = os.fsdecode(file) # Use English files only if filename.startswith(\"man\"): sections.append(filename) return sections # Get pages in an individual section def get_section_pages(section): # Array for returning pages = [] # Path of man section path = os.path.join(manpath, section) for file in os.listdir(path): filename = os.fsdecode(file) # On Ubuntu all but a couple misc files are gzipped # Ignore files that aren't .gz if filename.endswith(\".gz\"): pagename = filename.removesuffix('.gz').removesuffix('.'+section) pages.append(pagename) return pages def get_page_contents(section, page): # Get all the rest of the details to make the path #section = \"man\" + section.removeprefix(\"man\") # normalize section_number = section.removeprefix(\"man\") filename = page + \".\" + section_number + \".gz\" # fallback for pages in some sections having additional prefixes if \".\" in page: filename = page + \".gz\" path = os.path.join(manpath, section, filename) with open(path, 'rb') as file: content = gzip.decompress(file.read()).decode() return content data = [] # Loop through all sections and get pages for section in get_sections(): # Loop through pages \u0026 add content for page in get_section_pages(section): content = get_page_contents(section, page) data.append((section, page, content)) Now, let’s load all of this into a format that’s easy to work with. I’m going to use a SQLite database for a couple reasons:\nIt’s a well-supported format It allows easy text matching To take this a little further, it support Full-Text Searching So, here’s some Python to load all of that into SQLite:\nSQL # Configuration vars con = sqlite3.connect(\"man.db\") cur = con.cursor() # Create data table con.execute(\"CREATE TABLE manpages(section text, page text, content text)\") # Create full-text search table # (has different indexing that's focused on search) con.execute(\"CREATE VIRTUAL TABLE man_fts USING FTS5(section, page, content)\") # Add the data cur.executemany(\"INSERT INTO manpages VALUES(?, ?, ?)\", data) cur.executemany(\"INSERT INTO man_fts VALUES(?, ?, ?)\", data) con.commit() Now that that’s in SQL, let’s try the most simple sort of searching – matching specific words.\nLet’s again assume that I’m searching based on a set of organizational standards and I’m not familiar with PAM. The first search I might try is “password” and “complexity”:\nSQL select * from manpages where content like '%password%' and content like '%complexity%' Query ResultsUh oh, none of that is what we want. passwd is close since that’s the command for changing passwords, but we’ll find no mention of pwquality.conf in there. Stemming the word “complexity” to “complex” will give more results but still not what we’re looking for.\nIf we change the query to look for “requirements” instead of “complexity”, we’ll get the correct file, but it’s the 51st result. If we exclude section 3, it will become the 11th result.\nHowever, we should be able to do better. Let’s look at the Full-Text Search (FTS) table we created previously. Also, the full-text search is based on this TIL (“Today I Learned”) post by Simon Willison.\nSQL select * from man_fts where man_fts match \"password requirements\" order by man_fts.rank Query ResultsSecond result, that’s a massive improvement!\nFor convenience, here’s how we can turn this into a Bash function (for example, to include in a .bashrc file):\nShell # Creates a command called \"manfts\" that searches our manual database function manfts { # (Also, yes, this is vulnerable to a SQL injection and you shouldn't do this in production) sqlite3 ~/man.db \"select page from man_fts where man_fts match \\\"$@\\\" order by man_fts.rank limit 5\" } Shell Session $ manfts \"password requirements\" getpass.3 pwquality.conf.5 putspent.3 endspent.3 lckpwdf.3 For this week, this is as far as I got. However, we’ll pick this up to see how more powerful/feature-ful databases fare, starting with ElasticSearch.\n","date_modified":"2023-09-01T20:50:00-05:00","date_published":"2023-09-01T20:50:00-05:00","id":"https://www.ctmartin.dev/posts/searching-linux-man-pages/","image":"https://www.ctmartin.dev/posts/searching-linux-man-pages/sql-fts-output.png","summary":"Creating a better way to search Linux manual pages","tags":["Informal Posts","Graduate Works","Independent Study – Rapid Prototyping in Data \u0026 AI"],"title":"Searching Linux \"man\" pages","url":"https://www.ctmartin.dev/posts/searching-linux-man-pages/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"This project is an interactive dashboard visualization of the Smithsonian Institute’s Smithsonian Open Access dataset, which contained 11.9 Million Records at time of. The dashboard allows insight into the composition of the Smithsonian’s collections, including what, when, and where items come from. Specifically, the visualization looks at the unit (such as the National Museum of American History or the Human Studies Film Archives), country, and age/time items come from (particularly if within the last couple centuries). Interaction enables filtering to a specific unit, allowing comparison in trends between units in addition to the whole.\nSince the Open Access dataset contains almost 12 million records, the data is, in its own way, opaque. When I started working on this, there was no published documentation on the API (thankfully, Matt Miller had a blog post that helped me get started, and shout-out to Dr. Decker who spotted it). So, I had to fall back on the tried and true randomly sample, make statistics, and log anomalies.\nAdditionally, at 26GB uncompressed, it was too large for me to load at once, much less interactively search through. With the laptop I had at the time, doing a simple keyword search using grep took about 5 minutes to complete. To accomplish this larger project, the data was sampled for basic structure and then processed with Python and Jupyter Lab. After logging summaries and anomalies, string processing was used to clean up typos, inconsistencies, and similar issues. Finally, a JSON file was created with the aggregated data.\nAt a technical level, the data was processed using command-line tools (grep, head, tail, awk), Python, Jupyter Lab, and Regex. The visualization was created with d3 and a fork of Semantic UI.\nThis project was also the third and final of the museum visualization projects that I did during my undergrad program at RIT.\nView Project Code (GitHub) ","date_modified":"2020-04-30T12:00:00-04:00","date_published":"2020-04-30T12:00:00-04:00","id":"https://www.ctmartin.dev/projects/smithsonian-collections/","image":"https://www.ctmartin.dev/projects/smithsonian-collections/si.png","summary":"An interactive dashboard of the Smithsonian Institute's collections, enabling insight into what, when, and where items come from.","tags":["Projects","Museum Data","Undergraduate Works"],"title":"Smithsonian Collections","url":"https://www.ctmartin.dev/projects/smithsonian-collections/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"This project is a narrative visualization seeking to interpret some of the information that can be obtained by looking at a museum’s collections. The dataset contains about 62,000 records including data like department, type (such as “print” or “jewelry”), and whether the item is on display.\nOpenRefine was used to do initial analysis of the data and understand what’s in it. All data processing in the final version is done in the browser using JavaScript and d3. This project was also the second of the museum visualizations I did during my undergraduate program at RIT.\nOne of the most interesting charts in this data was where I started exploring collections growth via accessions. An accession the formal acquisition (regardless of reason, such as donation or purchase) of a museum item when it enters the collections. While there are a lot of reasons this data may not be fully reliable, it can show some general trends about how actively items came into the museum collections.\nView Project Code (GitHub) ","date_modified":"2020-01-23T12:00:00-05:00","date_published":"2020-01-23T12:00:00-05:00","id":"https://www.ctmartin.dev/projects/cma/","image":"https://www.ctmartin.dev/projects/cma/cma.png","summary":"Narrative visualization seeking to interpret information that can be obtained from a museum's collection data","tags":["Projects","Museum Data","Undergraduate Works"],"title":"Cleveland Museum of Art Collections","url":"https://www.ctmartin.dev/projects/cma/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"Talk about some of the layers that make up “The Cloud” \u0026 the security concerns at those layers. Includes OSI Layers 1-4 (Physical-Transport), Service-Oriented Architecture, Containers, \u0026 Infrastructure/Platform/Software/X as a Service. Originally presented to the RIT Linux Users Group (RITlug) in 2019.\nSlides","date_modified":"2019-02-01T18:00:00-05:00","date_published":"2019-02-01T18:00:00-05:00","id":"https://www.ctmartin.dev/talks/ritlug-intro-cloud-cloud-security/","summary":"Talk about some of the layers that make up “The Cloud” \u0026 the security concerns at those layers","tags":["Talks","RITlug"],"title":"Intro to the Cloud \u0026 Cloud Security","url":"https://www.ctmartin.dev/talks/ritlug-intro-cloud-cloud-security/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"How the internet is build on trust and common information security problems that can arise from violation of that trust. Originally presented in 2018 for a Science, Technology, \u0026 Values group project. Sections on Trust in Personal Data \u0026 Trust in Information Integrity.\nSlides","date_modified":"2018-12-03T12:00:00-05:00","date_published":"2018-12-03T12:00:00-05:00","id":"https://www.ctmartin.dev/talks/stv-internet-trust/","summary":"How the internet is build on trust and common information security problems that can arise from violation of that trust","tags":["Talks"],"title":"Trust in the Internet Age","url":"https://www.ctmartin.dev/talks/stv-internet-trust/"},{"authors":[{"name":"Christian Martin","url":"https://www.ctmartin.dev/authors/ctmartin/"}],"content_text":"An overview of the evolution in how clusters have been managed. Originally presented to the RIT Linux Users Group (RITlug) in 2017.\nSlides","date_modified":"2017-11-03T18:00:00-05:00","date_published":"2017-11-03T18:00:00-05:00","id":"https://www.ctmartin.dev/talks/ritlug-overview-distributed-computing/","summary":"An overview of the evolution in how clusters have been managed","tags":["Talks","RITlug"],"title":"Overview of Distributed Computing","url":"https://www.ctmartin.dev/talks/ritlug-overview-distributed-computing/"},{"content_text":"I love making data useful to people!\nI regularly float between languages and tools depending on what I need to do and what’s the best tool for the job. In a given week you might find me writing data science workflows in Python \u0026 SQL, doing web dev in JavaScript, or Linux systems administration.\nI got introduced to data science while doing my bachelor’s in interactive software development at Rochester Institute of Technology. During this time I also got heavily involved in the RIT Linux Users Group and the open source community.\nMy data science skills were further refined by getting my master’s at The University of Texas at Austin, where I specialized in Data Analytics and AI \u0026 Machine Learning.\nProfessionally, I’ve used my skills to do comprehensize migrations \u0026 updates to data, generate reports for director \u0026 board-level stakeholders, and test \u0026 validate solutions to problems.\nI’ve been a life-long self-learner; I taught myself how to code starting in middle school and have always kept pushing myself to learn how the software \u0026 libraries I use work and what it takes to run them. I’ve run a “home lab” for ~5 years to teach myself virtualization, containerization, and how to run/deploy the software I use. This has included what I’ve called my “data lab,” which I’ve used to orchestrate data science \u0026 analytical workflows on Kubernetes.\nOutside of work and school I enjoy tea, immersive experiences blending the real world with technology, and my “data hobby” is datasets of cultural institutions 😊\nFor more about my professional experience, check out my Resume.\n","id":"https://www.ctmartin.dev/about/","summary":"I love making data useful to people!\n","title":"Hey There!","url":"https://www.ctmartin.dev/about/"},{"content_text":" Download ","id":"https://www.ctmartin.dev/resume/","title":"Resume","url":"https://www.ctmartin.dev/resume/"},{"id":"https://www.ctmartin.dev/search/","summary":"Search","title":"Search","url":"https://www.ctmartin.dev/search/"}],"language":"en-us","title":"Christian Martin","version":"https://jsonfeed.org/version/1.1"}