Stories by David Ohana on Medium

On the Safe Side: Log your 3rd-party Package Versions

David Ohana — Wed, 22 Dec 2021 15:50:16 GMT

Better safe than worry: log the versions of all dependencies at runtime

In short: Tutorial for how to easily get the versions of log4j2 and the rest of your Java dependencies at runtime.

In the light of the recent log4j2 vulnerability, it happens frequently that you have to upgrade a version of some 3rd-party dependency in order to mitigate a recently discovered exploit.

But how to be sure that the version update actually happened? With nowadays complex dependency rules, even when you specify an explicit package version somewhere, your package manager of choice (e.g. Maven, Gradle) might decide to set your package version to some lower version, to adhere to some common dependency with another package you use.

Listing your dependencies at build time

If you use Maven, you can runmvn dependency:resolve and the maven-dependency-plugin will list your dependencies. But this applies only at build-time.

https://medium.com/media/928f9b409a4f58d848ce3fb0896d59c4/href

It is recommended to add this command to your build workflow so that all dependencies are clearly logged already at build time.

Listing your dependencies at run time

It’s a good practice for your application to print/log the versions of all loaded packages at run-time as well so that you can always look at logs of the production environment and be sure that the proper version is there.

Kotlin/Java Implementaation

It is possible to use reflection to get all loaded packages. Then, for each package — read and print implementation metadata from its JAR manifest. Here is a sample Kotlin code that does this. In the demo, I excluded built-in Java packages.

https://medium.com/media/ee4f56f72e5ae64eed927db039666c6a/href

This is the output:

https://medium.com/media/414ce53ffb8f46a55c2a6f51d2f21275/href

An important note: The code above prints only packages that are loaded to the JVM at the time of the call. So, you want to call it not at the very beginning of your app, but only after 3rd party components are already initialized or started.

Better Formatting

Printing all loaded package versions can be too verbose as you may have hundreds of loaded packages. I also prepared another flavor, which groups all packages by their product (title) and version. It also lets you decide whether to print all the packages in each product.

https://medium.com/media/7dfe8d68d878d349bf4c5d3f20f843a0/href

Output — no details:

https://medium.com/media/0c481a096c38cc55a94039be2df236f3/href

Output — with details (clipped):

https://medium.com/media/88a164c55947ec5a45f48828e4cfb6fe/href

Preserving Manifests of Dependencies

Since the approach above relies on JAR manifest files, it will not work properly if you pack your application into a a single JAR file with maven-assembly-plugin(“big-jar”). Instead, you will get null values for all implementation* fields. That is because each JAR can have a single manifest file, and the original manifests of each dependency are omitted in the packing process.
Workaround: Instead of producing a big jar, use the maven-dependency-plugin to copy JARs of your dependencies to a lib folder below your app-only JAR file. In addition, use maven-jar-plugin to add lib folder to the classpath, as demonstrated below.
If you have a big-jar solution, I would be happy to know about it.

https://medium.com/media/b5c890a88ad8ec3b4189bd698c30204e/href

It is also recommended to set the true directive as shown above for maven-jar-plugin. This will update the Implementation-* fields in the MANIFEST.MF file in your own JAR from the pom.xml file — so that your own packages will be properly versioned as well.

Simpler Solution

But wait — If we already copy each dependency JAR below our app.jar using maven-dependency-plugin, there is a much simpler solution! Just log the filenames of the .jar files. When using a dependency manager, those already contain the version number. This also has other advantages over the previous method:

It logs the version off any JAR, regardless of if it's loaded.
Many authors don’t bother to set the implementation fields in their JAR manifest, but every library in MavenCentral has a file version, that is preserved in the directory structure properly.

Here is the implementation:

https://medium.com/media/70718d03eeebcda2b5b1c34161a745cb/href

The output (clipped):

https://medium.com/media/7e5bc1fb97df774be573f33a6062da25/href

Summary

Regardless of your language, it's a good practice to log versions of all your dependencies to be sure known vulnerabilities are mitigated properly. I demonstrated a few implementation alternatives in Kotlin. The first approach is based on printing the version of packages loaded into the JVM. Using it you can be absolutely sure what version of each package is in use. However, it is not as robust as the latter approach of simply copying dependency JARs to a known location and printing their filenames.

So don’t wait for the next vulnerability to be discovered! Add dependency version logging to your bootstrapping code now, and you will be able to determine instantly if mitigation is required.

On the Safe Side: Log your 3rd-party Package Versions was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

10 Tips for Machine Learning Experiment Tracking and Reproducibility — Do It Yourself Approach…

David Ohana — Sun, 21 Nov 2021 08:48:37 GMT

10 “Do It Yourself” Tips for Machine Learning Experiment Tracking and Reproducibility

Image credit: wiredsmartio/pixabay

This post was also published in the IBM Developer Blog.

As machine-learning practitioners, we invest significant time and effort to improve our models. You usually do it iteratively and experimentally, by repeatedly changing your model, running an experiment, and examining the results, deciding whether the recent model change was positive and should be kept or discarded.

Changes in each iteration may involve, for example, changing a value for a hyper-parameter, adding a new input feature, changing the underlying ML model (e.g., using Gradient Boosting Classifier instead of Random Forest classifier), trying a new heuristic, or trying an entirely new approach.

Experimentation cycles can cause a great deal of confusion. It is easy to get lost, forgetting what changes you made in the recent experiments and whether the latest results are indeed better than before. A single experiment can take hours or even longer to complete. So, you try to optimize your time and execute multiple experiments simultaneously. This makes it even less manageable and the confusion gets even worse.

In this article, I will share lessons and good practices I learned in my recent machine learning projects. Although I call it a “Do it yourself” approach, some may call it “The caveman way”. I am fully aware that nowadays there are many experiment tracking and management platforms, but it is not always possible or convenient to use them. Some platforms require that you execute your experiments on their platform. Sometimes you can’t share any sensitive information outside of your organization — not just the datasets but also results and code. Many platforms require a paid subscription, which can also be a problem in some cases. Sometimes you just want full control of your experiment management approach and data.

The practices described below are easy to implement and do not require additional tooling. They are mostly suitable for small to medium ML projects with a single researcher or a small team. Most of the artifacts are saved locally and adaptations may be required if you want to use shared storage. As a seasoned developer of production systems, I’m aware that a few of the tips below might be considered ‘code-smells’ or bad practices when it comes to the traditional development approach of such systems. However, I believe they have their place and are justified for short-term research projects. I would like to emphasize that the tips below reflect my personal journey and point of view, and not necessarily any official views or practices. So, here I am, waiting for your stones :-)

Tracking what you did

1. Use source control

It goes without saying that your experimentation code should be source-controlled. That said, when using modern interactive environments like Jupyter Notebooks, it is easy to be tempted to make quick experiments on the fly without committing changes to GIT or any other source control system. Try to avoid that as much as possible. Maybe it is only me, but I prefer using a decent IDE and plain Python scripts to run experiments. I may use a notebook for the initial data exploration, but soon after an initial model skeleton is ready, I switch to a full-fledged Python script, which also allows debugging, refactoring, etc.

2. Use identifiable experiments

But you know what? Source control isn’t enough. Even if everything is source-controlled, it can be tedious to browse the repository’s history and understand what source was used for running some experiment 12 days ago. I would like to suggest an additional practice that I call “Copy on Write”. Duplicate your latest experiment script file/folder before each new experiment and make the changes on the new file. Make your experiments identifiable by adding a sequential number to each experiment in the source file name. For example, animal_classifier_009.py for experiment #9. And yes, this works also for notebooks: you can create a notebook per experiment. This means you need only a file diff to understand what changed between experiment #9 and #12. Storage is cheap and the size of all of your experiments’ source code is probably dwarfed by the size of your data.

3. Automatic source code snapshots

Another tip is to automatically take a snapshot of your experiment code for each run. You can do this easily inside the experiment script itself, by bootstrapping code that copies the source file/folder to a directory with the experiment start timestamp in its name. This will make your experiment tracking strategy robust even if you were tempted to make on-the-fly experiments without committing or copy-on-write above (a.k.a “Dirty Commits”).
For example, when running the experiment animal_classifier_009.py, we create the folder out/animal_classifier_009/2021_11_03–12_34_12/source and store a snapshot of the relevant source code inside.

Source code snapshot on disk

4. Treat configuration parameters same as source code

Avoid tuning experiment parameters or hyper-parameters in the command line, environment variables, or any other external means that are not part of the source code. Otherwise, you risk losing traceability for changes if you forget to log the parameter values.

To embed experiment configuration, you can use either plain Python, dictionary, json, yaml, or any other format you find convenient. Just make sure you commit the configuration files together with the experiment code. Does hard-coding stuff seem like a code smell? Well, not in this case. If you do accept external run-time parameters — be sure to log their values!

Each configuration changeset should be treated as a unique experiment. It should be committed to source control, configuration files shall be included in the experiment code snapshot, and it should get its own experiment ID.

The advantage of embedding configuration as part of the source control is that you can be sure you reproduced the same experiment just by running the program file, not other moving parts that you may forget to set.

Using plain Python variables for configuration tracking

5. Track experiment evolution tree

One of the things that helps me a lot is to keep track of the reference experiment — the predecessor baseline that I am trying to improve upon. This is easy to do if your experiments are identifiable. When you create a new experiment by duplicating, keep track of the parent experiment ID + the essence of what you’ve tried in this experiment that is different from the parent. This information will help you quickly recall what you did, without relying on code diffs. It also makes it possible to traverse back in the experiment tree and quickly get the full picture. You can track the parent experiment inside the source code itself, as a code comment.

Experiment notes in code comments

However, this might cause a problem if you forget to update the notes before running the experiment. I suggest a simple spreadsheet like this:

Experiment tracking in a spreadsheet

In the sheet, you can also capture other information or parameters that you used in this experiment, and of course — experiment results. But we’ll touch on that later.

Tracking what happened

6. Keep console/log output

Be generous with logging statements that track what happened in the experiment. Track many metrics and types of information, like dataset size, label count, date ranges, experiment execution time, and more. These can help you detect issues and mistakes. Be paranoid! Every unexplained change in a metric could be caused by some mistake in the experiment setup. This will help you understand its root cause.

Any experiment output should be persisted. I recommend using the Python logging module instead of plain console prints, so you can redirect logging messages to both stdout and a file. In addition, you will get timestamps for each log event, which can help you to solve performance bottlenecks. You can store the log file under a folder correlated to the experiment ID and execution time:

Experiment log on disk

Experiment log output

7. Track experiment results

You might use multiple metrics that quantify the quality of your model. For example, accuracy, precision, recall, F-score, AUC. Make sure you track these in a separate, structured, results file that you may automatically process later to show charts, etc.

Experiment results on disk

result.json — structured results file.

It’s also a good idea to track your most important metrics in the experiment spreadsheet so you can get the full picture quickly and decide on future directions. I like using colors to mark results (green=improved, red=got worse, yellow=not sure).

Tracking experiment results in a spreadsheet

8. Do multiple repeats for stochastic models

You want your results to be reproducible, but still, avoid getting misleading results due to chance. The solution lies in repetition with random seeds. Avoid using fixed random seeds if your models are stochastic. The same applies when shuffling, downsampling, or any operation that contains a random element. If you use SciKitLearn, for example, always run your models with random_state=none. Perform multiple repeats in each experiment and average the results of your optimization target metrics in all repeats, so you get stable numbers. You can use metrics like Standard Error of the Mean (SEM), to estimate how close your repeats’ mean is to the true mean of the population (if you could run an infinite number of repeats). SEM metric value decrease as you increase the number of repeats. This will help you gain confidence and understand if your latest results are indeed better, or might it was just luck, and you should increase the repeat count to be sure. In general, when your model gets more mature/stable, your optimizations will probably have a smaller impact and you might need to increase the repeat count.

Tracking input and intermediate datasets

9. Track input datasets

Remember to version and name the datasets that are used as input to your model with the version identifier. Input datasets tend to be big, so I wouldn’t recommend duplicating them into each experiment’s tracking folder. Just make sure to log the filenames/URIs of the input datasets you used. You can also find these filenames in the source code snapshots for the relevant experiment. You can add another safety layer here by computing and logging a hash/digest of the contents of each input dataset. Log also the basic characteristics of the data, such as its dimensions and sample counts for each class.

10. Avoid or track intermediate datasets

Some of your code may carry out heavy preprocessing of datasets. This can sometimes take a long time, so you may do it once and then use the output in later steps. If your preprocessing has a stochastic nature, (shuffling, train/test splitting, etc..), try to avoid creating intermediate datasets unless the processing can really save a lot of experiment time. Otherwise, you may have an inherent bias in your data, similar to when using a fixed seed. Instead, you can invest in optimizing the execution time of the preprocessing steps.
If you do generate intermediate datasets, treat the source code you wrote for that purpose just like a normal experiment using the practices described so far. Use version numbers for the source file, track the source code, track the logs, etc. It’s a good idea to save the output intermediate datasets in the out folder of each experiment. This will make the datasets inherently identifiable.

Tracking intermediate datasets

Summary

In short, experiment management is essential and pretty easy to do if you adopt some simple techniques. No matter whether you do it yourself or use an experiment management platform, just do it!

Avoiding Bash frustration — Use Python for shell scripts

David Ohana — Sun, 08 Nov 2020 15:37:27 GMT

Avoiding Bash frustration — Use Python for shell scripting

I never got used to Bash programming syntax. Whenever I have to write a more-than-trivial bash script, the strange syntax annoys me, and I have to Google every little thing I need to do, starting from how to do comparisons in if statements, how to use sed , etc.

For me, using Python as a shell scripting language seems like a better choice.
Python is a more expressive language. It is relatively concise. It has a massive built-in library that let you perform many tasks without even using shell commands, it is cross-platform and it is preinstalled or easily installed in many OS’s.
I am aware that some other dynamic languages (e.g Perl, Lua) might also be very suitable for shell programming, but I (and my team) work with Python daily and familiar with it, and it gets the jobe done.

In my last project, after some bash frustration, I decided to refactor a bloated set of bash scripts with a CLI-style Python script. A side-result of this work is a small, single helper file, which I am going to share with you here, with few utilities that bridge the gap to easily let you use Python for shell scripting.

In an effort to make this utility runnable ubiquitously, I made it compatible with both Python 2.7 and Python 3.5+ (Python 2.7 is preinstalled in Ubuntu since version 14). There are no 3rd-party requirements — only Python’s standard library is used, so no pip install is required.
I tested it on Mac and Ubuntu.

For shell programming, we need to be able to execute shell commands conveniently. The most extensive Python function for that is subprocess.Popen() . However, Popen() might feel too raw to use easily. It also has some compatibility changes between Python 2/3 and some missing features in Python2.

The core function I provide here is sh() , which is a friendly wrapper around subprocess.Popen(). It let you execute a shell call from Python, deciding whether to:

Capture stdout/stderr to a string or print it.
Time-out the call after x seconds
Terminate the calling Python script on failure of the shell call.
Terminate the calling Python script on timeout of the shell call.
Echo the command string of the shell call.
Run the command inside a shell [ like subprocess.Popen(shell=true) ]. This is considered an insecure practice due to the possibility of a shell injection, but allows many convenient features in shell calls, like pipes (|), environment variables interpolation, executing multiple statements with && or ; at once call and more, so if your script gets no user input, or you trust your input, you may opt to use it.
Apply Pythonic formatting arguments to the shell command before executing it.
And more …

Other utilities let you:

Log/print to stdout with ANSI colors according to the logging level.
Get user input prompt from stdin, with compatibility to both Python 2.7 and Python 3+.

An example script which:

Asks the user whether to pull for a new docker image
Removes the running container for this image (if any)
Runs a new container for the image.
Outputs the first 5 seconds of the new container log.

https://medium.com/media/612c618cc729403ecab361ca24709e93/href

Output:

A caveat I discovered with using Python for shell scripting, is that child processes are not terminated when the parent Python process dies. A solution I found for that, which is embedded in the utility, is using the exit hook [ at_exit() ] to kill any child processes that are not terminated yet. This approach will work for soft kills like Ctrl-C, but not for a more aggressive kill like when your python script is terminated using kill -9 , and may leave the child shell command running. I am open to new ideas on how to work-around this drawback. However since most shell scripts are executing short-lived commands, I don't see this as a showstopper.

How to consume:

Just copy the peasyshell.py file near your Python shell script, and import it.

Another usage sample
GitHub repo

The code is distributed under the Apache v2 OSS license.

Tip: You can make your Python script executable:

Add a shebang line at the top of your script:

#!/usr/bin/env python2

from peasyshell import *
...

2. Make your script executable:

chmod +x my_app.py

3. Run:

./my_app.py

Have fun scripting.

Avoiding Bash frustration — Use Python for shell scripts was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Simple Runtime Profiling of Batch Jobs in Production

David Ohana — Mon, 02 Nov 2020 14:21:49 GMT

So, you have a batch processing / ETL task that receives some data in a loop or per request and crunches it. It might even be running in production. All is working well but now you need to optimize it a bit in order to increase the rate of data you can handle.

In this post, I will introduce a simple yet effective approach to do so, which you can even run in production to measure the performance of real-world workloads. I will supply a short (single file, no dependencies) implementation for the profiler for both Kotlin and Python.

So why not use widely-available profiling tools like VisualVM / JProfiler / cProfile?

It is not easy to instrument those tools for running in a production environment. You may need to add extra dependencies, different bootstrapping code, etc..
Some of the profilers have a significant overhead on runtime since they measure every method and code line (even in sampling mode). You don't want to deploy such a profiled application to production as it will affect your processing rate significantly.
The profiling output is way too verbose, again, since every function or code-line is measured.
Most of the profilers provide a summary output at the end of the run. What if you want to log/print profiling results periodically while still running?
How do you deal with warm-up times and changes in workloads while the application Is running?
Some profilers use stack-trace sampling as a data-source, therefore they measure run times for whole functions only. This may force you to refactor code blocks to sub-functions.
Some of the available profiling tools are commercial.

SimpleProfiler

The approach I suggest is very simple. Your task probably runs in a loop or on a per-call basis. Call the profiler’s start_section(section_name) before any code block you want to measure, and end_section() after the code block. The profiler will collect running-time statistics for this code block.

You may call report() to get a profiling summary like this:

https://medium.com/media/2e2460cda27aaf6ab463f6428d070b15/href

We can see a line per section, measuring:
(1) The cumulative runtime of all code inside that section.
(2) The contribution (in %) of this section to the total runtime.
(3) Count of time this section was executed.
(4) Average runtime of 1000 executions of this section.
(5) Frequency/Rate — Count of times this section can be executed in 1 second.

Note that sections can overlap or be nested. In the example output above, parser section includes both split , drain and mask section.

The profiler also supports an optional enclosing section. This section (total above) typically covers a full single iteration/request and all other sections are sub-sections of it. It serves as a reference for what is 100% of run-time and by summing runtime of all nested sections, you may determine whether there are some other code-blocks which take significant run-time but are not enclosed in any section.

Dealing with warm-up time and changing workloads

Sometimes you want to exclude from profiling the first iterations of batch processing until everything is cached properly. There are also cases, in a long-running application, where workloads/request content changes over time, for example, if the work to do is dependent on a data-structure that is growing gradually.
In order to deal with such cases, the profiler supports an optional resetAfterSampleCount argument (0 by default), which will also calculate rates statistics for up to n last executions of each section. Those rates (runtime per 1000 executions and frequency), will be displayed in the report in parenthesis next to the cumulative global rates.

Notes

The printer argument of the SimpleProfiler class allows you to supply a function which may write report output to a destination other than the default stdout (for example, log)

SimpleProfiler also implements a general Profiler interface, so that you can replace it with NullProfiler, in order to disable profiling without removing profiling sections from your code, and avoiding if statements and null-checks.

Overhead — the profiler overhead is mainly dependant on the duration of a single workload cycle/iteration/request. The Kotlin profiler took about 0.8% overhead when a single cycle run took 0.1 ms, but 28.2% overhead when a single cycle run took 0.00036ms. In Python, overheads were 3.4% and 31.3% respectively. Conclusion — this profiling approach is less suitable for profiling micro workloads.

The profiler is not (yet) thread-safe. This means it is suitable for profiling tasks that run in a single thread only. I intend to add multithreading support soon, should not be a big deal...

Demo Use-Case: comparing hashing algorithms performance: Python vs. Java.

The following code snippets profiles calculation of md5, sha1 and sha256hashes for 10000-bytes array using the standard librariesjava.security.MessageDigest and Python’s hashlib. I also added calculation of square root for a random number.

In Kotlin:

https://medium.com/media/8ac32d3617ed7ba355ad16eef5c4ec45/href

In Python:

https://medium.com/media/817b73da212fb665e37fdb6eeb70fa30/href

Output (Kotlin):

https://medium.com/media/022bf1213db250a919292437bd8abf80/href

Output (Python):

https://medium.com/media/342db05c114d42f551b42167153325a6/href

And few insights from this little experiment:

Hashing in Python is about 2-times faster than in Java (well in Python it is actually C implementation)
sha256 is the slowest in Python and sha1 is the slowest in Java.
Initialization of hashing algorithms in Java is very slow.
sqrt and random are much slower in Python.
Generally, pure Python code runs slower than JVM code (we can learn that from the 0.09ms profiling overhead in Kotlin and 0.32ms profiling overhead in Python).

Profiler Implementation

Single file with less than 150 lines, no 3rd party dependencies.
It's an Apache v2 license, just copy the file to your project to use it.

In Kotlin
In Python (3.6 or later)

GitHub repository with full code and samples. Star it if you like it :-)

— Happy profiling!

Simple Runtime Profiling of Batch Jobs in Production was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Code-First Configuration Library for Kotlin

David Ohana — Thu, 15 Oct 2020 13:18:20 GMT

GitHub repo: https://github.com/davidohana/kofiko-kotlin

Preface

Kotlin and Python are my favorite programming languages. After publishing Kofiko configuration library for Python, I decided to work on a port of it for Kotlin. Actually the porting to Kotlin took significantly more effort, for many reasons. I wanted to introduce better extensibility architecture this time, I wanted the library to support more formats, and also due to many conceptual differences between Kotlin and Python. For example, Kotlin Annotations can contain metadata only, and Python Decorators can contain logic.

Other challenges involved were how to discover configuration objects, how to design a fluent API for adding configuration layers (providers), and a lot of reflection work.

Though I tried to keep the library clean from any external dependency, I finally settled on a single dependency in the well-known jackson.core (ObjectMapper) library, in order to be able to reuse string parsing ability of Jackson without having to write many type converters.

Kofiko (Kode-First Konfiguration) is a lightweight, simple and minimal boilerplate configuration library for Kotlin.

Supported formats: .json, .ini, .properties, .env (more to come)
Layered (cascading) and extensible design allows overriding the configuration from environment variables, command-line arguments, JVM system properties (-D), Java Maps in any precedence order you like.

Demo

Define application configuration as Kotlin classes/objects:

https://medium.com/media/e77ac8909b94de242ab580faf0aa8ab8/href

Each config section is represented by a class / object so that configuration consumers may receive only the configuration of interest. Configuration options should be declared as var properties (read/write) with baked-in defaults. By using Kotlin object, you may easily access configuration as a singleton without injection. However, instances of configuration classes may be configured as well.

Override default values at run-time:

For example, from a JSON file:

{
  "database": {
    "user": "davidoh",
    "db_size_limits": {
      "logs": 1,
      "events": 120
    }
  }
}

or using env. vars:

DATABASE_user=davidoh \
DATABASE_password=reallysecret! \
DATABASE_endpoints=prod1,prod2 \
LOG_level=WARNING \
DATABASE_DB_SIZE_LIMITS=logs:5,events:120 \
java -cp my_app.jar

Kofiko uses out-of-the-box (configurable) conventions to search for matching configuration entries, looking for lowercase, uppercase, camel-case, snake-case, kebab-case matches.

Initialize Kofiko with the desired configuration sources:

https://medium.com/media/793633347a58eb0f8bfacffc589344f1/href

Program output:

LogConfig.level was changed from  to  by IniConfigProvider
WARNING: Hello Kofiko
DatabaseConfig.user was changed from  to  by IniConfigProvider
DatabaseConfig.password was changed from <[hidden]> to <[hidden]> by IniConfigProvider
DatabaseConfig.endpoints was changed from <[http://localhost:1234]> to <[prod1, prod2]> by IniConfigProvider
DatabaseConfig.dbSizeLimits was changed from <{alerts=50, logs=200}> to <{alerts=2, logs=1}> by JsonConfigProvider
Database user is davidoh

Kofiko can print/log the effective configuration overrides, omitting secret info like passwords.

The source code for Kofiko is available on GitHub under the Apache-2.0 license.
For further details, please refer to the GitHub repository of the project at https://github.com/davidohana/kofiko-kotlin.

Code-First Configuration Library for Kotlin was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Code-First Configuration approach for Python

David Ohana — Sun, 28 Jun 2020 14:52:33 GMT

GitHub repo: https://github.com/davidohana/kofiko-python
PyPi package: https://pypi.org/project/kofiko

In the Code-First approach, you first define your data-model in plain code. You can start working with that model immediately, and only later you worry about schema definitions, bindings, and other necessities. The mapping between the domain model and external entities like database tables/fields usually relies on conventions.

What I like about this approach is that it let you focus on the most important things first, and not less important, you have all the convenience of a modern IDE when defining your model — refactorings, code completion, type checks, etc..

After my last attempt in creating a configuration library for Python, I was still not satisfied and looked for a way to implement the code-first approach for configuration. The outcome is a Python package named kofiko — “Kode First Konfiguration” (which is also a funny ape from an old Israeli children book series). I am pretty satisfied with the outcome, which is described next.

First, you define the desired configuration settings in plain Python code. Configuration entries are defined as static attributes of a class, with a default value which also defines the type of each entry. You may define many configuration classes — each class serves as a different config section. It can be located anywhere in your code.

https://medium.com/media/26728c13f0b03e76fffe097bd332a847/href

In the example above, for example, the log bootstrapping code that uses the config class above is not interested in any configuration options other than log-related, so its reasonable to define a dedicated config class for log and locate it side-by-side with the bootstrapping code.

The @config_section decorator above is actually registering this class with Kofiko. Once kofiko.configure() function is called, it will look-up for configuration overrides in various sources and set the value of relevant attributes of the class accordingly.

Currently, Kofiko supports the following configuration sources:
(1) Customization functions
(2) .INIfiles
(3) Env. vars

Customization functions allow you to stay in the code-first approach, by defining simple Python functions that override selective variables in the default configuration classes. This is useful for example in cases where you have multiple deployments. You can create a customization class for each deployment.

https://medium.com/media/2c537efeeac38e2ab4200bd04a481540/href

You have to register the customization function with Kofiko using the @config_custom decorator, the same way we did with the configuration class.

Kofiko also supports the familiar .INI format. You can specify one ore more in filenames to search for overrides. Kofiko maps the config class name to an ini section and the attribute name to an ini option:

https://medium.com/media/7ec00ae8dd87f70ca9b79b4c757b7a56/href

Note that you can even omit the Config keyword from the section name.

The last supported override source is environment variables. Kofiko will look-up for env keys that matches the following convention app-prefix_section_option, and override config attributes accordingly. For example, we can run our app like this:

log_file_out_folder=../log-staging elastic_env_name=staging mirror_batch_hours=1 python my_app.py

In this case, we didn’t use any app-specific lookup prefix, but we can always opt for use such, in order to prevent collisions with other env vars.

Also, note that the lookups in ini and env-vars are case-insensitive by default. And if you don't like my default conventions for override lookups, Kofiko also allows you to customize those with your own.

One of the nicest things about Kofiko is that you don't have to do type-conversion any more. Kofiko will use the default value for each config attribute as a type-hint and will try to convert the text value read from untyped override sources (ini and env) to the same type. It will fallback to string only when unable to convert.
In addition to the basics (str, int, float & bool ), I added support for parsing list (comma delimited by default) and dict in the format key1:val1,key2:val2. The type conversion for list values and dict keys and values is done using the first element in the default value for the relevant config attribute (if exist).

database_endpoints=host1,host2 database_quotas=logs:300,alerts:50 python my_db_client.py

Bootstrapping

In the initialization code of your Python app, all you have to do is call the static kofiko.configure() function. You may specify a customization and/or ini files for lookup like this:

overrides = kofiko.configure(
   customization_name="prod", 
   ini_file_names="../cfg/prod.ini")

After this call, attributes values in all config classes are overridden from relevant override sources, and you can use those config classes directly in your code.

The overrides return values holds a dictionary of all values that were changed from their defaults. It might be useful to log or print this.

How to get it

The source code for Kofiko is available on GitHub under the Apache-2.0 license. You can also install it from PyPi:

pip install kofiko

I hope you will like this little configuration monkey, and please comment and tell me what you thought about it.

Code-First Configuration approach for Python was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to restrict user access with to Grafana with Generic OAuth

David Ohana — Wed, 17 Jun 2020 12:47:14 GMT

Grafana & Generic OAuth — How to restrict access to specific authenticated users?

If you have successfully integrated Generic OAuth with Grafana, you might wonder as I did, how do you allow only specific authenticated users from your organization to access Grafana? and how do you set different access rights (admin, editor, viewer) to those users?

The first thing we need to do is to set disable sign up of new users and anonymous users. This can be done by editing grafana.ini file or setting env vars.

[auth.generic_oauth]
allow_sign_up = false

[auth.anonymous]
enabled = false

This will allow only users that are already listed in Grafana’s user database to sign-in.

But how do we add users to Grafana?

Inviting a user using Grafana’s User management dashboard will not do the trick, as invited user is not an actual user until the first sign up.

But, we can add a user programmatically using the Admin HTTP API with an HTTP POST call:

https://medium.com/media/da487972ac2ee3eaec192b99bd7a271b/href

Now all users added like this will be able to access Grafana.
Note that there is a tricky part here: Grafana is case sensitive to emails. So the Case of the email address of the user must match exactly to the case returned from the OAuth endpoint.

Now, how do we set permissions to specific users, such that some are viewers, some are editors and some are admins?

After you have added a user, her default access level is viewer. You can use the permissions API call to set a user as global Grafana admin:

https://medium.com/media/f86b7540e2eea295be66a57da104b298/href

You will probably also want to set a user as an organization admin/editor, to allow her to edit some dashboards (Global admin cannot do this until explicitly setting herself also as org admin). This can be done using the organization API:

https://medium.com/media/f5229b8f10ed7ad853b6cc901cb8b118/href

That's it. Wrapping all up, here is a sample Python code that reads a user list from a CSV file and adds all of the entries Grafana users. Users with * as first char will be added as admins.

https://medium.com/media/0cbce7904550d45282e5119335598de7/href

User list after running add_grafana_users.py

Python Retry on Exception

David Ohana — Mon, 25 May 2020 15:35:38 GMT

Never stop trying: Retry on Exception in Python

(Full code and samples for this post at my GitHub Repo)

Suppose you have the following code which invokes a gRPC request and may fail due to various network conditions.

https://medium.com/media/b64dff89ac49dc3cc14736656bc68a97/href

How to retry this call until no exception raised? Wrap the call in an inner(inline) named function and use the provided retry function.
Thanks to closures, we can use any variable in the scope outer to the inner function.

https://medium.com/media/29017d4392c5f89ab6b39f31d0c2b545/href

This retry function supports the following features:

Returns the value of the invoked function when it succeeds
Raises the exception of the invoked function if attempts exhausted
Limit for the number of attempts (0 for unlimited)
Wait (linear or exponential) between attempts
Retry only if the exception is an instance of a specific exception type.
Optional logging of attempts

Retry function code:

https://medium.com/media/bab6831d4732ec6403bfb6e95197deb9/href

So, keep trying!

David

Python Logging: Colorize Your Arguments!

David Ohana — Sun, 24 May 2020 16:05:03 GMT

Logging in alternating colors for message arguments

The latest full code and samples for this article are available under the Apache-2.0 license at my GitHub Repo.

Yes, we love logging in colors.
Yes, there are many Python libraries and sample code that show you how to colorize your stdout log messages by logging level.

But I am going to show you something better — log messages with alternating colors for each argument in the format string, in addition to colorization by log level. And a bonus — argument formatting is made using the new “brace-style” formatting introduced in Python 3.2, for example:

https://medium.com/media/9d139840234f34558c44f9d38417d94e/href

Implementation is simple, and no 3rd party dependencies are introduced. To apply colors, all you need to do is to set the formatter of the StreamHandler to an instance of ColorizedArgsFormatter:

https://medium.com/media/43256d6df8a1f48d86e0257ff0442537/href

How does it work?

Each log record is inspected to determine if brace-style formatting is used. I decided to use brace style formatting because it makes it easier to identify the start and end of a formatting parameter: {param}
ANSI escape code for the current alternating color is added before each formatting parameter, and reset color escape code is added after it
LogRecord is updated so that message field is formatted using str.format() and args field is set to empty.
ANSI escape code for the specific level is added before levelname and levelno formatting placeholders, and reset color escape code is added after the placeholder.

Compatibility Challenges

We still need to support legacy-style string formatting, e.g:
logger.info(“My name is %s and my age is %d”, “Dave”, 12)
as many 3rd party dependencies of our code might log in the old format.
My solution for this is using a few simple heuristics to identify whether brace-style formatting should be used: no ‘%’ character in string + number of curly braces pair matches the number of log record arguments. Otherwise, we fall-back to legacy formatting, but no argument colorization is available.
We have to support brace-style formatting also when logging to other mediums, like files. Otherwise, the logger will fail to understand format string with “TypeError: not all arguments converted during string formatting”. Solution for that is using another simple formatter named BraceFormatStyleFormatter for file logging handlers and other mediums that do not support colors. This formatter is very similar to ColorizedArgsFormatter, however, it only performs rewriting of log record with the formatted message, and not adding ANSI colors escape codes.

Other Tips

By default, there are two alternating colors. You can change the colors or add more alternating colors by changing the arg_colors list.
It is possible to alter the mapping of log levels to colors by changing the level_to_color dictionary.
Currently, curly brace message formatting with kwargs mapping is not supported in logging, e.g: My name is {name}”.format(name=”David”)

Customization — more alternating colors and different color dor DEBUG messages

Full code

https://medium.com/media/d66f6af59bbfb1ddc223775c0008609e/href

Bootstrapping and Logging — Full Sample App:

https://medium.com/media/13ca0591227dab5514ef9d10a6f97f91/href

The latest full code and samples are available under the Apache-2.0 license at my GitHub Repo.

Have a colorful day,
David

Python Logging: Colorize Your Arguments! was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Layered Python ConfigParser wrapper with support for environment vars

David Ohana — Wed, 20 May 2020 15:14:59 GMT

The full code (library + example + .ini files) for the following article are available at GitHub : https://github.com/davidohana/LayConf

In every programming language I use, one of the first thing I need is a decent configuration library.

My requirements are usually:

Default (hard-coded) configuration
Custom configuration file
Ability to override configuration entries from environment variables and command-line arguments

In Python, the existing packages I found seems to be an overkill, or it might just be my NIH syndrome. But anyway, after some search, I shamelessly copied and modified this.

The result is a small and simple class that supports most of what I need.

Configuration options are first looked up at env var {section}_{option}, then in a custom .ini file, then in the default .ini file. It's also possible to add inline defaults if the entry does not exist in the default .ini file.

Example usage:

https://medium.com/media/ed17c88ef9c87bdb8b52f7a0277865ab/href https://medium.com/media/475e11c7a08dccb55408c4c297931554/href https://medium.com/media/53e887f82544e3e85acafc2b2cc29147/href

Now, let's run it while overriding one config option:

% example_LOG_file_backup_count=300 python example_app.py

and the output would be:

config env prefix: example
config default: cfg/default.ini
config custom: cfg/staging.ini
env_name: staging
console_enabled: True
file_rotation_size_mb: 10
'foo' not found
foo: bar
foo_number: 33
file_enabled: true
file_backup_count: 300

The full code (library + example + .ini files) are available at GitHub : https://github.com/davidohana/LayConf

Happy Coding!