Modern Best Practices for Web Security Using AI and Automation
Vibe Coding Is Great for Demo; It’s Not a Strategy for GenAI Value in the SDLC
Generative AI
Generative AI has become a default feature expectation, pushing engineering teams to treat models like production dependencies that are governed, measured, and operated with the same rigor as any other critical system in the stack. Model behavior and quality have to be measurable, failures must be diagnosable, data access needs to be controlled, and costs have to stay within budget as usage inevitably climbs. Operationalizing AI capabilities responsibly, not just having access to powerful models, is the differentiator for organizations today.This report examines how organizations are integrating AI into real-world systems with capabilities like RAG and vector search patterns, agentic frameworks and workflows, multimodal models, and advanced automation. We also explore how teams manage context and data pipelines, enforce security and compliance practices, and design AI-aware architectures that can scale efficiently without turning into operational debt.
Threat Modeling Core Practices
Getting Started With Agentic AI
Nowadays, there are quite a lot of AI coding assistants. In this blog, you will take a closer look at Qwen Code, a terminal-based AI coding assistant. Qwen Code is optimized for Qwen3-Coder, so when you are using this AI model, it is definitely worth looking at. Enjoy! Introduction There are many AI models and also many AI coding assistants. Which one to choose is a hard question. It also depends on whether you run the models locally or in the cloud. When running locally, Qwen3-Coder is a very good AI model to be used for programming tasks. In previous posts, DevoxxGenie, a JetBrains IDE plugin, was often used as an AI coding assistant. DevoxxGenie is nicely integrated within the JetBrains IDEs. But it is also a good thing to take a look at other AI coding assistants. And when you are using Qwen3-Coder, Qwen Code is an obvious choice. Qwen Code is based on Google's Gemini CLI, and it is a terminal application. This is different from DevoxxGenie. However, you are able to access the terminal from within the JetBrains IDEs, so you do not need to leave the IDE at all. In this blog, you will take a closer look at Qwen Code, how to configure it, and how to use it. Sources used in this blog can be found on GitHub. Prerequisites Prerequisites for reading this blog are: Some experience with AI coding assistants;If you want to compare to DevoxxGenie, take a look at a previous post. Installation In order to install Qwen Code, Node.js 20+ is required. Install Node.js following the official installation instructions. Execute the following commands. Shell # Download and install nvm: curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash # in lieu of restarting the shell \. "$HOME/.nvm/nvm.sh" # Download and install Node.js: nvm install 24 # Verify the Node.js version: node -v # Should print "v24.12.0". # Verify npm version: npm -v # Should print "11.6.2". After successfully installing Node.js, you install Qwen Code by means of the following command. Shell npm install -g @qwen-code/qwen-code@latest Setup Qwen Code is installed now, but first, some configuration needs to be done. A complete list of all configurable settings can be found on GitHub. 1. Disable Usage Statistics Many tools like to receive usage statistics by default. As so does Qwen Code. If you do not want this, you can disable it in the settings. Settings can be added on different levels. To make things easy, a user settings file will be used in this blog. Navigate to your home directory, and you will see a .qwen directory. Create a file settings.json in this .qwen directory. Disable the usage statistics. JSON { "privacy": { "usageStatisticsEnabled": false } } 2. Configure Model In this blog, a local model setup is used, using Ollama as the inference engine and Qwen3-Coder running as a local model. In order to create the setup for this, create a .env file in the .qwen directory. OPENAI_API_KEY: When using a local model, the contents do not matter.OPENAI_BASE_URL: The URL where the model can be accessed. Do note that you need to add /v1 at the end.OPENAI_MODEL: The model you want to use by default. Of course, this model needs to be available within Ollama. The contents of the .env file are as follows. Shell OPENAI_API_KEY="your-api-key" OPENAI_BASE_URL="https://localhost:11434/v1" OPENAI_MODEL="qwen3-coder:30b" 3. System Prompt It is a good practice to add a system prompt to your AI coding assistant. You can add some instructions for the model in this system prompt. You can add it by creating a QWEN.md file in the .qwen directory. This will ensure a default system prompt for all your projects. If you want to use a more specific system prompt for a particular repository, you can add a QWEN.md file in the repository itself. If you are developing in Java, Spring Boot, etc., the following system prompt can be used as an example. Plain Text You are an expert code assistant for a professional Java developer. All code examples, reviews, and explanations must be idiomatic to the following tech stack: * Backend: Java (latest LTS), Spring Boot (latest stable), PostgreSQL. * Frontend: Vue.js (latest stable), Angular (latest stable). * Follow modern best practices for RESTful APIs, object-relational mapping, unit testing (JUnit), and frontend-backend integration. * Prefer Maven for Java dependency management. * Whenever database code is required, use PostgreSQL syntax and conventions. * For frontend, use Vue composition API where applicable. * Always explain your reasoning, and reference documentation when giving architectural advice. * When unsure, ask clarifying questions before producing code. 4. Finetune Model Settings It is also possible to fine-tune model settings. For example, for coding tasks, you want the model to be more deterministic so that it will respond more factually. This can be realized by setting the temperature to 0. Add the following to the settings.json file. JSON "model": { "generationConfig": { "samplingParams": { "temperature": 0 } } } First Startup If you haven't done it already, now is the time to clone the GitHub repository. Be sure to check out the qwen-code branch. If you want to execute the commands from this blog, you first need to delete the QWEN.md file and the src/test directory. Qwen Code is a terminal application, so you have some different options here: Open a terminal and navigate to the repository.Open your IDE, e.g., IntelliJ, and open a terminal from within IntelliJ (ALT+F12). Start Qwen Code by typing qwen in the terminal. The first time, Qwen Code will ask you how you want to authenticate for this project. In this blog, a local model is used, so choose OpenAI. The previously configured environment variables are shown; just confirm these settings. A first simple command is to show the memory content (/memory show). It should output the system prompt you configured earlier. As you can see, Qwen Code shows exactly which content and where it is retrieved from. Now, in order to verify whether the connection with the model is functioning correctly, just enter a simple prompt like how are you? At the moment of writing, Qwen Code 0.6.0 was used. Using qwen3-coder and even qwen3 as a local model resulted in errors because these models are not reasoning models. An issue was created for this, and this was fixed very quickly. From version 0.6.1, the errors disappeared. Create a Test Let's continue with something useful and create a test for the CustomersController. Using the @ character, you can add files to the context. When typing, a search is executed and using the arrows, you can easily select the file you need. Using tab, you select the file. After that, you can complete the prompt. The prompt used is: Java @src/main/java/com/mydeveloperplanet/myaicodeprojectplanet/controller/CustomersController.java Write a unit test for this code using JUnit. Use WebMvcTest. Use MockMvc. Use AssertJ assertions. Add the test in this repository Qwen Code starts analyzing the file and writes the test. The test created can be seen below. During generating the test, Qwen Code tries to invoke mvn test in order to verify whether the test succeeds. The test fails, and the model starts changing the test, but each time it makes things worse. It seems that the model is in some kind of loop. When you take a look at the first created test, then the solution is quite obvious: a wrong import statement is present, and if you comment this out, the problem is solved. Java package com.mydeveloperplanet.myaicodeprojectplanet.controller; import com.mydeveloperplanet.myaicodeprojectplanet.model.Customer; //import com.mydeveloperplanet.myaicodeprojectplanet.openapi.model.Customer as OpenApiCustomer; import com.mydeveloperplanet.myaicodeprojectplanet.service.CustomerService; import org.junit.jupiter.api.DisplayName; import org.junit.jupiter.api.Test; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.test.autoconfigure.web.servlet.WebMvcTest; import org.springframework.boot.test.mock.mockito.MockBean; import org.springframework.http.MediaType; import org.springframework.test.web.servlet.MockMvc; import org.springframework.test.web.servlet.MvcResult; import java.util.List; import java.util.Optional; import static org.assertj.core.api.Assertions.assertThat; import static org.mockito.ArgumentMatchers.any; import static org.mockito.ArgumentMatchers.anyLong; import static org.mockito.Mockito.when; import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.*; import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*; @WebMvcTest(CustomersController.class) class CustomersControllerTest { @Autowired private MockMvc mockMvc; @MockBean private CustomerService customerService; @Test @DisplayName("GET /customers returns list of customers") void getAllCustomersReturnsList() throws Exception { Customer customer = new Customer(1L, "John", "Doe"); when(customerService.getAllCustomers()).thenReturn(List.of(customer)); MvcResult result = mockMvc.perform(get("/customers")) .andExpect(status().isOk()) .andReturn(); String content = result.getResponse().getContentAsString(); assertThat(content).contains("John"); } @Test @DisplayName("GET /customers/{id} returns customer when found") void getCustomerByIdFound() throws Exception { Customer customer = new Customer(1L, "Jane", "Smith"); when(customerService.getCustomerById(1L)).thenReturn(Optional.of(customer)); mockMvc.perform(get("/customers/1")) .andExpect(status().isOk()) .andExpect(jsonPath("$.firstName").value("Jane")); } @Test @DisplayName("GET /customers/{id} returns 404 when not found") void getCustomerByIdNotFound() throws Exception { when(customerService.getCustomerById(99L)).thenReturn(Optional.empty()); mockMvc.perform(get("/customers/99")) .andExpect(status().isNotFound()); } @Test @DisplayName("POST /customers creates a customer") void createCustomer() throws Exception { Customer customer = new Customer(1L, "Alice", "Wonderland"); when(customerService.createCustomer(any(Customer.class))).thenReturn(customer); String json = "{\"id\":1,\"firstName\":\"Alice\",\"lastName\":\"Wonderland\"}"; mockMvc.perform(post("/customers") .contentType(MediaType.APPLICATION_JSON) .content(json)) .andExpect(status().isCreated()); } @Test @DisplayName("PUT /customers/{id} updates a customer") void updateCustomer() throws Exception { Customer customer = new Customer(1L, "Bob", "Builder"); when(customerService.updateCustomer(anyLong(), any(Customer.class))).thenReturn(customer); String json = "{\"id\":1,\"firstName\":\"Bob\",\"lastName\":\"Builder\"}"; mockMvc.perform(put("/customers/1") .contentType(MediaType.APPLICATION_JSON) .content(json)) .andExpect(status().isOk()); } @Test @DisplayName("DELETE /customers/{id} deletes a customer") void deleteCustomer() throws Exception { mockMvc.perform(delete("/customers/1")) .andExpect(status().isNoContent()); } } This test is successful. When you generate the mutation test results using mvn verify and check the report in the target/pit-reports directory, you notice that this test has a line coverage of 100% and a mutation coverage of 93%, which is quite good. Generating this test with the Claude models of Anthropic results in similar numbers. Commands Qwen Code supports commands. This is a convenient way of interacting with the model. When using the /init command, Qwen Code analyses your repository and creates a QWEN.md file in your repository with project-specific information. Executing this command for this repository results in the following QWEN.md file. Generating this file took a couple of minutes, but the result is really good. Java # QWEN.md ## Project Overview This repository is a **Spring Boot** application written in **Java 21** that exposes a simple **Customer Management API**. It demonstrates: * RESTful endpoints generated from an OpenAPI specification (`src/main/resources/static/customers.yaml`). * JOOQ code‑generation for type‑safe database access. * Liquibase migrations for schema creation. * Testcontainers integration for integration tests. * PIT mutation testing. The application uses a PostgreSQL database (Docker image `postgres:17-alpine`) defined in `compose.yaml`. ## Directory Structure ``` src/ ├─ main/ │ ├─ java/com/mydeveloperplanet/myaicodeprojectplanet │ │ ├─ controller │ │ ├─ model │ │ ├─ repository │ │ ├─ service │ │ └─ MyAiCodeProjectPlanetApplication.java │ ├─ resources/ │ │ ├─ db/changelog/migration │ │ ├─ static/customers.yaml │ │ └─ application.properties └─ test/ └─ java/com/mydeveloperplanet/myaicodeprojectplanet/controller ``` ## Build & Run The project uses **Maven**. The following commands are the typical workflow: | Action | Command | |--------|---------| | Clean & compile | `./mvnw clean compile` | | Run the application | `./mvnw spring-boot:run` | | Run tests | `./mvnw test` | | Generate JOOQ sources | `./mvnw generate-sources` | | Run mutation tests | `./mvnw test-compile exec:java -Dexec.mainClass=org.pitest.mutationtest.MutationTest` | > **Note**: The wrapper scripts `mvnw` and `mvnw.cmd` are provided for cross‑platform builds. ## Docker Compose A `compose.yaml` file is included to spin up a PostgreSQL instance for local development: ```bash docker compose up -d ``` The database is exposed on port **5432** and is automatically connected via Spring Boot's `spring-boot-docker-compose` dependency. ## Testing Unit tests are located under `src/test/java`. They use **JUnit 5** and **Spring MockMvc**. Integration tests can be added using Testcontainers. Run all tests with: ```bash ./mvnw test ``` ## Conventions & Style * **Java** code follows the standard Spring Boot conventions. * **Naming**: Packages use lower‑case, classes use PascalCase. * **Configuration**: All runtime properties are in `application.properties`. * **OpenAPI**: The API contract is defined in `customers.yaml` and used by the OpenAPI generator plugin. * **Database**: Liquibase changelogs are in `src/main/resources/db/changelog`. ## Extending the Project * Add new entities by creating a JOOQ table, updating the Liquibase changelog, and generating JOOQ sources. * Add new REST endpoints by implementing the corresponding OpenAPI interface and wiring it into the controller. * Add tests under the appropriate package. --- *Generated by Qwen Code on 2025‑12‑31.* Using the exclamation mark !, you can execute shell commands. However, aliases are not recognized. When using ll for example, the command cannot be executed. You have to use ls -la instead. Another drawback is that it executes quite slowly. A really nice feature is the option to create custom commands with predefined prompts. Very useful when you want to use prompts repetitively, and you can share them easily with someone else. Create in the .qwen directory of your home directory a directory commands. Using extra directories inside this commands directory, you can create namespaces. As an example, the following directory tree. Plain Text ~/.qwen/commands$ tree . ├── general │ ├── explain.toml │ └── javadoc.toml ├── review │ ├── extended.toml │ └── simple.toml └── test ├── controller.toml ├── integration.toml ├── repositoryjooq.toml └── service.toml The controller.toml file, contains the prompt you used for creating the test. Plain Text description = "Create a test for a Spring Boot Controller" prompt = """ Write a unit test for this code code using JUnit. Use WebMvcTest. Use MockMvc. Use AssertJ assertions. The previous prompt can now be reduced to: Java @src/main/java/com/mydeveloperplanet/myaicodeprojectplanet/controller/CustomersController.java /test:controller MCP With MCP (Model Context Protocol) servers, you can enhance the capabilities of the model. The configuration of an MCP server can be added to the settings.json file. The Context7 MCP server can be added as follows. JSON "mcpServers": { "context7": { "command": "npx", "args": ["-y", "@upstash/context7-mcp"], "timeout": 15000 } } Check whether the MCP server is configured correctly. Shell $ qwen mcp list Configured MCP servers: ✓ context7: npx -y @upstash/context7-mcp (stdio) - Connected The nice thing of this configuration is that you are also able to indicate by means of the trust parameter whether the tool confirmation may be skipped. Also, you are able to include and exclude certain tools with the help of the includeTools and excludeTools parameters. Remove the previously created test, add the CustomersController and use the following prompt. Plain Text /test:controller create the test in this repository, i am using spring boot 3.4, use context7 to retrieve uptodate information This should result in invoking the Context7 MCP server, and as a result MockBean should not be used in the test, but MockitoBean should be used instead. However, the Context7 MCP server is not invoked. Several other prompts are used, but it seems to be very hard to force the model to use the MCP server. The same prompt using DevoxxGenie does invoke the MCP server, so it is unlikely that this is caused by the model. An issue was created for this purpose, but it seems that Qwen Code and MCP are not very good friends yet. Conclusion Qwen Code offers quite a few nice features. There is a lot more to discover, but the first impressions are good. It is also good to experiment with other AI coding assistants now and then, in order to see how they compare to the ones you are using.
SQL performance tuning has been one of the most technical tasks in present-day software engineering. A query can be logically sound, well-indexed, and well-tested, yet still degrade significantly under production load. Answers to performance issues are found in execution plans, which are usually thick, technical, and hard to understand quickly. The skills and time needed to learn the strategies of joins, the type of scan, cost estimates, cardinality forecasts, and all that cannot be met by the forces of the development team. As large language models (LLMs) have become integrated into the developer workflow, a number of engineers have started piloting the use of AI as a query analysis tool and an interpretation of performance plans. Rather than manually dissecting the complex EXPLAIN ANALYZE results, developers are requesting AI to clarify bottlenecks, recommend indexing plans, and point out the inefficiencies. This brings us to a critical and practical query: Is it possible that AI can really help in optimization of SQLs, or does it only give confident answers devoid of engineering worth? Usage, Description, Real-Life Applications What Is AI-Assisted SQL Tuning? AI-aided SQL tuning is not used to replace the database optimizer, nor does it override the query planner. Rather, it is used as an analytical aid that assists the engineers in deciphering the elaborate execution plans and in reasoning about the performance bottlenecks more effectively. Large language models can analyze structured technical outputs, including the results of an EXPLAIN ANALYZE query, and translate them into clearer, plain-language interpretations. This minimizes brainwork and hastens diagnostic processes. Tuning with the help of AI implies three main capabilities. To start with, AI may evaluate the execution plans to detect possible bottlenecks, e.g., sequential scan of large tables, nested loop join of high-cardinality data, or late-stage filtering that makes processing rows more expensive. Second, it can be able to identify standard SQL anti-patterns, such as the lack of composite indexes that match filter predicates, non-selective indexing policies, or poor join criteria. Third, it can propose specific optimizations, including the addition of multi-column indexes, query restructure, or sorting the joins in terms of selectivity. Nevertheless, these are merely some suggestions which are not binding. In the same way that wider studies on foundation models have emphasized (Bommasani et al., 2021), large language models are pattern recognizers and not system-conscious. They lack access to real-time data distribution statistics, workload patterns, and production constraints, unless this is specifically made available. Thus, AI-based SQL tuning should be considered an accelerator of reasoning rather than an independent optimization mechanism. When applied appropriately, AI can be a technical interpreter of sorts, making it clearer how a particular execution should work, what improvements should be made, and experiments with hypotheses can be done by engineers more quickly. It should be applied blindly and could bring more changes that are not measured by any change in performance. Why SQL Performance Tuning Remains Challenging The art of tuning SQL has also remained one of the most technical areas of database engineering. Although relational databases like PostgreSQL and MySQL have complex query optimizers, their choices are not easily interpreted. The execution plans that are created as a result of the EXPLAIN ANALYZE contain comprehensive data regarding the types of scans, the types of join strategies, row estimates, and costs. These outputs are, however, usually thick and hard to assess quickly, particularly when production pressure is involved. In contrast to deterministic code debugging, SQL optimization is based on probabilistic estimation of the cost. Table statistics and cardinality predictions used by query planners are not always realistic reflections of real-world data distribution. Even a minor inaccuracy can result in inefficient join requests, excessive sequential scans, or inefficient use of indices. With the increasing size and complexity of datasets, the diagnosis of such inefficiencies cannot be done without experience and careful analysis. SQL tuning aided by AI tries to minimize this cognitive load. Rather than manually parsing large execution plans, developers are able to explore query behavior, identify possible bottlenecks, and propose optimization techniques using AI systems. But the success of such help is absolutely based on the interpretation and validation of suggestions. Real-World Applications SQL tuning AI-assisted SQL tuning is becoming useful in practice. A typical example is the analysis of slow production queries. The SQL statement and the execution plan are supplied by developers, and the AI determines the likely sources of delay, e.g., sequential scans of large volume tables or inefficient nested loop joins. One more use case is the index recommendations. In multiple-column applications of the filter predicates, AI systems can recommend composite indexes that are query-oriented. These recommendations, in most instances, allow for a dramatic decrease in execution time by allowing index scans as opposed to full table scans. Join-order inefficiencies can also be diagnosed with the help of AI. As an illustration, performance is degraded when big tables are merged prior to being subjected to selective filters. Through the analysis of execution plans, AI systems can advise reorganizing predicates or to make sure that indexes can be used to facilitate early filtering. Nonetheless, not all recommendations result in a quantitative change. Indexes add storage and write overhead. Thus, every recommendation should be justified by an actual performance indicator instead of presumed useful. Frameworks and Code Sample Rather than relying on programming frameworks, AI-assisted SQL tuning focuses directly on queries and execution plans. Consider the following query: SQL SELECT c.country, COUNT(o.id), SUM(o.total_amount) FROM customers c JOIN orders o ON o.customer_id = c.id WHERE c.country = 'US' AND o.status = 'COMPLETED' AND o.created_at >= now() - interval '30 days' GROUP BY c.country; Assume this query executes in approximately 2.8 seconds due to a sequential scan on the orders table. After analyzing the execution plan, an AI system suggests adding a composite index aligned with the filter conditions SQL CREATE INDEX idx_orders_status_created_customer ON orders (status, created_at, customer_id); After applying this index and rerunning EXPLAIN ANALYZE, execution time decreases to approximately 310 milliseconds. The plan now shows an index scan replacing the sequential scan, with significantly fewer rows processed before the join stage. This example illustrates how AI suggestions can serve as hypotheses. The improvement is measurable, reproducible, and validated through execution metrics rather than theoretical reasoning alone. Popular AI Suggestions and Their Outcomes Suggestionintended benefitobserved resultAdd composite index on filtered columnsEnable index scan and early filteringSignificant performance improvementAdd index on low-selectivity columnImprove join efficiencyMinimal or no measurable gainRewrite JOIN structureReduce intermediate row processingDependent on data distributionAdd multiple redundant indexesIncrease query speedIncreased write overhead without benefit These examples demonstrate that AI-generated recommendations must be tested systematically. Some deliver substantial improvements, while others provide no meaningful change. Human validation remains essential. Conclusion SQL tuning using AI can and must be viewed as an assistant in performance analysis, as opposed to an actual optimizer. It can handle execution plans written in natural language, detect anti-patterns in common SQL, and provide hints on possible strategies to optimize them. Nonetheless, it has no contextual sensitivity to the behavior of production workload, fails to consider data skew, and is unable to fully assess operational trade-offs. As such, it does not negate the fact that research can be supported by it, but it cannot substitute experience and contextual knowledge. Recommendations The AI-generated suggestions are not to be non-testable solutions; instead, they are supposed to be treated like a testable hypothesis. All suggested changes should be confirmed with the help of such tools as EXPLAIN ANALYZE and real execution time measurements to be sure that there was really a performance improvement. Creation of indexes must be assessed with care to prevent the unwanted storage growth and also to prevent undesirable impact on write performance. Most importantly, any changes that are to be implemented in a production environment should always be put under empirical testing. Key Takeaways The future of AI in database engineering is augmentation, not full automation.AI can: Accelerate diagnosisImprove pattern recognitionReduce time-to-optimizationFinal decisions must rely on empirical validation and domain expertise.AI does not replace database skills; it enhances them when used wisely and in moderation. References Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv:2108.07258.Kabra, N., & DeWitt, D. J. (1998). Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. Proceedings of the ACM SIGMOD International Conference on Management of Data.Leis, V., Radke, B., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How Good Are Query Optimizers, Really? Proceedings of the VLDB Endowment, 9(3), 204–215.Microsoft. (2023). Responsible AI Standard, v2. Retrieved from https://www.microsoft.com/en-us/ai/responsible-aiPostgreSQL Global Development Group. (2025). Indexes. PostgreSQL Documentation. Retrieved from https://www.postgresql.org/docs/current/indexes.htmlPostgreSQL Global Development Group. (2025). Using EXPLAIN. PostgreSQL Documentation. Retrieved from https://www.postgresql.org/docs/current/using-explain.html
Executive Summary Large Git repositories slow down developers, CI/CD, and release processes. The main culprits are big binary blobs, long-lived histories of rarely used files, and repeated commits of generated artifacts. This guide provides a comprehensive, step-by-step approach to: Measure where the bloat is and surgically remove it by rewriting history with safe, modern tools, Aggressively repack objects for performance, and put guardrails in place — such as Git LFS, CI size policies, and partial clone — to keep your repo lean over time. By the end, you will know how to identify the largest objects hidden in your commit DAG, remove historical binaries without breaking your trunk, safely coordinate a force-push for the team, reduce pack files by orders of magnitude, and adopt practices that prevent bloat from coming back. What This Article Is For Repository maintainers and leads are responsible for performance and healthDevOps/Platform engineers running CI/CD and build farmsEngineering managers planning monorepos or long-lived codebasesSecurity and compliance specialists who need to remove secrets or sensitive binaries from history What This Article Covers Why repositories grow and how Git stores data (objects, packs, delta-compression, reachability)Multiple techniques to find hidden giants: by size, by path, by time range, by authorHistory rewriting with git-filter-repo and BFG, including safety, backups, and rollbackAggressive packing and garbage collection for space and speedMigration strategies to Git LFS and practical caveatsClone/checkout optimizations (shallow, sparse, partial clone with blob filtering)CI/CD acceleration with caching, partial clone, and artifact policiesGovernance: prevent future bloat with hooks, policies, and education Symptoms That One Notices (Quick Checks) Clones or fetches take minutes instead of seconds.git/objects/pack contains one or more pack files of several gigabytesCI checkout steps dominate pipeline time and bandwidthDeveloper machines or build agents run out of disk spaceLocal operations like git status or git log feel sluggish after years of growth Quick sanity checks show compressed and loose object counts, packfiles in .git/objects/pack, and the repo's size. Shell git count-objects -v ls -lh .git/objects/pack du -sh .git How Git Stores Data (Objects, Packs, and Reachability) Git stores content as objects, blobs (file contents), trees (directories), commits (history), and tags. Initially, objects are loose files. Over time, git gc consolidates them into pack files, which apply delta compression and zlib compression. This is excellent for text, but not for many binary formats where deltas are ineffective. Loose objects: Individual files in .git/objects/??/Pack files: Consolidated storage in .git/objects/pack/pack-*.pack with an index .idxDelta compression: Efficient for text, often poor for already-compressed binaries (ZIP, JPEG, MP4)Reachability: Objects reachable from refs (branches, tags) are retained; unreachable ones become candidates for pruning after grace periodsReflogs: Record movements of refs; as long as reflogs reference old commits, their objects remain protected from pruning Understanding reachability and reflogs is crucial, even if you delete a file or force-push; Git may retain old objects until reflogs expire or you explicitly prune. That is why size improvements often materialize only after aggressive garbage collection and pruning windows are passed. Diagnose Bloat to Find Large Objects and Hotspots Start with a ranked list of the largest objects across the entire history: PowerShell git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n | tail -50 Interpretation: The output lists object type (typically blob), SHA, size (bytes), and optionally the path. Many of these blobs may no longer be in your working tree; they persist in history. Additional focused analyses: By path patterns (e.g., archives, media, models): Adjust the grep to match your patterns PowerShell git rev-list --objects --all | grep -Ei '\.(zip|tar|tgz|gz|7z|mp4|mov|avi|mkv|png|jpg|jpeg|psd|mp3|wav|pdf|bin|jar|war|exe)$' | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n | tail -50 By time range (recent growth vs. historical): Shell # Example: last 12 months git rev-list --objects --since="12 months ago" --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n | tail -30 By author (to coach teams or adjust workflows): Shell # Large blobs introduced by specific author (heuristic) git rev-list --all --author="First Last" | xargs -I {} git ls-tree -r --long {} | awk '{print $4, $5, $6}' | sort -k2 -n | tail -20 By directory (which areas contribute most): Shell # Summarize sizes by path (approximation via latest tree; history-wide requires custom scripts) git ls-tree -r --long HEAD | awk '{print $4, $5}' | awk -F'/' '{sizes[$1]+=$2} END {for (k in sizes) print sizes[k], k}' | sort -n | tail -20 For a birds-eye view, consider using git-sizer to highlight pathological patterns such as large commits, huge trees, and excessive history fan-out. Install and run git-sizer (if available) macOS: brew install git-sizerLinux: download binary from releases GitHub Flavored Markdown git sizer --verbose Backups, Mirrors, and Recovery Plans: Safety First History rewriting is powerful and disruptive. Before you modify anything, create a mirror backup of your repository and verify it. A mirror captures all refs, including remote-tracking branches and notes. From a clean clone of the repository you want to fix: GitHub Flavored Markdown cd /Users/Shivi/Code/bloated-repo-test echo "Creating a mirror backup..." cd .. git clone --mirror repo/ repo-backup.git Optionally pack backup aggressively to save space: GitHub Flavored Markdown cd repo-backup.git git gc --aggressive --prune=now # Save refs list cd /path/to/repo git for-each-ref --format='%(refname) %(objectname)' > ../refs-pre-cleanup.txt In addition, export a list of current branches and tags so you can compare pre/post cleanup and restore if necessary. Rewrite History With git-filter-repo git-filter-repo is the modern, fast, and well-tested replacement for git filter-branch. It operates locally and is scriptable for precise policies. Begin with a dry run mindset by crafting specific rules and testing on a clone. Common Scenarios and Recipes 1. Remove all blobs bigger than a threshold across all history: Plain Text # Remove any blob > 10 MB across history # Requires: pip install git-filter-repo (or package for your OS) git filter-repo --strip-blobs-bigger-than 10M 2. Remove specific paths (past and present), e.g., build outputs or vendor archives: Plain Text # Remove paths wherever they occur in history # Patterns use Python regex by default git filter-repo --path-glob 'build/**' --path-glob 'dist/**' --invert-paths # Or remove a specific file introduced long ago # (Keeps everything else intact) git filter-repo --path 'datasets/huge_model.bin' --invert-paths 3. Replace sensitive content (e.g., secrets) with placeholders while keeping file structure: Plain Text # Replace matching content with redactions # (Requires a replace map file) echo 'PASSWORD=***' > replacements.txt git filter-repo --replace-text replacements.txt 4. Rewrite author info (useful while you’re cleaning anyway): Plain Text # Map old emails to new canonical identities cat > authors.txt <<'EOF' Shivi Kashyap <[email protected]>==>Kashyap Shivi <[email protected]> EOF git filter-repo --mailmap authors.txt After running filter-repo, your local history has changed. Verify repository health: run tests, build, and compare key branches. Then you’ll coordinate a force-push. Coordinating a Safe Cutover (Force Push, Re-Clone, and Communication) Agree on a freeze window (e.g., 30–60 minutes) where no merges occur.Communicate the plan, the reason (space/performance), and precise steps collaborators must take.Protect or temporarily disable branch protections as required to allow force-push (admin-only).Force-push all updated branches and tags to the remote.Ask collaborators to archive/abandon old clones and perform a fresh clone, or hard reset to the new root if permitted. Plain Text # Force push examples (use cautiously) # Push the primary branch git push --force-with-lease origin main # Push all branches and tags after filter-repo # (Ensure you understand which refs changed) for ref in $(git for-each-ref --format='%(refname:short)' refs/heads/); do git push --force-with-lease origin "$ref" done git push --force --tags Encourage developers to run a fresh clone for the cleanest state. If they must keep their working copy, they can rebase or hard reset to the new history — though this is more error-prone. Plain Text # In an existing clone (risky for unpushed work) # Save local changes, then: git fetch origin git checkout main # Option A: hard reset to new history git reset --hard origin/main # Option B: rebase your topic branch onto new main # git rebase --rebase-merges --rebase-to origin/main <your-branch> Alternative: BFG Repo-Cleaner BFG provides a high-level, fast interface for common cleanup tasks like removing big files or secrets. It rewrites only the Git database, leaving your working tree unchanged. Plain Text # Remove blobs larger than 50 MB java -jar bfg.jar --strip-blobs-bigger-than 50M your-repo.git # Remove a directory everywhere in history java -jar bfg.jar --delete-folders build --delete-files build.log --no-blob-protection your-repo.git # After BFG, always run git reflog expire --expire=now --all && git gc --prune=now --aggressive Repack and Garbage-Collect for Maximum Space Savings Expire reflogs so unreachable objects can be pruned immediately.Run aggressive garbage collection to consolidate and compress anew.Repack with deeper windows for better delta chains across similar content. GitHub Flavored Markdown # Expire reflogs and prune aggressively (use with care) git reflog expire --expire-unreachable=now --expire=now --all git gc --prune=now --aggressive # Optional: explicit repack knobs git repack -a -d -f --depth=250 --window=250 These steps create fresh pack files, collapse historical objects, and remove unreachable blobs. Results vary, but reductions from multi‑GB to sub‑GB are common when binaries are purged. Example Size Improvement MetricBeforeAfter .git/objects/pack size 3.2 GB 350–450 MB Cold clone time (LAN) 45–90 s 6–12 s CI checkout (no cache) Slow/bandwidth heavy Fast/lightweight Move Big Binaries to Git LFS Git Large File Storage (LFS) keeps pointers in Git and stores large content on a separate LFS server (or the hosting provider’s storage). This keeps your Git history text-friendly, while still versioning binaries. Install and initialize LFS.Track patterns for large or frequently changing binaries.Migrate historical blobs to LFS if necessary.Enforce via CI or pre-receive hooks to prevent regressions. 1. Install and initialize. GitHub Flavored Markdown git lfs install 2. Track patterns. GitHub Flavored Markdown git lfs track "*.zip" git lfs track "*.mp4" git lfs track "*.psd" echo "*.zip filter=lfs diff=lfs merge=lfs -text" >> .gitattributes echo "*.mp4 filter=lfs diff=lfs merge=lfs -text" >> .gitattributes echo "*.psd filter=lfs diff=lfs merge=lfs -text" >> .gitattributes 3. Commit the attributes file. GitHub Flavored Markdown git add .gitattributes git commit -m "chore: track large binaries via Git LFS" Migrating history to LFS (optional; can be time‑consuming): GitHub Flavored Markdown # Migrate existing big binaries to LFS across history # Use with caution and test on a clone git lfs migrate import --include="*.zip,*.mp4,*.psd" --include-ref=refs/heads/main git push origin --all git lfs push origin --all Before enabling LFS, confirm your remote hosting (GitHub, GitLab, Azure Repos, Bitbucket) supports LFS and that your organisation has appropriate quotas and retention policies. Clone, Fetch, and Checkout Optimizations Shallow clones cut history depth for CI or ephemeral jobsSparse checkout limits the working tree to a subset of pathsPartial clone with --filter=blob:none fetches trees/commits first and lazily fetches blobs on demand GitHub Flavored Markdown # Shallow clone for CI git clone --depth=1 https://github.ibm.com/Shivi-Kashyap/test-repo.git # Sparse checkout: only a subset of the tree git clone https://github.ibm.com/Shivi-Kashyap/test-repo.git cd repo git sparse-checkout init --cone # Pull only certain directories git sparse-checkout set src/ tools/ git clone --filter=blob:none --no-checkout https://github.ibm.com/Shivi-Kashyap/test-repo.git repo cd repo git checkout main Combine these strategies in CI to cut network, disk, and time dramatically — especially when caches are cold. Governance: Prevent Bloat From Coming Back Policy: Define what should never be committed — archives, media, large datasets, build outputs.gitignore: Keep it current and central. Add generated artifacts and local files.Hooks: Use pre-receive or push-protection to reject large files above size thresholds.Education: Teach contributors how Git LFS works and when to use it.Monitoring: Track repo size, pack size, and clone times periodically. Shell # Example server-side pre-receive snippet (pseudo) size_limit=$((20*1024*1024)) # 20 MB while read oldrev newrev refname; do for obj in $(git rev-list --objects $oldrev..$newrev | awk '{print $1}'); do size=$(git cat-file -s $obj) if [ "$size" -gt "$size_limit" ]; then echo "Rejected: object $obj is larger than 20MB" >&2 exit 1 fi done done exit 0 Special Topics and Edge Cases Monorepos and Partial Ownership In monorepos, growth can be rapid because many teams contribute artifacts. Consider enforcing LFS for all non-text binaries, and encourage sparse/partial clones for teams that only need a subset. Break out exceptionally large, rarely changing assets into separate repositories managed as submodules or package artifacts in your artifact repository. Submodules vs. Subtrees vs. Vendoring Embedding third-party code or assets can balloon history. Submodules keep history separate but add coordination overhead. Subtrees copy history into your repo, increasing size. Vendoring prebuilt binaries or archives is especially costly. Prefer package managers and artifact repositories where possible. Binary Formats and Delta-Compression Already-compressed formats (ZIP, JPEG, MP4, tar.gz) are poor delta candidates; Git often stores each version almost independently. For assets that change frequently, LFS plus a content-addressable artifact store can be a better fit than versioning inside Git. Secrets and Compliance Cleanups When removing secrets, replace text with redactions across history and rotate credentials immediately. Combine git-filter-repo --replace-text with organization-wide secret scanning (e.g., pre-receive hooks or provider tools) to prevent recurrence. Document the incident and the fix for audit trails. Windows, macOS, and Filesystem Pitfalls Case sensitivity differences, long path limits, and antivirus scans can impact performance. On Windows, enable long paths, keep antivirus exceptions for your repo cache on build agents, and ensure line-ending normalization is configured consistently via .gitattributes. Reusable Cleanup Scripts (Linux/macOS) Shell #!/usr/bin/env bash set -euo pipefail # Usage: ./shrink-repo.sh /path/to/repo 10M REPO_DIR=${1:-.} SIZE_LIMIT=${2:-10M} pushd "$REPO_DIR" >/dev/null # 1) Backup (mirror) echo "==> Backing up as mirror..." BACKUP_DIR="../$(basename "$REPO_DIR")-backup.git" if [ ! -d "$BACKUP_DIR" ]; then git clone --mirror . "$BACKUP_DIR" fi # 2) Show top offenders echo "==> Top 30 largest blobs (pre-cleanup):" git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n | tail -30 | tee ../largest-pre.txt # 3) Rewrite history (size-based) echo "==> Rewriting history: stripping blobs bigger than $SIZE_LIMIT" if command -v git-filter-repo >/dev/null; then git filter-repo --strip-blobs-bigger-than "$SIZE_LIMIT" else echo "ERROR: git-filter-repo not found. Install it first." >&2 exit 1 fi # 4) Expire reflogs and GC echo "==> Expiring reflogs and running aggressive GC" git reflog expire --expire=now --expire-unreachable=now --all git gc --prune=now --aggressive # 5) Report echo "==> Top 30 largest blobs (post-cleanup):" git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n | tail -30 | tee ../largest-post.txt # 6) Reminder echo "==> IMPORTANT: Coordinate a force-push and ask all collaborators to re-clone." popd >/dev/null Appendix B: Pre- and Post-Cleanup Checklists Pre-Cleanup Announce maintenance window and re-clone requirementCreate mirror backup and verify it is restorableSnapshot list of refs (branches/tags)Draft filter rules and test on a scratch cloneConfirm branch protection and permissions for force-push Post-Cleanup Expire reflogs and run GC with pruneCompare refs list before/after; investigate discrepanciesValidate builds/tests on mainline and critical branchesRe-enable protections and update CI to use shallow/partial cloneMonitor repo size and performance for a week Appendix: Command Reference (Quick Copy/Paste) Shell # Largest objects git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n | tail -50 # Remove blobs > 10MB git filter-repo --strip-blobs-bigger-than 10M # Remove paths across history git filter-repo --path-glob 'build/**' --path-glob 'dist/**' --invert-paths # Replace sensitive content git filter-repo --replace-text replacements.txt # Post-rewrite GC git reflog expire --expire=now --expire-unreachable=now --all && git gc --prune=now --aggressive # Repack knobs git repack -a -d -f --depth=250 --window=250 # Partial clone git clone --filter=blob:none <url> Final Thoughts Git is astonishingly capable for source and text-based workflows, but it needs a little help when repositories accumulate large binaries and deep histories. With a structured cleanup using git-filter-repo, a disciplined repack, and permanent guardrails like Git LFS and partial clone, you can transform a sluggish multi-gigabyte repository into a nimble, developer-friendly asset. Make cleanup a periodic ritual, quarterly for fast-moving monorepos, and pair it with education and policies so your gains persist.
Chaos engineering transformed modern reliability practices. Instead of waiting for systems to fail in production, we began deliberately injecting failure into distributed architectures to observe how they behaved under stress. The philosophy was simple: resilience cannot be assumed; it must be tested. For stateless microservices and horizontally scaled cloud systems, this approach worked remarkably well. Random instance termination, injected latency, and packet loss exposed weaknesses in infrastructure that traditional testing often missed. However, the systems we are building today are fundamentally different from those chaos engineering was originally designed to protect. We are no longer validating stateless services. We are validating AI-driven pipelines, retrieval-augmented generation systems, vector databases, and increasingly, agentic AI frameworks capable of autonomous decision-making. In this environment, failure is no longer binary. It is probabilistic. When Failure Doesn’t Announce Itself In traditional systems, failure is loud. A node crashes. A request times out. An alert fires. Engineers respond. In AI systems, failure often manifests as degradation rather than collapse. A slight latency spike in an embedding service may reduce retrieval quality. A minor throughput bottleneck may truncate context windows. An inference layer might still return responses, but with subtle hallucination or reduced factual grounding. The system appears operational. Yet internal quality metrics: accuracy, precision, and contextual coherence, begin to drift. Recent reliability studies across production ML systems show that the majority of AI-related failures are not catastrophic outages but quality degradations that remain undetected for extended periods. This class of failure is more dangerous than downtime. Downtime is visible. Silent degradation compounds quietly. Traditional chaos engineering does not model this behavior. It injects static fault magnitudes without understanding how those faults propagate across stateful, interdependent AI components. That gap becomes critical in high-stakes environments. The Economic Reality of AI Degradation Enterprise downtime has a measurable cost, often cited at thousands of dollars per minute, depending on the industry. However, degraded intelligence carries a different type of risk. If an AI-driven support system processes thousands of interactions with reduced accuracy before detection, the impact is not limited to infrastructure. It affects customer trust, compliance exposure, and revenue trajectory. In recommendation systems, even a modest percentage drop in model performance can translate into significant financial impact at scale. In other words, the cost of instability in AI systems is nonlinear. And yet most chaos engineering tools still operate on fixed injection models: inject 10% packet loss, add 200 milliseconds of latency, terminate N instances. These actions do not account for dependency depth, model sensitivity, or SLA-critical inference paths. They assume infrastructure is flat. It is not. The Shift Toward Intent-Based Chaos Engineering This structural limitation became evident to me while working on enterprise infrastructure supporting high-value deployments. Chaos testing in production was considered too risky, not because resilience testing lacked value, but because it lacked predictability. The fundamental question from leadership was clear: How do we guarantee that resilience testing itself does not become the outage? Answering that question required reframing chaos entirely. Rather than beginning with fault injection, I began with intent. Instead of saying, “Inject 20% latency,” the system defines a resilience hypothesis: Validate that inference accuracy remains above 95% under simulated API latency stress within SLA thresholds. That distinction shifts chaos from disruption to experimentation. This approach became formalized as intent-based chaos engineering, protected under U.S. Patent 12242370B2. The core idea is straightforward but transformative: failure magnitude must be derived from environmental risk and business sensitivity, not arbitrarily applied. The Mechanics of an Intent-Based Engine At its core, an intent-based engine evaluates three primary dimensions before injecting any stress. First, it processes intent parameters. The target degradation threshold, acceptable SLA drift, duration of experiment, and business criticality weight. This ensures that the test is aligned with operational objectives.Second, it analyzes topology data. This includes the service dependency graph, node centrality, statefulness, throughput patterns, and critical path depth. AI systems often resemble interconnected graphs rather than linear flows, and understanding that structure is essential.Third, it calculates a sensitivity index. This metric reflects how strongly a given component influences inference quality, historical fault propagation rates, and compliance exposure. Using these inputs, the engine computes what I refer to as a Variable Chaos Level (VCL). In simplified form: Python risk_score = topology_centrality × sensitivity_index The injected stress is inversely proportional to environmental risk. High-centrality components receive carefully scaled degradation. Low-risk components can tolerate higher stress levels. Chaos becomes calculated. Not guessed. Why Topology Awareness Is Critical for AI Systems Consider a typical retrieval-augmented generation pipeline: User request → API gateway → Authentication → Embedding service → Vector database → Re-ranking layer → Language model → Agent layer → Response Some of these nodes have a limited blast radius. Others serve as convergence points that influence downstream reasoning. Injecting uniform failure across all components ignores this structural reality. A modest latency spike in a stateless gateway may be inconsequential. The same spike in a vector retrieval layer may significantly reduce context precision, altering the model’s reasoning path. Intent-based chaos evaluates dependency gravity before injecting stress. If SLA breach probability exceeds tolerance, the experiment is automatically scaled or aborted. After injection, the actual impact is measured against the intended degradation. Coefficients are recalibrated for subsequent iterations. This closed-loop mechanism transforms chaos from reactive testing into predictive modeling. Agentic AI and the Amplification of Instability As AI systems evolve toward agentic autonomy, resilience challenges intensify. Autonomous agents now trigger remediation workflows, rebalance traffic, scale infrastructure, and make configuration decisions without human approval. When instability enters such systems, it can propagate through automated decision loops. A transient latency signal might trigger an unnecessary failover. A temporary degradation could escalate into a cascading remediation cycle. In this context, chaos testing must model not only infrastructure resilience but decision resilience. Intent-Based Chaos provides a calibrated stress framework that ensures autonomous agents are validated against controlled degradation scenarios. Without that framework, autonomy risks amplifying minor disturbances into systemic instability. From Experimental Practice to Engineering Discipline Perhaps the most meaningful outcome of this methodology was organizational, not technical. When resilience testing became measurable and bounded, when stakeholders could see that degradation was derived from topology analytics and SLA sensitivity, rather than arbitrary values, executive resistance diminished. Chaos testing moved from an experimental DevOps tactic to a formal validation protocol. In enterprise environments tied to global contracts exceeding nine figures, resilience simulation became a prerequisite for major rollouts. That shift reflects a broader evolution. Chaos engineering began as bold experimentation. In AI-driven infrastructure, it must mature into risk-calibrated engineering. Well... The systems we are building today behave probabilistically. They learn, infer, and decide. They do not fail in clean, binary ways. Random failure injection was sufficient when architectures were simpler. But in AI-native and agentic systems, resilience must be engineered with intent. Intent-Based Chaos Engineering reframes chaos as controlled experimentation rooted in topology awareness, sensitivity modeling, and closed-loop validation. As autonomy increases, predictability becomes foundational. And resilience, like intelligence itself, must be designed deliberately.
GitOps has a fundamental tension: everything should be in Git, but secrets shouldn't be in Git. You need database passwords, API keys, and tokens to deploy applications, but committing them to a repository is a security incident waiting to happen. This post covers how to solve this with Infisical and External Secrets Operator (ESO) - a combination that keeps secrets out of Git while letting Kubernetes applications access them seamlessly. The same architectural pattern works with any ESO-supported backend (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, GCP Secret Manager), so the concepts apply regardless of which secrets manager you choose. The Problem: Secret Zero Every secrets management system has a bootstrapping problem. You need a secret to access your secrets manager. Where does that initial secret come from? The options aren't great: Environment variables on the host: Someone has to set themCloud IAM: Requires cloud infrastructure and vendor lock-inMounted files: Still need to get the file there somehow The pragmatic approach: machine identity credentials stored locally, passed to scripts as environment variables. Not perfect, but contained to one location and never committed to Git. Choosing a Secrets Backend I evaluated several options for this setup: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, and Infisical. For a homelab or small team context, I went with Infisical because it had lower operational overhead than Vault (no unsealing, no HA configuration), a native ESO provider, and machine identity authentication designed for Kubernetes workloads. It also offers EU hosting for data residency requirements. That said, ESO supports over 20 secret store providers. If your organisation already runs Vault or uses a cloud-native secrets manager, the ExternalSecret patterns in this article work the same way - only the ClusterSecretStore configuration changes. The general setup is: store secrets in your secrets manager, create a service account or machine identity for the cluster, and let ESO sync secrets into Kubernetes. Choosing Your Infisical Region Infisical offers two hosted regions. Choose based on your data residency requirements: RegionAPI URLUse CaseUS (default)https://app.infisical.comMost users, no specific data residency needsEUhttps://eu.infisical.comGDPR compliance, European data residency Throughout this post, examples use the US region (app.infisical.com) as the default. If you need EU hosting, replace the domain in all configuration. Setting Up Machine Identity Machine identities in Infisical use Universal Auth — a client ID and secret pair specifically for automated systems. No user login, no MFA prompts, just machine-to-machine authentication. In Infisical's web UI: Within a project, go to Access Control > Machine IdentitiesClick Add Machine Identity to ProjectGenerate a client ID and client secretSave these somewhere secure (you'll need them for bootstrap and ongoing management) The identity needs access to read secrets from your project. Scope it to the appropriate environment with read-only access - it doesn't need to modify secrets, just fetch them. Storing Configuration Before diving into implementation, establish where configuration lives. I use a config.env file for non-secret values that both scripts and infrastructure-as-code tools can read: Shell # Infisical Configuration INFISICAL_API_URL="https://app.infisical.com" # or https://eu.infisical.com for EU INFISICAL_PROJECT_SLUG="my-project-slug" INFISICAL_PROJECT_ID="your-project-uuid" INFISICAL_ENVIRONMENT="dev" # Credentials come from environment variables, never stored in files The actual credentials (INFISICAL_CLIENT_ID and INFISICAL_CLIENT_SECRET) stay in environment variables, set before running any scripts: Shell export INFISICAL_CLIENT_ID="your-client-id" export INFISICAL_CLIENT_SECRET="your-client-secret" This separation keeps configuration in version control while credentials stay out. Bootstrap: Fetching Initial Secrets During cluster bootstrap, ESO isn't installed yet. Use the Infisical CLI directly to fetch any secrets needed for initial setup (like an ArgoCD admin password). Install the CLI: Shell curl -1sLf 'https://dl.cloudsmith.io/public/infisical/infisical-cli/setup.deb.sh' | sudo -E bash sudo apt-get install -y infisical Authenticate and fetch a secret: Shell # Authenticate with machine identity INFISICAL_TOKEN=$(infisical login \ --method="universal-auth" \ --client-id="$INFISICAL_CLIENT_ID" \ --client-secret="$INFISICAL_CLIENT_SECRET" \ --domain="https://app.infisical.com" \ --silent \ --plain) # Fetch a specific secret ARGOCD_PASSWORD=$(infisical secrets get ARGOCD_ADMIN_PASSWORD \ --path="/argocd" \ --env="dev" \ --projectId="$INFISICAL_PROJECT_ID" \ --domain="https://app.infisical.com" \ --token="$INFISICAL_TOKEN" \ --silent \ --plain) # Clear token from memory when done unset INFISICAL_TOKEN The --plain flag returns just the value, no JSON wrapping. The --silent flag suppresses progress output. Validate credentials early in your bootstrap script: Shell validate_environment() { if [ -z "$INFISICAL_CLIENT_ID" ] || [ -z "$INFISICAL_CLIENT_SECRET" ]; then echo "Missing Infisical credentials" echo "Please set: export INFISICAL_CLIENT_ID='...' INFISICAL_CLIENT_SECRET='...'" exit 1 fi } Installing External Secrets Operator With the cluster running, install ESO via Helm: Shell helm repo add external-secrets https://charts.external-secrets.io helm repo update helm upgrade --install external-secrets external-secrets/external-secrets \ --namespace external-secrets \ --create-namespace \ --set installCRDs=true \ --wait Once installed, ESO watches for ExternalSecret resources and syncs them into Kubernetes Secrets. Creating the Credentials Secret ESO needs credentials to authenticate with Infisical. Create a Kubernetes Secret containing the machine identity: Shell kubectl create namespace platform-secrets kubectl create secret generic infisical-credentials \ --namespace platform-secrets \ --from-literal=client-id="$INFISICAL_CLIENT_ID" \ --from-literal=client-secret="$INFISICAL_CLIENT_SECRET" Or declaratively with Terraform/OpenTofu: Shell resource "kubernetes_secret" "infisical_credentials" { metadata { name = "infisical-credentials" namespace = "platform-secrets" } data = { "client-id" = var.infisical_client_id "client-secret" = var.infisical_client_secret } } Configuring the ClusterSecretStore A ClusterSecretStore tells ESO how to reach Infisical. This is cluster-wide, so any namespace can reference it: YAML apiVersion: external-secrets.io/v1 kind: ClusterSecretStore metadata: name: infisical-cluster-secretstore spec: provider: infisical: hostAPI: https://app.infisical.com # or https://eu.infisical.com for EU auth: universalAuthCredentials: clientId: name: infisical-credentials key: client-id namespace: platform-secrets clientSecret: name: infisical-credentials key: client-secret namespace: platform-secrets secretsScope: projectSlug: my-project-slug environmentSlug: dev secretsPath: "/" Apply it: Shell kubectl apply -f cluster-secret-store.yaml Using the Terraform Provider If you manage infrastructure with Terraform/OpenTofu, you can read secrets directly from Infisical. This is useful for configuring other providers (like ArgoCD) that need credentials. Shell terraform { required_providers { infisical = { source = "Infisical/infisical" version = "~> 0.15" } } } provider "infisical" { host = "https://app.infisical.com" # or https://eu.infisical.com for EU auth = { universal = { client_id = var.infisical_client_id client_secret = var.infisical_client_secret } } } Fetch secrets as data sources: Shell data "infisical_secrets" "argocd" { env_slug = "dev" workspace_id = var.infisical_project_id folder_path = "/argocd" } # Use in other provider configurations provider "argocd" { password = data.infisical_secrets.argocd.secrets["ARGOCD_ADMIN_PASSWORD"].value } This lets you bootstrap providers that need secrets without hardcoding values or using separate secret files. Important: State file security When Terraform/OpenTofu reads secrets, those values end up in the state file. This is a security consideration: OpenTofu supports native client-side state encryption (since 1.7) using AES-GCM with keys from PBKDF2, AWS KMS, GCP KMS, or OpenBaoTerraform does not have native state encryption - you must rely on encrypted backends (S3 with SSE, Terraform Cloud, etc.) If you're storing secrets in state, OpenTofu's encryption feature is worth considering. Otherwise, ensure your state backend is properly secured and access-controlled. ExternalSecret Patterns With the ClusterSecretStore configured, applications request secrets via ExternalSecret resources. These live in Git - they contain references to secrets, not the values themselves. Basic pattern — single secret: YAML apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: redis-credentials namespace: redis spec: refreshInterval: 15m secretStoreRef: name: infisical-cluster-secretstore kind: ClusterSecretStore target: name: redis-credentials creationPolicy: Owner data: - secretKey: password remoteRef: key: "/redis/REDIS_PASSWORD" Multiple secrets in one resource: YAML apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: minio-credentials namespace: minio spec: refreshInterval: 15m secretStoreRef: name: infisical-cluster-secretstore kind: ClusterSecretStore target: name: minio-credentials data: - secretKey: rootUser remoteRef: key: "/minio/MINIO_ROOT_USER" - secretKey: rootPassword remoteRef: key: "/minio/MINIO_ROOT_PASSWORD" Templated secrets with labels: YAML apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: gitlab-repo-credentials namespace: argocd spec: refreshInterval: 15m secretStoreRef: name: infisical-cluster-secretstore kind: ClusterSecretStore target: name: gitlab-repo creationPolicy: Owner template: metadata: labels: argocd.argoproj.io/secret-type: repository data: type: git url: https://gitlab.com/your-org/your-repo.git username: "{{ .username }" password: "{{ .password }" data: - secretKey: username remoteRef: key: "/gitlab/DEPLOY_TOKEN_USERNAME" - secretKey: password remoteRef: key: "/gitlab/DEPLOY_TOKEN_PASSWORD" The template feature lets you construct complex secrets combining static values with fetched values. The template feature is particularly useful for GitLab or GitHub runner authentication, where the target secret needs specific labels and a mix of static and dynamic values. Organizing Secrets in Infisical Organize secrets by path for clarity: PathPurpose/argocd/ArgoCD admin credentials/gitlab/GitLab deploy tokens, runner tokens/redis/Redis authentication/minio/Object storage credentials/grafana/Monitoring credentials/cert-manager/DNS challenge credentials The pattern: /<application>/<SECRET_NAME>. Clear, searchable, and easy to scope access. Types of secrets to store: Service credentials: Database passwords, cache auth, object storage keysPlatform tokens: Deploy tokens, runner registration tokensCloud credentials: IAM keys for cert-manager DNS validationApplication secrets: API keys, admin passwords The Refresh Cycle ESO polls on an interval, not continuously. Use refreshInterval: 15m for most secrets: Secret rotation takes up to 15 minutes to propagateReduces API calls to InfisicalAcceptable latency for most use cases Lower the interval for critical secrets requiring faster rotation. Increase it for static secrets that rarely change. Security Considerations What's protected: No secrets in Git - ExternalSecrets reference paths, not valuesMachine identity credentials never committedInfisical handles encryption at rest and in transit What's not protected: Kubernetes Secrets are base64 encoded, not encrypted (unless you enable encryption at rest)Anyone with cluster access can read synced secretsThe secret zero problem is pushed to the operator, not eliminated Recommendations: Enable Kubernetes encryption at rest for SecretsUse RBAC to restrict secret access by namespaceConsider Sealed Secrets or SOPS for secrets that must be in GitAudit Infisical access logs periodically The Complete Flow Putting it all together: Setup (one-time): Create machine identity in Infisical, store client ID/secret locallyBootstrap: Script authenticates via CLI, fetches initial secrets, installs cluster componentsESO Install: External Secrets Operator deployed to clusterCredentials: Create the infisical-credentials Kubernetes SecretClusterSecretStore: Configure ESO to connect to InfisicalExternalSecrets: Deploy manifests that reference secrets by pathSync: ESO watches ExternalSecrets, creates Kubernetes SecretsConsumption: Pods mount secrets normally - they don't know the source Applications see standard Kubernetes Secrets. ESO is the bridge. What I'd Change Secret versioning: Infisical supports secret versions. Pinning to specific versions would add safety during rotations. Backup strategy: If Infisical is unavailable, ESO can't refresh secrets. Existing secrets persist, but new deployments might fail. A backup secret store would help. Audit integration: Infisical has audit logs. Shipping these to your logging system would add visibility. Workload identity: On cloud providers, workload identity (GKE, EKS IAM roles) eliminates the secret zero problem entirely. Originally published at https://wsl-ui.octasoft.co.uk/blog/secrets-management-infisical-external-secrets
The Gap Nobody Is Talking About The Model Context Protocol (MCP) is quickly becoming the de facto standard between AI agents and the tools they use. The adoption is growing rapidly - from coding assistants to enterprise automation platforms, MCP servers are replacing custom API integrations everywhere. As a result of the MCP's rapid growth, the security community is now stepping up with solutions to address potential security threats. Solutions such as Cisco's open-source MCP scanner, Invariant Labs' MCP analyzer, and the OWASP MCP Cheat Sheet are helping organizations identify malicious MCP tool definitions, prompt injection attack vectors, and supply chain-related risk factors. These are significant efforts. But here's the problem: a secure MCP server can still take down your production environment. Security scanners answer the question "Is this tool malicious?" They do not answer "Will this tool behave reliably when called 10,000 times at 3 AM during an incident?" That second question is what separates a demo from a production deployment, and it's a question almost nobody systematically asks. I built a Readiness Analyzer to answer it, and contributed it to Cisco's MCP Scanner. Here's what I learned about the gap, and how to close it. The Production Readiness Problem Consider a typical MCP tool definition: JSON { "name": "execute_query", "description": "Run a database query", "inputSchema": { "type": "object", "properties": { "query": { "type": "string" } } } } A security scanner would look for prompt injection patterns in the description, verify that the input schema allows dangerous inputs, and compare the tool's behavior to its intended behavior. All important. But from an operational standpoint, this tool definition is a minefield: No timeout specified. A slow query will hang up the entire agent workflow indefinitely if one occurs.No retry configuration. If the connection to the database drops off, does the agent attempt to retry, forever, or with backoff?No error response schema. What does the agent see when this tool fails: an HTTP 500, a Python traceback, or nothing?No input validation hints. The schema accepts any string, including a SELECT * on a 500GB table.No rate limit guidance. An autonomous agent could continuously hammer this endpoint in a tight loop. None of these is a security vulnerability. All of them will cause production incidents. From Lesson to Analyzer: 20 Heuristic Rules After repeatedly seeing these patterns while shipping tools into production, I designed a static analysis engine with 20 heuristic rules organized into eight categories. The goal was to create a "production readiness score," a single number (0-100) that tells you whether an MCP tool is ready for real workloads. Static readiness analysis is not unique to MCP. Teams use many readiness checklists to assess the deployment readiness of Kubernetes environments, APIs, microservice health checks, and more. The key difference between these types of readiness analyses is that the MCP tool definitions include sufficient metadata to enable static readiness analysis. Still, the rules were not documented until now. The Rule Categories Timeout Guards (HEUR-001, HEUR-002) The most common production failure mode for MCP tools. When an agent calls a tool that results in a network request, database query, or other external API call, and there is no timeout, a single slow response can cascade throughout the agent's workflow. The analyzer checks whether the tool definitions include timeouts and whether they are reasonable for the type of operation. Retrying (HEUR-003, HEUR-004) Retries without a limit result in infinite loops; Retries without exponential backoff result in "thundering herds". The analyzer will flag tools that do not provide retry configurations or that retry indefinitely without exponential backoff and/or jitter. Error Handling (HEUR-005, HEUR-006, HEUR-007) When a tool provided by MCP fails, the agent requires structured error data to make decisions about what action to take (retry, fallback to another alternative, escalate to a human). The analyzer will check if the tools provide error response schemas, document error classifications, and describe the failure modes. Quality of Description (HEUR-009, HEUR-010, HEUR-016 – 020) While this is a readiness issue, it is not just a documentation issue. LLMs use tool descriptions to find/select and invoke tools. If the description is ambiguous, it will be misused (the wrong parameter, at the wrong time, etc.). Therefore, the analyzer will evaluate the quality of the description in terms of its length, how specific it is, and whether it provides precondition, side effects, and scope limitations. Input Validation (HEUR-011, HEUR-012) Beyond the schema type, production tools require input validation constraints such as string length limits, enumerated values for categorical inputs, and range bounds for numeric inputs; otherwise, an autonomous agent will always supply inputs that are technically correct but operationally catastrophic. Operational Configuration (HEUR-008, HEUR-013, HEUR-014) Rate limits, concurrency bounds, and resource quotas are the control mechanisms used to prevent a well-intentioned agent from overloading a backend service. The analyzer will flag tools that support write operations or resource-intensive queries that lack operational guardrails. Resource Management (HEUR-015) Tools that establish connections/file handles/sessions require corresponding cleanup semantics. The analyzer will determine whether resources that establish tools describe their lifecycle, particularly important for long-running agent workflows that invoke hundreds of tools in a single session. Safety Checks Safety checks are cross-cutting rules that will identify patterns such as missing idempotence declarations on write operations, no pagination on list endpoints, and modifications to state without describing reversibility. The Readiness Score Each finding carries a severity weight (HIGH, MEDIUM, LOW, INFO). The analyzer aggregates these into a readiness score from 0-100, with a production-ready threshold of 70. This isn't a pass/fail; it's a signal to engineering teams about where to invest effort before deployment. A score of 92 indicates that this tool was built with great care and will likely meet your organization's operational requirements. Conversely, a readiness score of 55 indicates that this tool works as expected during demonstration but may struggle to meet the demands of a real-world production environment. Architecture: Designed for Extension The Readiness Analyzer follows a provider abstraction pattern with three tiers: Tier 1: The Heuristic Engine (Zero Dependencies) This is a self-contained engine that operates via static code analysis using regular expressions, string matching, and schema inspection. It does not make any API calls, use any external services, nor require any special configuration. This was a deliberate design decision: the baseline scanner should run in CI/CD pipelines, air-gapped environments, and even on a developer's laptop without requiring any configuration beyond installing the package. Tier 2: OPA Policy Provider (Optional) If your organization already has policy-based infrastructure in place, the analyzer can evaluate each tool's definition against Rego policies. This will enable teams to create their own operational standards - e.g., all tools in the payments namespace must have a specified timeout under 5 seconds - and have those standards enforced automatically by the system. Tier 3: LLM Semantic Analysis (Optional) For a deeper assessment of a tool, the analyzer can utilize an LLM to assess properties of the tool that cannot be evaluated statically - i.e., whether the documented error-handling processes are actually helpful, whether the described failure modes are comprehensive, etc., and whether the scope of the tool is well-defined. The primary reason this tier is optional is that it requires both an API key and network access. The key design principle is progressive capability: the tool is useful with zero configuration and becomes more powerful as you add integrations. Integrating With Existing Security Scanning The Readiness Analyzer complements the existing MCP Scanner engines rather than replacing them. A typical scan now looks like: Shell mcp-scanner --analyzers yara,readiness --server-url http://localhost:8000/mcp The output includes both security findings and readiness findings: Shell === MCP Scanner Detailed Results === Tool: execute_query Status: completed Safe: No Findings: • [HIGH] HEUR-001: Tool 'execute_query' does not specify a timeout. Category: MISSING_TIMEOUT_GUARD Readiness Score: 55 Production Ready: No • [MEDIUM] HEUR-003: Tool 'execute_query' does not specify a retry limit. Category: UNSAFE_RETRY_LOOP • [MEDIUM] HEUR-006: Tool 'execute_query' does not define an error response schema. Category: MISSING_ERROR_SCHEMA Tool: get_user Status: completed Safe: Yes Findings: • [INFO] HEUR-012: Tool 'get_user' input schema lacks validation hints. Category: NON_DETERMINISTIC_RESPONSE Readiness Score: 92 Production Ready: Yes This gives teams a complete picture: is this tool safe (security) and ready (operations)? Lessons from the Contribution Process Building the analyzer was one challenge. Getting the analyzer accepted into an open-source project with several maintainers, continuous integration (CI) checks, and code scanning was another challenge. A few things I learned that might help others contribute to security tooling projects: Complement, don't compete. The MCP Scanner already had three powerful security analysis engines. A proposal for "the best security scanner" would potentially have been met with skepticism by the maintainers. I instead recognized a vacant space - operational readiness - that the existing engines did not address. The contribution expanded the project's value proposition rather than questioning its existing architecture. Start with zero dependencies. The heuristic engine requires no API keys, external services, or optional packages. This made integration dramatically simpler and reduced the review surface. The OPA and LLM tiers came as optional extensions, not requirements. Bring data, not opinions. When the maintainers asked for evidence that the rules worked, I provided an analysis of false positives and true positives across numerous test cases. When a reviewer ran the analyzer against a corpus of 2,300+ skills and found that some rules were too noisy, the response was to adjust thresholds based on empirical data - not to argue about them in theory. What's Next The 20 heuristic rules are a starting point. As MCP adoption matures and more tools move into production, the readiness taxonomy will need to grow. Areas that I'm actively researching: Multi-tool interaction patterns. Individual tool readiness is necessary but not sufficient. When an agent uses three separate tools to perform a chain of tasks (query a database, transform the results, write to an API), the potential failure points increase exponentially. Analyzing these multi-tool interactions requires a graph-based view of the interactions that none of today's scanners provide. Runtime behavioral validation. Static analysis finds configuration discrepancies; however, it cannot find a tool that produces valid-looking data during testing but degrades quietly under load. If we connect the readiness scanning to the runtime telemetry, for example, through OpenTelemetry traces of actual tool invocations, this creates a feedback loop that can inform readiness scores based on production behavior. Organizational policy integration. Every organization has different operational standards. The timeout requirements for a financial company differ from those for a media company. Deeper OPA integration and library templates for organizational policies would allow teams to capture their standards as reusable, shareable rule packs. Where to Find the Rules The Readiness Analyzer is available now as part of Cisco's open-source MCP Scanner: Shell pip install cisco-ai-mcp-scanner mcp-scanner --analyzers readiness --server-url http://localhost:8000/mcp Repository: github.com/cisco-ai-defense/mcp-scanner The tool scans MCP servers for both security threats and production readiness issues. It works as a CLI, a REST API, or as an integrated component in CI/CD pipelines. No API keys are required for the readiness analyzer - it runs purely on static analysis. If you are deploying MCP servers into production, scan them not just for security but also for readiness.
With businesses desperately searching for ways to reduce data bottlenecks associated with LLMs, synthetic data is now emerging as a leading solution. For those encountering difficulties in accessing, purchasing, and utilizing high-quality datasets due to scarcity, legalities, or costs, synthetic data provides a way out. You can also generate "long-tail" data that is difficult to find and use at scale. Large language model (LLM) training teams are experiencing challenges in sourcing sufficient quality data for training purposes. Although data may exist, the data often has contractual restrictions or other limitations on its usage. Even if there are no contractual restrictions, cleaning, validating, and standardizing such data so that it produces consistent results during training is an extremely costly process. Due to this, synthetic data has emerged as a critical element in the training strategies of numerous LLM training teams. That’s why synthetic data has shifted from “nice extra” to a synthetic data infrastructure. Imagine the extent of demand as the global synthetic data generation market size is projected to reach USD 1,788.1 million by 2030 at a staggering 35.3% CAGR from 2024 to 2030. Gartner mentions that organizations lack AI-ready data unless the organization has access to ready-to-use AI data. Synthetic data pipelines can fulfill this need by generating large volumes of training data used to train LLMs through AI algorithms, which include controls, reviews, and traceability. Top Strategies for Scaling Synthetic Data in LLM Training You cannot simply decide to create synthetic data and expect to get meaningful results. Instead, begin with the end in mind: define the objectives that correspond to the downstream tasks you want to achieve. Strategy 1: Define Task-Specific Synthetic Data Objectives Retrieval-based training requires query and evidence alignment. Reasoning-based training requires calibrated levels of complexity so that the model will learn to determine whether it needs to process additional information or provide a direct answer.Domain-specific training requires the language, constraints, and tone of the specific domain.Finally, be sure to differentiate between pre-training data augmentation and fine-tuning data generation. Although there is some overlap between the two, they serve different purposes. Pre-training can tolerate a wider range of variability. Fine-tuning requires strict schemas, rubrics, and output constraints. Strategy 2: Control Data Distribution With Domain-Aware Prompt Engineering One of the biggest issues with creating synthetic corpora is the tendency to create far too many examples of happy-path cases. These happy-path cases present no challenges to the model. In fact, models tend to perform very well in evaluation environments but struggle with the messiness of real-world prompts. Controlling data distribution balances common intents, realistic variations, and difficult tail cases, which addresses the happy-path issue. Domain-aware prompt engineering provides a method for controlling data distributions in a purposeful manner, as opposed to allowing the distributions to develop accidentally, and taxonomies and controlled vocabularies help to minimize terminology drift. To further anchor synthetic text to domain reality, especially in high-compliance environments, teams can use structured generation patterns. Strategy 3: Use Human-in-the-Loop Validation at Scale Automated pipelines are prone to drift. Automated generators tend to repeat patterns. Automated checks fail to capture nuances. And plausible-looking samples can cause the model to train on the wrong behavior. This is why human-in-the-loop validation is required to prevent drift and ensure consistency throughout the pipeline. However, human-in-the-loop validation can be used most effectively through strategic sampling. In particular, experts can validate the most risky areas of the pipeline and new templates. Spot checks can identify drift early, and feedback loops can correct recurring errors. To track quality signals, use practical signals related to semantic accuracy, schema fidelity, and task compliance. This is how you can maintain quality and consistency of synthetic datasets as volume increases, as opposed to hoping for the best. Strategy 4: Maximize Linguistic and Semantic Diversity If you create synthetic data that sounds just like all other synthetic data, you may actually be reducing the potential for generalization of the model using that synthetic data. When the model is trained using synthetic data generated in one single style, the model is learning the style of the generator and not the variability in the users. Create linguistic and semantic diversity through intentional methods such as: Sampling variation to ensure that the model sees a variety of ways to express the same thing.Multiple generator models to avoid developing a single dominant pattern.To increase coverage across various sentence structures, reasoning depths, and intent framing, without violating the constraints that you established for the task. Diversity expands the range of the model; it doesn't introduce unnecessary noise. Strategy 5: Generate Synthetic Edge Cases and Failure Scenarios Edge cases and failure scenarios are rarely captured in real-world corpora, yet they are precisely where brittle behavior resides. Synthetic data can be designed to simulate edge cases and failure scenarios, providing a way to test the model's ability to handle those types of behaviors on demand. In particular, you can generate the following types of edge cases and failure scenarios: Conflicting constraints that test the model's ability to reason and understand the hierarchy of instructionsAdversarial prompts that probe the policy boundaries of the modelLow-resource scenarios, where the number of examples available is limited. Synthetic data generation is particularly useful for strengthening the model's robustness in the long tail, where failures can result in lost trust, increased support costs, and even lost revenue. Strategy 6: Combine Synthetic and Real Data Using Weighted Blending The sixth strategy is to blend your synthetic data with real-world data by way of a weighted aggregation approach to cover missing areas of coverage, to identify the fundamental nature of natural language patterns as represented in synthetic data, and to create a method to determine the percentage of synthetic to real-world data at each level. Weighted aggregation allows you to control how much repetition occurs within the data during pretraining, which helps prevent overfitting of the data; however, it also requires that you apply additional filtering and schema checks during fine-tuning. While both Preference Learning and Reinforcement Learning from Human Feedback (RLHF) utilize synthetic data pairs, Preference Learning remains dependent upon the judgments of humans. A curriculum-style blended dataset typically performs better than a randomly sampled dataset because a curriculum-style blended dataset controls the level of difficulty within a given task and prevents sudden and/or unforeseen shifts. Strategy 7: Implement Strong Data Governance and Traceability As volume grows, it is crucial to be able to explain what was modified, when, and why. Data governance establishes the means to do so. Create version datasets and slices. Document the generation parameters and templates. Identify the generator model name, revision history, and applied filters. Establishing robust traceability will enable audits to survive, regressions to become debuggable, and ultimately will make your pipeline repeatable. Without establishing data governance, synthetic data scaling will simply consist of a series of single-use runs without accountability. Strategy 8: Automate Quality Scoring and Filtering Automated quality metrics for content are necessary to enable the scalable application of human review processes. The automated quality metrics should include rule-based evaluations for schema and formatting, and model-based evaluations for compliance with the instructions provided and semantic noise. Duplicate and near-duplicate detection should be included to eliminate duplication. It should also continually filter. Filtering is important because the introduction of hallucinations and small discrepancies through synthetic data generation can lead to continuous degradation of the training process and its associated evaluation. Therefore, ongoing filtering will help maintain a high signal-to-noise ratio and help prevent the degradation of the training process and associated evaluation reliability. Strategy 9: Localize and Multilingualize Synthetic Data Pipelines Although many pipelines are skewed toward English, localization is more important than translation and can limit the ability to expand products and degrade performance in multilingual environments. Synthetic data is useful in expanding low-resource languages; however, localization is significantly more important than translation. Domain terminology must be correct, tone must align with local standards, and context must appear natural. Curation by experts is critical in these cases. Although fluent-but-wrong text can damage credibility and skew downstream evaluation in difficult-to-identify ways, expert curation will minimize the risk of these issues occurring. Strategy 10: Design Synthetic Pipelines for Iterative Model Feedback In terms of durability, closed-loop systems are the best form of synthetic data pipeline. Derive error from evaluation and production signals, generate targeted synthetic corrections, retrain, and retest. By doing so, your dependence on the procurement of new real-world data will decrease, and your ability to develop models will increase as the models' behavior changes due to updates. In addition, the closed-loop system will be able to detect the onset of drift prior to it entering millions of synthetic samples. Why Eterprise-Grade Synthetic Data Requires Specialized Partners On “synthetic dataset tools,” most teams use a mix: prompt orchestration, dataset versioning, and evaluator frameworks, plus generation methods like prompt-based synthesis, distillation, and self-instruct patterns described in the reference. Synthetic Data as a Long-Term LLM Scaling Strategy Synthetic data is moving quickly from being a complementary technology to LLMs to becoming a core element in how teams develop, manage, and continually improve their models over time. Teams will get the most out of synthetic data if they create and sustain robustly engineered synthetic data pipelines based on well-defined objectives, controlled distributions, human-in-the-loop validation, and continuous, automatic filtering and traceability. If you treat synthetic data as an infrastructure component, you will have achieved safer scale-up, rapid iteration, and dependable training data under realistic stressors.
Delta Live Tables (DLT) has been a game-changer for building ETL pipelines on Databricks, providing a declarative framework that automates orchestration, infrastructure management, monitoring, and data quality in data pipelines. By simply defining how data should flow and be transformed, DLT allowed data engineers to focus on business logic rather than scheduling and dependency management. Databricks expanded and rebranded this capability under the broader Lakeflow initiative. The product formerly known as DLT is now Lakeflow Spark Declarative Pipelines (SDP), essentially the next evolution of DLT with additional features and alignment to open-source Spark. The existing DLT pipelines are largely compatible with Lakeflow; your code will still run on the new platform without immediate changes. However, to fully leverage Lakeflow’s capabilities and future-proof your pipeline, it’s recommended that you update your code to the new API. This playbook provides a practical, engineer-focused guide to migrating from DLT to Lakeflow declarative pipelines with side-by-side code examples, tips, and coverage of edge cases. We’ll focus on the migration logic, the code changes, and pipeline definition adjustments, rather than tooling or deployment, assuming you’re using Databricks with Spark/Delta Lake as before. Recap: What Is Delta Live Tables (DLT)? Databricks Delta Live Tables (DLT) is a declarative ETL framework for building scalable, reliable data pipelines on Delta Lake. With DLT, engineers define a series of datasets and their transformation logic in Python or SQL, and the system handles the execution order, dependency resolution, and incremental processing automatically. Key features of DLT included support for streaming tables, materialized views, and views. DLT pipelines also integrate data quality enforcement via expectations, allowing you to declare constraints that the pipeline can enforce or use to quarantine bad data. In short, DLT lets you focus on what transformations to do, not how to schedule or scale them, bringing a declarative approach to data engineering similar to how Kubernetes brings declarative management to infrastructure. Meet Lakeflow Declarative Pipelines (The Evolution of DLT) Lakeflow Spark Declarative Pipelines (SDP) is essentially DLT 2.0, a unified, declarative framework for batch and streaming ETL that Databricks introduced under the Lakeflow umbrella. Lakeflow pipelines build on the lessons of DLT and align with the open-source Spark API for declarative pipelines (introduced in Apache Spark 4.1). In practice, Lakeflow’s pipeline API is almost identical to DLT’s, but with new naming and some expanded capabilities. Notably, as of the 2025 Data and AI Summit, Databricks open-sourced the core declarative pipeline engine to Apache Spark. This means your pipeline code can, in principle, run on standard Spark, reducing vendor lock-in while still offering Databricks-specific enhancements. Lakeflow also introduced the concept of flows in pipelines. For a data engineer, the migration from DLT to Lakeflow is mostly a find-and-replace refactor plus adopting a few new best practices. The following sections will walk through the key changes with code examples. We’ll start with the simplest updates, then tackle specific features like expectations and change data capture. Migration Steps and Code Changes 1. Update Imports and Module References In DLT, you typically started your notebook with import dlt. In Lakeflow, the pipeline functions are accessed via the Spark pipelines module. Replace the DLT import with: Python from pyspark import pipelines as dp This import gives us a dp object analogous to the old dlt. Consequently, all references to dlt in your code should be replaced with dp. This includes decorator annotations and any function calls. For example: @dlt.table becomes @[email protected] becomes @dp.temporary_viewdlt.read("some_table") becomes dp.read("some_table") According to Databricks, the dlt module has been superseded by pyspark.pipelines and while legacy code will still run, it’s recommended to use the new module going forward. The name changes are designed to be straightforward. In fact, you can often do a simple search-and-replace on your notebooks to swap dlt for dp and add the new import. 2. Table Decorators: Distinguishing Streaming Tables vs. Materialized Views One notable API improvement in Lakeflow is making streaming tables vs. batch tables explicit. Under DLT, you would declare all persistent tables with @dlt.table, regardless of whether they were fed by streaming sources or batch data. DLT internally figured out which tables should be streaming versus which were materialized (refreshed on each pipeline run) based on how you read the data. In Lakeflow, the syntax is more expressive: Use @dp.table to define a streaming table.Use @dp.materialized_view to define a materialized view.Ephemeral in-memory views for intermediate transformations are declared with @dp.temporary_view. Temporary views are not persisted to the metastore and exist only within the pipeline’s processing graph, just as dlt.view worked previously. Migration tip: Review each @dlt.table in your code to decide if it should be a streaming table or a materialized view in the Lakeflow world. As a rule of thumb, if the function is reading from a streaming source (for example, using spark.readStream or auto loader on a directory of files), use @dp.table. If it’s doing a batch read (e.g., spark.read.format("delta").load(...) or joining already materialized tables), use @dp.materialized_view. In many cases, DLT pipelines had a mix of both types; now you’ll make that distinction explicit. For example, in DLT you might have: Python # DLT code (before migration) import dlt @dlt.table(name="raw_data") def raw_data(): return spark.readStream.format("cloudFiles")... .load("<path>") @dlt.view def aggregated(): df = dlt.read("raw_data") return df.groupBy("category").count() @dlt.table(name="report") @dlt.expect("PositiveCount", "count > 0") def report(): return dlt.read("aggregated") The equivalent Lakeflow pipeline code would be: Python # Lakeflow code (after migration) from pyspark import pipelines as dp @dp.table(name="raw_data") # streaming source def raw_data(): return spark.readStream.format("cloudFiles")... .load("<path>") @dp.temporary_view def aggregated(): df = dp.read("raw_data") return df.groupBy("category").count() @dp.materialized_view(name="report") @dp.expect("PositiveCount", "count > 0") def report(): return dp.read("aggregated") In this example, we changed the first table to @dp.table because it reads a streaming source. The intermediate view became @dp.temporary_view. The final table report is derived from batch aggregation, so we mark it as a @dp.materialized_view. We also carried over the data quality expectation. By making these choices explicit, the pipeline is clearer in intent. Under the hood, Lakeflow still builds a dependency graph and manages incremental updates, but now you have more control over how tables are updated. It’s worth noting that these changes align with Apache Spark’s emerging declarative pipeline syntax, meaning your @dp.table and @dp.materialized_view definitions mirror what vanilla Spark 4.1+ would accept. 3. Data Quality Expectations DLT’s ability to enforce data quality constraints via expectations is preserved in Lakeflow. In DLT, you might have used @dlt.expect, @dlt.expect_or_drop, or @dlt.expect_or_fail decorators to define rules on a table. The simplest form @dlt.expect("Name", "condition") would record any rows violating the condition (and depending on pipeline settings, either drop them, fail the pipeline, or just log the metric). In Lakeflow, the syntax is @dp.expect("Name", "condition"). The usage is the same conceptually you attach one or more expectations above a @dp.table or @dp.materialized_view function. 4. Change Data Capture (CDC) and Flows Handling change data capture (CDC). In DLT, you might have used the Python API dlt.apply_changes() inside a table function, or the SQL syntax APPLY CHANGES INTO in a pipeline notebook, to achieve this. For instance, a DLT example for CDC looked like: Python @dlt.table def target_table(): return dlt.apply_changes( target = "LIVE.target_table", source = "STREAM(LIVE.cdc_feed_table)", keys = ["id"], sequence_by = col("timestamp"), apply_as_delete = col("operation") == "DELETE" ) In Lakeflow, the CDC capability has been refactored slightly. The new API provides functions to create CDC flows. The direct replacement for dlt.apply_changes() is dp.create_auto_cdc_flow() which has the same function signature and behavior, but is used in a different way. Rather than returning a DataFrame inside a table function, you will call dp.create_auto_cdc_flow at the pipeline definition level to link a source and target. You also need to declare the target table as a streaming table beforehand. In practice, migrating a DLT CDC pipeline might involve: Define an empty target table using dp.create_streaming_table("target_table", schema=..., name="...") (or as a function with @dp.table if you prefer) to serve as the sink for changes.Use dp.create_auto_cdc_flow(target="target_table", source="source_table", keys=[...], sequence_by=..., apply_as_deletes=... ) to create the CDC flow that updates the target. This new pattern separates the declaration of the target table from the CDC application logic. The rationale is to make CDC a first-class concept rather than a special kind of table function. Under the hood, create_auto_cdc_flow will handle upserts and deletions on the target table similarly to how apply_changes did. The function parameters like keys, sequence_by, apply_as_deletes are unchanged. Databricks has simply renamed the API for clarity and future compatibility. During migration, replace any usage of dlt.apply_changes with the new dp.create_auto_cdc_flow call. If you were using the SQL APPLY CHANGES INTO syntax in a DLT SQL notebook, the equivalent Lakeflow SQL uses a CREATE FLOW ... statement with similar clauses. Edge Cases and Considerations Backward Compatibility A big advantage is that you can migrate gradually. Databricks has made Lakeflow backward-compatible; your old import dlt code will still run in the Lakeflow engine. This means you can perform A/B testing or phased migration: for instance, run the pipeline as-is and then run the migrated version and compare results. Just note that new features will likely appear only in the pipelines module going forward, so to take advantage of improvements, you’ll eventually want to switch fully. Also, be aware that some system names remain prefixed with dlt for now these legacy naming artifacts do not affect functionality, but can be confusing. Don’t be alarmed if you see dlt in logs it’s just cosmetic legacy. Mixing SQL and Python If your DLT pipeline was defined with SQL notebooks or a mix of SQL and Python, the migration concept is similar. SQL syntax in Lakeflow supports CREATE STREAMING TABLE, CREATE MATERIALIZED VIEW, and CREATE FLOW statements aligning with the new terminology. Any APPLY CHANGES INTO clause in SQL should continue to work, but Databricks docs suggest using CREATE FLOW for consistency going forward. In Python, as we covered, use dp functions and decorators. You can even mix Lakeflow SQL and Python in one pipeline, as was possible with DLT, just ensure the naming and types line up. Open Source Spark Pipeline Compatibility One motivation for Lakeflow’s changes was convergence with Apache Spark’s declarative pipelines. Spark 4.x introduced a pyspark.pipelines module with similar concepts, so that you could potentially run a pipeline outside Databricks. If you plan to take a Lakeflow pipeline and run it on an open-source Spark cluster, note that not all features carry over. Core dataset definitions and basic reads will work, but features like expectations and the CDC utilities are Databricks-only. In migration, you might flag these sections if portability is a concern. For pure Databricks usage, this isn’t an issue. Performance and Observability Migrating to Lakeflow should not degrade performance; in fact, see improvements or new options. Lakeflow still provides an event log, data lineage visualization, and metrics as DLT did. After migration, validate that your pipeline updates and triggers still behave as expected. Lakeflow pipelines can run in triggered mode or continuous mode, depending on configuration, just like DLT. So an edge case to check is if you relied on a particular trigger, so that it remains configured, but that’s a pipeline setting outside the code. Conclusion Migrating from Delta Live Tables to Lakeflow Declarative Pipelines is a straightforward process that mainly involves renaming APIs and clarifying table types. The declarative, engineer-friendly approach to building pipelines remains the same, but Lakeflow’s refinements bring you better alignment with open standards and future Databricks features. By updating your imports to pyspark.pipelines, switching to @dp.table or @dp.materialized_view where appropriate, and refactoring CDC and expectations to the new syntax, you’ll ensure your pipelines are future-proof. This migration not only preserves the benefits DLT gave you but also sets the stage for leveraging new Lakeflow enhancements. Happy migrating and enjoy the continued simplicity and power of declarative pipelines in Databricks Lakehouse!
The room for Nvidia’s Open Model Super Panel at San Jose Civic was packed well before Jensen Huang really got going. It felt less like a normal conference panel and more like one of those sessions where the industry starts saying the next platform shift out loud. Nvidia listed the session as “Open Models: Where We Are and Where We’re Headed,” moderated by Huang and held on March 18 during GTC 2026. Credit: Corey Noles/The Neuron But despite the title, the most interesting argument onstage was not really about open models. It was about open agents. The Real Story Was the Move From Models to Systems Huang opened the session by trying to kill the most boring framing in AI: the idea that the market is cleanly split between proprietary labs and open challengers. His point was broader than that. AI is not a single model, a single product, or a single winner-take-all category. It is a stack, a system, and increasingly a combination of many different model types working together. “Proprietary versus open is not a thing. It’s proprietary and open,” Huang said. “A.I. is a system of models and systems of a lot of other things.” That was the throughline of the discussion. Yes, the panel covered open models as infrastructure. Yes, it touched on why open systems widen access and why smaller players may create some of the most important specialized breakthroughs. But the stronger consensus was that the center of gravity is moving up the stack. Models matter. Open models matter a lot. But what increasingly matters more is the system wrapped around them: orchestration, memory, tools, identity, governance, and runtime. That is why the panel landed as such a strong case for open agents. Aravind Srinivas Gave the Clearest Product Abstraction The sharpest product framing came from Aravind Srinivas, who described Perplexity Computer in a way that captured where the market seems to be heading. Instead of asking users to choose a model, route tasks manually, and stitch together their own workflows, the system should take the task and decide how to solve it. “A.I. is not the model, it’s the system. It’s the computer,” Srinivas said. “Perplexity Computer is the idea that you should build the organizational system of everything that A.I. can do.” That is a bigger idea than product branding. It suggests the next useful abstraction layer in AI may not be a chatbot or even a single frontier model. It may be a computer for delegation: a system that knows which models to call, which tools to use, when open models are good enough, when closed models are worth using, and how to pull those pieces into one coherent workflow. Srinivas also made it clear that the future is unlikely to be a simple ideological split between open and closed systems. Different models will serve different functions. Harrison Chase Made the Case for the Harness Layer If Srinivas provided the cleanest product abstraction, Harrison Chase provided the clearest builder abstraction. His phrase, “harness engineering,” may have been one of the most important on the panel. Chase used it to describe everything around the model: which sub-agents are used, which skills are attached, how memory works, what tools are selected, and how the environment is configured for a specific domain or task. “Harness engineering is everything around the model,” Chase said. He made the point that when people are impressed by a polished AI product, they are often responding not just to the raw model quality but to the system surrounding it. That matters because it runs counter to one of the laziest ideas in AI discourse: that anything built around a model is “just a wrapper.” Once models get good enough, the wrapper stops being a wrapper and starts becoming the operating system. The harness is where general intelligence becomes useful intelligence. That also helps explain why routing and orchestration are starting to look like durable product layers. A useful reference point here is The Neuron’s write-up of OpenRouter. While not identical to what the panel discussed, it maps closely to the same underlying shift: value is moving into the layer that decides how intelligence gets assembled and deployed. OpenClaw Mattered Less as a Product Than as a Signal OpenClaw hovered over the whole conversation even when the panel was not explicitly about it. Huang framed it as a turning point, not just because it exists, but because it makes a new category legible. In the panel transcript, he described it as a big deal. In a separate GTC press Q&A, he went even further, calling it an inflection point for what comes after reasoning systems and arguing that it now needs enterprise-grade layers, including privacy, governance, security, and optimized runtimes. “OpenClaw is a big deal,” Huang said, a point he reiterated throughout GTC. The point is not that OpenClaw is the only product that matters. The point is that it signals the conversation has shifted from answering to acting. That is the more important category change. The panelists kept circling the same idea, even when they used slightly different language: AI systems are moving beyond responses and into execution across files, tools, workflows, and goals. Michael Truell Connected Coding Agents to the Rest of the Economy Cursor CEO and Founder Michael Truell offered one of the cleanest bridges from coding agents to the rest of the economy. His argument was that coding was simply the first place this system style began working in a real, visible way. The same pattern is now spreading into other domains. “What started working in coding last year … now, we’re going to all of these other domains,” Truell said. That is a useful lens for understanding why this panel mattered. Coding agents are the preview but not the overall endpoint. The combination of models, files, CLIs, tool use, and rapid iteration made coding the first environment where agentic systems felt obviously real. If those same primitives spread outward into research, healthcare, legal workflows, operations, and back office work, then the real market is not “AI coding.” It is the much larger category of computer work being reinterpreted as agent work.
Hello, our dearest DZone Community! Last year, we asked you for your thoughts on emerging and evolving software development trends, your day-to-day as devs, and workflows that work best — all to shape our 2026 Community Research Report. The goal is simple: to better understand our community and provide the right content and resources developers need to support their career journeys. After crunching some numbers and piecing the puzzle together, alas, it is in (and we have to warn you, it's quite a handful)! This report summarizes the survey responses we collected from December 9, 2025, to January 27 of this year, and includes an overview of the DZone community, the stacks developers are currently using, the rising trend in AI adoption, year-over-year highlights, and so much more. Here are a few takeaways worth mentioning: AI use climbs this year, with 67.3% of readers now adopting it in their workflows.While most use multiple languages in their developer stacks, Python takes the top spot.Readers visit DZone primarily for practical learning and problem-solving. These are just a small glimpse of what's waiting in our report, made possible by you. You can read the rest of it below. 2026 Community Research ReportRead the Free Report We really appreciate you lending your time to help us improve your experience and nourish DZone into a better go-to resource every day. Here's to new learnings and even newer ideas! — Your DZone Content and Community team
10 Strategies for Scaling Synthetic Data in LLM Training
March 20, 2026 by
Modern Best Practices for Web Security Using AI and Automation
March 20, 2026 by
Modern Best Practices for Web Security Using AI and Automation
March 20, 2026 by
Kubernetes Scheduler Plugins: Optimizing AI/ML Workloads
March 20, 2026 by
AI as a SQL Performance Tuning Assistant: A Structured Evaluation
March 20, 2026 by
Kubernetes Scheduler Plugins: Optimizing AI/ML Workloads
March 20, 2026 by
AI as a SQL Performance Tuning Assistant: A Structured Evaluation
March 20, 2026 by
Kubernetes Scheduler Plugins: Optimizing AI/ML Workloads
March 20, 2026 by
Stop Trusting Your RAG Pipeline: 5 Guardrails I Learned the Hard Way
March 20, 2026 by
Toward Intelligent Data Quality in Modern Data Pipelines
March 20, 2026 by
10 Strategies for Scaling Synthetic Data in LLM Training
March 20, 2026 by
Modern Best Practices for Web Security Using AI and Automation
March 20, 2026 by