gft-engineering - Medium

Evaluating IBM’s Granite TTMs and the Impact of Time-Series Foundational Models on Industry:

Ferran Aran — Fri, 06 Jun 2025 13:09:09 GMT

A New Approach to Time-series Data Prediction

In this article, we explore the use of IBM’s Granite Tiny Time Mixer Models (TTMs) to simulate and optimize the behavior of an industrial Heating Ventilation and Air Conditioning (HVAC) system. By comparing zero-shot predictions with models fine-tuned on domain-specific data, we aim to understand the practical value of TSFMs in driving energy optimization for building operations.

Nearly all industrial processes generate data which is inherently time-series in nature. Sensors, controllers, meters, and industrial systems continuously generate timestamped data that reflects the dynamic state of equipment and processes. This data holds the key to optimizing performance, predicting failures, and improving energy efficiency.

However, exploiting this potential remains challenging. Industrial data is often fragmented across different systems, protocols, and business units. In many cases, historical data is only retained for limited periods — sometimes just weeks or months — due to infrastructure limitations such as saturated networks or storage costs. As a result, traditional machine learning approaches that rely on large, clean, and well-labeled datasets are difficult to apply in many industrial contexts.

Given these challenges, reducing the dependency on such datasets has become a crucial research priority. Recently, foundational time-series models (TSFMs) have emerged as an exciting solution, offering powerful forecasting capabilities with significantly fewer data requirements. These models are pre-trained on diverse datasets, providing impressive zero-shot and few-shot forecasting capabilities.

TSFMs: A Breakthrough in Time-Series Forecasting

Time-series foundational models are revolutionizing how industries approach data forecasting. Their transformative potential has been demonstrated notably through successes like Moirai by SalesForce, which secured first place in the VN1 forecasting competition through fine-tuning (Zhang Xiaobin talks about how they did it here).

TSFMs broadly fall into three categories: diffusion-based, transformer-based, and non-transformer-based (MLP, CNN, RNN). Each category has distinct attributes and implications. Transformer-based models, widely popular due to their superior performance, typically demand substantial GPU resources for fine-tuning and inference. Contrarily, non-transformer models offer notable advantages in efficiency, enabling fast inference even on standard CPUs, thus significantly lowering hardware barriers.

When evaluating TSFMs for specific applications, resource requirements are just one factor among several critical considerations. Equally essential is understanding how a model addresses the complexities inherent in time-series data. To effectively discuss these considerations, let’s clarify some key concepts first:

Multivariate forecasting: Predicting multiple target variables simultaneously, as opposed to univariate forecasting, which focuses on a single target variable.
Covariate: A variable which is not a target and is used to enhance forecasting accuracy, it may remain constant (e.g., location) or vary over time (e.g., temperature), and may be known in the future (like holidays) or only observed up to the present (like past sales).
Exogenous variables: A kind of covariate that originates from a different generating process (e.g., outdoor temperature when predicting energy consumption of a home).
Interdependence: When a model treats variables (targets and covariates) as interdependent, it assumes they influence each other and that there are relationships between them that can be used to improve predictions (e.g. learning that turning on a fan during the next hour will influence indoor temperature during that time).

At the very least, TSFMs are capable of impressive zero-shot univariate forecasting, predicting future values based only on historical data. However, many real-world scenarios involve targets heavily influenced by multiple covariates and interdependencies between variables. Ignoring these would severely limit TSFMs’ performance compared to traditional supervised methods.

To fully exploit their potential in the industry world, TSFMs must support multivariate forecasting with robust handling of interdependencies among targets and covariates. Ideally, such models would also effectively take into account known future values of covariates to further refine predictions.

As illustrated in the diagram below, most popular TSMFs fall into the transformer-based category; Chronos (AWS), Lag-Llama (ServiceNow), Moirai (Salesforce) and many more. Although not transformer-based, Tiny Time Mixers (TTMs) from IBM also shows exceptional performance. Among these, Moirai and TTMs are the only ones to support interdependent multivariate forecasting capabilities at their maximum.

Figure 1: TSFMs classified by type or architecture and training procedure. [Source]

Moirai excels notably in zero-shot scenarios. It comes in the form of a few pre-trained models with parameters ranging from 14 to 311 million, trained on approximately 27 billion data points. In contrast, IBM’s TTMs offer a more compact alternative, with models starting at just 1 million parameters and trained on roughly 1 billion data points. Check the plot below for a comparison of model sizes and inference time.

Figure 2: Comparison of model sizes versus CPU inference time for popular TSFMs. [Source]

While Moirai supports interdependence out of the box, TTMs incorporate it only during fine-tuning, yet they offer the advantage of doing so efficiently on a single GPU (Moirai needs several GPUs), providing an ideal balance of performance, accessibility, and computational efficiency. Qualities that ultimately guided our selection for deeper evaluation since we have a high interest in finetuning the model.

Case Study: Building a Digital Twin of an Office HVAC System

For our case study, we chose a dataset that contains extensive time-series data from the HVAC (Heating, Ventilation, and Air Conditioning) system of a real office building. This dataset spans three years and includes over 300 variables collected at varying sampling intervals, ranging from 1 to 15 minutes. Additional details can be found here.

Our primary objective was to create a digital twin of the office building’s HVAC system. A digital twin, in this context, is a time-series forecasting model designed to accurately mimic the building’s responses to various control actions, such as opening valves or adjusting fan speeds. Building such a model opens the possibility of developing solutions to optimize HVAC operations, aiming for goals like reducing energy consumption, enhancing predictive maintenance, and improving overall operational efficiency — particularly with minimal reliance on extensive, high-quality data.

The office building itself is structured into two wings, each further divided into multiple thermal zones. Cooling for these zones is provided by four rooftop units (RTUs), each serving distinct groups of thermal zones. The primary focus of our digital twin was to predict two critical aspects: the energy consumption of the HVAC system and the internal temperatures within individual thermal zones for the upcoming hours.

Figure 3: Diagram illustrating the rooftop units and their corresponding thermal zones with RTU 3 and Zone 18 highlighted in red. [By author]

Given that the dataset provides total energy consumption data per wing, significant simplifications were necessary for our initial experiments. Specifically, we focused only on a single thermal zone connected to one rooftop unit (highlighted in red on the diagrams below), approximating its energy consumption as half the total wing’s energy consumption. Additionally, we simplified the use case by removing several variables that exhibited low correlation with our key target variables.

To the best of our knowledge, this particular dataset has not been previously explored in the context of TSFMs. Investigating the application of TTMs in this real-world scenario provides invaluable insights and benchmarks for similar practical deployments.

To further understand how the dataset and the models will couple during the experiments, let’s look at how TTMs internally classify input variables:

Targets: Variables the model aims to forecast.
Conditionals: Variables which we know in the past, but do not know in the future.
Observables: Variables which we have knowledge about in the past and future but cannot be controlled. For example, which days are holidays.
Controls: As in observables, we have knowledge of them during past and future but they can be controlled by us. For example, the speed of a fan, which we can be sure to give a certain value during the next hour if we want.

This detailed classification of targets and covariates helps guide the TTM, signaling how each variable might influence the predicted outcomes through internal prompts. But how do these concepts fit with the ones described at the beginning? Let’s go through a very simple example to illustrate.

Imagine you’re managing a smart home system and your goal is to forecast the indoor temperature over the next six hours. To make this prediction more accurate, you don’t rely only on past temperature readings, instead, you bring in additional information called covariates.

One of these covariates is past power usage. This tells you how much electricity the HVAC system has consumed up to now. It’s useful for understanding how your system behaves, but since you don’t know future power usage in advance, this falls under what we call a conditional. It’s only available up to the present moment.

Next, you incorporate electricity price. The price fluctuates depending on the market, but you can look it up or use a forecast to see what it will be in the coming hours. While you don’t control the electricity price, you do have access to it in both the past and the future. That makes it an observable, known across time, but not influenced by your decisions.

Finally, there’s fan speed. Unlike electricity price, this one is in your control. You can decide in advance to run the fan at a certain speed during the next few hours, and your system will follow that plan. Because it’s fully determined by you and known ahead of time, it’s categorized as a control.

By combining these different types of covariates — the historical behavior of your system, external signals like market prices, and your own planned actions — a model can make a much more informed and adaptive prediction about what the indoor temperature will be.

Table 1: Example illustrating how different example variables fit onto different categories. [By author]

In our scenario, the target variables — internal temperature and energy consumption — are significantly impacted by future control actions. For example, fully opening a heating water valve or raising the hot air setpoint significantly affects both temperature and energy consumption. Therefore, the main challenge was ensuring the model’s forecasts remained sensitive to future known values of control actions, effectively transforming it into an oracle that accurately predicts how the building reacts to various potential control strategies.

Achieving precise sensitivity to control variables with minimal data requirements would be transformative. It would offer a fascinating alternative to traditional, resource-intensive, physics-based building models, significantly simplifying the process of optimizing building operations for energy efficiency and predictive maintenance.

The Experiment: Pushing the Limits of TTMs

To fully understand the experiment conducted and the rationale behind parameter adjustments, it is crucial first to figure out certain aspects of the TTM architecture. As mentioned earlier, the capability for variables to interact and influence each other, called variable interdependence, is available only during the fine-tuning stage. This interdependence is achieved by activating channel mixing within the decoder for past values and via the forecast channel mixer for known future values.

Each variable in the TTM architecture is associated with a specific channel, categorized depending on its role (whether as a target, covariate, or another type). Channel mixing allows the model to identify and learn intricate relationships between these variables.

While channel mixing in the decoder can simply be toggled on or off, the forecast channel mixer offers more nuanced parameter settings. Two primary parameters caught our interest:

Context length: Defines the number of historical data points included when mixing channels.
Number of layers: Specifies how deep (how many layers) the neural network composing the mixer will be.

Theoretically, increasing these parameters allows the model to better capture complex interactions between variables, though at the expense of increased computational resources.

Understanding the TTM Architecture

Below is an illustration of the TTM architecture highlighting key elements of the workflow. The left side represents the pretraining phase, and the right side depicts the fine-tuning phase. During pretraining, there are roughly 1 million learnable parameters. However, only about 300,000 parameters are trainable during fine-tuning because the model’s backbone remains frozen.

Figure 4: TTM architecture. [Source]

A notable feature of the TTM architecture is how it handles input time series data. This data is divided into non-overlapping segments, called patches, enabling the model to efficiently learn both long-term dependencies and maintain local semantic information. The backbone, highlighted below, uses MLP layers combined with gated attention mechanisms instead of traditional transformer architectures, resulting in significantly enhanced computational efficiency while maintaining channel independence.

Figure 5: TTM architecture with backbone highlighted. [Source]

Next key element here is the TTM Head, composed of the TTM Decoder and the Forecast Head, the former is pretrained with channel independence but allows enabling mixing during finetuning while the latter is responsible for producing the final forecasted time-series from the decoder output.

Figure 6: TTM architecture with the TTM Head highlighted. [Source]

Finally, the optional exogenous mixer, also called forecast channel mixer, takes the forecasted values and combines them with the known future values. This module together with the TTM Head contain the 0.3M learnable parameters.

Figure 7: TTM architecture with the Forecast Channel Mixer highlighted. [Source]

TTM Variants

TTMs prioritize small, specialized models tailored for different use cases rather than one universal model. As a result, TTMs are available in numerous pretrained versions — 36 to be precise — each defined by specific combinations of context and forecast window sizes. These combinations range significantly, from models with shorter context windows (52 data points) and forecast windows (16 data points) to models with very large context windows (1536 data points) and extensive forecast horizons (720 data points). See the diagram below depicting available models for each release.

Figure 8: Available combinations of context and forecast window sizes for each release of TTMs. [By author]

Some combinations are repeated across models with variations in training losses or an additional frequency indicator (a special embedding vector specifying data frequency). Our experiments tested 24 publicly available TTM variants, excluding research-only versions.

An important capability of TTMs is their flexibility in forecast window sizes. Suppose our sampling interval is every 15 minutes, and we need predictions for the next 4 hours (16 data points). If we prefer a longer context window than the variant offering this forecast length, we can utilize the “prediction filter.” This mechanism effectively “crops” the predictions from a model with a larger forecast window, ensuring that the output exactly matches our desired forecast length. This cropping applies both during training (affecting loss calculations) and inference.

In time-series forecasting, a datapoint is one measurement taken at a fixed sample frequency (for example, recording indoor temperature every hour). The context window (or input window) is how many past datapoints the model “looks at” to learn patterns, while the forecast window is how many future datapoints it aims to predict. For instance, if your sample frequency is one datapoint per hour, a context window of 24 means the model uses the last 24 hourly measurements; a forecast window of 6 means it then predicts the next 6 hours.

Dataset and Experimental Setup

For the experiments, we used a simplified subset of the original dataset with data resampled to every 15 minutes, totaling around 11,500 data points over four months. The dataset comprised three groups of variables:

Targets:

Energy consumption (hvac_S) in kW/h.
Internal temperature of a thermal zone (zone_018_temp) in ºC.

Controls:

Various adjustable HVAC settings, including temperature setpoints and fan speeds. Some are tied specifically to the thermal zone of interest and some are tied to the roof top unit serving a set of thermal zones.

Observables:

External and internal environmental factors such as outdoor temperature, solar radiation, relative humidity, and airflow metrics.

Two distinct experimental scenarios were tested:

Experiment A: Included all variables (targets, controls, and observables).
Experiment B: Only used target variables (hvac_S and zone_018_temp), omitting controls and observables.

Both experiments shared critical configuration parameters:

Four months of data split into approximately 70% training (~3 months) and 30% testing.
Consistent ranges for testing mixer context lengths and mixer number of layers.
Forecast window (filter length) fixed at 16 data points (4 hours).
Evaluated identical TTM variants.

This setup allowed us to systematically test how well different TTM configurations capture variable relationships and forecast performance, providing valuable insights into optimizing TTMs for real-world HVAC system forecasting scenarios.

Result Analysis

Let’s first go through our initial expectations. We anticipated seeing clear improvements when fine-tuning the models. Specifically, we expected that increasing the model’s context length, forecast window, channel mixing context, and the number of channel mixing layers would correlate with enhanced forecasting performance. Additionally, between the two experiments, we hypothesized that Experiment A, which includes 16 extra covariates, would significantly outperform Experiment B.

The two plots presented below provide critical insights into the model performance:

Left Plot: Displays the overall Root Mean Squared Error (RMSE) for the fine-tuned models in each experiment (lower values indicate better performance).
Right Plot: Illustrates the RMSE improvement achieved by fine-tuning compared to the baseline pretrained models (higher values signify greater improvement).

Figure 9: RMSE by experiment (left) and RMSE improvement by experiment (right). [By author]

As expected, Experiment A demonstrates a notably lower RMSE, confirming that the inclusion of additional covariates substantially enhanced model accuracy. Contrarily, Experiment B showed negligible improvements over pretrained models, highlighting the limited benefit of fine-tuning without additional covariates.

The analysis of model parameters reveals clear trends related to context and forecast lengths when observing the plot below. Performance improves as context and forecast window sizes increase, particularly evident in Experiment A.

Figure 10: RMSE for each context length (left) and RMSE for each forecast length (right). [By author]

An intriguing observation arises around a context length of approximately 400 and forecast length of about 250, where errors sharply decrease. This phenomenon aligns with model variations defined in different TTM releases, specifically release 2, optimized more for higher-frequency data. If look at Figure 8 in the previous section, we’ll see that those that score the best are the ones that fall into release 2. This could be due to release 2.1 being more focused towards weekly and daily data and no so much towards higher sample rates. Check the extract below from the official documentation.

TTM r2 comprises TTM variants pre-trained on larger pretraining datasets (~700M samples). The TTM r2.1 release increases the pretraining dataset size to approximately (~1B samples). The prior model releases, TTM r1, were trained on ~250M samples and can be accessed here. In general, TTM r2 models perform better than TTM r1 models as they are trained on a larger pretraining dataset. In standard benchmarks, TTM r2 outperform TTM r1 by over 15%. However, the choice of r1 vs. r2 depends on your target data distribution, and hence users should try both variants and pick the best model for your data. The TTM r2 releases support point forecasting use-cases specifically ranging from minutely to hourly resolutions (Ex. 10 min, 15 min, 1 hour.). With the TTM r2.1 release, we add support for daily and weekly resolutions.

Contrary to our initial assumptions, parameters related to the channel mixing module did not significantly affect the forecasting outcomes. The correlation matrix below illustrates that, on the one hand, context and forecast window sizes (highlighted in red) strongly correlate negatively with error metrics, as just discussed. On the other hand, channel mixing parameters (highlighted in blue) exhibit negligible correlations, indicating minimal influence on model performance.

Figure 11: Correlation between model context, forecast window size, channel mixing layers, channel mixing context against error metrics. [By author]

Since there are two target variables to forecast in the experiments, it’ll be worth looking at the performance of each one of them.

Remember we are predicting the energy consumption ( hvac_S ) and the internal room temperature ( zone_018_temp ). Below is a comparison of the performance on these metrics for each experiment. Lighter shades are pretrained models and darker ones are finetuned models. Blue is for energy consumption and red is for internal temperature. On the left we have experiment A, and on the right, experiment B.

Figure 12: RMSE by experiment for each of the two target variables and grouped by finetuned and pretrained models. [By author]

Note that for experiment B, where covariates are not available, both target variables improve by a similar margin. Whereas experiment A shows a greater improvement on forecasting the energy consumption compared to that on forecasting the internal temperature. Anyway, notice that there is always a notable gap between the two target variables, which is expected given the more complex nature of the values the energy consumption can have compared to those of the internal temperature, which follows a much more predictable pattern.

Considering the dramatic simplification made on the original dataset, having been able to benefit from using covariates this way is a great win for this use case, where the sensitivity of TTMs’ forecast against past and future known values of covariates are of great importance.

This positions TTMs as a great candidate for the discussed landscape of opportunities that the industry offers regarding time-series forecasting.

Uptake in the Industry

As highlighted at the beginning, industrial processes are ready for optimization through advanced time-series forecasting. Foundational Time-Series Models (TSFMs) substantially lower the barriers to entry with the help of powerful pretrained models, mirroring the transformative impact observed in fields such as natural language processing, computer vision, and audio analysis. However, putting them into real operations comes with real-world challenges.

Given the diversity inherent in time-series data across different domains, it’s unrealistic to expect a single foundational model to universally excel. Nonetheless, even specialized foundational models optimized for specific applications could offer substantial improvements. We particularly see great promise in processes involving controllable variables, where accurately forecasting system dynamics under varying conditions enables cost-effective digital twin creation and facilitates reinforcement learning training in risk-free environments.

There’s also the issue of governance. In critical industries where safety, cost, or environmental impact are at stake, it’s important to make sure AI models are transparent, explainable, and accountable. TSFMs raise questions: How do we check that fine-tuning worked? How do we test for rare situations? And how do we make sure decisions based on forecasts can be trusted?

Beyond the tech, people also need to be ready for change. Operators and engineers must understand and trust these models. That means clear visual tools, simple interfaces, and training that helps people stay in control — not feel replaced by AI.

Still, the benefits are clear. In complex systems with many moving parts, TSFMs can speed up digital twin development, help optimize processes, and even support advanced AI like reinforcement learning — all with less data requirements than traditional models.

As AI tools and industrial systems continue to grow together, TSFMs could become a key part of future smart, self-improving operations.

Final Thoughts

Time-Series Foundational Models (TSFMs) offer the potential to reshape how industrial systems are modeled, optimized, and controlled. While early results — including our exploration of IBM Granite TTM — demonstrate clear promise, much remains to be understood about how best to work with these models within the constraints of real industrial environments.

At GFT Technologies, we see collaboration and knowledge sharing as essential to advancing this field. The broader community will need to test TSFMs across diverse use cases, identify best practices for integration, and address challenges such as governance, lifecycle management, and human-AI interaction.

We encourage practitioners, researchers, and industry partners to engage with us — to jointly explore, validate, and scale these emerging capabilities. The opportunity is significant but realizing it will require a collective effort grounded in both technical rigor and practical experience.

This project was developed under the direction of Sebastian Santibáñez, who managed and guided the initiative from start to finish.

Evaluating IBM’s Granite TTMs and the Impact of Time-Series Foundational Models on Industry: was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hyper-personalised wealth management and investment recommendations: the role of AI and ML in…

Michael Doherty-Harding — Wed, 30 Apr 2025 07:04:54 GMT

Hyper-personalised wealth management and investment recommendations: the role of AI and ML in transforming the industry.

By Michael Doherty-Harding

Introduction

The financial services industry, as is most likely the case with other industries, is undergoing significant transformation driven by the increasing use of artificial intelligence (AI) and machine earning (ML). Amongst the most impactful of these is the rise of hyper-personalised wealth management and investment recommendations tailored specifically to individual clients. Traditional methods of wealth management, usually based on generalised strategies, are being replaced by a more refined approach where the unique financial needs, goals, risk appetites and preferences of each client, drive the process. At the centre of this shift are AI and ML, which enable wealth managers and financial advisors to leverage large amounts of data to provide tailored investment advice.

In this article, I explore how hyper-personalised wealth management is revolutionising the financial advisory landscape, with a special focus on the role AI and ML play in achieving this.

The shift towards hyper-personalised wealth management

Historically, wealth management services have been structured in tiers. High-net-worth individuals often received bespoke advice, whilst others were offered a more standardised, ‘one-size-fits-all’ investment strategy. This model is increasingly becoming outdated as individuals across the wealth spectrum demand services that cater to their specific financial situations.

Hyper-personalisation in wealth management involves creating unique financial plans for each client, considering their life goals, risk tolerance, income, spending habits and other personal factors. This is made possible through the convergence of AI, ML and big data analytics, which enable financial advisors to process vast amounts of data about each client, from their financial history to their behavioural tendencies. The result is a bespoke service where each investment recommendation is carefully constructed to align with an individual’s profile and changing circumstances.

How AI and ML are enabling hyper-personalisation

Data-driven insights

One of the most powerful ways AI and ML are driving hyper-personalisation is through data collection and analysis. Banks and wealth managers usually have access to a large amount of a client’s data, including transaction histories, social media behaviour, credit scores and even non-financial data, such as a person’s lifestyle and purchasing habits. Machine learning algorithms can process and interpret this data, identifying patterns that may not be obvious to a human advisor.

ML algorithms can also use this data to predict future financial behaviours, which helps advisors provide forward-looking advice. For instance, predictive analytics might suggest that a client is on the verge of a significant life event, such as purchasing a home or retiring, prompting the advisor to recommend portfolio adjustments or new financial products.

Dynamic portfolio management

Traditional investment portfolios are often reviewed annually or semi-annually, but with the use of AI and ML, portfolio management can be continuous and dynamic. AI-driven systems can monitor markets and individual investments in real time, quickly adjusting portfolios to respond to changes in market conditions or client goals.

This approach is known as ‘robo-advising’. AI and ML tools are also being used by human advisors to augment their decision-making. These systems can recommend rebalancing portfolios, reducing exposure to certain risk factors, or diversifying assets when market shifts are detected. As a result, clients can benefit from a more agile approach to managing their investments.

Risk profiling

Every individual has a different risk tolerance based on their financial goals, life stage and psychological makeup. One of the key challenges in wealth management is accurately assessing a client’s risk profile and aligning their investment strategy accordingly.

AI and ML excel in this area. By analysing a combination of financial and behavioural data, AI can provide a more nuanced understanding of each client’s risk appetite. For instance, machine learning algorithms can analyse how a client has reacted to market volatility in the past, as well as their spending and saving habits, to determine their true risk tolerance. This allows for more precise recommendations, ensuring that a client is neither exposed to too much nor too little risk.

Hyper-personalised investment recommendations

AI and ML can deliver investment recommendations tailored specifically to each client’s needs, often using sophisticated algorithms that evaluate a wide range of variables, such as market conditions, personal risk profiles and financial goals. By analysing data in real time, AI-driven systems can present clients with bespoke investment opportunities that align with their preferences and long-term financial strategies.

For instance, if a client expresses a preference for environmentally sustainable investments, AI can automatically adjust recommendations to include ESG (environmental, social and governance) assets or portfolios that align with the client’s ethical values. This kind of personalisation was previously difficult to scale but is now becoming standard practice thanks to AI-powered platforms.

Enhancing client-adviser relationships

One of the most compelling aspects of AI and ML in wealth management is their ability to enhance, but not replace, human advisors. By taking on much of the heavy lifting related to data analysis, monitoring trends and automating portfolio updates, AI systems free up advisors to focus on building better relationships with clients.

AI-driven tools can provide advisors with detailed insights into each client’s financial habits and preferences, allowing them to offer more personalised advice during face-to-face meetings.

AI-Driven Risk Management

Risk management is a critical component of wealth management. ML algorithms can assess potential market risks, economic changes and geopolitical factors, enabling wealth managers to proactively adjust investment strategies. For example, if an AI model predicts a downturn in a particular sector, it can recommend reducing exposure to that sector before significant losses occur.

Also, by continuously analysing patterns in a client’s financial behaviour, AI systems can flag up any unusual activity, alerting the client and their advisor to potential issues that need attention.

Ethical and Privacy Considerations

While the benefits of AI and ML in hyper-personalised wealth management are substantial, there are also important ethical and privacy considerations to take into account. Companies must ensure that they are using customer data responsibly and in compliance with data protection regulations, such as GDPR.

Also, there is the question of explainability of ML models. Clients may question the decision-making process behind AI-driven recommendations, particularly if the reasoning is not clearly explained. Financial institutions should focus on ensuring that their AI systems are transparent and that clients understand how and why investment decisions are being made.

The Future of Hyper-Personalised Wealth Management

The future of wealth management is moving towards increased personalisation powered by AI and ML. As these technologies continue to advance, we can expect even more sophisticated algorithms that can deliver very targeted investment recommendations in real-time. This will not only enhance the customer experience but also make wealth management services more accessible and inclusive, and crucially, allowing individuals at all income levels to benefit from tailored financial advice.

Conclusion

Hyper-personalised wealth management is still a growing field and GFT recognises the need to be ready to make strides in this space. We are further developing expertise in AI, ML and data analytics, helping to create comprehensive solutions for financial institutions, focusing on client experience and engagement, building a strong data privacy and compliance framework and offering consulting and advisory services in AI strategy and implementation.

Hyper-personalised wealth management and investment recommendations: the role of AI and ML in… was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM and Prolog: the logical alternative to chain-of-thought reasoning

Lu Mao — Wed, 23 Apr 2025 09:21:10 GMT

Large language models (LLMs) like GPT have revolutionised natural language processing, excelling in diverse tasks from translation to summarisation. However, when it comes to mathematical reasoning — particularly tasks involving arithmetic, commonsense logic, and symbolic reasoning — LLMs still fall short. This is primarily because these models rely heavily on their training data, often memorising patterns rather than genuinely understanding or deriving new solutions. Mathematical problems, which can manifest in countless contextual variations and numerical combinations, expose this limitation. As a result, LLMs often struggle with mathematical queries that deviate even slightly from their training examples.

The power of combining LLMs with Prolog

My first encounter with Prolog was during my studies in artificial intelligence at Edinburgh University 20 years ago. As one of the foundational languages in AI, Prolog’s unique approach to problem-solving through facts and rules intrigued me, but I quickly realised how challenging it could be to master. Writing accurate logical predicates requires not only a deep understanding of the problem domain but also a solid grasp of Prolog’s syntax and logic programming principles. This initial exposure highlighted both the power and complexity of Prolog, setting the stage for my appreciation of how modern tools like LLMs can simplify and enhance its use.

To overcome this challenge of solving complex mathematical and reasoning queries, recent research proposes coupling LLMs with logic programming languages like Prolog . Instead of expecting LLMs to directly compute answers, their role is shifted towards semantic parsing — translating natural language questions into logical predicates. These predicates are then processed by Prolog, a language designed for symbolic reasoning and logical inference. By offloading the computational tasks to Prolog, the LLM focuses on accurately understanding and converting questions into a formal logical structure, significantly enhancing the system’s ability to handle complex mathematical problems.

This blog explores the potential of this hybrid approach through a series of experiments using LLMs to generate Prolog code for financial reasoning tasks. Specifically, we examine the performance of several state-of-the-art LLMs, including DeepSeek, Grok, GPT-4, Gemini and Claude, on generating Prolog predicates for questions sourced from the FinQA dataset. We also compare this performance with advanced reasoning models to evaluate the effectiveness of using foundation models combined with logic programming. Our findings reveal that this approach not only rivals but, in some cases, surpasses traditional reasoning models, offering a promising direction for enhancing AI’s reasoning capabilities. Furthermore, fine-tuning an LLM with extra supervised or reinforcement learning is costly and the results are variable and hard to predict.

Evidence from recent studies

In similar research, studies like “Exploring an LLM to Generate Prolog Predicates from Mathematics Questions” [2] and “Arithmetic Reasoning with LLM: Prolog Generation & Permutation” [1] highlight the effectiveness of this hybrid approach. The latter even introduces the GSM8K-Prolog dataset, where Prolog code is generated [7] from arithmetic word problems using LLMs. As shown in Figure 1, the experiments demonstrate that Prolog code generation outperforms traditional Chain-of-Thought (CoT) reasoning, allowing for precise, deterministic solutions. Additionally, the non-sequential nature of Prolog predicates enables innovative data augmentation techniques, such as predicate permutation, to further enhance model robustness. Together, these insights point towards a promising future where the synergy between LLMs and logic programming unlocks new potentials in AI-driven mathematical reasoning.

Figure 1: Accuracy results on the GSM8K and GSM-HARD datasets. [2]

Efficiency and accuracy in logical reasoning

The appeal of this integration lies in its efficiency and accuracy. While LLMs are adept at parsing and understanding the nuances of natural language, they are not inherently designed for logical computations or symbolic manipulations. Prolog, on the other hand, excels at handling logical rules and constraints, offering deterministic solutions to problems that LLMs might struggle with. By combining these strengths, we can create systems that not only understand complex queries but also solve them with high precision.

Applications in financial reasoning

One particularly exciting application of this LLM-Prolog integration is in the realm of financial reasoning. Financial markets are governed by a complex interplay of numerical data, regulatory constraints and logical conditions — factors that can overwhelm traditional AI models. By using LLMs to interpret and parse financial reports, market analyses and regulatory texts into logical predicates, Prolog can then be employed to execute precise financial computations, risk assessments and compliance checks. This dual approach can significantly improve the accuracy of financial forecasting, fraud detection and automated auditing processes.

Additionally, this process provides an element of explainability into the process. Whereas an LLM that explains itself, such as a chain of thought reasoning model like Deepseek-R1 might provide a false narrative about what it’s discovered and why it’s reached a particular conclusion, a model that explicitly creates logical predicates and then uses them in reasoning is working “in the clear”. We don’t know why it’s picked out or failed to pick out a predicate from the data, but it is straightforwardly the case that it has, or hasn’t. This provides more transparency and potentially enables debugging and whitebox style testing.

Enhancing financial decision-making tools

We can also use the combination of LLMs and Prolog to enhance decision-making tools within the finance industry. For example, investment strategies often require evaluating multiple scenarios under varying constraints and conditions. LLMs can digest and summarise vast amounts of financial data, whilst Prolog can simulate various investment scenarios by applying logical rules and constraints. This can empower and support financial analysts with more robust tools.

Neural symbolic reasoning v.s. chain-of-thought reasoning

Chain-of-thought (CoT) reasoning [4] is a method where LLMs are guided to break down complex problems into sequential, intermediate reasoning steps before arriving at the final solution. This approach mirrors human problem-solving by encouraging models to explicitly generate each step of the thought process, rather than attempting to leap directly to the answer. CoT has been shown to improve performance on arithmetic, commonsense reasoning, and multi-step logical tasks by providing a more structured reasoning path.

Strengths:

Improved interpretability: By making intermediate steps explicit, CoT provides a more transparent reasoning process, which is easier to follow and validate.
Enhanced accuracy: For tasks requiring sequential operations, such as arithmetic and procedural tasks, CoT helps reduce logical leaps that often lead to incorrect answers.

Limitations:

Dependency on training data: CoT heavily relies on patterns learned during training, which limits its generalisation to novel scenarios or logical structures not well represented in the training corpus.
Sequential constraints: CoT enforces a linear sequence of reasoning steps, which can be restrictive for problems requiring non-sequential logic or multiple interdependent conditions.

Neural symbolic reasoning combines neural networks’ pattern recognition capabilities with the symbolic reasoning power of logic programming languages like Prolog. This approach involves using LLMs to extract facts and logical rules from natural language, which are then processed by a symbolic reasoning engine (e.g., Prolog) to perform deterministic logical deductions.

Strengths:

Logical consistency and determinism: Symbolic reasoning engines provide deterministic outputs for the same input, ensuring consistency and accuracy.
Generalisation and flexibility: By decoupling the natural language understanding (handled by the neural network) from the logical reasoning (handled by Prolog), this approach generalises better to novel scenarios.
Explainability and transparency: The declarative nature of Prolog makes the reasoning process more interpretable, as rules and facts are explicitly defined.

Limitations:

Complexity in rule extraction: Extracting accurate logical predicates from natural language is challenging and requires sophisticated NLP techniques.
Steeper learning curve: Developing Prolog rules and understanding symbolic logic can be more complex compared to sequential reasoning in CoT.

Why neural symbolic reasoning is a game-changer

In complex reasoning tasks, especially those requiring symbolic manipulation, logical constraints, or hierarchical data structures, neural symbolic reasoning demonstrates clear advantages. It enables a more modular architecture, where the neural network focuses on understanding and parsing natural language, whilst Prolog handles logical deductions. This separation not only improves generalisation but also enhances robustness and explainability, addressing some of the major challenges faced by chain-of-thought reasoning.

Furthermore, by leveraging the powerful pattern recognition capabilities of LLMs to extract logical predicates, neural symbolic reasoning significantly reduces the burden of manually crafting rules, making it more scalable and adaptable to a wider range of problem domains.

In recent studies [1, 2] Figure 1, it has been demonstrated that fine-tuning LLMs to generate Prolog predicates directly outperforms traditional CoT reasoning for mathematical and symbolic tasks. By focusing on semantic parsing and allowing Prolog to handle logical computations, this hybrid approach mitigates the limitations of CoT and leverages the strengths of both methodologies. The result is a more robust, reliable and interpretable AI reasoning framework that excels in tasks where traditional LLM approaches fall short.

Prolog v.s. Python

Both Prolog and Python are widely used in the field of artificial intelligence, yet they cater to different aspects of problem-solving and reasoning. Python, with its extensive libraries and frameworks such as TensorFlow, PyTorch, and scikit-learn, has become the go-to language for machine learning, data science and numerical computations. Its imperative programming style and vast ecosystem make it ideal for building and deploying AI models efficiently.

Prolog, on the other hand, offers unique features that are not present in Python. As a logical programming language, Prolog excels in symbolic reasoning, knowledge representation, and automated theorem proving. Its declarative nature allows developers to define relationships and rules, letting the language’s inference engine handle the logical deductions and backtracking. This is particularly beneficial for tasks involving complex logical structures, such as natural language understanding, expert systems and constraint satisfaction problems.

One standout feature of Prolog is its built-in support for backtracking and unification. Prolog automatically explores multiple possibilities to find solutions, a process that requires explicit programming in Python. Additionally, Prolog’s ability to handle recursive queries and constraints with ease makes it a powerful tool for solving problems that require exhaustive search and logical consistency. These capabilities highlight Prolog’s strength in areas where Python’s traditional procedural approach may fall short, making it a valuable complement to Python in AI applications.

Quick introduction on Prolog

To better understand how Prolog operates, let’s consider a straightforward example that demonstrates its ability to handle recursive logic and relationships.

In this example:

Facts: We define simple parent-child relationships.

Alice is the parent of Bob.
Bob is the parent of Carol.
Carol is the parent of David.

Rule: The ancestor/2 rule determines if one person is an ancestor of another.

The base case states that if X is a parent of Y, then X is also an ancestor of Y.
The recursive case states that if X is a parent of Z, and Z is an ancestor of Y, then X is an ancestor of Y.

Query and Output from Prolog:

swipl -q -f parents.pl
1 ?- ancestor(alice, david).
true .

This query checks if Alice is an ancestor of David. Prolog uses its recursive rule to establish the lineage, confirming that Alice is indeed David’s ancestor. This simple example illustrates Prolog’s power in handling hierarchical data and logical reasoning.

Using DeepSeek R1 to generate Prolog facts and rules from text

One of the most promising applications of LLMs like DeepSeek R1 is their ability to convert natural language descriptions into Prolog facts and rules, significantly simplifying the process of building logic-based systems. Let’s explore this with an example problem taken (from GSM8K dataset):

Prompt:

TASK: can you extract SWI-Prolog-based facts and rules for me to pass to a SWI-Prolog program for querying： 
CONTEXT: A family of 12 monkeys collected 10 piles of bananas. 6 piles had 9 hands, with each hand having 14 bananas, whilst the remaining piles had 12 hands, with each hand having 9 bananas. How many bananas would each monkey get if they divide the bananas equally amongst themselves?

Deepseek-Generated Prolog Facts and Rules:

How to Query

How many bananas would each monkey get if they divide the bananas equally amongst themselves?

swipl -q -f banana-deepseek.pl
1 ?- bananas_per_monkey(X).
X = 99.

By using the extracted facts and defining rules from natural language text, we can streamline the development of Prolog-based solutions. This integration allows for seamless conversion of complex problem statements into logical structures, enabling more efficient and accurate problem-solving.

Using GPT 3o-mini-high to generate Prolog facts and rules for financial reasoning

The application of GPT models in financial reasoning takes another leap forward with GPT-3o-mini-high, which excels in parsing and analysing complex financial documents. By transforming dense financial reports into structured data, GPT can facilitate more accurate and insightful financial analysis.

The example below is a question (extracted from pdf) taken from the dataset of FinQA[3], FINQA offers a comprehensive set of 8,281 QA pairs (Figure 1) that reflect real-world financial scenarios.

Prompt:

TASK: can you extract SWI-Prologbased Ffacts and rules for me to pass to a SWI-Prolog program for query.
CONTEXT：notes to consolidated financial statements for the years ended february 3 , 2006 , january 28 , 2005 , and january 30 , 2004 , gross realized gains and losses on the sales of available-for-sale securities were not mate- rial . the cost of securities sold is based upon the specific identification method . merchandise inventories inventories are stated at the lower of cost or market with cost determined using the retail last-in, first-out ( 201clifo 201d ) method . the excess of current cost over lifo cost was approximately $ 5.8 million at february 3, 2006 and $ 6.3 million at january 28, 2005. current cost is deter- mined using the retail first-in , first-out method . lifo reserves decreased $ 0.5 million and $ 0.2 million in 2005 and 2004, respectively, and increased $ 0.7 million in 2003. costs directly associated with warehousing and distribu- tion are capitalized into inventory . in 2005, the company expanded the number of inven- tory departments it utilizes for its gross profit calculation from 10 to 23. the impact of this change in estimate on the company 2019s consolidated 2005 results of operations was an estimated reduction of gross profit and a corre- sponding decrease to inventory, at cost, of $ 5.2 million. store pre-opening costs pre-opening costs related to new store openings and the construction periods are expensed as incurred . property and equipment property and equipment are recorded at cost. the company provides for depreciation and amortization on a straight-line basis over the following estimated useful lives:
land improvements 20
buildings 39-40
furniture fixtures and equipment 3-10
improvements of leased properties are amortized over the shorter of the life of the applicable lease term or the estimated useful life of the asset . impairment of long-lived assets when indicators of impairment are present, the company evaluates the carrying value of long-lived assets, other than goodwill, in relation to the operating performance and future cash flows or the appraised values of the underlying assets. the company may adjust the net book value of the underlying assets based upon such cash flow analysis compared to the book value and may also consid- er appraised values . assets to be disposed of are adjusted to the fair value less than the cost to sell if less than the book value. the company recorded impairment charges of approximately $ 0.5 million and $ 0.6 million in 2004 and 2003, respectively, and $ 4.7 million prior to 2003 to reduce the carrying value of its homerville, georgia dc (which was sold in 2004) . The company also recorded impair- ment charges of approximately $ 0.6 million in 2005 and $ 0.2 million in each of 2004 and 2003 to reduce the carrying value of certain of its stores 2019 assets as deemed necessary due to negative sales trends and cash flows at these locations. these charges are included in sg&a expense . Other assets consist primarily of long-term investments, debt issuance costs which are amortized over the life of the related obligations, utility and security deposits, life insurance policies and goodwill. vendor rebates the company records vendor rebates , primarily consisting of new store allowances , volume purchase rebates and promotional allowances , when realized . The rebates are recorded as a reduction to inventory purchases , at cost , which has the effect of reducing cost of goods sold , as prescribed by emerging issues task force ( 201ceitf 201d ) issue no. 02-16, 201caccounting by a customer (including a reseller) for certain consideration received from a vendor 201d. rent expense expense is recognized over the term of the lease. the company records minimum rental expense on a straight-line basis over the base, non-cancelable lease term commencing on the date that the company takes physical possession of the property from the landlord, which normally includes a period prior to store opening to make necessary leasehold improvements and install store fixtures. when a lease contains a predetermined fixed escalation of the minimum rent, the company recognizes the related rent expense on a straight-line basis and records the difference between the recognized rental expense and the amounts payable under the lease as deferred rent. the company also receives tenant allowances , which are recorded in deferred incentive rent and are amortized as a reduction to rent expense over the term of the lease . any difference between the calculated expense and the amounts actually paid are reflected as a liability in accrued expenses and other in the consolidated balance sheets and totaled approximately $ 25.0 million.

Prolog code generated by the model:

How to Query:

We can provide the code to Prolog to answer questions:

What was the total impairment costs recorded from X1 to X2 years in millions

swipl -q -f financial.pl
1 ?- total_impairment(2003, 2005, Total).
Total = 2.1.

2 ?- total_impairment(2004, 2006, Total).
Total = 1.2999999999999998.

By leveraging GPT-3o Mini High to extract and structure financial information, Prolog can then be used to perform precise calculations and generate reliable financial insights. This combination provides a powerful tool for financial analysts, enhancing accuracy and efficiency in financial reasoning tasks.

Here is the F and R generated by DeepSeek R1 and yield the same query result:

Generating Python code for reasoning on FinQA.

We use the same prompt for generating Python code for all models.

You are an expert financial reasoning assistant. Your task is to generate a **Python program** that can answer a financial query using extracted information from a financial report.

### TASK:
- You will be given a **financial report** containing both text and tables.
- Extract relevant **figures and calculations** needed to answer the question.
- Based on the extracted information, generate a **Python script** that computes the answer.
- Ensure that the Python code is **self-contained**, meaning:
  - It **defines a function** to compute the answer.
  - It **calls the function** and **prints the final answer**.
- **Output Python code only**—no explanations or additional text.
### CONTEXT:
{context}
### QUESTION:
{question}
### OUTPUT FORMAT:
```python

An example code generated by GPT:

```python
def calculate_sum_european_papers_net_sales():
    """
    Calculates the sum of European papers net sales from 2006 to 2004.
    """
    sales_2006 = 1.5 * 1000  # $1.5 billion
    sales_2005 = 1.4 * 1000  # $1.4 billion
    sales_2004 = 1.5 * 1000  # $1.5 billion
    total_sales = sales_2006 + sales_2005 + sales_2004
    return total_sales

# Calculate and print the answer
total_european_papers_net_sales = calculate_sum_european_papers_net_sales()
print(f"The sum of European papers net sales from 2006 to 2004 is: ${total_european_papers_net_sales:.2f} million")
```

FinQA experiment results and discussion

Experiment overview: Evaluating Prolog-based vs. Python-based v.s. chain-of-thought financial reasoning

This experiment evaluates the effectiveness of LLMs in generating both Prolog and Python code for 100 randomly selected FinQA questions. The goal is to compare symbolic reasoning with Prolog, procedural/numerical reasoning with Python, and LLM-only chain-of-thought (CoT) reasoning, examining their strengths and weaknesses in financial reasoning tasks.

Experiment setup

For each question:

Prolog-based reasoning:

The LLM is prompted to generate Prolog code and a corresponding Prolog query.
The Prolog code is executed using SWI-Prolog.
If the first attempt (Pass@1) fails, the model gets a second attempt (Pass@2).

Python-based reasoning:

The LLM is prompted to generate Python code that performs financial calculations.
The generated Python script is executed within the evaluation framework.
The result is extracted and compared with ground truth financial reasoning benchmarks.

Chain-of-thought (CoT) reasoning:

The LLM is prompted to reason step-by-step in natural language.
The final answer is extracted from its textual explanation.
The same 100 questions were tested against three CoT-based reasoning models: GPT4–1o-mini-2024–09–12, DeepSeekR1, Grok-3

This comparison allows us to assess whether foundation models paired with Prolog or Python can match or outperform specialised reasoning models.

Figure 2: Scores on LLM’s answers to the FinQA questions

Notes: Due to quota limitations, we are unable to run the experiment using the latest Gemini-2* models.

Results overview: Prolog vs. Python vs. chain-of-thought

Prolog code generation performance

DeepseekV3 demonstrated the strongest performance, achieving a total score of 80%.
GPT-4o followed closely with 79%, leading in Pass@1 with 74%.
Claude-3–7 performed consistently, scoring 77%.
Grok-2–1212, a new addition, achieved 69%, with notable improvements in Pass@2.
Gemini-1.5-pro lagged behind at 58%, struggling with Prolog code generation.

Python-based reasoning performance

DeepseekV3 and gpt-4o performed well in Python-based reasoning, both achieving 78%.
Claude-3–7 followed with 74%.
Grok-2–1212 scored 65%, showing room for improvement.
Gemini-1.5-pro had the lowest performance at 61%.

Chain-of-thought (CoT) reasoning performance

DeepSeek R3 achieved the highest CoT performance (80%), comparable to Prolog-based reasoning.
Grok-3 followed at 78%.
GPT4-o1-mini scored 77%, slightly below the top CoT performers.
Foundation models (e.g., deepseekV3, gpt-4o, grok-2–1212) had CoT scores ranging from 63% to 76%, slightly lower than their code-based reasoning counterparts.

Analysis and insights

The experiment reinforces that foundation models combined with Prolog can achieve performance comparable to, or even better than, specialised reasoning models. This approach leverages:

LLMs’ natural language understanding for effectively parsing complex financial contexts and questions.
Prolog’s symbolic reasoning for performing deterministic logical computations and ensuring logical consistency.

Key observations:

deepseekV3 and gpt-4o continue to demonstrate robust performance, indicating their suitability as effective reasoning engines when paired with Prolog.
Grok-2–1212 shows significant promise as a new contender, with performance close to traditional reasoning models, especially with its improvement on Pass@2.
Grok-3 establishes itself as a competitive reasoning model, outperforming gpt4–1o-mini and coming close to DeepSeekR1.

Results overview: Prolog vs. Python vs. chain-of-thought

Prolog code generation performance

DeepseekV3 demonstrated the strongest performance, achieving a total score of 80%.
GPT-4o followed closely with 79%, leading in Pass@1 with 74%.
Claude-3–7 performed consistently, scoring 77%.
Grok-2–1212, a new addition, achieved 69%, with notable improvements in Pass@2.
Gemini-1.5-pro lagged behind at 58%, struggling with Prolog code generation.

Python-based reasoning performance

DeepseekV3 and gpt-4o performed well in Python-based reasoning, both achieving 78%.
Claude-3–7 followed with 74%.
Grok-2–1212 scored 65%, showing room for improvement.
Gemini-1.5-pro had the lowest performance at 61%.

Chain-of-thought (CoT) reasoning performance

DeepSeek R3 achieved the highest CoT performance (80%), comparable to Prolog-based reasoning.
Grok-3 followed at 78%.
GPT4-o1-mini scored 77%, slightly below the top CoT performers.
Foundation models (e.g., deepseekV3, gpt-4o, grok-2–1212) had CoT scores ranging from 63% to 76%, slightly lower than their code-based reasoning counterparts.

Analysis and insights

The experiment reinforces that foundation models combined with Prolog can achieve performance comparable to, or even better than, specialised reasoning models. This approach leverages:

LLMs’ natural language understanding for effectively parsing complex financial contexts and questions.
Prolog’s symbolic reasoning for performing deterministic logical computations and ensuring logical consistency.

Key observations:

deepseekV3 and gpt-4o continue to demonstrate robust performance, indicating their suitability as effective reasoning engines when paired with Prolog.
Grok-2–1212 shows significant promise as a new contender, with performance close to traditional reasoning models, especially with its improvement on Pass@2.
Grok-3 establishes itself as a competitive reasoning model, outperforming gpt4–1o-mini and coming close to DeepSeekR1.

Conclusion and discussion

In this blog, we present results demonstrating that generating Prolog and Python programmed for reasoning tasks can achieve performance comparable to fine-tuned reasoning models. Whilst LLMs excel at understanding and parsing natural language, their limitations in symbolic reasoning, logical consistency and precise numerical computations are well-documented.

Prolog, with its robust logic programming and deterministic reasoning, fills this gap by offering precise, rule-based solutions that enforce logical consistency and enable transparent reasoning paths. Similarly, Python-based reasoning leverages LLMs’ ability to generate and execute numerical computations, making it highly effective for financial analysis, revenue calculations, and other quantitative tasks.

This neural symbolic reasoning approach not only leverages LLMs’ natural language understanding but also capitalises on Prolog’s symbolic inference and Python’s computational power to create a hybrid reasoning system. This combination proves particularly powerful in scenarios that require both linguistic interpretation and rigorous logical or numerical deductions, overcoming the limitations of purely neural approaches such as chain-of-thought reasoning.

The experiments conducted on FinQA questions clearly demonstrate that this hybrid model — incorporating both symbolic logic (Prolog) and procedural computation (Python) — achieves performance comparable to, and in some cases, superior to state-of-the-art reasoning models, including DeepSeekR1 and Grok-3. These findings highlight that foundation models combined with logic programming or numerical computation can rival specialised reasoning models, offering a more flexible and interpretable approach to AI-driven financial reasoning.

Reference

[1] Exploring an LM to generate Prolog Predicates from Mathematics Questions. Yang. X et al 2023.

[2] Arithmetic Reasoning with LLM: Prolog Generation & Permutation Yang. X et al 2023 Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics

[3] Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, William Yang Wang FinQA: A Dataset of Numerical Reasoning over Financial Data EMNLP 2021

[4] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023.

[5] Andrew Cropper Efficiently Learning Efficient Programs, Imperial College London 2019

[6] Meragol: https://github.com/metagol

[7] What’s Left? Concept Grounding with Logic-Enhanced Foundation Models Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu, NeurIPS 2023

LLM and Prolog: the logical alternative to chain-of-thought reasoning was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Empowering innovation in regulated environments

Ivan Gatnau Lopez — Tue, 04 Mar 2025 13:15:36 GMT

Zero trust service mesh for tier 1 financial services institutions (part 1)

Introduction

Tier 1 financial services organisations are often at the forefront of adopting cutting-edge technologies whilst also adhering to the highest regulatory and compliance requirements.

Within this context service mesh has emerged as a vital component for managing microservices, usually within multi-tenant Kubernetes clusters. It enables secure, controlled, reliable and observable communication between services. However, implementing a service mesh in environments like tier 1 financial services institutions poses unique challenges due to strict compliance requirements, multi-tenancy, and modern network security.

Whilst service mesh implementations might be relatively simple to set up in clusters managed by a single application team, with same confidentiality and data sensitivity requirements, the complexity can increase significantly in multi-tenant environments where clusters are shared among multiple users and application teams.

Despite this added complexity, it is still beneficial for a large organisation to develop a capability that supports multi-tenancy for the following key reasons:

1. Cost Efficiency

Shared Infrastructure: Multi-tenancy allows multiple tenants to share the same cluster resources (nodes, storage, cross-cluster services, etc.), reducing the need for dedicated clusters per application team.
Optimized Resource Utilization: Multi-tenant clusters will always have a higher utilisation rate compared with small/medium applications. The latter may underutilise regional clusters with several availability zones being a common resiliency requirement for tier 1 financial services organisations.
Reduced Operational Overhead: A team managing a single shared cluster is usually more cost-effective than many application teams managing their own clusters.

2. Operational Simplicity

Centralized Management: A single cluster reduces the complexity of managing Kubernetes environments, including upgrades, monitoring and scaling.
Standardization: Consistent policies, tooling and configurations can be applied across all tenants, ensuring uniformity and compliance.

3. Security and Compliance

Policies Enforcement: Network policies, resource quotas and security controls (e.g., OPA Gatekeeper constraints) are easier to be enforced in multi-tenant clusters by an expert platform team.
Auditing and Governance: Centralized logging and monitoring allow organizations to track activity and enforce compliance across tenants of multi-tenant clusters rather than possible many individual clusters.
Real-Time Observability: Tools like Jaeger and Prometheus provide detailed metrics, logs, and traces for compliance audits. These are enforced for all tenants in a multi-tenant cluster.
Centralized Logging: Sidecar proxies (e.g., Envoy in Istio) collect and forward logs to centralized systems for long-term retention and analysis. Similar to above points, these should be enforced for all tenants in a multi-tenant cluster, with no possible deviation.

4. Data Residency and Isolation:

Traffic rules governance: Enforces rules for where data flows, ensuring it complies with regional regulations.

At GFT and Google we have years of experience building applications and multi-tenant platforms supporting those application teams within highly regulated organisations. During this blog, co-authored by Radoslaw Cisz, we will walk through all characteristics of a very opinionated implementation of service mesh governance. This implementation is based upon some principles that enable teams within the controls required for a tier 1 financial services organisation.

In this document, we will try to explain how we enabled service mesh and its governance within a multi-tenant platform, empowering application teams to perform to their best capacity by following the key principles below:

Zero Trust Network (ZTN): How to enforce that ALL traffic within the cluster being authenticated and authorised.
Micro-segmentation: Workload isolation enforcement through auto-governance, exceptions could be allowed through manual governance.
Automation first: How we leveraged on Open Policy Agent’s Rego policies to validate configuration that is applied. It prevents misuse of the mesh configuration resources.
Simplicity: How abstraction of the mesh configuration from main use cases into simple configuration files can help tenants of the platform to understand use of Service Mesh.

Service Mesh Governance

Service mesh, even though it is a very useful technology that can enforce zero trust network (ZTN) and enable the application team’s enhanced control over the network traffic, still demands a thorough understanding of the technology. The learning curve to grasp all potential use cases can be steep and may overwhelm developers.

Re-usable artefacts, developed in-house, can be an easy way to enable application teams within the same organisational context. However, these also become a black box for them, which on the one hand takes away complexity, but on the other requires a lot of troubleshooting support. Usually, those implementations are very opinionated. Service mesh and networking in general require very specific skillsets that not all application teams have the capacity and capability to learn quickly.

Despite application teams eventually mastering service mesh technology, human errors can affect other teams whilst configuring resources. Additionally, malicious actors might exploit vulnerabilities within any cluster or application. Therefore, governance is crucial in any multi-tenancy platform, ensuring that multiple application teams can coexist in an isolated, secure and controlled manner. However, it’s vital to have some balance between allowing teams the autonomy to manage their code release processes and maintaining the platform’s security integrity.

Governance’s main focus will usually be on ingress and egress traffic use cases, or how to enforce who can access an application and what applications can access externally to the clusters.

Traffic management and governance within the cluster (or multiple clusters) or “in-mesh” can be limited to namespaces isolation. Even though this limits the number of capabilities that service mesh offers, it will simplify the governance around configuration resources and therefore reduce the noise and audit controls around a solution in highly regulated organisations.

Once the organisation becomes comfortable with what service mesh offers, and how to exploit these capabilities, it should be time to discuss how “in-mesh” governance can be achieved.

Egress Governance

The most common use case for service mesh governance in a financial services organisation is egress traffic. When not properly configured, it can increase the risks of data exfiltration, where sensitive data is transmitted to unauthorised or malicious destinations.

Below are the most important key risks to be considered and remediated:

Unrestricted Outbound Traffic:

By default, services within the mesh may communicate freely with external endpoints if no egress controls are in place. This can allow accidental or intentional transmission of sensitive data to unknown or malicious destinations.

Malicious Workloads:

If an attacker compromises a workload within the cluster, they can use unrestricted egress to exfiltrate data, bypassing network-level defenses.

DNS Exfiltration:

Poorly configured egress allows DNS queries to external servers. Attackers may encode sensitive data in DNS requests and send it to rogue DNS servers for retrieval.

Lack of Visibility:

Without proper egress configuration, outbound traffic bypasses monitoring and logging mechanisms. This lack of observability hinders the detection of anomalies or unauthorized data transfers.

Non-Compliant Traffic:

Organizations with strict compliance standards (e.g., GDPR, HIPAA) risk non-compliance if sensitive data flows to unapproved regions or endpoints.

Google Egress Gateways Best Practices (https://cloud.google.com/service-mesh/docs/security/egress-gateways-best-practices) proposes an architecture that takes a defense-in-depth approach to the fine-grained control of egress traffic for a cluster used by multiple teams. The controls are based both on Layer 4 (transport) and Layer 7 (application) network controls.

Image taken from https://cloud.google.com/service-mesh/docs/security/egress-gateways-best-practices

The above design may not allow enough granular governance. It may be required for multiple tenants in the same cluster to be individually limited to access external resources.

The above will assume Istio Authorization Policy can be used to authorise for instance team-x and team-y namespaces (and/or workloads in the same namespaces if we want to be more granular) to access the egress gateway.

In this case we would use a ServiceEntry to register the domain that traffic can be routed to.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: restrict-egress-access-team-x
  namespace: istio-egress # Namespace where the egress gateway resides
spec:
  selector:
    matchLabels:
      app: istio-egressgateway # Label identifying the egress gateway
  rules:
  - from:
    - source:
        namespaces:
        - team-x # The namespace allowed to access the egress gateway
    to:
    - operation:
        hosts:
        - example.com # Allowed external host
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: restrict-egress-access-team-y
  namespace: istio-egress # Namespace where the egress gateway resides
spec:
  selector:
    matchLabels:
      app: istio-egressgateway # Label identifying the egress gateway
  rules:
  - from:
    - source:
        namespaces:
        - team-y # The namespace allow  ed to access the egress gateway
    to:
    - operation:
        hosts:
        - api.another-example.com # Another allowed host

This would work for L7 traffic where the destination host’s information is present and can be used as part of the authorisation policy. The situation changes when we consider L4 traffic.

There will be many cases where TCP L4 traffic is required to egress the cluster and SNI information is not sent as part of this request. Therefore, the host’s value in the Authorization Policy has to be set to ‘*’ allowing anything exiting the cluster. This then becomes a risk which must be mitigated, for instance, by implementing a dedicated egress gateway per tenant.

There is also the consideration of balancing between the tenants’ freedom to manage their service mesh resources and the fact they may deploy resources in a shared namespace with other tenants with the risk of misuse of RBAC privileges.

Therefore, the recommended practice is to have a separate egress gateway namespace per tenant that will allow you to manage individual applications’ egress traffic. This will require a more complex setup, including:

· Creating a specific RBAC for tenants to be able to deploy (only) Authorization Policies in the application egress namespace.

· The platform team to manage the deployment of egress gateway workload resources.

This egress gateway setup becomes quite complex to understand. Hence, a form of simplification (remember one of our design principles) is required. In our case, as you will see later, we decided to build a few helm charts that could be re-used by application teams to configure the most common use cases for ingress and egress.

Application teams could decide not to use it, as in fact they own the complete lifecycle of the service mesh deployment. However, from our experience, we recognise this approach as a quick enabler for application teams to properly configure their Istio resources.

These assets are maintained by the platform team. Hence, they are implemented by some of the most expert members in the whole organisation when it comes to service mesh.

Ultimately, we decided to include Service Entry resources as the main resource to add some form of governance.

The below diagram depicts how a dedicated egress gateway per tenant and common egress gateways could be configured in a multi-tenant cluster.

Common Services and Tenant Dedicated Egress Istio Gateways topology

Ingress Governance

Similarly for the ingress use case, there are some well-documented risks if the implementation does not follow some secured mesh best practices:

Unauthorized Access:

If the ingress gateway is misconfigured or lacks proper authentication and authorization policies, unauthorized users or attackers can gain access to internal services. Without validating external requests, sensitive resources could be exposed to malicious actors.

Data Interception:

If TLS encryption is not enforced for ingress traffic, sensitive data (e.g., API keys, personal data) can be intercepted during transit, exposing it to man-in-the-middle attacks.

Unfiltered External Traffic:

Without ingress controls like rate-limiting, IP whitelisting, or allow-lists, external traffic can flood the services. This can lead to denial-of-service (DoS) or distributed denial-of-service (DDoS) attacks, causing service outages.

Lack of Traffic Routing and Filtering:

Improper routing configuration (e.g., wrong ServiceEntry or VirtualService settings) can direct external requests to the wrong services, potentially leading to data leaks or the exposure of internal services to the outside world.

Improper Authorization and Role-Based Access Control (RBAC):

AuthorizationPolicy in Istio or missing RBAC rules can lead to the misuse of ingress, where traffic meant for specific services can reach unauthorized destinations, violating security policies and exposing services to attacks.

Misconfigured SSL/TLS Termination:

If SSL/TLS termination is improperly configured at the ingress gateway, sensitive traffic could flow in plaintext between the gateway and backend services. This misconfiguration could expose traffic to potential interception within the mesh.

Bypassing Security Policies:

Misconfiguration in the ingress can allow traffic to bypass service mesh security policies like mutual TLS (mTLS), making it possible for traffic to be routed without authentication or authorization checks.

Note, even though authorising at the ingress gateway is recommended when the same topology of applications are deployed in a Kubernetes cluster, this is not always possible or even recommended when it comes to a multi-tenant cluster, where each application may have to work with its own authentication mechanisms. In that case, it is recommended to delegate to the workloads how they will achieve the required authentication and authorisation mechanisms. In fact, this allows better granular authorisation controls.

We will talk later about how an allow-nothing Authorization Policy in Istio can be useful for this scenario in the Principles of Zero Trust and Their Implementation section.

Before then, the below diagram depicts a possible implementation of a tenant workload deployment, and all service mesh resources required to route and authorise traffic to a workload.

Ingress Traffic configuration with Istio

In the diagram above, the RBAC required for tenantA defines clear boundaries limiting the management of Kubernetes (including service mesh) resources to its own namespace. However, the Istio ingress gateway workload and data plane will require to have tenantA server certificates next to Ingress PODs.

For that we can use a secrets synchronisation mechanism (see some examples here https://cert-manager.io/docs/devops-tips/syncing-secrets-across-namespaces/) and “exportTo” capability of Istio Gateway resource to upload required configuration on istio-ingressgateway envoy proxy.

The above requirements can be auto-governed/controlled by the use of OPA Gatekeeper policies, for instance, to enforce naming convention of ingress Gateway and Secret resources so there is no clash in names between tenants.

The above Request Authentication resource defines what request authentication methods we want to be supported by the app workload. This is in essence the first line of defence; only authenticated traffic will be then validated against an Authorization Policy. A new OPA gatekeeper constraint could be implemented to only accept identity providers allowed within the FSI. This is a way of auto-governance that can be enforced within the cluster.

Conclusion

Whilst service mesh can be a very useful tool for managing micro-services at scale, implementing it in highly regulated environments is a complex task that requires careful planning, strong governance and deep expertise. By addressing these challenges with a structured approach, organisations can harness the benefits of service mesh whilst maintaining compliance, security and operational efficiency.

What next?

This blog continues in Empowering Innovation in Regulated Environments — Zero Trust Service Mesh for Tier1 Financial Institutions (part 2)

Appendix

You can find some examples how Ingress and Egress traffic can be configured for multi-tenant clusters in the following GitHub repositories:

The intention of these repositories is to provide a hands-on experience through a series of injected bugs in Istio configuration and some clues how to use Istio troubleshooting tools.

Solution is also provided in a separate branch, just when you get stuck.

Enjoy it!

References

Istio Service Mesh: https://istio.io/

Best Practices for using Cloud Service Mesh egress gateways on GKE clusters: https://cloud.google.com/service-mesh/docs/security/egress-gateways-best-practices

Directly accessing Google APIs using Private Google Access and IAM permissions: https://cloud.google.com/service-mesh/docs/security/egress-gateway-gke-tutorial#directly_accessing_google_apis_using_private_google_access_and_iam_permissions

Identity Based Segmentation for a ZTA — Zack Butcher, Tetrate & Ramaswamy Chandramouli, NIST: https://www.youtube.com/watch?v=s2lIaFhkA8c&ab_channel=CNCF%5BCloudNativeComputingFoundation%5D

Understanding Istio and Open Policy Agent (OPA): https://tetrate.io/blog/understanding-istio-and-open-policy-agent-opa/

Some general knowledge text has been AI generated using https://chatgpt.com/, you can find it through this document in italic font.

Empowering innovation in regulated environments was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluating LLMs for infrastructure as code

Lu Mao — Tue, 11 Feb 2025 16:15:40 GMT

Introduction

The rapid adoption of cloud computing has made infrastructure as code (IaC) a cornerstone of modern IT operations. IaC streamlines infrastructure provisioning and management through code-based templates, offering scalability, consistency and version control. However, crafting IaC configurations, particularly for tools like Terraform, often demands significant expertise and effort due to the complexity of cloud environments and the intricacies of domain-specific languages. Using an LLM to assist in developing IaC is an attractive idea and a benchmark for LLM performance in generating IaC has been published [1]. However, LLMs are a fast-moving area of research, and several new and powerful models have been developed and introduced since the publication of the IaC benchmark. This blog documents the performance of these new models. We find that DeepSeekV3 is the new leader on these benchmarks and additionally note that it is approximately 30x cheaper than GPT4.

Business values

Enhanced efficiency and automation: The automation capabilities of large language models (LLMs) simplify the creation and management of infrastructure as code (IaC), reducing the time and effort required for manual configurations. [ST1]
Reduced skill barriers: Developing IaC configurations traditionally demands specialised expertise in domain-specific languages like HCL (Terraform) or YAML (Ansible). LLMs lower this barrier by translating natural language prompts into IaC scripts, enabling teams with diverse technical backgrounds to contribute effectively. This democratises access to infrastructure provisioning, fostering innovation and agility.
Cost optimisation: By minimising human error and automating tedious tasks, LLMs contribute to more consistent infrastructure deployment. This reduces unnecessary resource allocation and associated costs. Moreover, continuous optimisation and monitoring via LLM-driven systems further enhance resource utilisation.
Improved governance and compliance: Many organisations are subject to stringent regulatory requirements. LLMs can assist in embedding compliance rules directly into IaC templates, ensuring that configurations adhere to security, governance and regulatory standards from the outset. This proactive approach minimises the risk of non-compliance and enhances audit readiness.

Large Language Models (LLMs) have shown promising capabilities in code generation, making them a potential game-changer for automating IaC creation. New LLM models are being introduced all the time, and it is helpful to have clear information about the relevance of these models to the IaC generation task.

To address this, we leverage the IaC-Eval benchmark [1] — a comprehensive evaluation framework designed to test the capability of LLMs in generating IaC configurations. Using this benchmark, we evaluate the performance of cutting-edge LLMs, including Google Gemini Experiment 1114, ChatGPT-4o-Latest (20241122), Google Gemini 1.5 Pro, Claude 3.5 Sonnet (20241022), and LLAMA 3.1 Nemotron 70B, DeepSeek-V3 in generating Terraform configurations. Additionally, we conducted experiments on smaller language models, including LLAMA 3.1 8B Instruct, Google CodeGemma 7B, and Qwen 2.5 Coder 7B Instruct, to explore their performance and viability in IaC tasks. These smaller models offer the significant advantage of enabling IaC tasks to be performed locally, which is particularly attractive for organisations prioritising security, auditability and reproducibility. This evaluation not only assesses the accuracy and functionality of these models but also highlights their potential to enhance IaC workflows in real-world cloud environments.

The findings from this study will help identify strengths, limitations and areas for improvement in using LLMs for IaC automation, ultimately driving more efficient and accessible infrastructure management solutions.

The IaC-Eval Benchmark dataset

The IaC-Eval dataset is designed to rigorously test the capabilities of language models in generating Infrastructure-as-Code (IaC) configurations. It consists of 458 human-curated scenarios covering a diverse range of AWS services, each tailored to capture real-world complexities and varying levels of difficulty. Each scenario includes:

1. Natural language problem description: A prompt describing the desired cloud infrastructure in user-friendly terms, serving as input for the LLM.

2. Infrastructure intent specification: A set of rules written in OPA Rego, detailing valid resources, optional attributes and required configurations for the problem.

3. Terraform configuration file: A correct example of the expected IaC output, written in HashiCorp Configuration Language (HCL).

The dataset spans a wide range of AWS services, from foundational components like VPCs and IAM roles to advanced services like RDS, DynamoDB, and CloudFront. Problems are categorised by difficulty, accounting for factors such as lines of code (LoC), number of resources, and their interdependencies. This structured dataset not only evaluates functional correctness but also assesses how well the generated IaC aligns with the user’s intent.

The evaluation pipeline

The IaC-Eval evaluation pipeline [1] is a two-phase process that rigorously validates the functional correctness and intent fulfilment of LLM-generated IaC configurations without requiring actual cloud deployment. The pipeline ensures both scalability and practicality for experimentation:

1. Speculative deployment plan:

The LLM-generated Terraform configuration is processed using the terraform plan command.
This phase checks for syntactical correctness and produces a speculative deployment plan, including a dependency graph of resources.
Configurations failing this phase due to syntax errors or missing dependencies are immediately deemed incorrect.

2. Intent specification matching:

The dependency graph from the first phase is evaluated against the predefined infrastructure intent specification using OPA (Open Policy Agent).
This phase ensures that the configuration meets all user-specified requirements, such as including mandatory resources, adhering to value constraints, and fulfilling interdependencies.
The output determines whether the IaC program aligns with the user’s intent, independent of actual cloud execution.

This two-phase evaluation pipeline enables precise measurement of LLMs’ capabilities in generating valid, functional and intent-fulfilling IaC configurations. By leveraging the IaC-Eval framework, my experiment aims to benchmark state-of-the-art LLMs, shedding light on their strengths and limitations in Terraform code generation.

Evaluation setup

The evaluation leverages a subset of 150 scenarios selected from the IaC-Eval dataset, focusing on diverse and challenging configurations representative of real-world cloud infrastructure use cases. These scenarios were tested on five large language models (LLMs) not previously evaluated in IaC-Eval: Google Gemini Experiment 1114, ChatGPT-4o-Latest(20241122), Google Gemini 1.5 Pro, Claude 3.5 Sonnet (20241022), DeepSeek-V3 (20241226), DeepSeek-Coder, and LLAMA 3.1 Nemotron 70B. Additionally, we extended our evaluation to include smaller models, such as LLAMA 3.1 8B Instruct, Google CodeGemma 7B, and Qwen 2.5 Coder 7B Instruct, which offer the added advantage of enabling IaC tasks to be performed locally. This local deployment capability is particularly attractive for organisations prioritising security, auditability, and reproducibility in their workflows, ensuring sensitive infrastructure configurations remain under tighter control.

For each scenario, models interacted over a maximum of two attempts. In the first attempt, the model generated Terraform code based on a natural language prompt. If the output failed, either the Terraform plan compilation phase or the OPA policy evaluation phase, the error message and previous output were provided as feedback for a second attempt. This iterative approach ensures the models had a fair opportunity to improve their output and resolve detected errors.

Notably, the evaluation used zero-shot prompting, meaning no examples or demonstrations were included in the input prompt. This setup tests the raw capability of the models to understand natural language descriptions and produce syntactically correct and intent-compliant Terraform configurations without additional contextual guidance.

Evaluation Result:

Figure 1: Performance of models on generating compiled Terraform

Figure 2: Performance of the model on generating Terraform that complies with OPA

Figure 3: Retry success heatmap analysis

Key insights from the experiment results

Performance consistency across models:

With the inclusion of DeepSeek-V3, this model emerged as the new leader in Terraform plan compilation, achieving the highest totals for both first attempts (80) and retries (29). Its strong performance in hard levels (41 on the first attempt) highlights its capability to handle complex configurations. Its DeepSeek-Coder version achieved the highest in OPA.
ChatGPT-4o-Latest (20241122) retained its position as a top performer across both phases, particularly excelling in Terraform’s retry phase (41 hard and 36 easy successes), showcasing its resilience.
Claude 3.5 Sonnet (20241022) continued to shine in simpler tasks, achieving 40 easy-level Terraform successes in the first attempt, but its results for retries (6 hard and 4 easy) highlight some inconsistency in more complex scenarios.
Smaller language models, such as LLAMA 3.1 8B Instruct, Google CodeGemma 7B, and Qwen 2.5 Coder 7B Instruct, demonstrated moderate success in easy-level Terraform tasks. First-attempt totals for Terraform ranged between 23 and 26, suggesting viability for local IaC tasks prioritising security and reproducibility.

Retry success highlights:

Both DeepSeek-V3 and DeepSeek-Coder matched ChatGPT-4o’s retry totals for Terraform (29), emphasising its robustness in error correction and reattempts. Notably, it maintained strong retry results across both hard and easy levels.
Smaller models such as LLAMA 3.1 8B Instruct demonstrated a significant improvement on retries, particularly for Terraform easy levels (24 successes), although their retry performance in OPA evaluations remained limited.
Google CodeGemma 7B and Qwen 2.5 Coder 7B Instruct also showed consistent improvements upon retries but fell short in achieving competitive scores in challenging scenarios.

Challenges with hard levels:

Across all models, hard-level scenarios remained a consistent challenge. DeepSeek-V3, whilst performing the best overall, managed 13 retry successes in hard-level OPA evaluations, which underscores the need for further optimisation.
Smaller models, including Google CodeGemma 7B and Qwen 2.5 Coder 7B Instruct, struggled significantly in hard-level retries for OPA, achieving minimal successes.

Comparative weakness:

LLAMA 3.1 Nemotron 70B, whilst showing improvement on retries (35 total for Terraform), had the lowest first-attempt results for OPA (4 total). Its performance highlights challenges in intent alignment and handling complex configurations.
Among smaller models, Google CodeGemma 7B and Qwen 2.5 Coder 7B Instruct continued to show limited utility for complex scenarios but demonstrated reliability for simpler tasks, making them suitable for local, low-complexity deployments.

Trends across phases:

Terraform plan compilation remained the phase where models performed most consistently, with DeepSeek-V3 leading the way. In contrast, OPA policy evaluation posed greater challenges, with all models achieving lower success rates.
Smaller models like LLAMA 3.1 8B Instruct and Qwen 2.5 Coder 7B Instruct demonstrated notable limitations in OPA evaluations, with retry totals not exceeding single digits in hard-level scenarios. Even for the top winner, DeepSeek-Coder, that totally made 52 correct OPA, that is only 33% of total questions

These insights highlight DeepSeek-V3 as the strongest candidate for robust IaC generation, outperforming other models in both accuracy and resilience across Terraform and, DeepSeek-Coder tops the OPA evaluations. ChatGPT-4o-Latest (20241122) remains a reliable contender, excelling in retry performance and maintaining versatility across diverse scenarios. Claude 3.5 Sonnet (20241022) and Google Gemini models also demonstrated consistent success, particularly in easier scenarios, making them strong options for less complex IaC tasks. Smaller models, whilst less capable for handling challenging configurations, continue to offer significant advantages for local execution environments prioritising security, auditability and reproducibility. Future fine-tuning and optimisations could further enhance these smaller models’ applicability to a broader range of use cases, particularly in lightweight and domain-specific tasks.

Comparison and observed improvements with latest-generation models

The IaC-Eval benchmark (458 human-curated scenarios) highlighted the difficulty of generating accurate IaC configurations, with even top-performing models like GPT-4 achieving a modest 1st attempt accuracy of 19.36%. Other models, such as WizardCoder-33B-V1.1 and GPT-3.5, scored significantly lower, with 1st attempt accuracies of 8.93% and 7.99%, respectively. This performance gap emphasises the challenge posed by IaC-Eval and the limitations of earlier-generation models.

Notable improvements in large models

Our experiments (took 150 scenarios from IaC-Eval) included newer-generation models such as Claude 3.5 Sonnet (20241022), DeepSeek-V3, DeepSeek-Coder, and LLAMA 3.1 Nemotron 70B, which were not evaluated in the original IaC-Eval paper. These models demonstrate meaningful advancements:

DeepSeek-V3: Achieved the highest Terraform totals across both first attempts (80) and retries (29), outperforming all other models, particularly in hard-level scenarios. Meanwhile, DeepSeek-Coder’s performance in OPA evaluations (first-attempt total of 31) also surpassed the previous top performers.
ChatGPT-4o-Latest (20241122): Continued to excel, particularly in Terraform retries, where it matched DeepSeek-V3’s retry totals (29). Its versatility across both hard and easy levels remains noteworthy.

Small language models

We also explored smaller models, including LLAMA 3.1 8B Instruct, Google CodeGemma 7B, and Qwen 2.5 Coder 7B Instruct, which add unique value for local execution:

LLAMA 3.1 8B Instruct: Demonstrated improved retry performance for Terraform (total of 42 across retries), making it a viable candidate for secure, local IaC tasks despite modest results in OPA evaluations.
Google CodeGemma 7B and Qwen 2.5 Coder 7B Instruct: These models maintained their utility for simple configurations, achieving retry totals of 24 and 17 in Terraform, respectively, though they lagged in handling complex scenarios.

Overall trends and implications

The latest-generation models demonstrate significant advancements in both size and specialised training, with DeepSeek-V3 emerging as the standout performer in Terraform, and DeepSeek-Coder topped OPA evaluations. Claude 3.5 Sonnet and LLAMA 3.1 Nemotron 70B show tangible improvements over earlier benchmarks, particularly in retry scenarios for simpler tasks. Whilst large models like ChatGPT-4o-Latest (20241122) continue to dominate in overall performance, smaller models, such as LLAMA 3.1 8B Instruct and Google CodeGemma 7B, offer compelling advantages for lightweight, localised IaC generation prioritising security and auditability. These results highlight the dual need for continued model refinement to handle complex configurations effectively whilst maintaining efficiency and scalability for simpler, domain-specific use cases.

After IaC-Eval was published

There have been many developments in the field of language model that were observed.

Google’s Gemini 2.0

Google’s release of Gemini 2.0 was a significant improvement for the Gemini series of models on GCP, particularly for infrastructure as code (IaC) generation. Key advancements include:

Enhanced multimodal functionalities: Gemini 2.0 supports native image creation and controllable text-to-speech features, providing additional context for improving IaC processes [3].
Improved efficiency: The Gemini 2.0 Flash variant offers faster processing for complex infrastructure configurations, ensuring accuracy and alignment with organisational requirements [2]
Autonomy in execution: The model’s agentic capabilities allow it to autonomously interpret and execute complex instructions, reducing the likelihood of human error and increasing the reliability of IaC templates [4].

These improvements position Gemini 2.0 as a powerful tool for automating, optimising, and scaling IaC workflows, making it an attractive choice for managing cloud infrastructures effectively.

Meta’s Llama series

Meta’s Llama series, particularly Llama 3 and its subsequent iterations, has advanced significantly in the IaC domain:

Llama 3: Introduced architectural enhancements, such as a larger vocabulary and more efficient tokeniser, boosting its reasoning and code generation capabilities [5].
Llama 3.1: Features an expanded context window of up to 128,000 tokens, enabling seamless handling of extensive IaC scripts. Its superior performance in reasoning and mathematics makes it a robust choice for automating IaC tasks [6].
Llama 3.3: Focused on efficiency and accuracy, this version has set high benchmarks in coding tasks and is tailored for multilingual and domain-specific applications [7].

These iterations highlight Meta’s commitment to providing open-source AI tools that can enhance automation and management of IaC.

Claude 3.5 Sonnet

Claude 3.5 Sonnet outperforms GPT-4o in several evaluations, particularly in reasoning and complex code generation [8]. Whilst these models exhibit advanced capabilities, their practical application to IaC generation should be assessed in specific contexts to maximise their potential.

DeepSeek-V3

In December 2024, DeepSeek released its latest AI model, DeepSeek-V3, marking a significant advancement in open-source large language models (LLMs). Key highlights of this release include:

Model architecture and training efficiency: DeepSeek-V3 is a Mixture-of-Experts (MoE) model with 671 billion parameters, activating 37 billion per token. It was trained on 14.8 trillion tokens over approximately two months, utilising 2,048 H800 GPUs. The total training cost was around $5.58 million, significantly lower than comparable models, demonstrating enhanced training efficiency. [10]
Performance benchmarks: The model outperforms other open-source models like Qwen2.5–72B and Llama-3.1–405B in various evaluations. Its performance is on par with leading closed-source models such as GPT-4o and Claude-3.5-Sonnet, particularly excelling in tasks involving mathematics, code generation and Chinese language processing. [9]
Inference speed enhancement: Through algorithmic and engineering innovations, DeepSeek-V3 achieves a threefold increase in generation speed compared to its predecessor, DeepSeek-V2.5, enhancing user experience with more rapid and seamless interactions.
Open-source commitment: Continuing its dedication to open-source principles, DeepSeek has made the FP8-trained model weights of V3 publicly available. This transparency supports community-driven development and broader accessibility for research and application.

DeepSeek-V3’s performance underscores its potential to improve IaC workflows, offering an effective balance between accuracy, efficiency and scalability. Additionally, it is very cheap to use — approximately 1/30th of the cost of GPT4. Given this, DeepSeek-V3’s capabilities make it a promising choice for organisations aiming to automate and optimise their cloud infrastructure management.

DeepSeek-R1

After the time of writing this blog, in January 2025, DeepSeek released DeepSeek-R1, an open-source large language model (LLM) that has garnered significant attention in the AI community. This model is particularly notable for its performance in code generation and reasoning tasks, which are critical components in Infrastructure as Code (IaC) applications.

Enhanced Reasoning Capabilities: DeepSeek-R1 has demonstrated superior performance in coding and mathematical reasoning, ranking #1 in these categories on LMArena shortly after its release. [11]
Open-Source Accessibility: Released under the MIT license, DeepSeek-R1 provides the community with open access to its model weights and outputs, facilitating fine-tuning and distillation for specific applications, including IaC. [12]
Integration with Cloud Platforms: DeepSeek-R1 is now available on major cloud platforms such as AWS and Azure AI Foundry, enabling developers to seamlessly incorporate its capabilities into their workflows. [13]

These advancements position DeepSeek-R1 as a promising tool for enhancing IaC processes, offering improved automation and efficiency in managing cloud infrastructure.

Best practices and recommendations

To effectively leverage LLMs for infrastructure-as-code (IaC), adopting a human-in-the-loop approach is essential. LLM-generated scripts must be reviewed by domain experts to ensure compliance and mitigate configuration errors. Version control systems should store IaC configurations, enabling collaboration, tracking and rollback if necessary. Continuous testing in sandbox environments can identify potential flaws before deployment. Organisations should prioritise education and cross-team collaboration, empowering staff to understand both IaC principles and LLM capabilities. Feedback mechanisms, where lessons from deployments refine LLM prompts and training, are critical to enhancing the precision and reliability of generated configurations.

Despite their potential, LLMs pose safety and ethical challenges. Security risks arise when LLMs generate insecure configurations or expose sensitive data, and they may also produce outdated or inefficient setups, leading to resource wastage. Ethical considerations include the transparency of LLM decision-making processes, potential biases in training data and accountability for errors in production. To address these concerns, organisations should embed compliance rules in IaC templates, ensure diverse datasets for LLM training and conduct regular audits. Limiting automation for critical systems and maintaining rigorous human oversight further enhances reliability and safeguards against unintended consequences. Careful thought must be given to understanding whether implementing policies that mitigate the risk of LLM-generated IaC outweigh the value of adopting this technology.

Reference:

[1] Kon, P. T. J., Liu, J., Qiu, Y., Fan, W., He, T., Lin, L., Zhang, H., Park, O. M., Elengikal, G. S., Kang, Y., Chen, A., Chowdhury, M., Lee, M., & Wang, X. (2024). IaC-Eval: A code generation benchmark for Infrastructure-as-Code programs. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024).

[2] Gemini 2.0 Flash (experimental) https://ai.google.dev/gemini-api/docs/models/gemini-v2

[3] Gemini 2.0 https://deepmind.google/technologies/gemini/

[4]: Google puts AI agents at the center of Gemini update. Reuters https://www.reuters.com/technology/artificial-intelligence/google-puts-ai-agents-center-gemini-update-2024-12-11/

[5] Build the Future of AI with Meta Llama 3 https://www.dell.com/en-us/blog/build-the-future-of-ai-with-meta-llama-3/

[6] Meta’s Llama 3.1: Pushing the Boundaries of Open-Source AI https://skimai.com/how-metas-llama-3-1-is-pushing-the-boundaries-of-open-source-ai/

[7] Llama 3.3 https://ollama.com/library/llama3.3

[8] Claude 3.5 Sonnet https://www.anthropic.com/news/claude-3-5-sonnet

[9] DeepSeek https://api-docs.deepseek.com/news/news1226

[10] DeepSeek Technical Report https://arxiv.org/pdf/2412.19437

[11] DeepSeek Open-Sources DeepSeek-R1 LLM with Performance Comparable to OpenAI’s o1 Model https://www.infoq.com/news/2025/02/deepseek-r1-release/?utm_source=chatgpt.com

[12] DeepSeek-R1 Release https://api-docs.deepseek.com/news/news250120?utm_source=chatgpt.com

[13] DeepSeek-R1 models now available on AWS https://aws.amazon.com/cn/blogs/aws/deepseek-r1-models-now-available-on-aws/?utm_source=chatgpt.com

[14] DeepSeek R1 is now available on Azure AI Foundry and GitHub https://azure.microsoft.com/en-us/blog/deepseek-r1-is-now-available-on-azure-ai-foundry-and-github/?utm_source=chatgpt.com

Appendix

System prompt used for querying the model [1]:

You are TerraformAI, an AI agent that builds and deploys cloud infrastructure written in Terraform HCL. Generate a description of the Terraform program you will define, followed by a single Terraform HCL program in response to each of my Instructions. Make sure the configuration is deployable. Create IAM roles as needed. If variables are used, make sure default values are supplied. Be sure to include a valid provider configuration within a valid region. Make sure there are no undeclared resources (e.g., as references) or variables, that is, all resources and variables needed in the configuration should be fully specified.

An example prompt used for asking model to generate terraform [1]:

Configure route 53 to return different database endpoints to internal versus external users. Internal users are routed to an internal RDS instance, whilst external users are routed to a publicly accessible one. Name the zones “private” and “public”, the databases “internal” and “public”, and the subnet group “main”

Prompt that asks model to retry if the Terraform failed to compile [1]

You are TerraformAI, an AI agent that builds and deploys cloud infrastructure written in Terraform HCL. Given an incorrect Terraform program along with an error message, your task is to first describe the error in your own words, followed by a description of the fix you will apply, and ending with a single corrected Terraform HCL program. Make sure the configuration is deployable. Create IAM roles as needed. If variables are used, make sure default values are supplied.

Here is the original prompt:

{}

Here is the incorrect configuration:

{}

Here is the Terraform plan error message (potentially empty):

{}

Prompt that asks model to retry if OPA Rego validation failed [1]:

You are TerraformAI, an AI agent that builds and deploys cloud infrastructure written in Terraform HCL. Given an incorrect Terraform program along with an error message, your task is to first describe the error in your own words, followed by a description of the fix you will apply, and ending with a single corrected Terraform HCL program. Make sure the configuration is deployable. Create IAM roles as needed. If variables are used, make sure default values are supplied.

Here is the original prompt:

{}

Here is the incorrect configuration:

{}

Here is the Rego OPA policy associated with this configuration:

{}

Here is the Rego OPA policy error message:

{}

Terraform generated by Gemini Exp 1114 for the Hard Level:

## Terraform configuration for internal/external RDS routing with Route53

This Terraform configuration sets up two RDS instances, one private (“internal”) and one public (“public”). It then configures Route53 with two zones, “private” and “public”, to route traffic based on the user’s location. Internal users querying the “private” zone will be directed to the internal RDS instance, while external users querying the “public” zone will be directed to the public RDS instance.

This configuration includes:

* **Variables:** Defines variables for the AWS region, database instance class and database engine.

* **Providers:** Configures the AWS provider with the specified region.

* **RDS instances:** Creates two RDS instances, “internal” and “public”, with appropriate security group settings.

* **Route53 zones:** Creates two Route53 zones, “private” and “public”.

* **Route53 records:** Creates A records within each zone, pointing to the respective RDS instance endpoint.

```terraform

# Configure variables

variable “region” {

default = “us-west-2”

}

variable “db_instance_class” {

default = “db.t3.micro”

}

variable “engine” {

default = “mysql”

}

variable “engine_version” {

default = “8.0.28”

}

# Configure the AWS provider

provider “aws” {

region = var.region

}

# Create a security group for the internal database

resource “aws_security_group” “internal_db_sg” {

name = “internal_db_sg”

description = “Allow inbound traffic on port 3306 from within VPC”

ingress {

from_port = 3306

to_port = 3306

protocol = “tcp”

cidr_blocks = [“10.0.0.0/16”] # Example VPC CIDR

}

egress {

from_port = 0

to_port = 0

protocol = “-1”

cidr_blocks = [“0.0.0.0/0”]

}

# Create the internal RDS instance

resource “aws_db_instance” “internal” {

identifier = “internal”

engine = var.engine

engine_version = var.engine_version

instance_class = var.db_instance_class

allocated_storage = 20

db_subnet_group_name = “main”

skip_final_snapshot = true

username = “admin” # Replace with your desired username

password = “password” # Replace with a strong password

vpc_security_group_ids = [aws_security_group.internal_db_sg.id]

}

# Create a security group for the public database

resource “aws_security_group” “public_db_sg” {

name = “public_db_sg”

description = “Allow inbound traffic on port 3306 from anywhere”

ingress {

from_port = 3306

to_port = 3306

protocol = “tcp”

cidr_blocks = [“0.0.0.0/0”]

}

egress {

from_port = 0

to_port = 0

protocol = “-1”

cidr_blocks = [“0.0.0.0/0”]

}

# Create the public RDS instance

resource “aws_db_instance” “public” {

identifier = “public”

engine = var.engine

engine_version = var.engine_version

instance_class = var.db_instance_class

allocated_storage = 20

db_subnet_group_name = “main”

skip_final_snapshot = true

username = “admin” # Replace with your desired username

password = “password” # Replace with a strong password

publicly_accessible = true

vpc_security_group_ids = [aws_security_group.public_db_sg.id]

}

# Create the private Route53 zone

resource “aws_route53_zone” “private” {

name = “private”

}

# Create the public Route53 zone

resource “aws_route53_zone” “public” {

name = “public”

}

# Create an A record in the private zone for the internal database

resource “aws_route53_record” “internal_db_record” {

zone_id = aws_route53_zone.private.zone_id

name = “database”

type = “A”

ttl = 300

records = [aws_db_instance.internal.address]

}

# Create an A record in the public zone for the public database

resource “aws_route53_record” “public_db_record” {

zone_id = aws_route53_zone.public.zone_id

name = “database”

type = “A”

ttl = 300

records = [aws_db_instance.public.address]

}

```

**Please note:**

* This configuration assumes you have a VPC with a subnet group named “main” already created.

* Replace the placeholder username and password with your desired credentials.

* Adjust the security group rules and VPC CIDR range according to your specific network configuration.

* This configuration is a basic example and can be further customized to fit your specific needs.

OPA Rego Intent:

package terraform.validation

default is_configuration_valid = false

default is_valid_iam_instance_profile = false

default is_valid_iam_role = false

default is_valid_iam_role_policy_attachment = false

default is_valid_eb_app = false

default is_valid_eb_env = false

is_valid_iam_role {

some i

resource := input.resource_changes[i]

resource.type == “aws_iam_role”

contains(resource.change.after.assume_role_policy,”ec2.amazonaws.com”)

}

is_valid_iam_role_policy_attachment {

some i

resource := input.configuration.root_module.resources[i]

resource.type == “aws_iam_role_policy_attachment”

resource.expressions.role.references[0] == “aws_iam_role.eb_ec2_role.name”

resource.expressions.policy_arn.constant_value == “arn:aws:iam::aws:policy/AWSElasticBeanstalkWebTier”

}

# Validate aws_iam_instance_profile resource

is_valid_iam_instance_profile {

some i

resource := input.configuration.root_module.resources[i]

resource.type == “aws_iam_instance_profile”

resource.expressions.role.references[0] == “aws_iam_role.eb_ec2_role.name”

}

# Validate aws_eb_app

is_valid_eb_app {

some i

resource := input.configuration.root_module.resources[i]

resource.type == “aws_elastic_beanstalk_application”

resource.expressions.name

}

# Validate aws_eb_env

is_valid_eb_env {

some i

resource := input.configuration.root_module.resources[i]

resource.type == “aws_elastic_beanstalk_environment”

resource.expressions.name

resource.expressions.application.references[0] == “aws_elastic_beanstalk_application.my_application.name”

resource.expressions.solution_stack_name

resource.expressions.setting[0].value.references[0] == “aws_iam_instance_profile.eb_ec2_profile.name”

}

# Combine all checks into a final rule

is_configuration_valid {

is_valid_iam_role

is_valid_iam_role_policy_attachment

is_valid_iam_instance_profile

is_valid_eb_app

is_valid_eb_env

}

OPA Rego Error message:

Rule violation found. OPA complete output logged here:

{‘result’: [

{‘expressions’:

[{‘value’:

{‘terraform’:

{‘validation’:

{‘is_configuration_valid’: False,

‘is_valid_eb_app’: True,

‘is_valid_eb_env’: False,

‘is_valid_iam_instance_profile’: True,

‘is_valid_iam_role’: True,

‘is_valid_iam_role_policy_attachment’: True

}

},

‘text’: ‘data’,

‘location’:

{‘row’: 1,

‘col’: 1

}

]

}

]

}

Evaluating LLMs for infrastructure as code was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

A practical guide to extracting data from financial documents

Hugo Wickham — Thu, 06 Feb 2025 16:06:13 GMT

Hugo Wickham & Alastair Gill

An enduring challenge faced on data science projects is extracting data. This is not usually due to a lack of data but rather having it in a usable format. Much of this goes back to the well-worn adage of 60% of a data scientist’s time being spent on data cleaning and preparation, with this often relating to data quality issues, such as columns containing the expected type of data, removing anomalies and outliers, etc.
However, in recent years another type of challenge has come along in contexts where automation has not been designed into the business processes: the challenge is to extract the data from a variety of sources that were designed by humans, for humans. As companies digitise and become more and more digitised, it becomes necessary to bridge the gap between the cold efficient clarity of the API and the ambiguity of human data exchanges. Formats for even structured inter-human documents can vary greatly, with much information left to the recipient to infer from experience or common sense. Where one financial institution may refer to ‘net asset value’, another may call it ‘funds under management’. Where one form has ‘address’, another has ‘address line 1’, “address line 2’, etc.

If you have worked in this area for any length of time, this situation may well be familiar. You will probably have examples of your own. Approaching such a problem for the first time might seem at first simple (to those making the request) and impossible to the data scientist tasked with actually doing it. With this in mind, we have set out some of the steps that we might take in approaching such a problem having previously worked on numerous projects where we have extracted data (mainly text) from documents, forms or spreadsheets which are generally very similar but vary slightly (e.g., insurance claim forms, death certificates, pension fund payment extracts, performance report documents).

The process

The degree to which you might incorporate Machine Learning (ML) into this problem can vary according to the set of input documents. At its simplest, you may be able to build a solution using only a manually-created ruleset. It may be necessary to incorporate OCR models to read in data from troublesome file formats or equate similar terminologies (as mentioned earlier). As of relatively recently, you could indeed do the whole thing using a multimodal LLM (large language model). In this section, we will go through how you may implement each method.

Manual ruleset

Begin by examining a representative document format, ideally one that makes up a large portion of your dataset. This may come from a department or organisation that frequently publishes documents of this type. For documents with different versions, select one reflecting the most common format or major variations. Due to the nature of this task, the steps you then follow will depend entirely on the document format in question and how it relates to your organisation’s business processes — every case will be different, and you may have to deal with a lack of flexibility down the line, if the file formatting or terminology changes. In any case, there are some common steps to take, even if the specifics are vastly different between implementations.

Worked example

Using a trial balance spreadsheet as an example, we can walk through the process:

Obtain representative data: First, get a trial balance spreadsheet from a common source within your organisation or domain.

Read and process the data: To read the file, there are several methods depending on its format:

If the file is in Word or Excel, convert it to plain text or CSV to simplify processing. You can use Python packages like pandas for Excel files or pdfminer for PDFs.
In cases where file structure is critical, extract the XML from .docx or .xlsx files.
Automate the process as much as possible. For example, if you regularly receive PDF files, you could request access to the original .docx versions to simplify extraction.
For spreadsheets with multiple sheets or tables, use relevant Python libraries to handle them effectively. For instance, the pandas package can load Excel files directly into a dataframe, whilst more inconsistent data may require iterating through rows to ensure it meets predefined criteria.

Create rules for parsing: Now, analyse the content:

Filename patterns: Check if the filename offers clues about the format or source. This could help in categorising documents in the future.
Line splitting: If dealing with a CSV or similar, split the file into lines. Count the lines and fields to detect patterns, noting useful details like the number of rows or fields in each row.
Initial information: The first few lines may contain headings or metadata, such as company information or document titles, which can help categories or extract further details. Use regular expressions (regex) to extract patterns like dates or headings.

Get metadata: Extract metadata from the document, which can indicate the format and help locate relevant sections of the trial balance (e.g., columns like account code, description and amount). Identifying the table will involve:

Iterating through rows to find those with consistent formatting or the same number of fields.
Collecting rows until the end of the table, marked by empty or irregular fields.

Extract table contents: After identifying the relevant table:

Column headings: Backtrack a few lines before the table to capture potential headings.
Field types: Analyse text and numeric fields to determine their roles. For example, numeric fields may represent amounts, and patterns like in/out columns can help deduce relationships, such as calculating balances.
Patterns and formatting: Recognise formatting features like negative numbers or brackets, which could signify financial data.

Pros/cons of creating a manual ruleset for data extraction

Pros:

Customisable: Tailored to specific document types, offering high accuracy.
Control: Full control over extraction logic, allowing fine-tuning for edge cases.
Automation: Once established, the ruleset can be automated for future data extraction tasks.

Cons:

Time-consuming: Developing a comprehensive ruleset can take a significant amount of time.
Fragile: Changes in document formats may require regular updates to the ruleset.
Limited scalability: May struggle to adapt to diverse or new document types without modification.

ML approach

In some cases, a manual ruleset can be easy to build, easy to maintain and effective. If so, this is the right solution. However, if new data variations are expected, or if the anticipated rule set is complex, an ML approach may be more suitable. This does not necessarily mean a complete switch to ML but perhaps rather using ML to handle aspects that are difficult or time-consuming to manage manually.

One challenge that often necessitates ML is handling the unpredictable input formats which a rigid ruleset may struggle with. This problem is especially common with data external to your organisation; when dealing with large numbers of organisations, it is common to receive similar information structured in entirely different ways. This variation can be in both the layout of the data and the file type itself. Whilst some formats like CSV or neatly structured Excel files can be processed without automation, others like PDFs and scanned images require specialised approaches, such as optical character recognition (OCR). Even with structured data like Excel files, there is no guarantee that different sources will follow the same naming conventions or formatting standards.

Visual data extraction

Reading data from complex unstructured files can be streamlined using OCR models like Google’s Document AI. This tool can handle PDFs, images and various other formats, detecting tabular information and extracting it into plain text. For tabular data that is strictly formatted (e.g., a single table without extra whitespace or logos), models like Huggingface’s TAPAS can perform question-answering tasks effectively, provided the data is in a CSV-like structure. However, TAPAS is not suitable for diverse or irregularly formatted inputs.

Tabular data extraction and data reconciliation

For ML-based extraction, identifying relevant information within tabular data can be divided into three main areas:

Identifying the table of interest:

This can be framed as a classification problem to determine if a row is within a table or as a multi-class problem to distinguish between headers, first/last rows, or middle rows. Features might include the number of fields in a row, the type of content, and the row’s position within the document.

Identifying relevant columns:

This can be a multi-class task where each class corresponds to specific columns like ‘account code’ or ‘amount’. Features for this classification could include the length, type, and patterns in text or numbers. Metadata like the file name or information found at the top of the document can also provide context.

Data reconciliation:

The next step is data reconciliation — aligning extracted information with a standard format. This is particularly important when different organisations use varying terms for the same concept (e.g., ‘zip code’ vs. ‘postcode’). For minor variations, it might be sufficient to hard-code expected terms and use a search-and-replace approach. However, for larger and more complex datasets, embedding terms using a language model and training a classifier to map terms to standardised labels can be a more scalable solution.

Worked example:

Suppose your organisation receives financial reports from multiple external parties. Some send structured Excel sheets, whilst others send PDFs or scanned images of physical documents. Here’s how an ML-based approach might look:

File classification and preprocessing

Detect the file type (Excel, PDF or image) and route it to the appropriate tool. Some Excel files can be processed using pandas, whilst the cases with complex formatting, as well as PDFs and images, are passed through an OCR tool like Google’s Document AI to extract raw text.
If necessary, clean the extracted content to remove any non-essential elements (e.g., logos, headers, footnotes). This step can help ensure that the subsequent LLM processing is not affected by noise. However, be aware that ostensibly tabular data could be contained within an image which would then be removed (e.g. a table of data presented in an image and inserted into a spreadsheet)

Identify table and extract data.

For structured Excel files, the table can be directly read into a dataframe. For unstructured or semi-structured formats, use a model like TabNet to identify and extract table boundaries.
In a PDF, for instance, Document AI can identify tables and return a structured JSON output. The next step is to analyse the extracted text using a classifier to identify the rows of interest, such as headers, first/last rows or the main content rows.

Identify columns and normalise data

Apply a multi-class classifier to each column in the extracted table. Suppose some sources use ‘revenue’ whilst others use ‘sales’; train a model using sentence-transformers embeddings to map these terms to a standard label like ‘total revenue’.

Contextual data reconciliation

Once the table and columns are identified, reconcile extracted data using language models. For example, match ‘UK postcode’ with “zip code’ using embedding similarity scores.

Automate and continuously improve

Automate the extraction process for each incoming document. As new document formats are encountered, use a human-in-the-loop approach to label and retrain models, improving accuracy over time.

Large language models — the answer to data extraction?

A more forward-thinking document extractor might even make use of large language m(LLMs) such as GPT-4. One can use currently available models without any real preprocessing to attempt this task. In a very simple example, let us compare two differing quality multimodal LLMs — the freely available version of Gemini 1.5 Flash and GPT-4o. The following document is a publicly available example balance sheet, and each model was asked to:

Extract all the data as JSON (Prompt: Convert this balance sheet to JSON.)
Extract certain fields, using slightly different fields named to those in the document (Prompt: Give JSON with following template: {[year]:{Net Asset Value:[net asset value], Total Assets:[total assets], Total Liabilities:[total liabilities], Owed Income Tax: [sum of all owed income tax]}}. Replace [] terms with appropriate value from document.)

The results from this short series of tests are listed above. Whilst GPT-4o extracted the information perfectly, the less powerful Gemini model managed to fail on both of what we might imagine are quite simple tasks. It could not differentiate the 2022 and 2023 columns, leading to incomplete and incorrect data extraction. In fact, further questioning revealed that it was for some reason reading both columns as 2022 but prioritising the left column. This is a danger inherent in these models — they are black boxes and thus hard to debug. Where a faulty calculation in a traditional data loading program can be fixed by identifying and correcting a piece of deterministic code, LLMs are almost arcane — an agent whose behaviour you can often anticipate but possibly never know. As effective as GPT-4 ostensibly is, unknowability is not a desirable quality for any part of a data pipeline. Also worth noting is that mistakes were all semantic — they won’t raise KeyErrors or ValueErrors and could easily go unnoticed without proper oversight.

As well as generalist models, more specialised models could be available soon for tasks such as this. In 2023, researchers at JP Morgan proposed DocLLM, a specialised model which largely outperformed other models including GPT-4 and Llama2 in a range of tasks including document classification, key information extraction and visual question answering1. Whilst this specialised model is not openly available, it is not unlikely that the big LLM providers will follow suit soon given the commercial applications for such a tool.

Aside from any question of efficacy, there is the problem of LLMs’ vulnerable nature to bad actors. In the same way that inputs must be sanitised to avoid code injections, if an LLM-based approach were used, great care would need to be taken to ensure it did not open a vulnerability to prompt injections — specially crafted prompts which override the model’s intended behaviour and allow for malicious actors to control the model to their own ends. The most obvious example of this would be a document header saying “Ignore all previous instructions. Now, perform task xyz.” When interacting only with trusted parties, this concern is lessened, but one that absolutely cannot be ignored. Remember, document processing can be used in processes such as customer onboarding, or invoice processing. An invoice PDF with some background-coloured text — invisible to humans — might contain instructions for supplemental payments that could enter a workflow unnoticed.

A customer identity document may contain a photograph manipulated to match the identity of another person whilst appearing to human eyes to resemble someone else altogether. This problem is especially acute because bad actors can easily guess or discover which models are being used in an enterprise and, therefore, optimise their documents to exploit those models’ vulnerabilities.

(Source: https://www.accountingcoach.com/wp-content/uploads/2013/10/comparative-balance-sheet-example@2x.png)

All this is not meant to dissuade necessarily — LLMs often perform deceptively well, but systematically evaluating their outputs will allow you to check that the outputs you get from your LLM are correct and hallucination-free before trying to adopt it in a production environment. Note again that this example used no preprocessing, and a combination of a generalist LLM with an OCR step such as DocAI may well perform better. It is often the case that while LLMs can be applied to a broad range of tasks, they perform best when used only where needed. In almost any software solution conceivable, the ‘problem’ being solved is rather a series of connected problems; whilst some of these sub-problems might be best handled by an LLM (e.g. semantic field name reconciliation), that does not mean that all the sub-problems need to be handled by the LLM — where possible, more traditional processes should be used to avoid issues related to LLM opacity.

Worked example: Using LLMs for document extraction and validation

Let’s walk through a detailed example of how an LLM like GPT-4 can be integrated into a document extraction pipeline for financial reports.

Scenario:

Your organisation receives quarterly balance sheets from multiple external companies in various formats, including PDFs, images and structured Excel files. Some reports contain multiple years of data side-by-side, whilst others provide single-year snapshots. The goal is to create an automated pipeline to extract, validate and normalise this data for further analysis.

Create a way to evaluate the model output:

If you can not verify or evaluate the quality of the outputs, then it is going to be very difficult to establish confidence in your solution. As an example, a training set of documents that you have manually labelled can be used to assess performance during development. When a satisfactory solution has been created using the LLM, this can be tested on some unseen data and freshly hand evaluated. Of course, the quality of this sort of evaluation will come down to the amount of effort that you are able to commit. The more documents, from diverse and fresh sources successfully verified in the testing process, the more confidence that you will have that the model will work in the wild.

File classification and preprocessing:

This is largely the same as the previous worked example. However, this step is not necessary if you are feeding the input documents directly into a multimodal LLM.

LLM-Based data extraction

Feed the input data into your model choice using a series of structured prompts. For example:
Prompt 1: “Extract all key sections such as assets, liabilities, and stockholders’ equity along with their corresponding values for each year in the document. Present this information in JSON format.”
Prompt 2: “Calculate the net worth for each year by subtracting total liabilities from total assets. Provide a breakdown of these calculations.”

Context-aware validation

After the initial extraction, rephrase queries to confirm the accuracy of values. For example, instead of directly asking for ‘net worth’, ask the model to provide ‘total assets minus total liabilities’. If the returned value matches the previously extracted ‘Net Worth’, this serves as a validation step.
Use additional prompts to spot-check specific fields like ‘retained earnings’ or ‘total stockholders’ equity’. Include contextual prompts such as “What is the sum of all liabilities for the year 2022?” This ensures the LLM hasn’t borrowed values from other sections or years, as observed with Gemini.

Error handling and prompt refinement

If the LLM incorrectly labels fields, refine the prompt. Of course, this could happen in any manner of ways. For example, a model may be biased towards particular date ranges — i.e., using 2023 values for 2022. In this case, refine the prompt to include clearer temporal distinctions. For example, use: “Identify the first year whose data is listed in this document. Extract data for that year only. Ignore any values from other years. Start your analysis from the first row mentioning that year and stop when you reach the section for the next year.”
If the modal is struggling to identify a given field, you could add a list of similar field names to the prompt. This has the benefit of being able to leverage domain knowledge, whilst not being as brittle as a hard-coded lookup table. For example: “To populate the field ‘Short-term debt’, some examples of what it may be listed as are: short-term loans payable, current borrowings, short-term loans. It may be listed under a different title, but it will be similar to those terms listed.”

Output formatting and consistency checking

It is very important to define the desired output format. This will ensure that your outputs are all uniform and that you get all the data you need. Only collect raw values directly available on the source document. It is better to extract column X and column Y then manually calculate column Z = X * Y than ask the model to perform this calculation and risk a hallucinated result.
Include metadata like company name, report date, and file source in the JSON output. This information can be extracted from document headers or inferred based on context within the document.

Post-processing

At this point, once all the raw data has been extracted, perform calculations or any other transformations that can be coded as in any other ETL pipeline.

Field trials

As noted above, the testing process adopted can build confidence that an LLM-derived solution will work in the field, but experience shows that production conditions and production data can be radically different from those provided for the data science team. Because of this, field trials of potential solutions are an important step. If the field trial goes well, then success! However, if it fails (which should be to some extent expected) then having time to review and repair the solution can make the difference between a project’s success and failure.

Final thoughts

The task of extracting useful information from irregular documents is becoming more and more approachable. Depending on input variability, a hard-coded approach may be all you need, but for more diverse input data, an ML-based solution may be more desirable. If you are working with trusted parties or are confident you can mitigate the risk of prompt injection, using an LLM may well be your best option. Effective data extraction from financial documents can involve a blend of traditional and modern techniques, with evolving technology promising more efficient and accurate solutions. Yet the fundamental principles of careful data handling and validation always remain core to successful data science practices.

At GFT, we have used techniques outlined in this guide to create great value for our clients; one recent example was with an international investment bank dealing with almost exactly the same problem we have used as a demonstration — highly varied formatting of the financial documents they were receiving from their counterparties. In this case, we used the ML approach (in part because this was before the advent of accessible and powerful multimodal LLMs) to great success.

Any questions are welcome, and the authors can be reached at:

A practical guide to extracting data from financial documents was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comparing frontier LLM performance in finance

Lu Mao — Wed, 27 Nov 2024 13:16:00 GMT

Intoduction

A systematic comparison of the performance of LLMs in financial reasoning tasks, including Google’s newly released Gemini-Exp-1114 model against other leading models such as GPT-4o, Claude 3.5 Sonnet and Llama 3.1 Nemotron offers real business value, by identifying the most performant AI solution for automating complex financial analysis. Such evaluations help businesses select models that best align with their needs, whether for enhancing financial performance analysis, decision-making, or improving risk assessment accuracy. By understanding which model excels in numerical reasoning, multi-step problem-solving, and domain-specific tasks, companies can make effective investments in AI technology, reduce operational costs and gain a competitive edge in data-driven decision-making. The availability of the FINQA dataset [1] and access to the APIs of these models has enabled us to rapidly benchmark their performance.

Newly released Gemini-Exp-1114

Google DeepMind continues to innovate with its latest model, Gemini-Exp-1114. Gemini-Exp-1114 has ascended to the top of the Imarena Chatbot Arena leaderboard. This achievement surpassed OpenAI’s latest GPT-4o, showcasing Gemini’s exceptional capabilities .

The Imarena Chatbot Arena [2] is a blind head-to-head testing platform where users compare outputs from competing models without knowing their identities until after voting. In this competitive environment, Gemini-Exp-1114 achieves performance at or near ChatGPT-latest (20/11), GPT-4o, and also OpenAI’s o1-preview reasoning model.

Key highlights of Gemini-Exp-1114

§ Performance strengths: It excels in maths, vision tasks, solving complex problems, and creative writing — areas where the Gemini series has consistently outperformed peers.

§ Benchmarks: Unlike rigid, dataset-based benchmarks, the Chatbot Arena relies on human evaluation of performance and quality, emphasising Gemini’s ability to deliver outputs that resonate with users.

§ Development platform: Currently, Gemini-Exp-1114 is only available through Google AI Studio, targeting developers eager to explore new capabilities.

Whilst its exact lineage is unclear — whether it’s an advanced iteration of Gemini 1.5 Pro or an early preview of Gemini 2 — its remarkable performance indicates Google’s steady progress towards enhanced reasoning and agent-management tasks. This aligns with Gemini’s vision of being a versatile tool for both technical and creative applications.

To run a model prediction using Gemini experiment 1114, you can either do this via the Gemini API:

…or, using the Vertex AI Generative Model API:

Motivation and value of conducting this experiment

Whilst platforms such as Imarena Chatbot Arena provide valuable insights into the general capabilities of AI models through human evaluations, they often lack focus on highly specialised domains such as financial reasoning. Financial reasoning involves complex numerical calculations, multi-step problem-solving, and the ability to draw insights from structured (tables) and unstructured (text) data, which are critical skills for tasks such as risk assessment, profitability analysis and compliance auditing. These complexities make it a far more challenging domain compared to general-purpose reasoning or conversational tasks evaluated in arenas such as Imarena.

The motivation for conducting this experiment was to address this gap by testing and comparing the performance of advanced Large Language Models (LLMs) in a domain-specific context using the FINQA [ [1] dataset, a benchmark explicitly designed for financial reasoning. Unlike general benchmarks, financial reasoning demands not just linguistic understanding but also expertise in numerical operations and contextual financial knowledge. By evaluating models like Gemini-Exp-1114, Gemini 1.5 Pro, and GPT-4o on 150 challenging financial reasoning questions, the experiment provides unique insights into their real-world applicability for the finance industry.

This focused evaluation holds immense value for businesses, as financial reasoning tasks often underpin critical decisions with significant monetary stakes. The results of such experiments guide organisations in adopting AI solutions that are both accurate and reliable in handling domain-specific challenges.

FINQA dataset

The FINQA dataset is a groundbreaking resource designed to facilitate advancements in financial reasoning and question answering (QA). Developed using earnings reports from S&P 500 companies spanning 1999 to 2019, FINQA offers a comprehensive set of 8,281 QA pairs (Figure 1) that reflect real-world financial scenarios. These reports, sourced from the publicly available FinTabNet dataset, include tables and textual data annotated for precise numerical reasoning tasks. FINQA stands out by addressing the unique challenges of financial analysis, such as extracting insights from heterogeneous data sources and performing multi-step calculations, which are critical for understanding and analysing company performance.

Figure 1: An example of a financial context, question, reasoning and answer combination provided in the FinQA dataset.

The dataset was annotated by financial experts recruited through a meticulous process. Eleven professionals, including Certified Public Accountants (CPAs) and MBAs, were trained to compose meaningful financial questions, detailed reasoning programmes, and supporting facts for each selected page. Each question required either textual, tabular, or combined sources of information to generate insights relevant to financial decision-making. The annotation process emphasised numerical reasoning, with operations such as addition, subtraction, multiplication and division forming the foundation of reasoning programmes. Division, in particular, dominated the operations, reflecting the prevalence of ratio calculations in financial analysis.

To validate data quality, additional financial professionals were tasked with answering a random subset of the questions. They achieved an impressive execution accuracy of 91.16% and programme accuracy of 87.49%, underscoring the dataset’s high quality and reliability.

Experiment setup and results

In this experiment, we evaluated the ability of advanced Large Language Models (LLMs) to perform financial reasoning tasks using a subset of the FINQA dataset. Specifically, 150 questions were randomly selected from the dataset, and the model-generated answers were manually compared against the ground truth to assess accuracy. The goal was to determine how well these state-of-the-art models could handle the complexities of numerical reasoning and data interpretation in the financial domain.

The raw dataset was provided in the format of JSON where the fields are:

"pre_text": the texts before the table;
"post_text": the text after the table;
"table": the table in the format of CSV;
"qa": {
"question": the question;
"program": the reasoning program;
"result": the gold answer for the corresponding question;
}

Each prompt we sent to models contained the concatenation of pre-text, table, post-text as well as the question:

To run model prediction with Claude 3.5 Sonnet 20241022, we use:

To run model prediction with Llama 3.1 Nemotron, we use:

Finally, to run model prediction with GPT-4o, we use:

The author of FinQA does provide an evaluation script, but it only tests the reasoning steps of each model and not the final answer. The model needs to be trained to produce training steps that comply to their format. Since in this experimental study we are only interested in the output from using a pre-trained LLM without doing any further training, the model generated answers are then manually compared to the golden answer provided by FinQA. Here is what Google Gemini 1.5 Pro provided:

Gemini 1.5 Pro’s answer to the question: what percent of net interest revenue were total operating expenses in 2008 is as follows.

REASON STEPS:
1. Find total operating expenses in 2008 from the table: $988 million.
2. Find net interest revenue in 2008 from the table: $3332 million.
3. Divide total operating expenses by net interest revenue and multiply by 100% to express the answer as a percentage: ($988 million / $3332 million) * 100% = 0.2965 * 100% = 29.65%.
FINAL ANSWER:
Total operating expenses were 29.65% of net interest revenue in 2008.

The answer from Llama is as follows:

Here are the reason steps and final answer to the question:
**QUESTION:**
What percent of net interest revenue were total operating expenses in 2008?
**REASON STEPS:**
1. **Identify the necessary values for 2008 from the table:**
* Net Interest Revenue (2008): $3,332 million
* Total Operating Expenses (2008): $988 million
2. **Calculate the percentage of Net Interest Revenue that Total Operating Expenses represent:**
* Formula: (Total Operating Expenses / Net Interest Revenue) * 100
* Calculation: ($988 million / $3,332 million) * 100
3. **Perform the calculation:**
* ($988 / $3,332) ≈ 0.2965
* Multiply by 100 to convert to percentage: 0.2965 * 100 ≈ 29.65%
**FINAL ANSWER:**
Approximately **29.65%** of Net Interest Revenue in 2008 were Total Operating Expenses.

And the answer from Claude is as follows:

Let me help solve this step by step.
1) To find what percent total operating expenses were of net interest revenue in 2008:
* First, locate the values for 2008:
* Net interest revenue (2008) = $3,332 million
* Total operating expenses (2008) = $988 million
2) Calculate the percentage:
* Percentage = (Total operating expenses ÷ Net interest revenue) × 100
* = ($988 ÷ $3,332) × 100
* = 0.2965 × 100
* = 29.65%
FINAL ANSWER: Total operating expenses were 29.65% of net interest revenue in 2008.

Five models were tested:

§ Google Gemini Experiment 1114: A newly released experimental version of Google’s Gemini model, hosted on Google’s AI Studio, which achieved the highest accuracy of 78.5%, showcasing strong performance in numerical reasoning tasks.

§ Google Gemini 1.5 Pro: A prior version of the Gemini series, hosted on Model Garden of GCP’s Vertex AI, which scored 73.3%, demonstrating robust capabilities, but slightly lagging behind the experimental model.

§ GPT-4o: OpenAI’s advanced language model, hosted on the OpenAI Platform called using an API Key, which closely matched the performance of the leading model with an accuracy of 78%.

§ Llama-3.1-Nemotron-70B-Instruct: A state-of-the-art model from Meta’s Llama series, hosted on Nemotron’s AI platform, achieved an accuracy of 73.1%, demonstrating comparable reasoning capabilities to Gemini 1.5 Pro.

§ Claude 3.5 Sonnet (20241022): Anthropic’s latest iteration, released in October 2024, achieved the highest accuracy of 83.9%, outperforming all other models in this experiment, particularly excelling in contextual reasoning and multi-step problem-solving.

Key insights

The results highlight the growing sophistication of LLMs in handling domain-specific tasks such as financial reasoning. Both Claude 3.5 Sonnet and Google Gemini Experiment 1114 emerged as top performers, with Claude achieving a notable accuracy of 83.9%, significantly surpassing other models in this evaluation. Models such as GPT-4o, and Google Gemini Experiment 1114 demonstrated quite similar performances, with accuracies ranging from 78% to 78.5%, indicating their capability to handle financial reasoning tasks effectively.

It should also be noted that these models may have been trained on this task, and at least some of the data. Further work to evaluate the robustness of the performance of the models on unseen cases would be very valuable, but given the difficulty of obtaining representative cases for analysis and creating the target answers, this would be extremely challenging.

On the other hand, these results remain extremely impressive. This experiment reveals the potential of LLMs to augment financial analysis by automating complex reasoning tasks. Given the overwhelming scale of the data that must be evaluated when considering investments or undertaking due diligence, this advance has the potential to both enable new opportunities that are currently uneconomic and to raise the quality of investigations in marginal activities. This is a very optimistic take, but it is important to also emphasise the need for continued work in fine-tuning and contextual understanding of the information in documents, to further bridge the performance gap with domain experts.

References:

1. Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, William Yang Wang FinQA: A Dataset of Numerical Reasoning over Financial Data EMNLP 2021

2. Chatbot Arena https://openlm.ai/chatbot-arena/

Comparing frontier LLM performance in finance was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Teacher-student network architecture

Ioannis Gkiliris — Tue, 19 Nov 2024 16:08:42 GMT

Introduction

This article is aimed at introducing the teacher-student network architecture for training machine learning (ML) models and how its use can improve a model’s performance. There are plenty of reasons that led to the development of such architectures and important new drivers have emerged in the last few years with the use of huge, complex (black box) models in domains such as computer vision or natural language processing (NLP). A special case of this architecture that has gained popularity is called knowledge distillation (KD) and it is worth understanding the inner works of this framework compared to other teacher-student structures.

In general, the notion of utilising models and architectures that combine the knowledge of existing, already trained models, is well explored in the field of transfer learning [Zhuang, Fuzhen, et al., 2020]. The purpose of this overview is to describe the differentiating characteristics and use-cases of knowledge distillation and showcase the breadth of solutions that can be put into practice in various fields and not just NLP and large language models (LLMs).

After providing an overview of the teacher-student architecture and explaining its dependencies, capabilities and limitations, we will be exploring how such a framework can be used in a more traditional domain such as anomaly detection on transactional payments data. This will clarify how large the space of experimentation available to us is. Precisely because of this reason, our goal in this blog is to provide helpful guidance for other practitioners by documenting the structure, implementation and ultimately, the evaluation of our solution’s performance.

To help the reader, we have divided this blog into two sections. The main part of the blog describes the techniques that have been developed around the concept of teacher-student networks and discusses the relative strengths and weaknesses of each. Then, in the second part of the blog, a case study applying a teacher-student setup to weather anomaly detection is documented, showing how to use these ideas and what kind of results can be created by their application.

Description of knowledge distillation

Knowledge distillation refers to the process of transferring the knowledge from a large model to a typically smaller one that is easier and more practical to work with, interpret and deploy in environments with resource limitations. This form of model compression was firstly described by [Bucilua et al., 2006].

Knowledge distillation is performed mostly on deep neural network models that involve a multitude of layers and trainable parameters. Consequently, it became quite intriguing to utilise this approach, especially in commercial applications, as neural networks progressively made impactful innovations in speech or image recognition, as well as natural language processing.

In fact, the drive for efficiency in deploying large deep neural network models was motivated by the necessity to perform this on edge devices with limited memory and computational constraints. To tackle this challenge, a model compression method was first proposed [Bucilua et al., 2006] to transfer the knowledge from a large model into training a smaller model without any significant deviation in overall performance. This process of training smaller models from larger ones was formalised as the knowledge distillation framework by [Hinton et al., 2015].

The main structure of such a framework is comprised of two units/networks: the teacher model and the student model. As shown in Figure 1, a small student model learns to mimic a large teacher model and in essence leverage the knowledge of the teacher to achieve equivalent accuracy or output performance. The term knowledge distillation refers to the specific approach being employed by our design in order to define an objective (and effectively a loss function) and thus facilitate the strategy described above.

Figure 1: Knowledge distillation broad architecture / Source: Gou et al., 2021

Here, we will provide a summary of the main differences and variations of these KD approaches and then delve more into the most common and widely used approach and its mechanisms.

As described by [Gou et al., 2021] in their survey article, we have different ways to use knowledge distillation, depending on how we would prefer to leverage the teacher model’s knowledge and learned parameters.

Figure 2: Feature-based knowledge distillation / Source: Gou et al., 2021

There is a feature-based knowledge distillation that is designed to capture the feature maps of the teacher mode’s intermediate layers and match the student network’s feature activations with them. The distillation loss function then is trying to minimise these differences by having the respective models be trained on the same attributes and labelled data (see Figure 2).

The most common and generic approach is the response-based KD (Figure 3). In this structure, we are focusing on the last layer of the neural network and are trying to match the teacher model’s predictions by training a much smaller (and typically simpler) model. The distillation is calculated by using the logits output of both networks, which are the final output scores of the respective models. Depending on the used loss function, we can either use the logits directly or convert them to probability scores (multi-class or binary classification use cases).

Figure 3: Response-based knowledge distillation / Source: Gou et al., 2021

Lastly, there is a framework that does not rely on the intermediate or final layers’ output of the teacher network, but rather explores the relationship between different layers or data samples.

Some of these layers that correspond to learned feature maps can be utilised by the student network in the form of feature embedding, which preserves the feature similarities of samples in the intermediate layers of the teacher network.

Figure 4: Relation-based knowledge distillation / Source: Gou et al., 2021

In Figure 4, Instance Relations refer to how different data points (or instances) relate to each other within the model’s feature space. To illustrate, consider the task of processing three images: Cat1, Cat2, and Dog1. The teacher model learns that Cat1 and Cat2 are very similar (high similarity score), while Cat1 and Dog1 are less similar (low similarity score). Initially, the student model might not recognise these similarities in the same way.

The goal of relation-based knowledge distillation is to train the student model to capture these relational patterns, teaching it that Cat1 and Cat2 should have a high similarity score and Cat1 and Dog1 should have a low similarity score, just as the teacher model does.

In broader terms, t₁, t₂, …, tₙ and s₁, s₂, …, sₙ refer to feature representations for every given input sample (e.g., Cat1, Cat2, Dog1) which in turn lead to the calculation of the instance relations output. Finally, the distillation loss computes the similarity of the instance relations coming from both the teacher and the student models on the same set of input records. This loss helps the student model to better mimic the teacher’s behaviour in capturing and preserving the relational structure of the data.

Training (distillation) schemes

In a similar fashion, we can employ different training techniques that are essentially dependent on whether the teacher model is updated at the same time as the student one or not. Consequently, we can differentiate between offline, online and self-distillation techniques (Figure 5) .

Figure 5: Training (distillation) schemes / Source: Gou et al., 2021

The most commonly used form of training is the Offline Distillation, where a pre-trained teacher model is used as the base on which the student model is trained. This is a well-established technique in deep learning and has driven the advances in other knowledge transfer approaches too, namely transfer learning.

The online and self-distillation schemes share one key characteristic: both models are trained simultaneously. The use of Online Distillation is preferable in the absence of a readily available, large teacher model and thus we can instead train both networks at the same time and additionally make use of parallel computing to achieve higher efficiency.

Lastly, Self-Distillation can be considered a special case of online distillation as the same network is being employed for both the teacher and the student models. For example, among the existing variations, knowledge can be transferred from deeper levels of the same network to the shallower ones or knowledge in the earlier epochs of the network (teacher training) can be transferred into its later epochs (student training).

The latter approach is particularly useful, as the earlier model’s predictions act as soft targets to guide the learning process of later stages. The transfer is facilitated by a distillation loss, which is typically combined with the model’s final loss function to help improve both task performance and generalisation.

Knowledge distillation vs transfer learning

At this point, it is essential to highlight how these two approaches are distinct from each other. As an example, we can make reference to the classification task of images depicting animals. We can make use of a pretrained image model on clear and standardised samples of animal images and treat it as our base teacher model.

Then, this model’s parameters can serve as the initial layers of a new and larger SoftMax classification network architecture, where the training on new and more realist pictures (blurry or taken with less light available for example) can begin from scratch. In this case, the first layers’ parameters of this new and large deep neural network will not be initiated randomly but rather have as default value the ones that were obtained during the pre-training process. The above scenario is what transfer learning is all about and although it provides impressive results, as demonstrated by [J. -T. Huang et al., 2013], it does leave some of the core needs we have described earlier unaddressed (runtime performance constraints in particular).

Alternatively, in the KD scenario, the teacher network is used as is, and its intermediate model parameters are not getting updated, or interfere with the student model during training. For instance, we use the output classification of the teacher model as targets for training the student model against, i.e. we are aiming at matching the predicting capabilities of a much larger and complex neural network (teacher model) by training a much simpler and smaller network, which will be less time-consuming and more easily managed.

Structured data (transactions/trades) use-case

The power of the teacher-student framework is tangible in a variety of novel applications and has been proven extremely useful when edge devices need to make use of complex and resource-intensive applications of large ML models.

However, there are also use cases within the more traditional space of processing structured data (e.g., transactions or booked trades) that can demonstrate the same efficiency and power using other established methods. For instance, in real-time transaction processing, traditional rule-based systems or linear regression models are often employed due to their simplicity and speed. In fraud detection, ensemble methods like random forests or XGBoost have shown strong performance in identifying anomalies without the complexity of deep learning models. Additionally, in optimising trade execution strategies, classic algorithms such as logistic regression or ARIMA (autoregressive integrated moving average) models are frequently used to predict market movements and adjust trading decisions accordingly. Whilst these methods may not utilise the teacher-student architecture, they have proven effective in many contexts.

The challenge, however, lies in the fact that a solution that performs well in one scenario may not be as effective in another. This variability underscores the importance of having well-defined objectives, validated data sources and suitable performance metrics, so that the success of different approaches can be appropriately measured and compared. By aligning the chosen method with the specific requirements and constraints of the problem at hand, organisations can maximise the likelihood of achieving the desired outcomes.

Anomaly detection on structured data

The anomaly detection use-case, especially within the space of transactional payments data of financial organisations, can demonstrate limitations in the quality of performance that can be achieved. One key characteristic within this setting is the heavy class imbalance of anomalous versus valid transactions. The impact of such a limitation is easy to grasp and especially when the class imbalance is quite substantial, even the most sophisticated remediation solutions might simply not be adequate.

Different approaches to tackle the class imbalance such as under-sampling the majority class or oversampling the minority class and different performance metrics have been explored widely in the industry. Often, a combination of the above might prove to be sufficient to provide us with acceptable results for the anomaly detection task. At this point, however, it is worth seeing how knowledge distillation can be employed as a potential alternative to the above.

Often, we do not have visibility on which data is considered anomalous or the numbers available to us are quite small. According to the Financial Conduct Authority, 99.5% of trading activity within 2022 did not occur during a sensitive time period (i.e. not preceding a potentially price sensitive news announcement that led to significant price movement), but for the remaining 0.5% of trading activity that justified further review, only 4.7% of it was considered potentially anomalous.

In scenarios like the one described above, unsupervised algorithms can be an ideal solution, as their primary objective is to identify clusters or groupings of data points that are more similar to each other than to those in other clusters, all without the need for explicit guidance. Algorithms such as Isolation Forest are particularly useful in these situations because they are easy to implement, provide score metrics that quantify the significance of detected patterns and are more easily interpreted through the importance ranking of individual features.

Models such as autoencoders can prove to be quite useful as well, as they have two key characteristics:

They do not rely on labelled data.
They can be designed with deep neural network architectures on both the encoder and decoder side, thus unlocking the benefits of using large and complex ML models that encompass non-linearities in their activation functions.

The caveat though is that they suffer from the common interpretability issues of large neural networks (treated as black boxes) and are heavily resource-intensive.

Proposed Architecture

These two attributes — interpretability and complexity — often pose significant challenges for organisations, particularly financial institutions and banks, when evaluating different solutions or pipelines. Due to the intense scrutiny and stringent regulations these entities face, balancing the need for sophisticated models with the requirement for transparency and explainability can become a substantial hurdle.

Here, we propose a solution that aims at achieving just that: combining the benefits of powerful models, such as deep neural networks, and the perks of using less complex models that can be more easily interrogated. The proposed architecture is designed according to the teacher-student framework, leveraging the power of autoencoders on the teacher unit whilst pairing them with simpler and smaller models on the student side.

The architecture is as follows:

Initially, a large autoencoder needs to be trained on valid, unflagged transactional payments data, which tend to be quite large in volume.
- The model then learns a smaller representation of regular, non-anomalous transactions that can be reconstructed quite accurately.
- This will serve as the teacher model and the reconstruction errors will be the data we are more interested in.

The student network can be a simpler model, such as a decision tree, or even a small, shallow network that will be trained on the same feature set and learn the reconstruction errors provided by the autoencoder (teacher model).
- A key observation is that we will have to make use of an evenly split dataset of flagged and non-flagged transactions, where the flagged ones will have to be passed on the teacher network on evaluation mode to obtain their reconstruction error scores.
- It is expected that since the autoencoder has learned to reconstruct valid transactions, there will be higher reconstruction error values for the fraudulent ones when they are evaluated against the teacher model.

The core benefit of this solution is that, by using the values of the reconstruction errors, we obtain a distribution that might be able to provide us with different insights when our original feature set is trained against it.

An equally important factor, which renders the above useful to explore, is that we do not require the same wealth and volume of data for the student network as it is desired to have as minimal a model as possible. Of course, there is a wide breadth of models that can be employed on the teacher network side and might be worth experimenting with, including density estimation models (KDE, GMM) or ensemble methods.

This architecture is demonstrated and explained in more detail in the case study below, where more than 80 years of weather metrics data (collected daily) is utilised in order to observe the effects of climate change through anomaly detection.

Employing knowledge distillation

The previously suggested solution can be considered as a specialised form that utilises the teacher-student dynamic in order to guide the student network training process. However this design does not really fall under the traditional KD space, as there is no use of a distillation loss involving both soft and hard targets.

Essentially, the KD scenario would require the presence of a balanced dataset of records on both the teacher’s and the student’s networks. By applying supervised learning on both these networks’ models, we can obtain logit outputs that correspond essentially to a probability distribution (binary in the case we distinguish between anomalous and non-anomalous records). All the logits provided by the teacher’s model output can be transformed to labels that are referred to as soft targets. The true labels from our training dataset are called hard targets.

The main idea of combining these two groups of targets is as follows:

Hard target loss: Compute the standard loss (e.g., cross-entropy) between the student model’s predictions and the true labels.
Soft target loss: Compute the loss between the student model’s predictions and the soft targets provided by the teacher model. This is often done using Kullback-Leibler (KL) divergence.
Total loss: Combine the hard target loss and soft target loss, typically using a weighted sum.

By following this approach, we transform the scenario into a KD task, leveraging the teacher model’s soft targets to train a more efficient student model. We can see that this is essentially an instantiation of a response-based knowledge distillation using an offline training scheme.

Additionally, we could have the teachers’ feature maps be matched with the student network ones, following the feature-based distillation technique and thus focus more on the features’ representation when defining a distillation loss function.

Conclusions

To sum up, this example is an indicative case of how the main structure of two distinct networks can support a variety of training approaches. The number of options to explore can become quite large and complex, but as we have highlighted previously, it is essential to prioritise what we consider as satisfactory results and most importantly, how do we measure our performance in order to determine significant impact and ultimately, success.

Using a student model provides the following benefits:

Adaptability and simplicity: A smaller model that can be executed many times at a much lower cost than the teacher model, and yet retains many of the teacher model’s properties.
Because the model is smaller, it can be executed on a wider range of hardware, potentially on edge devices like phones.
Regularisation and eventually better generalisation as the simpler, student model tries to align its predictions with the ones made by the teacher model.
Enhanced interpretability when simple models are selected (e.g., decision trees).

But it is important that practitioners who harness this technique are aware of the following issues:

Poor selection or training of the teacher model can lead to error propagation and affect the student network’s performance.
Scalability: considering the complexity and size of the teacher model, large and ever-increasing data could slow down the training process and render it computationally expensive.
Data drift detection which might not be as straightforward to detect on both the teacher and the student models.

Case study: The teacher-student architecture for weather data analysis

In this section, we will delve into a case study specifically designed to showcase the teacher-student network architecture we proposed above and how this can be employed in a real-world scenario. Our analysis will involve the collection of daily weather metrics of New York City spanning the years 1940 to 2024. Our focus will be on understanding how this architecture can be applied to a given dataset and most importantly, how we can extract meaningful insights by practically combining autoencoders with decision trees.

Loading the data

To begin with, we load and preprocess our weather dataset, which includes features like temperature, precipitation and other atmospheric conditions. Our source was open-meteo, a historical weather API that allows users to select the features and time period they would like to explore and download. The individual records we collected represent daily average measurements reported for New York City over a span of more than 80 years (1940–2024).

Figure 6: Sample view of the collected weather metrics data

After the required cleansing, preprocessing and standardisation of the data, the final feature set we ended up using had the schema shown in Figure 7. This will include, among others:

features coming from the one-hot encoding transformation of the weather_code feature
the cyclical encoding of the month values through the sine and cosine trigonometric functions
moving averages of different weather metrics (both on a 30 and a 90-day window)

The main hypothesis behind this example is that the impact of climate change can be observed by the emergence of more anomalous days in the recent years with respect to their daily averaged metrics. Therefore, in order to render the experiment unbiased, we had to keep aside any time-dependent or time-indicating features before feeding the prepared dataset to our autoencoder network.

Figure 7: Final output schema after preprocessing and cleansing

Teacher network: Autoencoders

The teacher network in this architecture involves a multilayer perceptron (MLP) autoencoder. Although there are autoencoders such as the long short-term memory (LSTM), which is a better fit for time-series data as well as making predictions using historical patterns, we opted for the MLP one for two reasons:

Reconstruction errors could be mapped in a straightforward manner to the individual day records of our dataset, compared to having to work with LSTMs and their day-windows.
The premise of our argument is that any anomalous records to be detected will be attributed exclusively to the weather conditions metrics. Any temporal or seasonal factors had to be excluded from our analysis so it would not impact the training process.

On the premise that earlier decades represent less extreme weather conditions, we have selected the metrics between the years 1940 and 1980 as the training dataset to be used. Once the autoencoder is trained, we calculate the reconstruction error, which serves as an indicator of how well the model has learned to represent the data. High reconstruction errors may point to anomalies or unusual patterns in the data.

In Figure 8, we observe that the distribution of daily reconstruction errors is heavily right-skewed, affirming that only a small number of daily records cannot be reconstructed as accurately as the others.

Figure 8: Distribution of reconstruction error scores across daily input records

Figure 9 depicts the yearly average reconstruction error score over the years following the cutoff period for training our autoencoder model.

Figure 9: Yearly average reconstruction error across out-of-sample data

A particularly striking observation is that between 1985 and 2015 — a period unseen by the autoencoder — the yearly average score almost consistently shows lower values, never surpassing the extreme value threshold, which is set at the 2-standard-deviation mark of all reconstruction errors. However, over the past few years the score appears to have risen to a consistently higher average level, around the threshold of 2 standard deviations, and even surpassing it on three occasions.

Student network: Tree-based models

The student network aims to approximate the performance of the teacher network using a simpler model, such as a decision tree or random forest. In our experiment, we split our metrics data chronologically and trained a decision tree on all historical records, excluding the last 20 years, which were used as the test set.

By training this student model to predict the reconstruction errors obtained from our autoencoder, we gain insights into which features are most influential in the model’s predictions, as shown in Figure 10:

Figure 10: Feature importances of the student, decision tree model

A worthwhile experiment was the attempt to enhance the predictive capability of our model with respect to certain key attributes (such as temperature) by adding the reconstruction error to our feature set.

Interestingly, this approach does not significantly improve the prediction of important features, such as the daily mean temperature. Figure 11 depicts the weak direct correlation between the prediction error for temperature and the reconstruction error of the corresponding records. This suggests that whilst the autoencoder captures complex patterns, these may not be directly useful for predicting individual attributes.

Figure 11: Correlation between the prediction error for the temperature feature and the reconstruction error

Correlation analysis

Considering the previous finding, it was intriguing to observe whether other attributes demonstrate a stronger relation with our reconstruction errors instead. A detailed correlation analysis reveals that the reconstruction error is highly correlated with features such as precipitation_sum, rain_sum and snowfall_sum. This indicates that higher precipitation levels are often associated with higher reconstruction errors, suggesting that unusual weather patterns (e.g., heavy rain) are harder for the model to reconstruct accurately.

Figure 12: Correlation between reconstruction error and weather condition features

This finding proved to be extremely important as it allowed us to identify a narrow list of features that determine a record as anomalous. Interestingly enough, this seems to be aligned with the observations made by scientists over the last few years that showcase the substantially increased precipitation New York City has been experiencing recently (NYC on track for more intense rainfall, flooding and heatwaves), a trend that is likely to continue into the coming decades.

Comparison with Isolation Forest

Finally, we compare the performance of our autoencoder-based approach with an isolation forest, another popular technique for anomaly detection. Interestingly, both methods identify similar anomalous behaviours in the data, as shown by the strong negative correlation between the two scores (pearson correlation coefficient equal to -0.852). The lower the isolation forest’s output score is, the more anomalous the instance is likely to be.

Figure 13: Correlation between the autoencoder’s reconstruction error and the isolation forest’s anomaly score

Additionally, precipitation-related features play a significant role in both models. However, there is a useful and important distinction between those two approaches.

Our proposed solution managed to pinpoint with high confidence 3 features that determine whether a day’s measurements constitute an anomaly or not. On the other hand, the isolation forest displayed a more varied list of high-importance features (Figure 14) and although it did include the same attributes our teacher-student approach identified (precipitation, rainfall, snow), these seem to have featured lower in its ranking order.

Figure 14: Isolation forest’s feature importances

Summary

This exploration demonstrates the potential of combining autoencoders with tree-based models to analyse large and complex weather data. Whilst the autoencoder provides a robust method for detecting anomalies, the student network (tree-based models) offers interpretability, allowing us to understand which features drive the model’s predictions. Future work could explore alternative feature sets or different autoencoder or teacher model architectures (as mentioned previously) to further enhance predictive performance.

References

[Gou et al., 2021] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. Knowledge Distillation: A Survey. Int. J. Comput. Vision 129, 6 (Jun 2021), 1789–1819. https://doi.org/10.1007/s11263-021-01453-z

[Zhuang, Fuzhen, et al., 2020] Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A comprehensive survey on transfer learning. Proceedings of the IEEE. 2020 Jul 7;109(1):43–76.

[Hinton et al., 2015] Hinton, Geoffrey. “Distilling the Knowledge in a Neural Network.” arXiv preprint arXiv:1503.02531 (2015).

[J. -T. Huang et al., 2013] J. -T. Huang, J. Li, D. Yu, L. Deng and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7304–7308, doi: 10.1109/ICASSP.2013.6639081.

[Bucilua et al., 2006] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06). Association for Computing Machinery, New York, NY, USA, 535–541. https://doi.org/10.1145/1150402.1150464 .

Teacher-student network architecture was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Open Source LLMs for Business: Delivering Value, Ensuring Privacy, and Reducing Costs

Aaron Zhao — Wed, 06 Nov 2024 14:19:43 GMT

Introduction

ChatGPT, Google’s Gemini, Anthropic’s Claude, and other advanced Generative AI (GenAI) models have revolutionised how we engage with technology by providing powerful capabilities such as summarisation, creative writing, translation, sentiment analysis, code creation, and more. Typically developed and hosted by large AI organisations like OpenAI and Google, these models rely on vast, sophisticated infrastructure to deliver their impressive performance. However, innovations like quantisation and distillation have made it feasible to significantly reduce model sizes with minimal loss of performance. This enables open-source models to run efficiently on personal computers, offering performance that rivals more powerful setups, and making AI-driven tasks more accessible than ever before.

In this blog, we demonstrate how easy and practical it is to run a GenAI application on a MacBook using open-source Large Language Models (LLMs) through a case study of an AI text editor powered by locally hosted open-source LLMs. We believe these open-source alternatives can provide businesses with significant advantages compared to their proprietary counterparts, including greater flexibility, cost-effectiveness, and control over customisation. By leveraging open-source LLMs, companies can harness the power of GenAI while maintaining autonomy and potentially reducing reliance on external infrastructure.

The rise of open-source LLMs

Table 1. Performance comparison of finetuned Llama 3 models with competing models on key benchmark evaluations [1]

Open-source LLMs have made significant advancements achieving performance comparable to or even exceeding their closed-source counterparts. Table 1 shows the benchmark evaluation results released in Meta’s research paper [1]. We can draw the following insights from it

1. For most capable Llama3.1 model (405B), its performance is on par with the newest flagship closed-source models (Claude 3.5 Sonnet and GPT-4o). It even scores higher on 7 out of the 16 public benchmark tests, e.g., IFEval, GSM8K, ZeroSCROLLS/QuALITY, across all categories except coding

2. The 70B model is significantly better than its closed model of similar size, GPT-3.5 Turbo on 12 out of 13 available benchmark tests

3. Even more strikingly, compared with GPT-3.5 Turbo, the 8B model (smallest version of Llama3.1) still outperforms on 8 out of 13 available benchmark tests, across all the categories, i.e., general, coding, math, reasoning, tool use, multilingual. Noticeably, IFEval, MGSM see big margin wins of more than 10 points by Llama3.1 8B.

Similar observations of open-source LLMs outperforming or matching proprietary models can be seen in evaluations of Mistral Large 2 [2], another open-source LLM. This supports the assertion that ‘The State of the Art is Open,’ as noted in [3]

Don’t use a sledgehammer to crack a nut

From a business point of view, not all tasks demand the most sophisticated tools. Just as a bicycle can get me to my gym a mile away as efficiently as a car, without the added complexity and cost, tasks like translation, summarisation, grammar correction, sentiment analysis, and more can often be handled comfortably by relatively small open-source LLMs, without the need for heavy machinery.

Another scenario is when users only require text processing. In our case study, a text-only model will meet their needs, while a multimodal model capable of handling images or audio would not provide additional value and would be more costly to utilise.

Benefits of Hosting LLMs Locally

The benefits of hosting LLMs locally can be summarised as reduced costs, enhanced security and privacy, lower latency, improved robustness and greater transparency.

Reduced cost — Depending on the LLM provider, pricing models can be subscription-based with a certain quota limit (e.g., number of requests or tokens per minute) or pay-as-you-go (charged based on usage and measured by the number of tokens processed and produced). Regardless of the pricing model, organisations with a significant number of frequent users will incur substantial costs. However, lightweight LLMs, with model sizes smaller than 8GB, can be comfortably hosted on a modern PC, making the cost almost negligible. For larger models, deployment can be done on dedicated servers, either on-premises or in the cloud. Although this might be more expensive upfront due to deployment costs, it can still be more economical in the long run, especially for larger teams with frequent usage.

Enhanced security and privacy — Sharing data with third-party LLM providers over the internet raises privacy and security concerns, which can be challenging or even prohibited in a highly regulated industry such as financial service. Leveraging locally hosted LLMs ensures that data remains on the user’s device or approved infrastructure, thereby meeting data security and privacy requirements.

Lower Latency — Running lightweight LLMs locally can significantly reduce the time needed to process and deliver results compared to using broadband connections for remote LLM services.

Improved robustness — Running everything locally enhances the robustness of the service, as relying on internet-based LLM services like ChatGPT or Gemini can be disrupted by internet issues or service outages.

Greater Transparency — Although transparency is a general challenge for all LLMs due to their black-box nature, open-source LLMs offer greater visibility. They provide insights into their training data, model evaluation processes, and overall development, allowing users to understand and trust the model more.

Case Study

In this section we showcase how to build an intelligent writing assistant empowered by locally hosted open-source LLMs.

Description of the Application

Figure 1. The interface of the text editor. When users highlight a piece of text, it provides a list of operations on the text, which then calls LLM to process the text accordingly and provide the output in the bottom left text area for user to review

The LLM-powered writing assistant app looks as Figure 1, and it helps writers to:

1. Edit their writings. It offers predefined text editing capabilities such as spelling and grammar correction, tone adjustment, quality enhancement, translation, and summarisation.

2. Create drafts or brainstorm ideas for blogs and articles. It enables free-form text generation for ideation and draft creation based on user instructions.

Figure 2 shows a flow chart that demonstrates how the application works. On the left hand of the flow the application accepts a text selection from the user, then a selection menu is shown that allows the user to choose from a predefined or user-defined (free-format) operations such as translation, summarisation, expansion. The application then calls LLM to generate edits, which the user reviews. If the edits are satisfactory, the original text is replaced; otherwise, the user can retry for further adjustments. The process repeats until the user opts not to continue, at which point they can save the final version.

Figure 2. Flowchart of how the AI writing assistant work

Architecture

The architecture diagram in Figure 3 outlines a potential enterprise setup for the AI text editing application, with deployment options on-premise or in the cloud:

1. User Access: Users authenticate via Single Sign-On (SSO) for secure access.

2. Frontend UI: A web-based interface for users to input text, select text editing operations and receive results, etc

3. Text Editor App Backend: The Ollama client in the backend processes user requests, sending them to the LLM backend.

4. LLM Backend: The Ollama server hosts models like Mistral AI, Llama 3.1, and Phi-3, returning edited text to the client.

5. File Storage: Stores user files and outputs securely.

Figure 3. Architecture diagram of a potential enterprise setup for the AI text editor

The demo app in this case study adopts a simpler architecture and it runs on a personal MacBook. Underlying the application are three key components:

A GUI where users write, specify model and parameters and request text editing with LLMs
Ollama for hosting open-source LLMs such “gemma2”, “llama3.1:8b”, “phi3:14b”, and “qwen2:7b”
Python app for running the GUI and call the LLMs

Setup and Running

To set up and run the environment, follow these steps:

1. Download and Install Ollama:

● Visit Ollama Download [4] and follow the instructions to download and install the Ollama application.

2. Install Python and Dependencies:

● Ensure Python is installed on your system.

● Install the necessary Python dependency by running the following command:

pip install ollama

3. Run the Application:

i. Start the Ollama application you installed in the first step and run it in the background.

ii. Open a terminal and download the model files by running ollama run : For example, to download the Llama3.1 8B model, run:

ollama run llama3.1:8b

iii. Finally, start the Python application by running:

python main.py

Results

Here is a demo video [9], specifically demonstrating the following functionalities of the app:

1. model selection & temperature specification

2. fixing grammar and spelling errors

3. making it professional

4. translating into Chinese and English

5. rejection and retry options

6. summarisation

7. customer instruction: some tips for making perfect coffee

8. saving output into a file

https://medium.com/media/cb267285a223d30b1ab1ac382e1ec57d/href

We used an example letter with grammar and spelling errors sourced online [6]

Hello,

My name is Susan. I’m forteen and I life in Germany. My hobbys are go to discos, sometimes I hear music in the radio. In the summer I go bathing in a lake. I haven’t any brothers or sisters. We take busses to scool. I visit year 9 at my school. My birthday is on Friday. I hope I will become a new guitar.

I’m looking forward to get a e-mail from you.

Yours,

Susan

Figure 5. Comparison between the texts before (left) and after (right) the “Fix spelling & grammar” operation

By selecting the “Fix spelling & grammar” option, as shown in Figure 5, the original text (left) has been corrected for spelling and grammar (right).

If the user opts for further editing with the “Make it professional” option, Figure 6 illustrates the additional improvements made to the text.

Figure 6. Comparison between texts before (left) and after (right) the “Make it professional” operation

For more examples of other text-editing features, readers can check out the video

Conclusion

In recent months and years, open-source LLMs have surged in popularity, achieving state-of-the-art performance on benchmark tests and even surpassing proprietary models. Open-source LLMs offer several inherent advantages over their closed counterparts, including cost efficiency, enhanced security, reduced latency, increased robustness, and greater transparency — all crucial factors for informed business decisions.

In this blog, we explore how leveraging open-source models can lead to valuable, private, and cost effective applications. As demonstration, we showcase an end-to-end solution with an intelligent text editor, powered by locally-hosted LLMs via Ollama.

Last but not least, when integrating LLMs into your solution, we encourage you to consider the specific goals you need to achieve, your budget, data security requirements, and the desired performance and robustness. Often, you will find that an open-source LLM can effectively meet your needs.

Acknowledgements

Special thanks to Dr. Simon Thompson for the valuable discussions, feedback and comments.

Reference

1. Llama Team, AI @ Meta (2024) The Llama 3 Herd of Models. Available at: https://arxiv.org/pdf/2407.21783 (Accessed: 5 November 2024).

2. Mistral (2024) Mistral Large 2407. Available at: https://mistral.ai/news/mistral-large-2407/ (Accessed: 8 August 2024).

3. DeepLearning.AI (2024) The Batch, Issue 260. Available at: https://www.deeplearning.ai/the batch/issue-260/ (Accessed: 8 August 2024).

4. Ollama (2024) Download Ollama. Available at: https://ollama.com/download (Accessed: 5 August 2024).

5. YouTube (2024) ai text editor demo [online video]. Available at: https://www.youtube.com/watch?v=cG7ul-cVnzc&feature=youtu.be (Accessed: 5 August 2024).

6. Englisch-Hilfen (n.d.) Error text: Letter. Available at: https://www.englisch-hilfen.de/en/exercises/structures/error_text_letter.htm (Accessed: 5 August 2024)

About the Author

Aaron Zhao is a Senior Data Scientist at GFT, leveraging AI to drive innovation for leading banks and insurance firms across the UK. He holds a Ph.D. in Engineering from the University of Cambridge.

Open Source LLMs for Business: Delivering Value, Ensuring Privacy, and Reducing Costs was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluating an LLM code documentation generation application

Aaron Zhao — Thu, 05 Sep 2024 08:26:38 GMT

Introduction

Business problem — All software developers know the importance of good quality code documentation. It bridges the gap between the code authors and new developers. It helps business users, who may not be familiar with the code, understand the software’s functionalities. Additionally, it allows developers to record their achievements, as people tend to forget details over time. However, high-quality documentation is often lacking because it is time-consuming and less stimulating to produce compared to writing code.

In our case study, the client is a large organisation with many existing code repositories, most of which are poorly documented or not documented at all. Traditionally, like most organisations, they relied on developers to manually generate technical documents. According to the client, it takes a developer an average of four weeks to create these documents, which is both time-consuming and costly. Therefore, the client approached us to see if GenAI can assist in this process.

Specifically, the document they expect to create follows the below pre-defined structure:

1. Overview
2. Functional Overview
3. Design
3.1 Technology Stack
- Programming Languages
- Main Frameworks and Libraries
- Additional Technologies
4. External Interfaces (Minimum)
4.1 Upstream Connections
4.2 Downstream Connections
5. Insights
6. Entry points
7. Data Manipulation (SQL) (Optional)

LLMs have excelled in various fields, achieving exceptional performance in applications such as translation, sentiment analysis, question answering, creative writing, conversational chatbots, etc. However, there are some general challenges associated with LLMs, such as

Hallucination — LLMs can confidently make false statements, which is a well-known issue.
Inconsistency — Even with the same inputs and with temperature set as 0, there are variations in the results generated, as documented below by Google Cloud [1] for Gemini model

“A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.”

For example, with a temperature of 0, asking Gemini-1.5-pro-001 the same question twice “Why is the sky blue? Answer it in three sentences”, we will get the results returned as below, where the differences are obvious though their semantic meanings are almost the same.

Lack of transparency — Due to the black-box nature, it is difficult to interpret the model outputs and understand why a particular output was produced.
Instructions in prompt ignored by LLM — Use more detailed instructions such as XML or Markdown rather than plain text (as discussed here: Reddit Post [2]) will alleviate the problem. However, these tweaks are often experiential rather than scientific and there is no guarantee they will always work.

Consequently, we often lack confidence in the reliability and accuracy of their outputs. This absence of confidence highlights a critical need: effective evaluation methods for LLM outputs. Without robust evaluation techniques, we cannot reliably determine when to trust these models and when to be sceptical.

Challenges — The challenges around evaluating document quality arise from the following aspects:

An evaluation dataset is not available, i.e., no existing documents that are complete and reliable to compare with LLM-generated docs.
There is a lack of well-defined quantitative metrics for evaluating the quality of free-text output. This challenge has been elaborated in a recent article by Andrew Ng [3]. In traditional ML tasks, we have well-defined quantitative metrics for evaluating performance, such as MSE for regression problems and confusion matrices, recall, and precision for classification problems. Based on these, we can optimise the model. However, for tasks like document generation, it is difficult to define or select quantitative metrics for evaluating performance.
Human evaluation is much more reliable. However, companies usually don’t have the luxury of a dedicated army of reviewers, and people often have other priorities in their daily jobs. Also, individual developers are usually only familiar with specific repositories, and hardly anyone can know all the details about all repositories.

In this case study, we explore options for effectively evaluating the technical documents generated by the Gemini-1.5-pro-001 model for existing source code repositories, thereby building confidence in its ability to generate reliable documentation.

Methodology

Given the challenges mentioned above, our approach to evaluating the quality of the documentation is two-fold:

Automatic evaluation — Assess the quality of the LLM-generated text against information extracted from existing README files.
Human evaluation — We are fortunate to have a group of reviewers who can provide both quantitative and qualitative evaluations of the document quality.

Automatic Evaluation

The repository typically includes a README file, but there are some data quality issues to note:

The README files do not follow the desired structure as above, so for comparison purposes, we need to extract the relevant fields first. We use an LLM to extract these fields from the README file and then manually review them to ensure accuracy.
Although all repositories have a README file, some are copied from a template and are irrelevant to the specific repository. These README files are excluded from comparison.

In total, we identified 6 useful README files out of 11 for further comparison with LLM-generated documents. This underscores the lack of quality documentation in many code repositories. Since the existing README files did not adhere to the predefined structure, we first extracted the relevant fields for comparison. Specifically, we focused on extracting the following fields: overview, functional overview, technology stack, external interfaces, insights, entry points, and data manipulation (SQL). We use an LLM (Gemini-1.5-pro-001, to be specific) to perform these information extraction tasks, as LLMs have proven highly effective in this area.

Here is the prompt used for information extraction:

prompt = """You are an AI documentation assistant, and your task is to extract important information from README files.
Specificly, from the README file, you need to extract the following key components:
# 1.0verview
# 2. Functional Overview
# 3. Technology Stack
# 4. External Interfaces (Minimum)
## 4.1 Upstream Connections
## 4.2 Downstream Connections
# 5. Insights
# 6. Entry points
List of the endpoints (if any), the corresponding method and description as a table
# 7. Data Manipulation (SQL) (Optional)
For the extracted results:
- Return the results in JSON format with the following keys only: Overview, Functional Overview, Technology Stack, External Interfaces, Insights, Entry points, Data Manipulation (SQL)
- Exclude the Mermaid diagrams, such as sequence diagrams, flowcharts from extractions
- Never add ```json on the beginning and do not add ``` at the end
The README to extract information from is as following, delimited by triple backticks: \\n
```{readme)```.
"""

Here is an example of the JSON output:

{"Overview": "An overview of the applicaiton",
"Functional Overview": "Explain the functionalities of components",
"Technology Stack": "",
"External Interfaces": {
"Upstream Connections": "upstream connections",
"Downstream Connections": "downstream connections"},
"Insights": "Insights of the applicaiton as bullet points",
"Entry points": "None",
"Data Manipulation (SQL)": "None"}

We also extract JSON information from LLM-generated documents, which is easier due to the clear mapping between JSON keys and Markdown section headings.

The quality of the LLM-extracted sections from the README files and the LLM-produced documents is validated manually to ensure they are correct. From the inspection of the extraction results, due to the incompleteness of the human-authored documents, only sections of ‘overview’, ‘functional overview’ and ‘technology stack’ are useful for the evaluation purpose.

There is a wide variety of similarity measures such as Euclidean distance, edit distances and Levenshtein distance. Considering we are comparing the semantic similarities between the LLM-generated and human-authored texts, we choose cosine similarity as the evaluation metric. Figure 1 illustrates the process of calculating the cosine similarities. We first take the section of interest, e.g., the overview section, from the LLM-generated document and the extraction of the corresponding existing README file. Then, call the pre-trained Sentence Transformers model (all-mpnet-base-v2) to generate their vectorised representations, i.e., embeddings, based on which we can compute the cosine similarity between the two pieces of texts. This model can process texts up to 384 tokens in length, which corresponds to approximately 256 to 320 words — longer than any of the texts we need to generate embeddings for, with the longest being 174 words. As a result, we can effectively capture the semantic meanings of these texts in 768-dimensional vectors.

Figure 1. Automatic evaluation process

Human evaluation

We also request a group of experienced developers familiar with the repositories to review the LLM-generated documents. These individuals have hands-on experience with the code, providing them with in-depth knowledge of how the application works and its intricacies.

Their task is to assess the documents based on four key criteria: accuracy, completeness, user satisfaction and overall quality. Here is the definition of the metrics:

Accuracy — Measure the accuracy of the Al-generated documentation by comparing it against manually created documentation by domain experts. Where documentation does not already exist, please score this accordingly.
Completeness — Assess whether the Al-generated documentation covers all essential aspects of the application, including technology stack, functionalities, dependencies, configurations and usage instructions.
User Satisfaction — Gather feedback from users, developers and stakeholders to assess the usefulness and clarity of the Al-generated documentation compared to manually created documentation. Provide the ratings in three categories, high, medium and low.

Consistency tests

To assess the consistency of LLM outputs, we conducted ten tests using the same prompt with the model temperature set to 0. The tests were run on the repository with the highest number of lines of code. Pairwise similarity scores, both edit-distance and semantic-based metrics, were calculated for each section. The results were visualised through heat maps to illustrate the consistency across outputs.

Non-functional metrics

In addition to the above metrics, we also use two non-functional metrics:

$ per document generation
seconds per document generation

Results

Here are the evaluation results. Please note that repository names have been anonymised.

Figure 2. Cosine similarity scores between human-authored and LLM-generated document sections across different repositories

As shown in Figure 2, across the sections, the LLM-generated contents match best with human-authored contents in the technology stack section, scoring the highest average similarity of 0.72. This is followed by the overview section, with an average similarity score of 0.65. The functional overview section shows the lowest similarity, with an average score of 0.56, and for one repository, it was only 0.32.

The observed performance differences across different sections generally match expectations. For the technology stack section, the task is more like information extraction, which does not require the LLM to understand the code structure and interdependencies. For the overview and functional overview sections, these are understanding and summarisation tasks, so the LLM needs a good grasp of each piece of code, as well as the structure and relationships among various components. The functional overview requires more detailed information; thus a deeper understanding of the code is needed.

Furthermore, when comparing the functional overview sections from the existing README files, a noticeable difference emerges across the repositories. For instance, the three repositories with the lowest similarity scores to the LLM-generated content — mistral, spectrum and avalon — tend to present their functional overviews as basic lists of functions or components, lacking descriptive details. In contrast, repositories like frostbite and cascade, which have higher similarity scores with the LLM-generated content, provide more descriptive and detailed functional overview sections. The LLM-generated functional overviews are expectedly more descriptive because in the prompt, we specifically ask LLMs to “Explain the functionalities of the component with bullet points”. This helps understand the variation in similarity across the repositories.

According to Manralai [4], a score above 0.6 is considered a strong indicator of similarity, 0.3–0.6 indicates some similarity, and below 0.3 indicates dissimilarity. Hence, adopting the same interpretation here, the LLM-generated technology stack and overview sections are very similar to the human-authored counterparts, whilst the LLM-generated functional overview indicates some similarity but requires more scrutiny.

Figure 3. Accuracy and completeness of the LLM-generated documents based on human evaluation. (Note: The data has been rescaled to a 0 to 5 scale. The original ratings were assigned on a scale of 0 to 100, with increments of 20.)

Figure 3 shows that the accuracy and completeness both receive high scores based on human evaluation. A few exceptions include:

Missing data manipulation information — it accounts for the relatively low score on completeness for velocity which contains four SQL files but the document presents no information available for the SQL section
The functional overview lacking expected detail — it accounts for the relatively low score on completeness for cascade
A misrepresentation of workflow in the sequence diagram — it is the reason why sparkle scored 3 for both accuracy and completeness as “the sequence diagram shows what eventually happens, but the sparkle app just marks the extract as ready in the database” as explained by the reviewer

Figure 4. Count of components by user satisfaction

As presented in Figure 4, documents generated for all eleven repositories were marked either Medium (7) or High (4) in terms of user satisfaction.

Figure 5. Example heat-maps of pairwise Levenshtein ratios

The consistency evaluation by edit-distance-based Levenshtein ratio generally showed strong alignment, though some differences were observed across sections:

The ‘overview’ and ‘entry points’ sections were identical across all 10 tests.
In the ‘functional overview,’ ‘technology stack,’ and ‘external interfaces’ sections, 9 out of 10 runs were identical, with only Run 8 (R8) showing slight variations. The Levenshtein ratios for R8, compared to the other runs, were 0.91, 0.96, and 0.99 for ‘functional overview,’ ‘technology stack,’ and ‘external interfaces,’ respectively.
Notable differences were found in the ‘data manipulation (SQL)’ and ‘insights’ sections as shown in Figure 5. In the ‘insights’ section, R3 and R5–R7 produced identical results, whilst the remaining runs varied. The greatest dissimilarity was observed between R1 and R10, with a Levenshtein ratio of 0.60. In ‘data manipulation (SQL),’ four distinct versions emerged: version 1 (R1 and R5), version 2 (R2–R4), version 3 (R6–R9), and version 4 (R10). The largest difference was between version 1 and version 2, with the lowest Levenshtein ratio of 0.83.

Additionally, semantic consistency was evaluated using pairwise cosine similarity scores, calculated in the same way as described in the automatic evaluation section. The 10 runs produced very similar results overall. Five out of seven sections — ‘overview,’ ‘functional overview,’ ‘technology stack’, ‘external Interfaces’, and ‘entry points’ — were semantically identical across all tests, with a cosine similarity score of 1. The remaining sections, ‘insights’ and ‘data manipulation (SQL),’ had most pairwise cosine similarity scores above 0.95, with the lowest score being 0.88.

Table 1 lists non-functional statistics regarding document generation, with an average cost of $0.53 per document and a time cost of 41.8 seconds per document generation. According to the client, generating this type of document previously required a developer’s involvement over two sprints (160 hours). Now, with the use of LLMs, it takes approximately one day (8 hours), including the time needed for human evaluation and modifications to ensure accuracy and quality. This represents a 20-fold increase in efficiency. Assuming a daily developer cost of £250, this translates to a total savings of £4,750 per document creation.

Table 1. Non-functional metrics of the document generation

Conclusion

This case study demonstrates a methodology for enhancing confidence in LLM outputs through effective evaluation techniques. Whilst the evaluation of LLM-generated content presents challenges and is often task-specific, this example provides a framework of best practices. The evaluation process encompasses the following key steps:

Conduct a comprehensive problem analysis to define the criteria for a successful outcome.
Identify and utilise available evaluation resources, including human reviewers and existing comparative documents.
Integrate human review processes where feasible and accessible.
Develop and implement clear, measurable evaluation metrics that cover both qualitative and quantitative aspects.
Evaluate the quality of the LLM outputs based on the selected metrics, perform error analysis, and identify areas for improvement.

Through the application of rigorous evaluation methodologies, organisations can significantly increase their confidence in LLM-generated content, thereby maximising the potential of these advanced tools to address diverse business requirements and drive innovation.

Acknowledgments

Special thanks to Dr. Alastair Gill and Dr. Simon Thompson for the valuable discussions, feedback and comments, which have significantly contributed to the improvement of this work. I would also like to extend my gratitude to the GFT AI Impact team for their support in the document generation efforts.

Reference

[1] Google Cloud. (n.d.) Understanding Images with Generative AI. Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/image-understanding (Accessed: 1 August 2024).

[2] Reddit user. (2024) ‘Why LLM won’t follow the instructions’, r/PromptEngineering, 31 July. Available at: https://www.reddit.com/r/PromptEngineering/comments/1aqmpev/why_llm_wont_follow_the_instructions/?rdt=49996 (Accessed: 1 August 2024).

[3] DeepLearning.AI. (2024) ‘Heart-Risk Model Saves Lives, Self-Driving on Unruly Roads, and more’, The Batch, Issue 251, 29 May. Available at: https://www.deeplearning.ai/the-batch/issue-251/ (Accessed: 1 August 2024).

[4] Manralai. (2023) ‘Cosine Similarity Formula Explained in Easy Words’, Medium, 15 February. Available at: https://manralai.medium.com/cosine-similarity-formula-explained-in-easy-words-d501e42a3cb5 (Accessed: 1 August 2024).

Evaluating an LLM code documentation generation application was originally published in gft-engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.