Engineering AI Systems
Architecture and DevOps Essentials
Prof. Dr. Ingo Weber
Chair of Information System Development and Operation
Technical University of Munich, School of CIT, CS Department
Fraunhofer-Gesellschaft, Director for AI & Innovation
Coauthors:
Len Bass
Q.Lu, L. Zhu
Book website:
https://research.csiro.au/ss/
team/se4ai/ai-engineering/
Unless otherwise specified, figures are from the book and subject to copyright
Introduction
AI & Foundation Models
AI System Life Cycle
Qualities
Summary
2
Engineering AI Systems | Prof. Dr. Ingo Weber
2024 study:
• 80% of AI projects do not go into production
− https://www.rand.org/pubs/research_reports/RRA2680-1.html (2024)
− Included ML projects but not projects that used pre-trained LLMs (prompt
engineering)
− This is twice the rate for non-AI projects
−or, from the opposite viewpoint: the success rate is only a third of that of
regular software projects (20% vs. 60%)
Optimism about AI systems persists
3
Engineering AI Systems | Prof. Dr. Ingo Weber
Quality of software: how well does it do its job – along all relevant dimensions
→ Does the software fulfil the (implicit and explicit) requirements in terms of functionality and quality attributes?
ISO/IEC 25010:2011 Quality Model
Software quality
4
Engineering AI Systems | Prof. Dr. Ingo Weber
AI systems with high quality: meet quality
expectations in terms of conventional software
portions, AI model(s), and the combination
DevOps is a set of practices intended to reduce the time between committing a
change to a system and the change being placed into normal production, while
ensuring high quality. [DevOps book, 2015]
AI engineering is the application of software engineering principles and
techniques to the design, development, and operation of AI systems.
[AI Engineering book, 2025]
The term MLOps, analogous to DevOps, encompasses the processes and tools not
only for managing and deploying AI and ML models, but also for cleaning,
organizing, and efficiently handling data throughout the model life cycle.
[AI Engineering book, 2025]
Terminology
5
Engineering AI Systems | Prof. Dr. Ingo Weber
Introduction
AI & Foundation Models
AI System Life Cycle
Qualities
Summary
6
Engineering AI Systems | Prof. Dr. Ingo Weber
The specifics of the system architecture depend on the type of AI being used, nowadays mostly:
• Narrow ML Models or
• Foundation Models (or combinations)
Types of AI systems
Engineering AI Systems | Prof. Dr. Ingo Weber 7
Overview of Model Types
▪ Narrow ML Models – Designed for specific tasks using custom-collected and labeled datasets.
▪ Foundation Models (FMs) – Large, pretrained on extensive datasets for general purposes, adaptable to a variety of
specific tasks.
• Large language models (LLMs) are a subtype of FMs
▪ Symbolic AI (a.k.a. Good-old-fashioned AI, GOFAI): rule-based systems, symbolic reasoning, AI planning,
knowledge graphs, etc.
• Not discussed in depth here
Terminology: Narrow ML Models and Foundation Models
8
Engineering AI Systems | Prof. Dr. Ingo Weber
Artificial Intelligence
Symbolic AI
Machine Learning
Narrow ML Foundation Models
Large Language Models
… are statistical models
Intended for a specific task
Input to a narrow ML model is a data item consisting of a set values of labelled independent variables.
Output is a generated value of a dependent value.
What are narrow ML models used for?
Suitable for three main purposes
• Classification – assigns a category to an input. e.g. this email is spam
• Regression – returns a continuous value to an input. E.g.,
− this particular process will take 3 days to complete,
− this pipe will break in the next 6-24 months (and should be replaced)
• Clustering – groups similar items. E.g. this data set has 10 groups, clustered by age
Narrow ML models
Engineering AI Systems | Prof. Dr. Ingo Weber 9
Ethical concerns
Interpretability and explainability
Generalization and overfitting
Robustness and adversarial attacks
What are the problems with narrow models?
Engineering AI Systems | Prof. Dr. Ingo Weber 10
What are Foundation Models?
A Foundation Model (FM) is trained on an
extensive and diverse dataset, often comprising
billions or even trillions of data points.
The training data is largely unlabeled,
FMs are general purpose but can be customized
for particular applications.
Large language models (LLMs) are a type of
FMs.
Engineering AI Systems | Prof. Dr. Ingo Weber 11
AI-generated figure
Natural Language Processing
Text Machine translation
Question answering
Summarization
Image generation and classification
Object detection
Code Generation
Video / Audio generation
Brainstorming
Rapid prototyping
…and many other applications
What are Foundation Models used for?
Engineering AI Systems | Prof. Dr. Ingo Weber 12
Data Privacy and Security
Sensitive Data Leakage / Exposure
Misuse and Misinformation
Deepfakes and Fake Content
Hallucination? Confabulation?
What are some problems with Foundation Models?
13
Engineering AI Systems | Prof. Dr. Ingo Weber
What are some problems with Foundation Models?
Hallucination vs. Confabulation
14
Engineering AI Systems | Prof. Dr. Ingo Weber
„Create a photo of the moon surface“
(not an actual example) → Hallucination
…vs. confabulation
AI-generated figures
▪ Recap: Compact Overview of Model Types
• Narrow ML Models - Designed for specific tasks using custom-collected and labeled
datasets
• Foundation Models (FMs) - pretrained for general purposes, i.e.: not one specific
purpose, adaptable to a variety of specific tasks
▪ Current Use in AI Systems
• Most AI systems from before 2024 integrate narrow ML models and non-AI components
that handle specialized tasks and system functionalities
• FMs are being increasingly explored and utilized for their adaptability and broad
application potential
Central design decision in AI system:
Choosing between Narrow ML Models and FMs
15
Engineering AI Systems | Prof. Dr. Ingo Weber
▪ Advantages of Narrow ML Models
• Tailored to specific tasks, potentially offering higher precision for particular applications
• Greater control over the model development process, allowing for customized
adjustments and integration of safety measures (guardrails)
▪ Challenges with Narrow ML Models
• High cost of data collection, labeling, and computation during training
• Potentially limited by the quality and quantity of available training data
Advantages / Challenges with Narrow ML Models
16
Engineering AI Systems | Prof. Dr. Ingo Weber
▪ Advantages of Foundation Models:
• Can reduce the amount of human labor and time required for model development.
• Offer a flexible base that can be fine-tuned for accuracy improvements across diverse
tasks.
• Economical in terms of reuse and scalability, especially when shared within an
organization or accessed via API.
▪ Challenges of FMs:
• Training effort infeasible for many organizations
• Potentially high cost / compute requirements for inference (i.e. runtime use)
• Updating
• Limited grounding
• Hallucination / confabulation
Benefits / challenges of using Foundation Models
17
Engineering AI Systems | Prof. Dr. Ingo Weber
▪ Choosing between model types depends on specific project needs, resource availability,
and desired control over the model characteristics
▪ Organizations must weigh the ease of implementation and broad applicability of FMs
against the specificity and customizable aspects of narrow ML models
Trade-offs and Decision Factors
18
Engineering AI Systems | Prof. Dr. Ingo Weber
▪ Cost is a key factor in deciding between Narrow ML models and FMs
▪ Cost factors include:
• Development cost – similar for system, lower for using existing FMs
• Maintenance cost – depends on customization / model size
• Operation costs – typically higher for FMs
Costs of Narrow ML Models and FMs
19
Engineering AI Systems | Prof. Dr. Ingo Weber
AI-generated figure
Narrow ML Models can Co-exist with FMs
20
Engineering AI Systems | Prof. Dr. Ingo Weber
Narrow ML Models can Co-exist with FMs
21
Engineering AI Systems | Prof. Dr. Ingo Weber
▪ Adapting foundation models (FMs) to specific applications involves various strategies to refine
their functionality and ensure alignment with use case requirements.
▪ Techniques for Customizing FMs
• Prompt Engineering - Crafting specific prompts or input sequences that guide the FM towards
producing desired outputs.
• Retrieval-Augmented Generation (RAG) - Enhancing the FM's responses by dynamically
incorporating external knowledge or data during the generation process.
• Fine-Tuning - Adjusting the model’s parameters on a specific dataset to improve performance
for particular tasks.
• Distilling - Simplifying and compressing the model to enhance efficiency while retaining
performance, useful for deployment on resource-constrained environments.
Customizing Foundation Models
22
Engineering AI Systems | Prof. Dr. Ingo Weber
Customizing Foundation Models
23
Engineering AI Systems | Prof. Dr. Ingo Weber
Prompt engineering is the easiest for organizations
Implementing RAG is more difficult
Fine tuning and distilling require sophistication
Creating a bespoke FM requires a high degree of sophistication
Engineering AI Systems | Prof. Dr. Ingo Weber 24
Complexity
Complexity vs time/organizational maturity
▪ Guardrails are designed to monitor and control the
inputs and outputs of
• foundation models
• users
• RAG
• external tools
▪ Goal is to meet specific requirements, including
• function
• accuracy
• aspects needed due to policy (AI Act, etc.),
standards and laws.
Guardrails
25
Engineering AI Systems | Prof. Dr. Ingo Weber
Stock image, CC licence
▪ Input guardrails are applied to the inputs received from users, and their possible effects include
refusing or modifying user prompts.
▪ Output guardrails focus on the output generated by the foundation model, and may modify the
output of the foundation model or prevent certain outputs from being returned to the user.
▪ RAG guardrails are used to ensure the retrieved data is appropriate, either by validating or
modifying the retrieved data where needed.
▪ Execution guardrails ensure that the called tools or models do not have any known
vulnerabilities and the actions only run on the intended target environment and do not have
negative side-effects.
▪ Intermediate guardrails can be used to assert that each intermediate step meets the necessary
criteria.
Different Types of Guardrails
26
Engineering AI Systems | Prof. Dr. Ingo Weber
Types of guardrails in NVIDIA NeMo
27
Engineering AI Systems | Prof. Dr. Ingo Weber
“NeMo Guardrails is an open-source toolkit for easily adding programmable
guardrails to LLM-based conversational applications.”
Slight changes in terminology: guardrails → rails; RAG guardrails → retrieval rails;
Intermediate guardrails → dialog rails.
Source: https://github.com/NVIDIA/NeMo-Guardrails
Introduction
AI & Foundation Models
AI System Life Cycle
Qualities
Summary
28
Engineering AI Systems | Prof. Dr. Ingo Weber
Quality is determined by:
• Software architecture
• Development processes
• Code quality
• Model quality
Quality Influences in AI Systems
System
qualities
Development
processes
Engineering AI Systems | Prof. Dr. Ingo Weber 29
AI System Life Cycle
30
Engineering AI Systems | Prof. Dr. Ingo Weber
System Build
Create an executable
artifact
System Release
Approve for deployment
System Test
Ensure high test coverage
& automate tests as much
as possible
Design
Design architecture to fulfill
requirements, achieve goals, and
support other activities
System Dev
Perform normal development
activities & create scripts for
other activities
Deploy
Move into production environment
Operate & Monitor
Execute system,
gather measurements,
display metrics,
detect anomalies
Analyze
Analyze measurements taken
during operation, user feedback,
new developments, etc.
Model Build
Create an executable artifact
Model Release
Approve for integration in system
Model Test
Test accuracy, risks, biases, etc.
AI Model Dev
Model selection, exploration,
hyperparameter tuning, data
management, model preparation
and training, model evaluation
AI
Model
Dev
Deploy
Operate &
Monitor
Build
Release
Analyze
Test
Build
Dev
Rel.
Test
AI models are just one component (each)
The AI model processes are for developing and testing the AI model
AI System Life Cycle
32
Engineering AI Systems | Prof. Dr. Ingo Weber
AI
Model
Dev
Deploy
Operate &
Monitor
Build
Release
Analyze
Test
Build
Dev
Rel.
Test
Sufficient training data – 10x number of parameters in the model
Distribution of training data
• Reflect distribution of requests
• Corrected for biases
• Adjusted to reflect security concerns
Data quality is important
Engineering AI Systems | Prof. Dr. Ingo Weber 33
Data drift – training data is not representative of actual data after some period of time. E.g. housing prices
have changed
Environmental change – e.g. training data represents housing prices in urban areas, attempt to use model
for rural areas.
Regulatory changes – ongoing regulations can affect issues such as fairness, transparency, accountability,
and human oversight
- Imagine the road rules changing…
Problems with data
Engineering AI Systems | Prof. Dr. Ingo Weber 34
AI models are just one component (each)
The AI model processes are for developing and testing the AI model
The remaining processes are for both the AI and non-AI components
• Create non-AI components
• Build the AI system with both AI components and non-AI components
• Test it, deployment it, and operate it
AI System Life Cycle
35
Engineering AI Systems | Prof. Dr. Ingo Weber
AI
Model
Dev
Deploy
Operate &
Monitor
Build
Release
Analyze
Test
Build
Dev
Rel.
Test
▪ Interdisciplinary teams should be set up to jointly develop the portion of the design that affects
both the AI and the non-AI modules
▪ Interdisciplinary teams can frequently have problems with vocabulary and cultural differences
▪ Bruce Tuckman characterized team formation as
“forming, storming, norming, performing”
▪ Different expertise may be needed at different times, e.g., UX, security, …
Interdisciplinary Teams
36
Engineering AI Systems | Prof. Dr. Ingo Weber
Co-design and development
• Allows negotiation about how functions should be divided
• Ideally, some developers will have expertise across both fields
Using microservices
• Some components may not make good microservices, e.g. due to communication issues (latency, etc)
Design for modifiability
• Intermediary between the client of a component and the component itself, e.g. translating clients’
invocations into the new model interface
Design
Engineering AI Systems | Prof. Dr. Ingo Weber 37
Development of non-AI modules is similar to non-AI systems; testing (and developing tests) is not
A module is a coherent piece of functionality
A service is constructed through the integration of multiple modules and their supporting libraries during the build step.
• Using the language and technologies the team chooses
Unit test:
• Tests each module with defined inputs and expected outputs
• During development and again during the testing phase
• Unit tests cover
− Common cases: typical scenarios
− Edge cases: boundary of input ranges or special conditions
− Negative cases: invalid inputs where the module should fail gracefully
Develop Non-AI Modules
Engineering AI Systems | Prof. Dr. Ingo Weber 38
CI triggered when releasing a new AI model or committing a
change
CI retrieves all elements, build a new image, run tests
If tests pass, the image is deployed into a staging server
CI server also creates build meta data, such as version
manifest, dependency graph, package manifest
Build
Engineering AI Systems | Prof. Dr. Ingo Weber 39
Repeat model/unit tests from AI/non-AI module development
System-wide test for end-to-end execution of the system
• User stories
• Regression testing
• Compatibility testing
• Efficiency/performance testing
• Security testing
• Compliance testing
Non-repeatable testing results
• Test database has not been restored
• Have background tasks
• Probabilistic AI models may produce different results at different times
→ need to control all sources of randomness to achieve repeatable tests
Test
Engineering AI Systems | Prof. Dr. Ingo Weber 40
Have a single, well-defined release and deployment process – unchanged DevOps principle
• May require a human to release the system to deployment for regulatory/legal reason
− Frequency of new releases: occur on a defined schedule or continuously
• Continuous release/deployment patterns
− Blue/green deployment: deploy multiple instances of a new service simultaneously and route all new requests to
instances of the new service while the instances of the old service are gradually drained and terminated
− Rolling upgrade: gradually replace instances of old service with new service, typically one at a time
− Canary testing: release a new version to a selected group of users to test the new version
− A/B testing: compare two versions and determine which one is better
− Roll back: revert to a previous version
Release and Deploy (Cont’d)
Engineering AI Systems | Prof. Dr. Ingo Weber 41
Additional reasons for updating AI systems
• Data drift
− Data distribution the model encounters in production can be different from the data it was originally trained
• Improved data
− The model may encounter additional/updated data in production that wasn’t available during the initial training
• Changing requirements
− The requirements for the model may change over time, e.g., add new features to the model or improve the
accuracy
• Biased/vulnerable models
Release and Deploy
Engineering AI Systems | Prof. Dr. Ingo Weber 42
Collect and analyze data to inform the next round of development
Monitoring Sources and metrics
• Infrastructure Metrics
− resource utilization including CPU usage, memory consumption, I/O, and network traffic of VMs and
containers
• System code metrics
− Application-specific metrics such as number of active users, number of requests created, API response
time
− Need higher-level metrics to assess AI systems, e.g., sentiment, precision of answers, number of
retry prompts
• Logs
− Text records generated by services and stored locally or moved to databases for analysis, such events,
errors, state changes
Operate, Monitor, and Analyze
Engineering AI Systems | Prof. Dr. Ingo Weber 43
Introduction
AI & Foundation Models
AI System Life Cycle
Qualities
Summary
44
Engineering AI Systems | Prof. Dr. Ingo Weber
A system is constructed to satisfy organizational objectives, which can
be manifested as a set quality requirements.
Quality requirements for a system are formalized as quality attributes
(QAs)
• A QA is a measurable or testable property of a system that is used to
indicate how well the system satisfies the needs of its stakeholders beyond
the basic function of the system
• Stated for both AI and non-AI portions
Qualities
Engineering AI Systems | Prof. Dr. Ingo Weber 45
Reliability
• Operate over time under various specified conditions and without failure.
• AI model performs effectively when introduced to new data
− This new data should be within or similar to the distribution of the training data, not out of distribution (OOD).
Robustness
• Operate despite unexpected inputs
• AI model is able to handle out-of-distribution (OOD) data and adversarial input
Resilience
• Operate despite changes in the environment
• In AI, this is characterized by the model’s intrinsic ability to continue functioning effectively amidst significant disruptions.
Reliability, Robustness, Resilience
Engineering AI Systems | Prof. Dr. Ingo Weber 46
Performance
Engineering AI Systems | Prof. Dr. Ingo Weber 47
Efficiency: utilize resources effectively while maintaining acceptable speed and
responsiveness, in addition to fulfilling its functional requirements.
• Throughput: number of tasks processed within a specific time frame
• Latency: time it takes for a system to complete a task
• Scalability: handle increasing workloads without degrading performance
− Training scalability: train AI models as training task size increases (e.g. number of
model parameters, data volumes)
− Inference scalability: deploy and operate AI systems for inference as the model size
or the number of requests increases
• Resource utilization: how efficiently a system uses its computational resources
− Training time resource utilization: the effective use of computational resources for
training
− Inference time resource utilization: the model’s resource efficiency in making
predictions or generating outputs
Efficiency
Engineering AI Systems | Prof. Dr. Ingo Weber 48
Architecture Design
• Use an interface style like GraphQL
• Optimal placement of containers and VMs
• Caching
• Edge computing
• Mode switcher
− Selecting components/models
− FM as a software connector
Architecture Approaches to Improving Efficiency
Engineering AI Systems | Prof. Dr. Ingo Weber 49
Model Choice
• Assess resource required for both training and inference
• Compared to deep neural networks, simpler models like decision trees or naive Bayes are more resource friendly for
both training and execution.
Hyperparameters
• Adjusting hyperparameters (e.g. learning rate) to reduce the number of training iterations
• Employing regularization hyperparameters (like L1 or L2 penalties) to simplify the model
• Utilizing early stopping involves halting the training process
• Reducing the precision of the model parameters
Data sampling
• Using random sampling to select a representative sample of dataset for training
Feature engineering
• e.g. reducing the number of features or using hashing to handle high cardinality features
Operation
• Process multiple data points simultaneously during inference
• Use lazy loading or model checkpointing to load only the necessary parts of the model at runtime
Process Approaches to Improving Efficiency
Engineering AI Systems | Prof. Dr. Ingo Weber 50
Accuracy refers to a system’s ability to produce expected or true results.
• AI: how well the AI system’s predictions or outputs align with the expected or true results.
Accuracy Metrics for Classification Models
• Accuracy Rate
• Precision
• Recall
• F1 Score
Accuracy Metrics for Regression Models
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Mean Absolute Error (MAE)
• R-squared (R2)
Accuracy Metrics for Foundation Models
• Bilingual Evaluation Understudy (BLEU) Score
• Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score
Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 51
Repeatability: produce the same output for the same input under identical conditions.
Some systems may not be deterministic because of factors like multi-thread design or other background activities.
AI systems maybe non-deterministic
• Rely on randomness during training, such as initializing weights or selecting optimization paths.
Output variability is desirable in certain AI applications, like creative writing or generating various strategic
recommendations.
Performance measurement, such as with benchmarks, requires multiple executions and statistical analysis.
Repeatability
Engineering AI Systems | Prof. Dr. Ingo Weber 52
Architecture design
• Multi-model decision-making
− Using different models to perform the same task or make a single decision.
− Consensus protocols can be defined to make a final decision, such as taking the majority decision or accepting only
unanimous results from all models.
− The end user or operator reviews the output from multiple models and makes the final decision.
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 53
Choice of hyperparameters
• Learning rate
− Too low rate: slow progress
− Too high rate: diverging training - disrupt the learning process
• Number of trees in a random forest
− too many trees may cause overfitting
− too few may lead to underfitting
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 54
Data preparation
• Sufficient and representative training data
• Bias-free data
• Incorporating samples of OOD data - handle unexpected inputs
• Addressing Missing Data
− Use imputation techniques to estimate and fill in missing values or remove rows/columns that contain excessive
missingness
• Managing Outliers
− Clip outliers to a certain range or eliminate them entirely
• Splitting the data into three main sets with same distribution (80%, 10%, 10%)
− Use cross-validation techniques to split the data if the training dataset is too small
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 55
Model preparation - feature engineering
• Scaling features to a common range so they contribute equally to training
− e.g. using standardization or normalization technique for larger range or high
value features
• Converting categorical data to numerical values so model can understand the data
− e.g. using one-hot encoding or label encoding
• Deriving new features from existing data
− e.g. calculating "time since last purchase"
• Applying transformations like log or square root to make the relationship between
features and target variables more linear
• Selecting the most influential features for the model
− e.g., using chi-square tests for categorical targets or deriving feature importance
scores from ensemble methods like random forests.
• Reducing the number of features using techniques like Principal Component
Analysis (PCA)
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 56
Model generation - model evaluation provides insights into how well the model is likely to perform on unseen data
• Evaluation metrics: Evaluate them once the model is full trained and monitor them through the training phase
− Metrics for classification models: accuracy rate, precision, recall, F1 score
− Metrics for regression models: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute
Error (MAE), R-squared (R2)
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 57
Operation
• The accuracy of a model is assessed based on known ground truth values.
• It's essential to periodically test the model for accuracy to ensure that it remains effective over time.
− If accuracy issues are detected, the model may need to be retrained with new data or adjusted to address any
identified issues.
− The updated model should undergo a rigorous deployment process to ensure that it meets the operational
standards and accuracy requirements before being fully implemented.
• Accuracy Testing Techniques
− Confusion Matrix - visualizing the performance of a model across different classes - helps identify imbalances
− Statistical tests like the Kolmogorov-Smirnov test can be used to compare the distribution of new data against the
training data – detecting data/concept drift
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 58
FM-based Systems: general-purpose nature makes it hard to achieve consistent accuracy across various domains
without specific adjustments
• Fine-tuning: Adapting the model on a more specific dataset
• Retrieval-Augmented Generation (RAG) - Integrating external data during generation
• Prompt Engineering - Crafting inputs (prompts) that guide the model to generate more accurate and contextually
appropriate outputs.
• Reinforcement Learning from Human Feedback (RLHF) - Adjusting the model's parameters based on feedback from
human evaluators.
• Adversarial Testing and Benchmarks - Testing the model against carefully crafted adversarial inputs or established
benchmarks to ensure robustness and accuracy.
• Acceptance Testing and Test Case Assessment - Systematically testing the model/system to verify performance
before full-scale deployment.
Approaches to Improving Accuracy
Engineering AI Systems | Prof. Dr. Ingo Weber 59
CIA Properties
• Confidentiality – only authorized parties can access data or resources.
• Integrity - data is accurate, complete, and has not been altered or modified by unauthorized parties
• Availability – there is timely and reliable access to data, systems, and services.
Authentication means that the persons or systems trying to access a service are who they claim to be.
Authorization refers to the rights the user has, i.e., they are authorized to perform certain operations like accessing a file or
approving an invoice.
Security
Engineering AI Systems | Prof. Dr. Ingo Weber 60
Attacks in an AI System
Engineering AI Systems | Prof. Dr. Ingo Weber 61
Architectural Approaches – based on zero trust
• Authorization, encryption, least functionality, limiting access, privilege minimization
Process Approaches
• Threat modeling and risk assessment for identifying/assessing vulnerabilities
• New attacks to AI/FM systems: use AI to track and analyze attacks
• Maintaining versions of data items as well as their lineage to ensure the integrity of the model
• Adversarial testing to detect vulnerabilities
• Logging and auditing can help to detect anomalies
Approaches to Mitigating Security Concerns
Engineering AI Systems | Prof. Dr. Ingo Weber 62
More complex
• Not only data access and storage but also the outputs and decisions of the AI systems which can potentially reveal
sensitive information
Organizational approaches: appoint executives such as Chief Privacy Officers (CPOs) or Data Protection Officers
(DPOs). Their responsibilities include
• Developing and enforcing data privacy policies
• Monitoring organizational compliance with privacy laws
• Handling data subject requests e.g. access or deletion
• Managing data breaches
• Educating and training employees on data privacy best practices
Process and architecture: Implementing LOCKED rights—Limit, Opt-out, Correct, Know, Equal, Delete
• E.g., machine unlearning, guardrails
Privacy in AI Systems
Engineering AI Systems | Prof. Dr. Ingo Weber 63
Fairness is not only about the outcomes of an AI system but also about the decision processes and model development
processes, including data collection, algorithm design, ongoing management of the system.
• For instance, in recruitment AI tools, fairness is challenged by historical biases present in the training data. An AI
system might inadvertently learn to prefer candidates from a certain demographic background, etc.
Organizational strategies
• Fairness teams: comprising members from data science, engineering, legal, and ethics departments to identify
fairness concerns at various stages.
• Third-Party Expertise: auditing their AI systems. These experts provide impartial assessments and recommend
strategies to address potential biases and ensure fairness.
• Employee Training: increasing awareness about the impact of bias and the importance of fairness.
Architecture
• Guardrails and monitor
• Logging
Process Approaches
• Fairness Metrics: Demographic Parity, Equal Opportunity, Equalized Odds, Predictive Parity, Treatment Equality
• Fairness tools: AI Fairness 360 by IBM, Fairlearn by Microsoft, PAIR Tools by Google
Fairness
Engineering AI Systems | Prof. Dr. Ingo Weber 64
Monitorability: ability to monitor and track predefined AI system/model quality metrics
• More reactive: identifying issues as they occur
• Solely relying on predefined metrics
Observability: ability to gain insights into the inner workings of the AI system at system level, the model level, and the
model development and deployment pipeline level, and into its operating infrastructure.
• More proactive: understanding an AI system, detecting data, model and concept drift early, and preventing incidents
• Leverages broader data sources, including metrics, logs, traces, events generated by the AI system
• Tools: Repositories for code and data, system logs, resource utilization monitoring infrastructure
Monitorability and Observability
Engineering AI Systems | Prof. Dr. Ingo Weber 65
Maintain logs to record system activity, ensuring all relevant events are captured for later analysis.
• Data Lineage Tools: track the history and flow of data used in model predictions
• Software Bill of Materials (SBOM): provide a detailed lineage of non-AI components
Explainable AI techniques
• Local explanation techniques : instance-based explanations for understanding the feature importance and correlations
that led to the specific outputs, e.g. LIME and SHAP
• Global explanation techniques: understand the general behavior of an AI system by using a set of data instances to
produce explanations
− Visualizing the relationship between the input features and model’s output over a range of values
− Global surrogate models such as tree-based models or rule-based models
• Foundation model-based systems: think aloud vs. think silently
Co-versioning registry
Independent overseeing agents
Approaches to Ensure Observability
Engineering AI Systems | Prof. Dr. Ingo Weber 66
Introduction
AI & Foundation Models
AI System Life Cycle
Qualities
Summary
67
Engineering AI Systems | Prof. Dr. Ingo Weber
Notes for instructors:
• Slides for the whole book are available
• All chapters include discussion questions
68
Engineering AI Systems | Prof. Dr. Ingo Weber
Book chapter overview
1. Introduction
2. Software Engineering Background
3. AI Background
4. Foundation Models
5. AI Model Lifecycle
6. System Lifecycle
7. Reliability
8. Performance
9. Security
10.Privacy and Fairness
11.Observability
12.Case Study: Using a Pretrained Language Model for Tendering
13.Case Study: Chatbots for Small and Medium-Sized Australian Enterprises
14.Case Study: Predicting Customer Churn in Banks
15.The Future of AI Engineering
Summary
69
Engineering AI Systems | Prof. Dr. Ingo Weber
Engineering AI Systems
Architecture and DevOps Essentials
Prof. Dr. Ingo Weber
Book website:
https://research.csiro.au/ss/
team/se4ai/ai-engineering/

Engineering AI Systems - A Summary of Architecture and DevOps Essentials

  • 1.
    Engineering AI Systems Architectureand DevOps Essentials Prof. Dr. Ingo Weber Chair of Information System Development and Operation Technical University of Munich, School of CIT, CS Department Fraunhofer-Gesellschaft, Director for AI & Innovation Coauthors: Len Bass Q.Lu, L. Zhu Book website: https://research.csiro.au/ss/ team/se4ai/ai-engineering/ Unless otherwise specified, figures are from the book and subject to copyright
  • 2.
    Introduction AI & FoundationModels AI System Life Cycle Qualities Summary 2 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 3.
    2024 study: • 80%of AI projects do not go into production − https://www.rand.org/pubs/research_reports/RRA2680-1.html (2024) − Included ML projects but not projects that used pre-trained LLMs (prompt engineering) − This is twice the rate for non-AI projects −or, from the opposite viewpoint: the success rate is only a third of that of regular software projects (20% vs. 60%) Optimism about AI systems persists 3 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 4.
    Quality of software:how well does it do its job – along all relevant dimensions → Does the software fulfil the (implicit and explicit) requirements in terms of functionality and quality attributes? ISO/IEC 25010:2011 Quality Model Software quality 4 Engineering AI Systems | Prof. Dr. Ingo Weber AI systems with high quality: meet quality expectations in terms of conventional software portions, AI model(s), and the combination
  • 5.
    DevOps is aset of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality. [DevOps book, 2015] AI engineering is the application of software engineering principles and techniques to the design, development, and operation of AI systems. [AI Engineering book, 2025] The term MLOps, analogous to DevOps, encompasses the processes and tools not only for managing and deploying AI and ML models, but also for cleaning, organizing, and efficiently handling data throughout the model life cycle. [AI Engineering book, 2025] Terminology 5 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 6.
    Introduction AI & FoundationModels AI System Life Cycle Qualities Summary 6 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 7.
    The specifics ofthe system architecture depend on the type of AI being used, nowadays mostly: • Narrow ML Models or • Foundation Models (or combinations) Types of AI systems Engineering AI Systems | Prof. Dr. Ingo Weber 7
  • 8.
    Overview of ModelTypes ▪ Narrow ML Models – Designed for specific tasks using custom-collected and labeled datasets. ▪ Foundation Models (FMs) – Large, pretrained on extensive datasets for general purposes, adaptable to a variety of specific tasks. • Large language models (LLMs) are a subtype of FMs ▪ Symbolic AI (a.k.a. Good-old-fashioned AI, GOFAI): rule-based systems, symbolic reasoning, AI planning, knowledge graphs, etc. • Not discussed in depth here Terminology: Narrow ML Models and Foundation Models 8 Engineering AI Systems | Prof. Dr. Ingo Weber Artificial Intelligence Symbolic AI Machine Learning Narrow ML Foundation Models Large Language Models
  • 9.
    … are statisticalmodels Intended for a specific task Input to a narrow ML model is a data item consisting of a set values of labelled independent variables. Output is a generated value of a dependent value. What are narrow ML models used for? Suitable for three main purposes • Classification – assigns a category to an input. e.g. this email is spam • Regression – returns a continuous value to an input. E.g., − this particular process will take 3 days to complete, − this pipe will break in the next 6-24 months (and should be replaced) • Clustering – groups similar items. E.g. this data set has 10 groups, clustered by age Narrow ML models Engineering AI Systems | Prof. Dr. Ingo Weber 9
  • 10.
    Ethical concerns Interpretability andexplainability Generalization and overfitting Robustness and adversarial attacks What are the problems with narrow models? Engineering AI Systems | Prof. Dr. Ingo Weber 10
  • 11.
    What are FoundationModels? A Foundation Model (FM) is trained on an extensive and diverse dataset, often comprising billions or even trillions of data points. The training data is largely unlabeled, FMs are general purpose but can be customized for particular applications. Large language models (LLMs) are a type of FMs. Engineering AI Systems | Prof. Dr. Ingo Weber 11 AI-generated figure
  • 12.
    Natural Language Processing TextMachine translation Question answering Summarization Image generation and classification Object detection Code Generation Video / Audio generation Brainstorming Rapid prototyping …and many other applications What are Foundation Models used for? Engineering AI Systems | Prof. Dr. Ingo Weber 12
  • 13.
    Data Privacy andSecurity Sensitive Data Leakage / Exposure Misuse and Misinformation Deepfakes and Fake Content Hallucination? Confabulation? What are some problems with Foundation Models? 13 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 14.
    What are someproblems with Foundation Models? Hallucination vs. Confabulation 14 Engineering AI Systems | Prof. Dr. Ingo Weber „Create a photo of the moon surface“ (not an actual example) → Hallucination …vs. confabulation AI-generated figures
  • 15.
    ▪ Recap: CompactOverview of Model Types • Narrow ML Models - Designed for specific tasks using custom-collected and labeled datasets • Foundation Models (FMs) - pretrained for general purposes, i.e.: not one specific purpose, adaptable to a variety of specific tasks ▪ Current Use in AI Systems • Most AI systems from before 2024 integrate narrow ML models and non-AI components that handle specialized tasks and system functionalities • FMs are being increasingly explored and utilized for their adaptability and broad application potential Central design decision in AI system: Choosing between Narrow ML Models and FMs 15 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 16.
    ▪ Advantages ofNarrow ML Models • Tailored to specific tasks, potentially offering higher precision for particular applications • Greater control over the model development process, allowing for customized adjustments and integration of safety measures (guardrails) ▪ Challenges with Narrow ML Models • High cost of data collection, labeling, and computation during training • Potentially limited by the quality and quantity of available training data Advantages / Challenges with Narrow ML Models 16 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 17.
    ▪ Advantages ofFoundation Models: • Can reduce the amount of human labor and time required for model development. • Offer a flexible base that can be fine-tuned for accuracy improvements across diverse tasks. • Economical in terms of reuse and scalability, especially when shared within an organization or accessed via API. ▪ Challenges of FMs: • Training effort infeasible for many organizations • Potentially high cost / compute requirements for inference (i.e. runtime use) • Updating • Limited grounding • Hallucination / confabulation Benefits / challenges of using Foundation Models 17 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 18.
    ▪ Choosing betweenmodel types depends on specific project needs, resource availability, and desired control over the model characteristics ▪ Organizations must weigh the ease of implementation and broad applicability of FMs against the specificity and customizable aspects of narrow ML models Trade-offs and Decision Factors 18 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 19.
    ▪ Cost isa key factor in deciding between Narrow ML models and FMs ▪ Cost factors include: • Development cost – similar for system, lower for using existing FMs • Maintenance cost – depends on customization / model size • Operation costs – typically higher for FMs Costs of Narrow ML Models and FMs 19 Engineering AI Systems | Prof. Dr. Ingo Weber AI-generated figure
  • 20.
    Narrow ML Modelscan Co-exist with FMs 20 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 21.
    Narrow ML Modelscan Co-exist with FMs 21 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 22.
    ▪ Adapting foundationmodels (FMs) to specific applications involves various strategies to refine their functionality and ensure alignment with use case requirements. ▪ Techniques for Customizing FMs • Prompt Engineering - Crafting specific prompts or input sequences that guide the FM towards producing desired outputs. • Retrieval-Augmented Generation (RAG) - Enhancing the FM's responses by dynamically incorporating external knowledge or data during the generation process. • Fine-Tuning - Adjusting the model’s parameters on a specific dataset to improve performance for particular tasks. • Distilling - Simplifying and compressing the model to enhance efficiency while retaining performance, useful for deployment on resource-constrained environments. Customizing Foundation Models 22 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 23.
    Customizing Foundation Models 23 EngineeringAI Systems | Prof. Dr. Ingo Weber
  • 24.
    Prompt engineering isthe easiest for organizations Implementing RAG is more difficult Fine tuning and distilling require sophistication Creating a bespoke FM requires a high degree of sophistication Engineering AI Systems | Prof. Dr. Ingo Weber 24 Complexity Complexity vs time/organizational maturity
  • 25.
    ▪ Guardrails aredesigned to monitor and control the inputs and outputs of • foundation models • users • RAG • external tools ▪ Goal is to meet specific requirements, including • function • accuracy • aspects needed due to policy (AI Act, etc.), standards and laws. Guardrails 25 Engineering AI Systems | Prof. Dr. Ingo Weber Stock image, CC licence
  • 26.
    ▪ Input guardrailsare applied to the inputs received from users, and their possible effects include refusing or modifying user prompts. ▪ Output guardrails focus on the output generated by the foundation model, and may modify the output of the foundation model or prevent certain outputs from being returned to the user. ▪ RAG guardrails are used to ensure the retrieved data is appropriate, either by validating or modifying the retrieved data where needed. ▪ Execution guardrails ensure that the called tools or models do not have any known vulnerabilities and the actions only run on the intended target environment and do not have negative side-effects. ▪ Intermediate guardrails can be used to assert that each intermediate step meets the necessary criteria. Different Types of Guardrails 26 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 27.
    Types of guardrailsin NVIDIA NeMo 27 Engineering AI Systems | Prof. Dr. Ingo Weber “NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational applications.” Slight changes in terminology: guardrails → rails; RAG guardrails → retrieval rails; Intermediate guardrails → dialog rails. Source: https://github.com/NVIDIA/NeMo-Guardrails
  • 28.
    Introduction AI & FoundationModels AI System Life Cycle Qualities Summary 28 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 29.
    Quality is determinedby: • Software architecture • Development processes • Code quality • Model quality Quality Influences in AI Systems System qualities Development processes Engineering AI Systems | Prof. Dr. Ingo Weber 29
  • 30.
    AI System LifeCycle 30 Engineering AI Systems | Prof. Dr. Ingo Weber System Build Create an executable artifact System Release Approve for deployment System Test Ensure high test coverage & automate tests as much as possible Design Design architecture to fulfill requirements, achieve goals, and support other activities System Dev Perform normal development activities & create scripts for other activities Deploy Move into production environment Operate & Monitor Execute system, gather measurements, display metrics, detect anomalies Analyze Analyze measurements taken during operation, user feedback, new developments, etc. Model Build Create an executable artifact Model Release Approve for integration in system Model Test Test accuracy, risks, biases, etc. AI Model Dev Model selection, exploration, hyperparameter tuning, data management, model preparation and training, model evaluation AI Model Dev Deploy Operate & Monitor Build Release Analyze Test Build Dev Rel. Test
  • 31.
    AI models arejust one component (each) The AI model processes are for developing and testing the AI model AI System Life Cycle 32 Engineering AI Systems | Prof. Dr. Ingo Weber AI Model Dev Deploy Operate & Monitor Build Release Analyze Test Build Dev Rel. Test
  • 32.
    Sufficient training data– 10x number of parameters in the model Distribution of training data • Reflect distribution of requests • Corrected for biases • Adjusted to reflect security concerns Data quality is important Engineering AI Systems | Prof. Dr. Ingo Weber 33
  • 33.
    Data drift –training data is not representative of actual data after some period of time. E.g. housing prices have changed Environmental change – e.g. training data represents housing prices in urban areas, attempt to use model for rural areas. Regulatory changes – ongoing regulations can affect issues such as fairness, transparency, accountability, and human oversight - Imagine the road rules changing… Problems with data Engineering AI Systems | Prof. Dr. Ingo Weber 34
  • 34.
    AI models arejust one component (each) The AI model processes are for developing and testing the AI model The remaining processes are for both the AI and non-AI components • Create non-AI components • Build the AI system with both AI components and non-AI components • Test it, deployment it, and operate it AI System Life Cycle 35 Engineering AI Systems | Prof. Dr. Ingo Weber AI Model Dev Deploy Operate & Monitor Build Release Analyze Test Build Dev Rel. Test
  • 35.
    ▪ Interdisciplinary teamsshould be set up to jointly develop the portion of the design that affects both the AI and the non-AI modules ▪ Interdisciplinary teams can frequently have problems with vocabulary and cultural differences ▪ Bruce Tuckman characterized team formation as “forming, storming, norming, performing” ▪ Different expertise may be needed at different times, e.g., UX, security, … Interdisciplinary Teams 36 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 36.
    Co-design and development •Allows negotiation about how functions should be divided • Ideally, some developers will have expertise across both fields Using microservices • Some components may not make good microservices, e.g. due to communication issues (latency, etc) Design for modifiability • Intermediary between the client of a component and the component itself, e.g. translating clients’ invocations into the new model interface Design Engineering AI Systems | Prof. Dr. Ingo Weber 37
  • 37.
    Development of non-AImodules is similar to non-AI systems; testing (and developing tests) is not A module is a coherent piece of functionality A service is constructed through the integration of multiple modules and their supporting libraries during the build step. • Using the language and technologies the team chooses Unit test: • Tests each module with defined inputs and expected outputs • During development and again during the testing phase • Unit tests cover − Common cases: typical scenarios − Edge cases: boundary of input ranges or special conditions − Negative cases: invalid inputs where the module should fail gracefully Develop Non-AI Modules Engineering AI Systems | Prof. Dr. Ingo Weber 38
  • 38.
    CI triggered whenreleasing a new AI model or committing a change CI retrieves all elements, build a new image, run tests If tests pass, the image is deployed into a staging server CI server also creates build meta data, such as version manifest, dependency graph, package manifest Build Engineering AI Systems | Prof. Dr. Ingo Weber 39
  • 39.
    Repeat model/unit testsfrom AI/non-AI module development System-wide test for end-to-end execution of the system • User stories • Regression testing • Compatibility testing • Efficiency/performance testing • Security testing • Compliance testing Non-repeatable testing results • Test database has not been restored • Have background tasks • Probabilistic AI models may produce different results at different times → need to control all sources of randomness to achieve repeatable tests Test Engineering AI Systems | Prof. Dr. Ingo Weber 40
  • 40.
    Have a single,well-defined release and deployment process – unchanged DevOps principle • May require a human to release the system to deployment for regulatory/legal reason − Frequency of new releases: occur on a defined schedule or continuously • Continuous release/deployment patterns − Blue/green deployment: deploy multiple instances of a new service simultaneously and route all new requests to instances of the new service while the instances of the old service are gradually drained and terminated − Rolling upgrade: gradually replace instances of old service with new service, typically one at a time − Canary testing: release a new version to a selected group of users to test the new version − A/B testing: compare two versions and determine which one is better − Roll back: revert to a previous version Release and Deploy (Cont’d) Engineering AI Systems | Prof. Dr. Ingo Weber 41
  • 41.
    Additional reasons forupdating AI systems • Data drift − Data distribution the model encounters in production can be different from the data it was originally trained • Improved data − The model may encounter additional/updated data in production that wasn’t available during the initial training • Changing requirements − The requirements for the model may change over time, e.g., add new features to the model or improve the accuracy • Biased/vulnerable models Release and Deploy Engineering AI Systems | Prof. Dr. Ingo Weber 42
  • 42.
    Collect and analyzedata to inform the next round of development Monitoring Sources and metrics • Infrastructure Metrics − resource utilization including CPU usage, memory consumption, I/O, and network traffic of VMs and containers • System code metrics − Application-specific metrics such as number of active users, number of requests created, API response time − Need higher-level metrics to assess AI systems, e.g., sentiment, precision of answers, number of retry prompts • Logs − Text records generated by services and stored locally or moved to databases for analysis, such events, errors, state changes Operate, Monitor, and Analyze Engineering AI Systems | Prof. Dr. Ingo Weber 43
  • 43.
    Introduction AI & FoundationModels AI System Life Cycle Qualities Summary 44 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 44.
    A system isconstructed to satisfy organizational objectives, which can be manifested as a set quality requirements. Quality requirements for a system are formalized as quality attributes (QAs) • A QA is a measurable or testable property of a system that is used to indicate how well the system satisfies the needs of its stakeholders beyond the basic function of the system • Stated for both AI and non-AI portions Qualities Engineering AI Systems | Prof. Dr. Ingo Weber 45
  • 45.
    Reliability • Operate overtime under various specified conditions and without failure. • AI model performs effectively when introduced to new data − This new data should be within or similar to the distribution of the training data, not out of distribution (OOD). Robustness • Operate despite unexpected inputs • AI model is able to handle out-of-distribution (OOD) data and adversarial input Resilience • Operate despite changes in the environment • In AI, this is characterized by the model’s intrinsic ability to continue functioning effectively amidst significant disruptions. Reliability, Robustness, Resilience Engineering AI Systems | Prof. Dr. Ingo Weber 46
  • 46.
    Performance Engineering AI Systems| Prof. Dr. Ingo Weber 47
  • 47.
    Efficiency: utilize resourceseffectively while maintaining acceptable speed and responsiveness, in addition to fulfilling its functional requirements. • Throughput: number of tasks processed within a specific time frame • Latency: time it takes for a system to complete a task • Scalability: handle increasing workloads without degrading performance − Training scalability: train AI models as training task size increases (e.g. number of model parameters, data volumes) − Inference scalability: deploy and operate AI systems for inference as the model size or the number of requests increases • Resource utilization: how efficiently a system uses its computational resources − Training time resource utilization: the effective use of computational resources for training − Inference time resource utilization: the model’s resource efficiency in making predictions or generating outputs Efficiency Engineering AI Systems | Prof. Dr. Ingo Weber 48
  • 48.
    Architecture Design • Usean interface style like GraphQL • Optimal placement of containers and VMs • Caching • Edge computing • Mode switcher − Selecting components/models − FM as a software connector Architecture Approaches to Improving Efficiency Engineering AI Systems | Prof. Dr. Ingo Weber 49
  • 49.
    Model Choice • Assessresource required for both training and inference • Compared to deep neural networks, simpler models like decision trees or naive Bayes are more resource friendly for both training and execution. Hyperparameters • Adjusting hyperparameters (e.g. learning rate) to reduce the number of training iterations • Employing regularization hyperparameters (like L1 or L2 penalties) to simplify the model • Utilizing early stopping involves halting the training process • Reducing the precision of the model parameters Data sampling • Using random sampling to select a representative sample of dataset for training Feature engineering • e.g. reducing the number of features or using hashing to handle high cardinality features Operation • Process multiple data points simultaneously during inference • Use lazy loading or model checkpointing to load only the necessary parts of the model at runtime Process Approaches to Improving Efficiency Engineering AI Systems | Prof. Dr. Ingo Weber 50
  • 50.
    Accuracy refers toa system’s ability to produce expected or true results. • AI: how well the AI system’s predictions or outputs align with the expected or true results. Accuracy Metrics for Classification Models • Accuracy Rate • Precision • Recall • F1 Score Accuracy Metrics for Regression Models • Mean Squared Error (MSE) • Root Mean Squared Error (RMSE) • Mean Absolute Error (MAE) • R-squared (R2) Accuracy Metrics for Foundation Models • Bilingual Evaluation Understudy (BLEU) Score • Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 51
  • 51.
    Repeatability: produce thesame output for the same input under identical conditions. Some systems may not be deterministic because of factors like multi-thread design or other background activities. AI systems maybe non-deterministic • Rely on randomness during training, such as initializing weights or selecting optimization paths. Output variability is desirable in certain AI applications, like creative writing or generating various strategic recommendations. Performance measurement, such as with benchmarks, requires multiple executions and statistical analysis. Repeatability Engineering AI Systems | Prof. Dr. Ingo Weber 52
  • 52.
    Architecture design • Multi-modeldecision-making − Using different models to perform the same task or make a single decision. − Consensus protocols can be defined to make a final decision, such as taking the majority decision or accepting only unanimous results from all models. − The end user or operator reviews the output from multiple models and makes the final decision. Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 53
  • 53.
    Choice of hyperparameters •Learning rate − Too low rate: slow progress − Too high rate: diverging training - disrupt the learning process • Number of trees in a random forest − too many trees may cause overfitting − too few may lead to underfitting Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 54
  • 54.
    Data preparation • Sufficientand representative training data • Bias-free data • Incorporating samples of OOD data - handle unexpected inputs • Addressing Missing Data − Use imputation techniques to estimate and fill in missing values or remove rows/columns that contain excessive missingness • Managing Outliers − Clip outliers to a certain range or eliminate them entirely • Splitting the data into three main sets with same distribution (80%, 10%, 10%) − Use cross-validation techniques to split the data if the training dataset is too small Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 55
  • 55.
    Model preparation -feature engineering • Scaling features to a common range so they contribute equally to training − e.g. using standardization or normalization technique for larger range or high value features • Converting categorical data to numerical values so model can understand the data − e.g. using one-hot encoding or label encoding • Deriving new features from existing data − e.g. calculating "time since last purchase" • Applying transformations like log or square root to make the relationship between features and target variables more linear • Selecting the most influential features for the model − e.g., using chi-square tests for categorical targets or deriving feature importance scores from ensemble methods like random forests. • Reducing the number of features using techniques like Principal Component Analysis (PCA) Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 56
  • 56.
    Model generation -model evaluation provides insights into how well the model is likely to perform on unseen data • Evaluation metrics: Evaluate them once the model is full trained and monitor them through the training phase − Metrics for classification models: accuracy rate, precision, recall, F1 score − Metrics for regression models: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared (R2) Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 57
  • 57.
    Operation • The accuracyof a model is assessed based on known ground truth values. • It's essential to periodically test the model for accuracy to ensure that it remains effective over time. − If accuracy issues are detected, the model may need to be retrained with new data or adjusted to address any identified issues. − The updated model should undergo a rigorous deployment process to ensure that it meets the operational standards and accuracy requirements before being fully implemented. • Accuracy Testing Techniques − Confusion Matrix - visualizing the performance of a model across different classes - helps identify imbalances − Statistical tests like the Kolmogorov-Smirnov test can be used to compare the distribution of new data against the training data – detecting data/concept drift Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 58
  • 58.
    FM-based Systems: general-purposenature makes it hard to achieve consistent accuracy across various domains without specific adjustments • Fine-tuning: Adapting the model on a more specific dataset • Retrieval-Augmented Generation (RAG) - Integrating external data during generation • Prompt Engineering - Crafting inputs (prompts) that guide the model to generate more accurate and contextually appropriate outputs. • Reinforcement Learning from Human Feedback (RLHF) - Adjusting the model's parameters based on feedback from human evaluators. • Adversarial Testing and Benchmarks - Testing the model against carefully crafted adversarial inputs or established benchmarks to ensure robustness and accuracy. • Acceptance Testing and Test Case Assessment - Systematically testing the model/system to verify performance before full-scale deployment. Approaches to Improving Accuracy Engineering AI Systems | Prof. Dr. Ingo Weber 59
  • 59.
    CIA Properties • Confidentiality– only authorized parties can access data or resources. • Integrity - data is accurate, complete, and has not been altered or modified by unauthorized parties • Availability – there is timely and reliable access to data, systems, and services. Authentication means that the persons or systems trying to access a service are who they claim to be. Authorization refers to the rights the user has, i.e., they are authorized to perform certain operations like accessing a file or approving an invoice. Security Engineering AI Systems | Prof. Dr. Ingo Weber 60
  • 60.
    Attacks in anAI System Engineering AI Systems | Prof. Dr. Ingo Weber 61
  • 61.
    Architectural Approaches –based on zero trust • Authorization, encryption, least functionality, limiting access, privilege minimization Process Approaches • Threat modeling and risk assessment for identifying/assessing vulnerabilities • New attacks to AI/FM systems: use AI to track and analyze attacks • Maintaining versions of data items as well as their lineage to ensure the integrity of the model • Adversarial testing to detect vulnerabilities • Logging and auditing can help to detect anomalies Approaches to Mitigating Security Concerns Engineering AI Systems | Prof. Dr. Ingo Weber 62
  • 62.
    More complex • Notonly data access and storage but also the outputs and decisions of the AI systems which can potentially reveal sensitive information Organizational approaches: appoint executives such as Chief Privacy Officers (CPOs) or Data Protection Officers (DPOs). Their responsibilities include • Developing and enforcing data privacy policies • Monitoring organizational compliance with privacy laws • Handling data subject requests e.g. access or deletion • Managing data breaches • Educating and training employees on data privacy best practices Process and architecture: Implementing LOCKED rights—Limit, Opt-out, Correct, Know, Equal, Delete • E.g., machine unlearning, guardrails Privacy in AI Systems Engineering AI Systems | Prof. Dr. Ingo Weber 63
  • 63.
    Fairness is notonly about the outcomes of an AI system but also about the decision processes and model development processes, including data collection, algorithm design, ongoing management of the system. • For instance, in recruitment AI tools, fairness is challenged by historical biases present in the training data. An AI system might inadvertently learn to prefer candidates from a certain demographic background, etc. Organizational strategies • Fairness teams: comprising members from data science, engineering, legal, and ethics departments to identify fairness concerns at various stages. • Third-Party Expertise: auditing their AI systems. These experts provide impartial assessments and recommend strategies to address potential biases and ensure fairness. • Employee Training: increasing awareness about the impact of bias and the importance of fairness. Architecture • Guardrails and monitor • Logging Process Approaches • Fairness Metrics: Demographic Parity, Equal Opportunity, Equalized Odds, Predictive Parity, Treatment Equality • Fairness tools: AI Fairness 360 by IBM, Fairlearn by Microsoft, PAIR Tools by Google Fairness Engineering AI Systems | Prof. Dr. Ingo Weber 64
  • 64.
    Monitorability: ability tomonitor and track predefined AI system/model quality metrics • More reactive: identifying issues as they occur • Solely relying on predefined metrics Observability: ability to gain insights into the inner workings of the AI system at system level, the model level, and the model development and deployment pipeline level, and into its operating infrastructure. • More proactive: understanding an AI system, detecting data, model and concept drift early, and preventing incidents • Leverages broader data sources, including metrics, logs, traces, events generated by the AI system • Tools: Repositories for code and data, system logs, resource utilization monitoring infrastructure Monitorability and Observability Engineering AI Systems | Prof. Dr. Ingo Weber 65
  • 65.
    Maintain logs torecord system activity, ensuring all relevant events are captured for later analysis. • Data Lineage Tools: track the history and flow of data used in model predictions • Software Bill of Materials (SBOM): provide a detailed lineage of non-AI components Explainable AI techniques • Local explanation techniques : instance-based explanations for understanding the feature importance and correlations that led to the specific outputs, e.g. LIME and SHAP • Global explanation techniques: understand the general behavior of an AI system by using a set of data instances to produce explanations − Visualizing the relationship between the input features and model’s output over a range of values − Global surrogate models such as tree-based models or rule-based models • Foundation model-based systems: think aloud vs. think silently Co-versioning registry Independent overseeing agents Approaches to Ensure Observability Engineering AI Systems | Prof. Dr. Ingo Weber 66
  • 66.
    Introduction AI & FoundationModels AI System Life Cycle Qualities Summary 67 Engineering AI Systems | Prof. Dr. Ingo Weber
  • 67.
    Notes for instructors: •Slides for the whole book are available • All chapters include discussion questions 68 Engineering AI Systems | Prof. Dr. Ingo Weber Book chapter overview 1. Introduction 2. Software Engineering Background 3. AI Background 4. Foundation Models 5. AI Model Lifecycle 6. System Lifecycle 7. Reliability 8. Performance 9. Security 10.Privacy and Fairness 11.Observability 12.Case Study: Using a Pretrained Language Model for Tendering 13.Case Study: Chatbots for Small and Medium-Sized Australian Enterprises 14.Case Study: Predicting Customer Churn in Banks 15.The Future of AI Engineering
  • 68.
    Summary 69 Engineering AI Systems| Prof. Dr. Ingo Weber
  • 69.
    Engineering AI Systems Architectureand DevOps Essentials Prof. Dr. Ingo Weber Book website: https://research.csiro.au/ss/ team/se4ai/ai-engineering/