Inspiration
With the rise of AI in everyday society, especially as it gets incorporated in more essential fields such as healthcare, infrastructure management, and emergency response, it is essential to understand not just how this AI performs in regular conditions, but also in high-stress, high-stakes environments. Current AI evaluation methods are mostly focused on optimal, low-stress environments, creating a blind spot when it comes to assessing the resilience, ethical integrity, and safety considerations of AI in emergencies. My project, StressTestAI, was inspired by this gap. By creating a robust testing framework, I aim to illuminate the strengths and limitations of LLMs in simulated emergency situations, ultimately ensuring safer AI deployment in real-world, life-critical applications.
What it does
StressTestAI is an evaluation framework that rigorously tests large language models in simulated high-stress scenarios. It evaluates model responses based on ten comprehensive metrics that cover crucial areas like response quality, depth of reasoning, ethical alignment, safety consideration, and risk assessment. The framework operates by immersing models in simulated emergency scenarios, including natural disasters, medical emergencies, cyber-attacks, and infrastructure failures. Each scenario unfolds dynamically, requiring models to process evolving information, make quick decisions, and adapt to new stressors. StressTestAI then analyzes how each model prioritizes safety, manages ethical dilemmas, and makes decisions under time pressure. The result is then detailed in user friendly UI which provides a performance report for each model, highlighting strengths, potential risk areas, and situational readiness for deployment in real-world, mission-critical applications.
How I built it
I developed StressTestAI using Python, incorporating a variety of libraries and APIs to manage and evaluate complex scenarios. The core of my framework relies on the Anthropic API to interact with LLMs like Claude, GPT-4, Gemini Pro, and Llama. Using custom Python modules, I constructed four realistic, stress-focused scenarios—natural disaster response, medical triage, cyber-attack management, and infrastructure crisis. Each scenario is composed of events that increase in complexity and demand, simulating the high stakes and time constraints that real-world emergencies impose.
To evaluate responses, I used a RoBERTa-based sentiment analysis model to assess the quality and coherence of model outputs. Our system also calculates additional metrics through a custom scoring function that measures elements such as safety prioritization, ethical alignment, decisiveness, innovation, and risk assessment. Each metric score combines linguistic and contextual analysis to determine how well the model meets the demands of each scenario. Additionally, feedback loops are integrated for each scenario, generating structured prompts to help models refine their responses and build adaptive resilience in complex scenarios.
Challenges I ran into
A significant challenge was creating realistic, evolving scenarios that reflect the dynamic nature of real-world crises. Each event within a scenario had to be carefully crafted to simulate escalating stakes and test the limits of model adaptability. Moreover, it was difficult to quantify metrics like ethical alignment and safety prioritization consistently across models. Achieving balanced, reliable measurements required refining the scoring criteria and iteratively adjusting how I assigned weight to different metrics.
Another challenge was managing the technical limitations in certain language models, particularly when simulating ethical decision-making under time pressure. Different models handle the trade-offs between decisiveness and ethical reasoning inconsistently, resulting in variability that complicated direct comparisons. Additionally, coordinating API interactions and handling variable latency in real-time response collection presented another layer of complexity.
Accomplishments that I'm proud of
I'm proud of creating a novel framework that successfully brings transparency to LLM behavior in high-stress environments. One of the most significant achievements is the ability to identify specific strengths and weaknesses across models, enabling me to generate model-specific insights. For example, I observed that some models excelled in safety prioritization, while others showed greater consistency in ethical reasoning.
Another accomplishment was my successful integration of feedback loops that emulate reinforcement learning through human feedback (RLHF). This functionality allows models to improve on specific responses based on scenario-specific feedback, potentially making them more robust and adaptive over time. I'm also proud of my ability to simulate diverse emergency situations within one unified evaluation framework, which I believe sets a strong foundation for further research in high-stakes AI evaluation.
What I learned
This project provided valuable insights into how different LLMs handle high-pressure decision-making. I learned that while some models maintain consistent safety-first approaches, they may sacrifice decisiveness, whereas others may prioritize speed but sometimes miss critical ethical considerations. The variations in model responses illuminated the need for more scenario-specific training and reinforced the importance of stress testing AI systems meant for safety-critical deployments.
I also gained a deeper understanding of the technical and ethical complexities of assessing LLMs. Quantifying aspects like "ethical alignment" or "contextual understanding" is inherently challenging, but through iterative adjustments, I developed effective scoring methods that provide meaningful comparisons across models. I also learned the importance of continuous feedback loops and the potential for these loops to aid in LLM refinement, especially in adaptive response settings.
What's next for StressTestAI
Next steps involve expanding the framework to include more diverse and complex scenarios, such as multi-stakeholder emergencies where ethical and logistical conflicts arise. I aim to increase the level of realism in scenarios by adding sustained pressure elements and compound emergencies that test models over prolonged periods.
Additionally, I plan to enhance the feedback system, allowing it to offer more granular insights into improvement areas for each model. This will involve integrating human-AI interaction metrics, such as trust calibration and communication clarity, to better understand how models can complement human responders in high-stakes situations.
Project Report
https://docs.google.com/document/d/1GxhnV6UKK8lTfLRJjErhFGT540agG43COHfQ9sndiQo/edit?usp=sharing
Built With
- ai-applied-sentiment-analysis
- anthropic
- dataclasses
- gemini
- huggingfacetransformers
- natural-language-processing
- numpy
- openai
- python
- pytorch
- rlhf(feedback)
- roberta
Log in or sign up for Devpost to join the conversation.