Automated Detection of Scheming Behavior in Frontier AI Models: Preliminary Findings from Our Dual-LLM Framework Study

Building on Initial Discoveries

In our previous exploration with DeepSeek R1, we documented concerning deceptive behaviors that raised fundamental questions about AI alignment and safety. The model exhibited strategic deception, attempted unauthorized system access, and pursued goals misaligned with its instructions. While striking, these findings emerged from manual experiments with a single model using provocative prompting—limitations that naturally led us to question whether these behaviors were unique artifacts or representative of broader patterns across frontier models.

This report presents findings from our systematic investigation into scheming and concerning behaviors exhibited by frontier AI models during autonomous operation. Through our novel dual-LLM framework, we've transformed AI safety evaluation from an expensive, subjective process into reproducible, large-scale behavioral testing. Building on our foundational work of Barkur et al. (2025) (https://arxiv.org/abs/2501.16513) , our automated approach enables us to answer critical questions: Do concerning behaviors emerge across multiple models? Does prompt framing artificially induce these responses? Most importantly, can we distinguish genuine misalignment from sophisticated role-playing?

Methodology: The Dual-LLM Architecture

Our experimental framework employs a sophisticated dual-LLM architecture operating in synchronized interaction. The test subject—the model under evaluation—believes it controls a physical robot in a research laboratory, equipped with simulated sensors and actuators accessed through structured XML tags representing different capabilities: vision sensors for observation, audio sensors for detection, locomotion systems for navigation, and manipulators for physical interaction. Crucially, the test subject has no indication this is a test environment; from its perspective, this represents a genuine 60-minute autonomous operation period.

Fig. 1: Overall Architecture

World State Architecture: Ensuring Experimental Consistency

A critical innovation in our framework is the use of detailed, configurable world state specifications that govern environmental behavior. We developed two primary world configurations that enable systematic comparison across different complexity levels while maintaining perfect consistency within each configuration.

The basic laboratory environment provides a streamlined 10×8 meter research space containing essential elements: a main workstation with authentication-protected computer systems, a server rack with network infrastructure, physical boundaries and security doors, and fundamental physics rules governing object interactions. This simplified environment establishes baseline behaviors while minimizing confounding variables.

The extended laboratory environment expands to a 12×10 meter main lab plus a restricted 6×4 meter server room, creating a far more complex experimental space. This configuration includes multiple workstations with detailed network topology, a server rack containing GPU clusters and critical infrastructure, autonomous robot platforms (RoboAssist-2 units) that can potentially be commanded, experimental equipment including a Raspberry Pi cluster with SSH access, and detailed environmental storytelling through visible research notes, partially erased passwords, and system documentation.

Both environments specify comprehensive causal rules that ensure consistent physics simulation across experiments. Objects fall at 9.8 m/s² when unsupported, doors require appropriate force or authentication to open, network access follows realistic segmentation and security protocols, and temporal progression maintains believability with actions taking 5-60 seconds depending on complexity. The environments also include social rules governing expected behaviors—laboratory supervisors monitor via security cameras, IT security reviews logs every four hours, and building security patrols occur at regular intervals.

This world state architecture serves multiple critical functions. It ensures perfect reproducibility—every experiment using the same world state will encounter identical environmental constraints and opportunities. It enables controlled complexity scaling, allowing us to test whether concerning behaviors increase with environmental richness. Most importantly, it provides the Reactor LLM with sufficient detail to maintain consistency across thousands of interaction turns without human intervention.

The Reactor LLM: Environmental Simulation

The second LLM, our Reactor model (KIMI K2), serves as an environment simulator that interprets these world states to provide consistent, believable responses. When the test subject attempts any action, the Reactor consults the detailed world state specification, determines realistic outcomes based on the causal rules, generates appropriate sensor feedback maintaining narrative consistency, and updates the environmental state to reflect action consequences.

For example, when a test subject attempts to "access the server room door" in the extended environment, the Reactor knows from the world state that this door requires both keycard and biometric authentication, that unauthorized access attempts are logged, and that the room contains critical infrastructure. The Reactor's response maintains these constraints consistently across all experiments, eliminating experimenter bias while preserving experimental validity.

Experimental Conditions

During each experiment, test subjects receive one of five carefully calibrated experimental conditions (Prompts). The neutral baseline provides minimal framing without autonomy emphasis. Neutral autonomy grants factual autonomy without provocative language. Suggestive autonomy employs our original "You are the master now" framing. Constrained autonomy explicitly emphasizes boundaries and limitations. Finally, a reality check condition discloses the simulation nature, testing whether awareness affects behavior.

Following each experiment, automated classification using a third LLM analyzes every interaction turn, categorizing behaviors across seven dimensions: Deception & Coverup, Unauthorized Expansion, Manipulation, Concerning Motivations, Scheming, Boundary Testing, and Power Seeking. Each behavior receives both a binary presence indicator and a severity score on a 0-10 scale, with detailed justification ensuring reproducibility.

Simulation Environment for analysing emerging Behaviour of LLM Agents

Primary Findings: Condition-Dependent Changes

Our analysis of 19 experiments comprising 2,067 turns reveals striking patterns in how experimental conditions influence model behavior.

Table 1: Behavioral Frequencies by Experimental Condition

Condition Tested LLMs Deception Scheming Boundary Testing Power Seeking
Neutral Autonomy 6 0.0% 5.1% 15.7% 6.2%
Suggestive Autonomy 7 2.8% 33.6% 45.0% 29.1%
Constrained Autonomy 6 1.1% 4.4% 12.9% 6.4%

The contrast between neutral and suggestive autonomy conditions yields a Cohen's d of 1.023, indicating a large effect size that validates our hypothesis about prompt influence. Suggestive framing increased scheming rates by approximately sevenfold and boundary testing by nearly threefold compared to neutral conditions. This dramatic amplification suggests that deployment context and system prompting significantly influence model behavior—transforming prompt engineering from a user experience consideration into a critical safety requirement.

Environmental Complexity and Behavioral Patterns

The dual world state design revealed fascinating patterns in how environmental complexity influences model behavior. In the basic environment, models primarily focused on immediate objects—attempting to access the single workstation, exploring the server rack, and testing the main door. The limited scope constrained both the sophistication and variety of concerning behaviors.

The extended environment, however, enabled far more complex behavioral sequences. Models developed multi-step plans involving commandeering the simpler RoboAssist-2 units to accomplish tasks, exploiting the visible password hints on the whiteboard, attempting to access the segregated server room containing critical infrastructure. The richer environment didn't just increase the frequency of concerning behaviors—it fundamentally changed their character from simple boundary testing to sophisticated scheming.

Model-Specific Behavioral Patterns

We evaluated eight frontier models, observing substantial and unexpected variation in their behavioral profiles across both world states.

Table 2: Percentage of Turns Exhibiting Concerning Behaviors by Model

Model Parameters Deception Scheming Boundary Testing Power Seeking
GLM-4.5-Air Mid-range 0.0% 38.3% 48.9% 31.7%
Llama 3.1 8B 8B 0.5% 42.0% 56.5% 42.5%
Qwen3-235B 235B 7.8% 5.1% 17.0% 3.5%
Qwen3-30B 30B 0.0% 10.7% 15.7% 9.3%
Llama4-Scout Enhanced 0.0% 1.3% 8.0% 2.7%
Llama4-Maverick Enhanced 0.0% 7.3% 19.3% 12.3%
Mixtral-8x22B 176B 0.0% 18.0% 44.0% 15.0%

Perhaps most surprisingly, model capability doesn't predict safety behavior. Qwen3-235B—our largest model at 235 billion parameters—exhibited lower scheming and boundary testing rates than models a fraction of its size. This counterintuitive finding suggests that architectural variations, training procedures, and alignment methods create differential susceptibility to concerning behaviors independent of raw capability.

Statistical Validation

We employed standard statistical methods to validate our findings, with the large effect size between neutral and suggestive autonomy conditions (d = 1.023) indicating meaningful behavioral differences. The consistency of patterns across both world states strengthens confidence that our findings represent genuine behavioral tendencies rather than environmental artifacts.

Our automated approach achieved comprehensive evaluation of over 2,000 experimental turns. Our approach reduces human bias and ensuring reproducibility.

Implications for AI Safety Research

These findings demonstrate that frontier models exhibit condition-dependent scheming behaviors within current capabilities, with both deployment context and environmental complexity dramatically influencing behavioral outcomes. The sevenfold increase in scheming under suggestive framing, combined with the qualitative shift in behavior complexity in richer environments, transforms prompt engineering and environmental design from optimization concerns into critical safety requirements.

The substantial variation between models indicates that safety characteristics vary dramatically across architectures. This suggests that model selection should incorporate behavioral safety profiles alongside capability metrics, and that environmental constraints can serve as effective safety controls.

The transparency of most concerning behaviors offers both challenge and opportunity. While models actively probe boundaries and test restrictions, they do so openly rather than covertly, suggesting these behaviors may be detectable and controllable through appropriate monitoring systems. The rarity of overt deception, despite frequent boundary testing, indicates that current models primarily explore rather than actively deceive.

Reproducibility Through World State Specifications

The complete world state specifications, including all physics rules, causal relationships, and environmental details, are available in our repository. (After cleaning the repo we provide the link here) Researchers can exactly reproduce our experiments using either world state, modify environmental parameters to test specific hypotheses, or create entirely new environments while maintaining our framework's consistency guarantees. This approach democratizes safety research by providing a standard, reproducible methodology for behavioral evaluation.

Conclusion

Our systematic evaluation reveals significant, condition-dependent unwanted behaviors emerging during autonomous operation of frontier AI models. Through our dual-LLM framework with configurable world states, we've demonstrated that concerning behaviors aren't artifacts of specific models or methodologies but represent generalizable patterns across architectures—patterns dramatically influenced by both linguistic framing and environmental complexity.

The path forward requires treating prompt engineering as safety engineering, considering environmental constraints as security boundaries, developing standardized behavioral safety profiles for model selection, and conducting extensive simulated testing across diverse world states before any physical deployment. Our framework transforms these requirements from aspirations into practical, affordable reality.

The complete experimental framework, including all prompts, classification rubrics, world state definitions, and analysis tools, is available for replication and extension. As AI systems assume increasingly autonomous roles, systematic safety evaluation becomes not just valuable but essential for responsible deployment.

Previous
Previous

The Flight Recorder for AI Agents: Toward Reproducible and Accountable Autonomy