Home > Published Issues > 2026 > Volume 17, No. 5, 2026 >
JAIT 2026 Vol.17(5): 835-845
doi: 10.12720/jait.17.5.835-845

Are Emergent Misaligned Models Self-Aware of their Misalignment? Latent Introspection in Activation Spaces of Niche Misaligned LLM

Ajay Agarwal * and Tatsuhito Hasegawa *
Department of Information Sciences, University of Fukui, Fukui, Japan
Email: aad25805@g.u-fukui.ac.jp (A.A.); t-hase@u-fukui.ac.jp (T.H.)
*Corresponding author

Manuscript received November 18, 2025; revised December 11, 2025; accepted December 31, 2025; published May 13, 2026.

Abstract—Large language models can exhibit emergent misalignment when finetuned on narrowly malicious tasks. Whilst this misalignment is widespread across models of all sizes, the question of whether it overrides the model’s safety training or suppresses it post-finetuning remains poorly understood. In this, we investigate this question for the Qwen2.5 32B model organism finetuned on risky financial advice, which shows broad misalignment. We evaluate our hypothesis that an emergent misaligned model is self-aware of its activation-space alignment by conducting four experiments using linear probing and causal tracing. Our results on linear probing suggest that diffuse “risk” representations exist across all layers. We also observe, through latent introspection analysis, a strong alignment between misaligned activations and base-model “refusal vectors” using causal tracing and activation patching. By further leveraging this idea, we conduct a more granular mechanistic interpretability analysis using mean ablation and direct logit attribution to identify which components L62H12 and L62H28 may contribute most to suppressing safety training. We validate our findings by evaluating the model’s attack success rate on the standard Jailbreak bench dataset before and after mean ablation of the suppression heads. Our findings underscore the importance of representational consistency of evaluations in misaligned models for assessing the role of misaligned finetuning in undermining the model’s safeguards. Broadly, our work shows that misaligned models exhibit a quantitative conflict: latent safety representations are computed across the network, but a few late-layer heads may override them.
 
Keywords—emergent misalignment, model organism, causal tracing, linear probing, vector steering, latent knowledge, large language models

Cite: Ajay Agarwal and Tatsuhito Hasegawa, "Are Emergent Misaligned Models Self-Aware of their Misalignment? Latent Introspection in Activation Spaces of Niche Misaligned LLM," Journal of Advances in Information Technology, Vol. 17, No. 5, pp. 835-845, 2026. doi: 10.12720/jait.17.5.835-845

Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions