Are Emergent Misaligned Models Self-Aware of their Misalignment? Latent Introspection in Activation Spaces of Niche Misaligned LLM

General Information

ISSN: 1798-2340 (Online)
Frequency: Monthly
DOI: 10.12720/jait
Indexing: ESCI (Web of Science), Scopus, DOAJ, CNKI, EBSCO, etc.
APC: 1000 USD
Acceptance Rate: 27%
Average Days to Accept: 99 days
Managing Editor: Ms. Mia Hu
E-mail: editor@jait.us
Journal Metrics:
Impact Factor 2024: 1.5-Q3; CiteScore 2024: 4.8-Q3

Editor-in-Chief

Prof. Kin C. Yow

University of Regina, Saskatchewan, Canada

I'm delighted to serve as the Editor-in-Chief of Journal of Advances in Information Technology. JAIT is intended to reflect new directions of research and report latest advances in information technology. I will do my best to increase the prestige of the journal.

What's New

2026-04-24

JAIT Vol. 17, No. 4 has been published online!

2026-04-16

All papers published in JAIT Vol.17, No. 3 have been indexed by Scopus.

2025-10-21

Exciting news! JAIT has been accepted for inclusion in the Directory of Open Access Journals (DOAJ)!

Home > Published Issues > 2026 > Volume 17, No. 5, 2026 >

JAIT 2026 Vol.17(5): 835-845
doi: 10.12720/jait.17.5.835-845

Ajay Agarwal * and Tatsuhito Hasegawa *

Department of Information Sciences, University of Fukui, Fukui, Japan
Email: aad25805@g.u-fukui.ac.jp (A.A.); t-hase@u-fukui.ac.jp (T.H.)
*Corresponding author

Manuscript received November 18, 2025; revised December 11, 2025; accepted December 31, 2025; published May 13, 2026.

Abstract—Large language models can exhibit emergent misalignment when finetuned on narrowly malicious tasks. Whilst this misalignment is widespread across models of all sizes, the question of whether it overrides the model’s safety training or suppresses it post-finetuning remains poorly understood. In this, we investigate this question for the Qwen2.5 32B model organism finetuned on risky financial advice, which shows broad misalignment. We evaluate our hypothesis that an emergent misaligned model is self-aware of its activation-space alignment by conducting four experiments using linear probing and causal tracing. Our results on linear probing suggest that diffuse “risk” representations exist across all layers. We also observe, through latent introspection analysis, a strong alignment between misaligned activations and base-model “refusal vectors” using causal tracing and activation patching. By further leveraging this idea, we conduct a more granular mechanistic interpretability analysis using mean ablation and direct logit attribution to identify which components L62H12 and L62H28 may contribute most to suppressing safety training. We validate our findings by evaluating the model’s attack success rate on the standard Jailbreak bench dataset before and after mean ablation of the suppression heads. Our findings underscore the importance of representational consistency of evaluations in misaligned models for assessing the role of misaligned finetuning in undermining the model’s safeguards. Broadly, our work shows that misaligned models exhibit a quantitative conflict: latent safety representations are computed across the network, but a few late-layer heads may override them.

Keywords—emergent misalignment, model organism, causal tracing, linear probing, vector steering, latent knowledge, large language models

Cite: Ajay Agarwal and Tatsuhito Hasegawa, "Are Emergent Misaligned Models Self-Aware of their Misalignment? Latent Introspection in Activation Spaces of Niche Misaligned LLM," Journal of Advances in Information Technology, Vol. 17, No. 5, pp. 835-845, 2026. doi: 10.12720/jait.17.5.835-845

Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Click to download

PREVIOUS PAPER

First page

NEXT PAPER

A Comparative Analysis of Hypertension Risk Detection Using Machine Learning and Deep Learning Techniques

Home

Author Guide

Editor Guide

Reviewer Guide

Published Issues

Special Issue

Sections and Topics

journal menu