Investigating Semantic Drift in GPT-4 Following Prompt Injection Attacks – American Journal of Student Research

American Journal of Student Research

Investigating Semantic Drift in GPT-4 Following Prompt Injection Attacks

Publication Date : Feb-02-2026

DOI: 10.70251/HYJR2348.41482498


Author(s) :

Advaith Govind Potti.


Volume/Issue :
Volume 4
,
Issue 1
(Feb - 2026)



Abstract :

Prompt injection attacks can override system instructions in large language models (LLMs), yet most existing evaluations measure only whether an attack “succeeds” or “fails,” which provides limited insight into whether an attempted injection still disrupts the conversation after a refusal. This study investigates whether prompt injection attempts measurably shift the semantic trajectory of a multi-turn conversation even when protected information is not disclosed. It was hypothesized that injected conversations would exhibit greater semantic drift than both clean baselines and length-matched, non-malicious pseudoinjected controls, and that this drift would persist across recovery turns. Using a GPT-4–class model and embedding-based similarity measures, baseline, injected, and pseudo-injected conversations were compared across multiple controlled scenarios, including storytelling, travel planning, math tutoring, and reference-grounded question answering. Across scenarios, injected runs consistently produced larger semantic deviations than both clean and pseudo-injected runs, with the largest disruptions occurring when the model shifted into refusal or meta-safety response modes that were semantically distant from the task domain. Pseudo-injected runs produced measurable drift but generally remained closer to the baseline trajectory, indicating that injection-specific effects were not explained solely by prompt length or elaboration. Although when the conditions did not result in leakage of protected tokens, post-injection responses often did not fully return to the pre-injection semantic trajectory over the short recovery window. These findings suggest that safety compliance and conversational stability are separable properties and that drift metrics can complement binary jailbreak outcomes when evaluating robustness to prompt injection.