MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Prompt Injection Generation Using Small Language Models with Reinforcement Learning with Artificial Intelligence Feedback

Author(s)
Gupta, Aneesh
Thumbnail
DownloadThesis PDF (1.527Mb)
Advisor
Gupta, Amar
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Large language models (LLMs) have become an integral part of many fields from customer support automation to research assistants. However, despite their growing adoption, they face significant challenges, particularly when it comes to safety in sensitive contexts. Existing methods like Reinforcement Learning with Human Feedback (RLHF) and keyword filtering have contributed to improving the robustness of these models, but these approaches are very resource-intensive and the models can still be vulnerable to malicious attacks like prompt injections and jailbreaking. One notable limitation in testing defenses against such attacks is the scarcity of appropriate datasets. This thesis investigates the use of small language models (SLMs) to generate goal hijacking messages, a subset of prompt injection messages. Techniques such as LoRA fine-tuning and full fine-tuning of even smaller models are employed in this short form text generation model. We also introduce a fine-tuned SLM enhanced with Reinforcement Learning with Artificial Intelligence Feedback (RLAIF), which removes reliance on slow human feedback by using faster AI-generated feedback instead. By optimizing the reference model and reward functions, we improve alignment with ground truth prompt injection messages while addressing issues such as mode collapse and overfitting. These findings show promise, and further research is necessary to determine how well the approach can generalize to other domains and perform in real-world scenarios. Future work is likely to focus on multilingual datasets and distributed computation to further extend the applicability and efficiency of the method.
Date issued
2025-02
URI
https://hdl.handle.net/1721.1/159142
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.