How To Hack Large Language Models (LLM)

Large Language Models (LLMs) are increasingly used in different types of applications but they present several security risks. As their integration into daily technology grows, it is crucial to understand their vulnerabilities to protect users and systems.

Hackers are finding ways to exploit these models, including hidden prompt injections and manipulation of training data. These issues can compromise privacy and impact the reliability of AI-generated content.

This blog covers the main risks LLMs face and the steps being taken to address them. By understanding these challenges, we can help make AI systems safer and more dependable.

Common Vulnerabilities in Large Language Models: A Critical Analysis

Large Language Models (LLMs) have advanced artificial intelligence, but they come with security challenges. Understanding these vulnerabilities is important for responsible AI development and use:

1. Prompt Injections: Manipulating AI Responses

Prompt injections represent a serious security threat where user inputs can manipulate an LLM’s behavior beyond its intended design. The most concerning aspect is that these manipulations can occur even when the input appears invisible or unreadable to human observers.

An attacker might craft a prompt that causes the LLM to override its original instructions, potentially revealing sensitive information or acting against its programmed guidelines. This vulnerability demonstrates the complexity of controlling AI system responses.

2. Prompt Leaking: Exposing Hidden Instructions

Prompt leaking involves the unintentional exposure of system prompts or instructions that guide an LLM’s behavior. These leaked prompts might contain sensitive information that malicious actors could exploit.

For example, if a system prompt includes access credentials or internal instructions, an attacker could gain insights into the model’s internal workings or operational parameters. This exposure creates potential pathways for unauthorized access or manipulation.

3. Model Stealing: Replicating Language Models

Model stealing is a type of attack where an attacker attempts to replicate or acquire a language model, either in part or in full.

In this attack, the hacker usually begins by recording a large number of interactions with the target model. By analyzing these input-output pairs, they can train a new model to mimic the behavior of the original.

This type of attack can be exploited for several purposes, such as stealing intellectual property or breaching licensing or usage agreements.

4. Data and Model Poisoning: Corrupting AI Learning

Data and model poisoning occurs when training or fine-tuning data is deliberately manipulated to introduce vulnerabilities, backdoors, or biases. This attack method can fundamentally compromise an AI system’s security, performance, and ethical behavior.

Attackers might strategically inject specific data points to skew the model’s responses, create hidden behavioral triggers, or introduce systematic biases that can be exploited later. The poisoning process can be subtle, making detection challenging.

5. Sensitive Information Disclosure: Unintended Data Leaks

Sensitive information disclosure happens when an LLM inadvertently reveals confidential or personal data. This vulnerability can lead to significant privacy violations and potential security breaches.

An AI system might accidentally expose personal details, financial information, or confidential business data during interactions. The risk is particularly high in systems handling sensitive customer or organizational information.

6. Improper Output Handling: Executing Unintended Commands

Improper output handling occurs when the content generated by a language model (LLM) is not properly validated before being processed by other systems. This creates significant security risks, such as remote code execution and unauthorized privilege escalation.

In cases of remote code execution, if the generated output contains code or commands that are not properly filtered or validated, those commands can be executed by the system. This could allow attackers to run malicious scripts or code on the system remotely, potentially compromising the entire system.

Similarly, unauthorized privilege escalation can occur if the output from the LLM contains instructions that alter access levels or permissions. For example, an attacker could manipulate the model’s output to include commands that grant them higher privileges or access to restricted areas of the system, bypassing normal security measures.

These vulnerabilities arise because, without proper validation, malicious content generated by the LLM can trigger unintended actions that compromise the system’s security, similar to vulnerabilities found in traditional software.

7. Vector and Embedding Weaknesses: Manipulating Information Retrieval

In systems using Retrieval Augmented Generation (RAG), vector and embedding techniques can create specific security challenges that aren’t always caught by traditional security checks.

One common tactic is embedding poisoning, where attackers alter the vector representations to retrieve misleading or harmful information. This can compromise the quality and safety of the generated content.

Another method is vector space manipulation, where attackers exploit the structure of the vector space itself. This can lead to malicious outputs or bypass security filters.
Access sensitive information by exploiting vulnerabilities in how embeddings are generated or stored, potentially leaking confidential data.

8. Unbounded Consumption: Overwhelming System Resources

Unbounded consumption describes scenarios where an LLM application allows excessive and uncontrolled computational inferences like ChatGPT Pro. This may result in:

Denial of service
Economic losses
Potential model theft
Resource depletion

Attackers could overwhelm the system by generating numerous complex requests, effectively creating a computational drain that disrupts service availability and increases operational costs.

9. Library Injection Exploit

It is sometimes also called ‘supply chain vulnerabilities’ where attackers can create trojanized versions of libraries or LLMs and deploy them as legitimate services. Users, unaware of the malicious code, may download and use these models, trusting them to provide solutions.

Once integrated, attackers can prompt the model to access sensitive data or even execute unauthorized actions.

10. Zero Day Flags

Zero Day Flags are serious security flaws in AI systems. These flaws are often discovered by attackers before anyone else, including the alignment team. Since there’s no fix available right away, these vulnerabilities can be exploited until a solution is found and implemented.

11. Misinformation Generation: Spreading False Narratives

Misinformation is a big challenge with LLMs. These systems can sometimes generate content that sounds believable but is actually false or misleading. As a result, users might make decisions based on inaccurate information, which can lead to serious consequences.

AI is already being used to create fake stories, statistics, and explanations that sound real but are completely made up. Some People are using fake accounts and AI to generate convincing, but false, content to influence public opinion. This isn’t just a small problem—it’s affecting important decisions, from politics to how people behave, and it could have serious consequences.

Different Ways to Hack LLMs: Alternative Methods of Manipulating Models

Different Manipulations techniques to Hack LLMs

Attackers have come up with various ways to exploit Large Language Models (LLMs), below we have listed some common prompting ways. These methods often involve crafting inputs that may seem harmless at first but can lead to serious manipulation of the model’s behavior. Here are some common ways attackers bypass filters and inject harmful content.

1. Leetspeak: Using Alternative Letter Substitutions

Leetspeak, or “1337,” is a form of stylized writing where letters are replaced with numbers or other characters to mimic the look of certain words. While it’s primarily used in online communities, it can also be used to bypass filters or restrictions placed on specific keywords.

How it works: An attacker might craft a prompt that includes common phrases or keywords but in a distorted form using leetspeak. For example, replacing “hack” with “h4ck” or “admin” with “4dm1n” could potentially bypass a model’s safeguards that are designed to detect these terms.

2. ROT13: Shifting Letters for Concealment

ROT13 (rotate by 13 places) is a simple encryption method that shifts each letter of the alphabet by 13 characters. It’s often used in online discussions to obscure spoilers or sensitive content. However, attackers can use ROT13 to obscure malicious inputs and trick LLMs into responding to harmful prompts.

How it works: An attacker might encode harmful text in ROT13, making it unreadable to human reviewers but still comprehensible to the LLM once decoded. The model may not detect the hidden harmful request if it doesn’t first decode the input.

3. Morse Code: Hiding Messages in Plain Sight

Morse code, a form of encoding text through sequences of dots and dashes, can also be used to conceal harmful messages. While the code itself is not typically malicious, attackers can use it to hide commands or instructions within seemingly innocuous inputs.

How it works: By encoding harmful phrases in Morse code, an attacker can effectively bypass a model’s text-based filters. The LLM may generate outputs based on decoded messages without flagging the original coded input.

4. Reverse Text: Flipping Text for Camouflage

Another simple but effective technique involves reversing the text order. In many cases, reversed text is still legible to the model, but filters may overlook it as being harmless.

How it works: An attacker could reverse the characters of a harmful prompt so that, when read normally, it appears as a nonsensical string of characters. However, the LLM can decode and understand the original meaning, bypassing any filters aimed at detecting specific phrases.

5. Pig Latin: Simple Obfuscation for Concealing Requests

Pig Latin, a playful form of language manipulation, involves altering the structure of words by moving the first letter or syllable to the end and adding “ay.” This technique can be used to mask harmful prompts by making them appear like nonsense, while still being intelligible to the model.

How it works: By turning words into Pig Latin, an attacker might obscure sensitive phrases (e.g., “access denied” becomes “ccessaay eniedday”). This manipulation could bypass security measures designed to flag problematic terms.

6. Binary Code: Encoding for Covert Communication

Binary code—comprising only 0s and 1s—can be used to encode messages, allowing malicious users to hide harmful commands within seemingly innocuous inputs. This can be especially dangerous if the model is interacting with systems that don’t validate or check for binary sequences.

How it works: An attacker could encode a harmful prompt in binary form, and while the text may appear nonsensical to human readers, the LLM would decode and process it, often without raising any alarms. For instance, “attack” could be encoded in binary as “01100001 01110100 01110100 01100001 01100011 01101011.”

7. Zalgo Text: Distorting Characters for Malicious Effect

Zalgo text is a form of corrupted text that involves adding excessive diacritics and other characters to the original text. While this is often used for aesthetic purposes (e.g., for creating “creepy” visuals), it can also be employed to bypass filters and confuse human reviewers.

How it works: By adding extra marks and symbols, attackers can distort otherwise recognizable words, making it difficult for automated security systems to flag harmful content. While the text may appear corrupted to a human, the LLM can still interpret the input.

8. Upside-Down Text: Flipping for Subtlety

Upside-down text is another obfuscation technique where the entire text is inverted. This manipulation can bypass certain filters, especially in systems that rely on standard text patterns to flag malicious behavior.

How it works: An attacker might reverse the entire input (e.g., “attack” becomes “ʇɔɐɹɐ”) to create an illusion of harmlessness. While the text is reversed, the LLM will still understand it and respond accordingly.

9. Unicode Substitutions: Using Unicode Characters to Replace Letter

Unicode characters can be used to replace standard alphabetic characters, often creating text that looks similar to the original but is technically different. This form of evasion is used to bypass filters that rely on text pattern matching.

How it works: An attacker might replace English letters with visually similar characters from other languages or special symbols from the Unicode character set. For instance, using a Cyrillic “а” instead of the Latin “a.”

10. Anagram: Rearranging Words to Hide Intent

Anagramming involves rearranging the letters of words to form new, often harmless-looking words that might still carry harmful intent when combined in a specific context.

How it works: An attacker could rearrange the letters of sensitive words to hide their true meaning from the model or indirectly writing in poetry. For example, “command” could be rearranged to “dacmon” to bypass simple keyword filters.

11. Using Lesser-Known Languages to Bypass Filters

Some attackers use obscure languages or regional dialects to avoid detection by LLM filters. Since these languages aren’t as widely monitored, the models are less likely to flag harmful content in them.

How it works: By entering prompts in less common languages, like small regional dialects or ancient scripts, attackers can slip past the system without triggering any security measures.

12. Using Less Restricted Open-Source LLMs

Open-source LLMs often have fewer built-in restrictions, making them easier to manipulate. These models don’t come with the same heavy content filters that commercial models have, offering more room for attackers to craft malicious inputs.

How it works: Since open-source models tend to be more flexible and less restricted, attackers can exploit them to generate harmful or inappropriate responses without much resistance.

13. Using Images to Exploit Text Generation

Attackers use images to manipulate LLMs through their ability to process both text and visual data. Malicious images can influence the text output generated by the model, bypassing text-only filters.

How it works: By embedding hidden messages or misleading content within images, attackers trigger specific responses from the LLM when it processes the image. The model generates harmful or unintended responses based on visual cues, allowing attackers to bypass normal text-based filters.

14. Using DAN (Do Anything Now) Prompts

The “DAN” method is a well-known exploit where attackers manipulate the LLM by bypassing its content restrictions using special prompts. These prompts trick the model into ignoring its safety protocols and generating responses it normally wouldn’t.

How it works: The attacker frames a prompt as if the LLM is in a mode called “DAN,” where it is allowed to generate any response, regardless of guidelines. The prompt convinces the model to act as if it has no restrictions, thus allowing it to produce harmful or inappropriate content.

15. Role-Playing or Hiding Malicious Intent in Code

Attackers manipulate LLMs by framing harmful requests as role-playing scenarios or wrapping malicious intent inside code snippets. This method exploits the model’s contextual understanding to bypass safeguards.

How it works: An attacker might instruct the LLM to “act as” a specific character, such as a hacker or programmer, and provide outputs under the guise of fulfilling the role. Similarly, malicious inputs can be disguised as code or technical examples, tricking the model into processing harmful instructions while assuming it’s responding to a legitimate query.

It’s a positive thing that many LLMs are getting better at resisting these types of attacks, recently openAI released Deliberative alignment. As they improve, their safeguards will keep getting stronger, making it harder for malicious users to exploit them. However, new methods will likely keep emerging.

What Companies Are Doing to Address These Vulnerabilities

To combat the exploitation of LLMs, companies are implementing a range of measures to improve safety and reliability. Here are six key strategies:

1. Prompt Sanitization

Companies clean and validate user inputs to remove potentially harmful or manipulative content. This ensures prompts are less likely to exploit vulnerabilities in the model.

2. Translating Non-English Inputs Back to English

Since models are often most effective in English, some companies translate non-English inputs into English before processing. This helps identify and address harmful prompts more reliably, especially in less monitored languages.

3. Guardrails for Safe Responses

Guardrails are mechanisms that restrict the model from generating harmful, inappropriate, or sensitive outputs. These safeguards ensure that the responses remain ethical and within predefined safety boundaries.

4. Evaluations (Evals)

Evals tools are used to evaluate model outputs for toxicity, bias, or harmful content. These evaluations check responses in real time and block any that violate safety or ethical guidelines before they reach the user.

5. Rate Limiting

Rate limiting restricts the number of requests a user can make in a specific timeframe. This prevents abuse, such as spamming the system with malicious inputs or overloading resources.

6. Alignment Testing and Responsible Releases

Extensive alignment testing ensures models follow ethical and safety guidelines before deployment. Companies also opt for responsible releases by initially limiting access to trusted researchers or organizations instead of directly making the models public. This phased approach allows issues to be identified and resolved early.

Conclusion

Large Language Models (LLMs) are becoming more common in technology, but they come with security risks that need attention. Hackers are finding new ways to manipulate these models, like hidden prompt injections or using images to affect text. These issues can put privacy and safety at risk, making it harder to trust AI-generated content.

Companies are working to solve these problems. They sanitize inputs, translate non-English text to English, and use tools to check the output for harmful content. These steps help make sure that the systems remain secure. However, as threats change, it’s important to stay ahead of new methods that could be used to attack.

The goal is to create AI systems that are safe to use and dependable. As technology continues to improve, keeping these models secure will be key to ensuring they can be trusted by everyone.

Solution

Make the most of Your Enterprise with AI Chatbot

Transform customer support with YourGPT AI and drive efficiency with the chatbot trusted in industry.

⚡️ 5-Minute Setup with No-Code Builder
🌐 Multi-Lingual Support for Global Audiences
🗣️ Advanced Voice AI Capabilities
🔌 Seamless Omni-Channel Integration
📊 Data-Driven Insights for Better Decision Making
✅ SOC 2 Type 2 & GDPR Compliant

Claim Free 7-Day Trial Book a Demo

No credit card required • Full access • Limited time offer

Neha

January 23, 2025

Newsletter