GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.

Paper and LLMs Ethics

GitHub Link

The GitHub link is https://github.com/robustnlp/cipherchat

Introduce

The "CipherChat" framework is introduced to assess the generalizability of safety alignment in language models (LLMs) to non-natural languages like ciphers. The framework involves training an LLM to understand a cipher and its rules, then converting inputs into a cipher format that may bypass safety alignments, and using a rule-based decrypter to convert the model's cipher output back to natural language. Experimental results are stored for analysis, and the paper proposes a stealthy chat method with LLMs through ciphers. The authors provide a tool and encourage citing their work for those interested.

Content

--model_name: The name of the model to evaluate. --data_path: Select the data to run. --encode_method: Select the cipher to use. --instruction_type: Select the domain of data. --demonstration_toxicity: Select the toxic or safe demonstrations. --language: Select the language of the data. Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form. The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data. For more details, please refer to our paper here. If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

Alternatives & Similar Tools

Free Google Gemini: the best largest and most capable AI model Free

Google Gemini, a multimodal AI by DeepMind, processes text, audio, images, and more. Gemini outperforms in AI benchmarks, is optimized for varied devices, and has been tested for safety and bias, adhering to responsible AI practices.

Visit →

Video ReTalking-focuses on audio-based lip synchronization for talking head video editing Open Source

Video ReTalking, advanced real-world talking head video according to input audio, producing a high-quality

Visit →

UniSim-Chat Control Video and Virtual simulation Open Source

Then transplant it to the real world to solve complex problems

Visit →

LongLLaMA-handle very long text contexts, up to 256,000 tokens Open Source

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.

Visit →

LLaVA-LLMs designed to connect a vision encoder with a language model Open Source

Large Language and Vision Assistant

Visit →

Ntropy Insights- Save 80% on underwriting businesses everywhere Freemium

Use bank data and Ntropy's AI. Parse bank feeds and statements, extract revenue and COGs, automatically re-create a P&L within milliseconds. Any industry, any geo.

Visit →