Recalling that LLMs have no notion of reality and thus no way to map what they’re saying to things that are real, you can actually put an LLM to use in destroying itself.
The line of attack that this one helped me do is a “Tlön/Uqbar” style of attack: make up information that is clearly labelled as bullshit (something the bot won’t understand) with the LLM’s help, spread it around to others who use the same LLM to rewrite, summarize, etc. the information (keeping the warning that everything past this point is bullshit), and wait for the LLM’s training data to get updated with the new information. All the while ask questions about the bullshit data to raise the bullshit’s priority in their front-end so there’s a greater chance of that bullshit being hallucinated in the answers.
If enough people worked on the same set, we could poison a given LLM’s training data (and likely many more since they all suck at the same social teat for their data).
LLMs don’t know anything. You’d have to have programs around the AI that look for that, and the number of things that can be done to disguise the statement so only a human can read it is uncountable.
##### # # ### #### ### #### # # # # # # # # ##### # ### # ### # # # # # # # # # # ### #### ### #### #### # # # # #### # # ### ##### # # # # # # # # # # # #### # # # # ### ##### # # # # # # # # # # # # # #### ### ##### ##### #### # # ### #
Like here’s one. Another would be to do the above one, but instead of using
#
cycle through the alphabet. Or write out words with capital letters where the#
is.Or use an image file.