A deep dive into a surprising phenomenon called Subliminal learning, through which LLMs can transfer unintended traits to other models during distillation.
Fascinating, thought-provoking, and troubling. Thanks for the great write-up!
LLMs learn complex patterns in token sequences -- patterns that are way too complex for humans to identify or understand -- and then continue generating token sequences that match those patterns.
I guess what happens here is that these preferences and other behavioral characteristics are embedded in the training token sequences in ways that we don't anticipate or understand.
Fascinating, thought-provoking, and troubling. Thanks for the great write-up!
LLMs learn complex patterns in token sequences -- patterns that are way too complex for humans to identify or understand -- and then continue generating token sequences that match those patterns.
I guess what happens here is that these preferences and other behavioral characteristics are embedded in the training token sequences in ways that we don't anticipate or understand.