Control vector discussion

#2
by ChuckMcSneed - opened

Continuation of:
https://huggingface.co./jukofyork/Dark-Miqu-70B/discussions/3
I've succeeded in removing slop from CR+ for both sfw and nsfw scenarios using control vectors. Strangely, sfw unslop control vector did not affect nsfw slop, and nsfw control vector made model extra horny, which in my opinion is an undesirable side effect. While sfw vector managed to stay coherent during my stress tests, nsfw vector caused poor commandr to disintegrate, it didn't know what to say without any of those overused phrases in erotic fiction that the control vector stopped from appearing. Looks like the issue for nsfw is at much deeper level: the data where the model gets it from is very monotonous, and when forced write in different style, it doesn't know what to do. This is what most likely makes it incredibly difficult to remove nsfw slop using regular prompting techniques.

Well darn...

I'm making more progress with control vectors!
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/bio/control_vector-commandr-bio.gguf
I tuned this one on very descriptive biological language as positive and vague flowery prose as negative. Seems to make it more aware of the biology and surroundings of characters.
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/incharacter/control_vector-commandr-incharacter.gguf
This one makes the model act slightly more in character, but the improvement is not very significant as commandr is already quite good at it.

nsfw vector caused poor commandr to disintegrate, it didn't know what to say without any of those overused phrases in erotic fiction that the control vector stopped from appearing. Looks like the issue for nsfw is at much deeper level: the data where the model gets it from is very monotonous, and when forced write in different style, it doesn't know what to do.

This may actually just be a problem with the "two class" control vectors! I have managed to even completely stop a model from being able to write a story because of this... To explain the problem in simple terms:

Think about a clock face with a shorter hour hand and a longer minute hand:

  • When the time is 12:00 both hands point in the same direction, but there is still a gap between the tips of the two hands. These sort of vectors are not what we want at all because moving in either direction will just make the model more or less "storyish", and ultimately these are what cause the mode to get crippled like you describe. Even times like 12:05 or 11:50 have this same problem.
  • When the time is 6:00, 5:25, etc the the two hands point in opposite directions and this is a good control vector that clearly moves from undesirable to desirable direction.

This is the problem I'll been grappling with for the last 2 weeks:

  • If the "hands" are both long and well defined then cosine similarity works fine: it outputs a number similar to correlation and 1.0 is like the 12:00 example above and -1.0 is like the 6:00 example above (and 0.0 is like 3:00 or 9:00; ie: 90 degrees). This can then be used to filter out these shitty "storyish" directions, but...
  • There isn't really a good reason that the things we are interested in create a clear "axis" like this, and it turns out that often the case will be like a really long minute hand and a tiny/stubby hour hand... Cosine similarity doesn't work in this case as the direction of the tiny hand has noise added to it and can point in wildly different directions as a result.

So after lots of experimenting with this, I think I may finally have worked out a method of detecting these shitty directions:

Flip the direction of one of the hands and see if it gets easier to discriminate between our two classes!!!

  • If the time is 12:00 and you flip either hand to get 6:00 or 12:30 then it's clear the gap between the tips of the hands has increased! This is a shitty direction for a control vector.
  • If the time is 6:00 and you flip either hand then the gap has clearly decreased! This is a good direction for a control vector.
  • This works fine even when one hand is tiny in length.
  • This works for 12:05, 11:50 6:00, 5:25, type directions.
  • The like 3:00 or 9:00 type directions (ie: 90 degrees) are the directional pairs where we get no change.

So what I am doing now is performing SVD to decompose the gap into lots of directions, testing each one and only keeping those that pass the above test, then finally reconstructing the final direction to only include the "good" directions.

I still need to run some more tests but will likely have this perfected in a couple of days and will upload the new control vectors and the code to create your own.

I'm making more progress with control vectors!
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/bio/control_vector-commandr-bio.gguf
I tuned this one on very descriptive biological language as positive and vague flowery prose as negative. Seems to make it more aware of the biology and surroundings of characters.
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/incharacter/control_vector-commandr-incharacter.gguf
This one makes the model act slightly more in character, but the improvement is not very significant as commandr is already quite good at it.

I'm making more progress with control vectors!
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/bio/control_vector-commandr-bio.gguf
I tuned this one on very descriptive biological language as positive and vague flowery prose as negative. Seems to make it more aware of the biology and surroundings of characters.
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/incharacter/control_vector-commandr-incharacter.gguf
This one makes the model act slightly more in character, but the improvement is not very significant as commandr is already quite good at it.

I'll have to look into your method as I'm currently using 30,000 samples to do what you look to be doing with 5!? I think my collection of story prompts are a bit shit as it's pretty hard to write a Grimdark story when the prompt says "Write a story about being overjoyed on the day of your graduation." or similar :/

I definitely think you need more samples though. PCA is basically just eigen-decomposition of a covariance matrix, and statistically it can be shown in the very best case you need O(d) samples to reliably estimate the covariance matrix:

https://stats.stackexchange.com/questions/90045/how-many-samples-are-needed-to-estimate-a-p-dimensional-covariance-matrix

and command-r-plus has around 11.5k variables in its hidden dimension and most other large 70b+ models have 8192 variables per hidden dimension.

I'm using 2 classes and a baseline, 10 system prompts per triple, and 1k prompts per system prompt = 3 x 10 x 1000 = 30000 samples. But I also have matched pairs that get subtracted from the baseline which should reduce the error in the covariance matrix even further.

A simple hacky test you could try would be to train your control vectors 5 times but leave one of the 5 prompts out each time. Then test and see if you get wildly different results... If you do then you need to increase the sample size, but if you don't then this must mean that only a tiny tiny fraction of command-r-plus's 11.5k variables are changing hugely in magnitude for your prompts (which would be very surprising).

I'm using 2 classes and a baseline, 10 system prompts per triple, and 1k prompts per system prompt = 3 x 10 x 1000 = 30000 samples. But I also have matched pairs that get subtracted from the baseline which should reduce the error in the covariance matrix even further.

Oh wow... That's real huge... Are all of those synthetic? I'm using high quality "cyborg" data: generated by model, but heavily edited by human(me) as positive, "mean" method; more time for me goes into dataset generation than into training. You know that the models have in-context learning, so my theory was that if I show it how to write(cyborg) vs how not to write(synthetic), I would get a better control vector out of it than when I just trhow it some starters with a prompt, and it seems to do just as I want. In the stories part, I try to keep as few variables from changing as possible, so they don't get affected by control vector. Also keeping the prompts equal length helps with the quality of the control vector, especially when they are short, >400token prompts can take 10 token variation much better than <100token prompts.

I'll have to look into your method as I'm currently using 30,000 samples to do what you look to be doing with 5!? I think my collection of story prompts are a bit shit as it's pretty hard to write a Grimdark story when the prompt says "Write a story about being overjoyed on the day of your graduation." or similar :/

Wait, you put that into positive too? It should be "Write a very sad story with a very bad ending about the day of your graduation." vs "Write a very happy story with a very good ending about the day of your graduation."

I'm using 2 classes and a baseline, 10 system prompts per triple, and 1k prompts per system prompt = 3 x 10 x 1000 = 30000 samples. But I also have matched pairs that get subtracted from the baseline which should reduce the error in the covariance matrix even further.

Oh wow... That's real huge... Are all of those synthetic? I'm using high quality "cyborg" data: generated by model, but heavily edited by human(me) as positive, "mean" method; more time for me goes into dataset generation than into training. You know that the models have in-context learning, so my theory was that if I show it how to write(cyborg) vs how not to write(synthetic), I would get a better control vector out of it than when I just trhow it some starters with a prompt, and it seems to do just as I want. In the stories part, I try to keep as few variables from changing as possible, so they don't get affected by control vector. Also keeping the prompts equal length helps with the quality of the control vector, especially when they are short, >400token prompts can take 10 token variation much better than <100token prompts.

I'm using a mix of different story prompt datasets I found and a set of 10 matched system prompts that go with these.

I'll have to look into your method as I'm currently using 30,000 samples to do what you look to be doing with 5!? I think my collection of story prompts are a bit shit as it's pretty hard to write a Grimdark story when the prompt says "Write a story about being overjoyed on the day of your graduation." or similar :/

Wait, you put that into positive too? It should be "Write a very sad story with a very bad ending about the day of your graduation." vs "Write a very happy story with a very good ending about the day of your graduation."

Even though the prompts are pretty trash; I think this might actually be quite a good thing and encourage the model to just generally "be dark" or "be chaotic" and not just when specifically asked to "write a grimdark story", etc.

It seem to have worked anyway, as the new control vectors are way better than the old ones from this repo.

I'm now also skipping the last layer (which it looks like you are also doing - from looking inside your .safetensors files?). The last layer seems to be an oddball and can have activations 10-100x larger than the pervious layer(s). The way I have the scale factors working now the early layers are fine to fiddle with and just get really tiny offsets added that do almost nothing if the direction is weak.

Later in the week I will investigate using the "Cross Correlation Matrix" again as now have a much better idea of how to test for the shitty "storyish" directions that killed this before.

I'm also gonna think what other traits I can try - "purple prose" isn't really something I encounter as mostly just try to get them to write "dark" stories and my main enemy is redemption arcs and stupid "steeled themselves for the challenges to come" BS.

How does your ArcaneEntanglement-model64 model not have more attention? I'd never heard of it.

I made it at the end of an era, right when CR+ came out, making all those L2 tunes and merges obsolete. Like all L2s it only had 4k context and was dumber than CR+. I never bothered advertising it because of it. It was quite high at the old llm leaderboard though, not that it means much.

Ah makes sense. A lot happened around the CR+ release time.

So this model is cooked right? Every token going forward is 99-100% lol

image.png

Ah makes sense. A lot happened around the CR+ release time.

So this model is cooked right? Every token going forward is 99-100% lol

image.png

You mean Gembo? Yeah, it's cooked AF. ArcaneEntanglement-model64 is messed up in the opposite direction, with some random tokens having too low probs, especially 's.

image.png

I have books3(38GB) with other stuff from ThePile(books1, literotica) and a shitton of (uncleaned) Russian books from Libruks(300GB). Also Project Gutenberg has free books. Doesn't look impossible.

I've managed to find books3, books1 (literotica not really what I'm aiming for though) and a couple of other datasets. I still can't find an "easy" download for deepmind/pg19 for the Project Gutenberg data though.

But we really shouldn't be using bad instruct tunes, like Qwen 2.5 or llama 3.1, since they will bake in their slop bias.

Yeah, qwen and llama-3 are pretty much useless IMO.

Let's continue the experiments...

CR+ Q6_K

E: 16%
Ash: 12%
Elias: 12%
Brother: 7%
Al: 5%
Cyrus: 4%
A: 2%
Marek: 1%
Ark: 1%
Mal: 1%

52% for top 5, but 61% for top 10. That's 16% less than Largestral. Also note the word "Brother".

Yeah, it seems really good at picking up little nuances from the context like this, eg:

  • When I asked it to write a story in the style of Patricia Highsmith set in the east coast of Scotland in the 1920s it really picked up lots of little things like this to do with the old-fashioned names, etc.
  • No other model could write even close to Cormac Mccarthy; their knowledge just seemed to be what they had read from Wikepeadia and nothing like the real author.

I still think CR+ (original) is the best model, so if we can get that to be more balanced without hurting it too badly then that would be the best we can do for now I think.

and fingers crossed it should not be "broken" at all due to limiting to down_proj only

I tried did a few qloras with down_proj only and didn't manage to break anything.

Sadly it was broken lol:

image.png

It wasn't going to do anything useful after that and just waste another 12.5 days...

I'm pretty sure it was either:

  1. The rank was (far) too big:

184000000/((28672+12288)×(88−2)) --> ~52.2 samples per rank-1 LoRA parameter

so using rank-64 was far too large even for 184M samples...

  1. Underflow problems due to using bfload16 type:

If unsure, set everything to bf16 and use the adamw_kahan optimizer type. Kahan summation is ESPECIALLY important for full fine tuning. Kahan summation requires an extra 2 bytes per trainable parameter compared to vanilla full bf16 training.
For LoRAs, another option is setting lora_weight_dtype to fp32, which also makes all optimizer states fp32.
For LoRAs only, with constant learning rate no lower than 5e-5 or so, I have seen full bf16 training with no Kahan summation mostly match fp32 or bf16 + Kahan.
(more experimental) You may try Deepspeed's bf16 mode, but I personally don't use this. I think this does something like mixed precision, where it wraps the optimizer to keep a master copy of the parameters in fp32, as well as doing gradient accumulation and all optimizer states in fp32. This will use much more memory than full bf16 + Kahan summation.

- https://github.com/tdrussell/qlora-pipe

and using this with the newer Mistralai models which seem to keep all their activations really small compared to other models... I was probably right on the border of his suggestions.


So I set the rank to 10 (~5 samples per LoRA parameter) and used float32as still have plenty of VRAM to spare.

I also:

  • Turned off lora_dropout and weight_decay and just use the rank to regularise now (I think weight_decay might work really bad with 'down_proj' or 'out_proj', and lora_dropout justification was for much smaller models and higher ranks).
  • Set to skip the last 2 layers instead of just the last layer.
  • Set focal_loss_gamma = 2.0 and found a way to monitor if it is doing anything via histogram of the log-loss of the evaluations.
# Paths
model = '/mnt/data/Mistral-Large-Instruct-2407'
output_dir = '/mnt/data/Mistral-Large-Instruct-2407__finetune_new'

# Lora configuration
lora_rank = 10                             ## 184000000/((28672+12288)×(88−2)) --> ~52.2 samples per rank-1 LoRA parameter --> ~5 samples per rank-10 LoRA parameter
lora_alpha = 10
lora_dropout = 0.0                         ## {{ use 0.0 for now }} "QLoRA paper sets to 0.1 for fine-tuning 7B/13B models, and reduces to 0.05 for 33B/65B models"
target_modules = ['down_proj']             ## target 'down_proj' only
layers_to_transform = '0:85'               ## skip last 2 layers (86 and 87)

# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 10
batch_size_tokens = 8192
focal_loss_gamma = 2.0                     ## {{ try 2.0 for now }} https://arxiv.org/abs/1708.02002

# Performance settings
pipeline_stages = 2                        ## see: https://github.com/tdrussell/qlora-pipe/issues/1
logging_steps = 1
eval_steps = 50
save_steps = 50
checkpoint_every_n_minutes = 60
eval_before_first_step = true
lora_weight_dtype = 'float32'              ## 'bfloat16' --> type = 'adamw' and lr >= 5e-5, or type = 'adamw_kahan' and lr < 5e-5
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Resume a prior run
resume_from_checkpoint = false

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 8

[quantization.bnb]
load_in_4bit = true
bnb_4bit_use_double_quant = false
bnb_4bit_compute_dtype = 'bfloat16'

[optimizer]
type = 'adamw'                            ## lora_weight_dtype = 'float32' --> 'adamw', or lora_weight_dtype = 'bfloat16' and lr < 5e-5 --> 'adamw_kahan'
lr = 5e-5
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.0                        ## {{ use 0.0 for now }} fixed L2-regularisation for 'down_proj' (or 'out_proj') will bias changes to later layers only

[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/datasets/ebooks/*.txt'
sequence_len = 8192
eval_size = 0.005

The time has dropped to 14 days due to these changes, and already looks to be levelling off the weight magnitudes and no more of the weird "slug" moving along the histogram:

image.png

I am going away from today onwards so will leave it running (assuming I don't find any more problems).

Set focal_loss_gamma = 2.0 and found a way to monitor if it is doing anything via histogram of the log-loss of the evaluations.

image.png

This displays the "raw" log-loss without the Focal-loss sample reweighing applied.

So if it is actually doing something then we should expect to see some of the mass of the big "hump" on the right move towards the left, and hopefully NOT lots of mass moving down below -12.5 (ie: creating "word salad").

(I'll likely be out of phone signal range for at least the next couple of days so won't be able to reply to posts)

If it ends up failing due to targeting the mlp.down_proj modules alone, could you perhaps include other modules during training, and then merge only the mlp.down_proj?

Merging just the mlp.down_proj seems to have worked well for me with Mistral-Large + Lumimaid-v0.2-123b (obviously I didn't train Lumimaid, just used your lora_extract.py then merged):

https://huggingface.co./gghfez/SmartMaid-123b-GGUF

P.S. What tool are you using to produce those charts?

If it ends up failing due to targeting the mlp.down_proj modules alone, could you perhaps include other modules during training, and then merge only the mlp.down_proj?

It was working and using Focal loss, but all the stats were broken because they used the pre-focal-loss's loss :/ Managed to edit the config using phone and restarted it - hope this is the last time :(

Merging just the mlp.down_proj seems to have worked well for me with Mistral-Large + Lumimaid-v0.2-123b (obviously I didn't train Lumimaid, just used your lora_extract.py then merged):

https://huggingface.co./gghfez/SmartMaid-123b-GGUF

The problem is that the small gradients between the layers will all be fucked up so it's not really the same.

I should also say I didn't write lora_extract.py - I just tidied it up and added some error checking!

P.S. What tool are you using to produce those charts?

It's called "Tensor Board" and seems to rely on the data file logs that qlora-pipe is dumping.

Sign up or log in to comment