Usage
Hi,
I really like your work and I generated all these control vectors for my own model. But now I have a few questions.
- it seems that at the moment only llama.cpp can handle these control vectors? Which also means that some samplers can't be used, like dry, smoothing factor,...
- after some testing I felt the model with control vectors got pretty dumb. The conversation starts running in circles, or even a little bit more dialogue bias means endless dialogues with 100 questions, jumping random topics in same message.
Do you have similar results or am I doing it wrong?
- it seems that at the moment only llama.cpp can handle these control vectors? Which also means that some samplers can't be used, like dry, smoothing factor,...
Yeah, sadly unless the idea of using control vectors becomes popular enough to encourage other inference back-ends to implement them, we're stuck with only being able to use these with llama.cpp
(I think they are on the verge of merging the DRY code soon though).
- after some testing I felt the model with control vectors got pretty dumb. The conversation starts running in circles, or even a little bit more dialogue bias means endless dialogues with 100 questions, jumping random topics in same message.
Do you have similar results or am I doing it wrong?
They definitely can cause problems with the model and the first thing to try is to reduce the scale-factor of the "dialogue" side of the "character focus" , eg:
--control-vector xxx-character_focus__debias.gguf \
--control-vector-scaled xxx-character_focus__dialogue.gguf 0.5
and if that fails, you can even try using just the "character focus" de-bias alone with a reduced scale-factor, eg:
--control-vector-scaled xxx-character_focus__debias.gguf 0.5
I general try to use as little as possible to get the desired effect, and sometimes this might even mean starting your story with a higher scale-factor and then restarting llama.cpp
to use a lower (or no) scale-factor for the rest of your story.
After one more day of testing I have a better feeling now. Thanks.
I would love to try more vectors, but generating these vectors takes forever. Any way to speed it up using more vram?
After one more day of testing I have a better feeling now. Thanks.
I would love to try more vectors, but generating these vectors takes forever. Any way to speed it up using more vram?
Yeah, it's possible somebody more knowledgeable with PyTorch could make it use batch processing, but there is some chance this screws things up due to the <PAD>
tokens :/
It works best if you can run on a single card: I could train a set of 8 control vectors on a 70b
model every 12 hours on a single A6000 with 48GB VRAM (so every 6 hours using a pair of A6000 working separately), but the larger models took ~36 hours to do the same using 2x A6000s.
You can also try just reducing the '--num_prompt_samples'
option to be 1/2 or 1/4 of the hidden state dimension.
The directions you'll find might not be quite as good, but I think using 1x the hidden state dimension might be overkill and in practice you can get away with a lower value (at least whilst you are refining/testing out your "continuations" prompts).
I have seen that llama.cpp brings it's own tool to generate control vectors which differs a great deal from yours. I like your approach much more but one difference seems important to me. The llama.cpp tool seems to use the smaller gguf quants to generate the vectors. Would this be possible with your tool too?
FYI - the vectors generated with the llamacpp tool don't work very well.
Yeah, the llama.cpp
tool is actually completely broken currently: each layer has a 50% chance of finding a direction that is the exact opposite of what you want :/
So what you train up will just end up a mixture of say "dialogue focus" vs "narration focus" all mixed up for every layer, and pure luck if it does anything useful...
It's also very hard to refactor the current code to use my method:
- The "covariance of differences" calculation is all hard coded using
ggml
C-code making it very hard for me to change to keep as separateA
andB
matrices to calculate the cross-covariance matrix from. - The Power Iteration code can only calculate the principle eigenvector (and is again this is all locked up in a single
ggml
compute graph) and my method often find the direction it needs is the 2nd, 3rd, 4th eigenvector instead.
I have (hopefully) one last question. in llama.cpp, what would I have to do to UNAPPLY the control_vectors? Merge request #6289 will probably stall for the near future, so i want to dabble my own things. My idea is:
- make a new control vector with new settings
- get the diff (new - old)
- apply the diff
how wrong am I with this idea?
Well right now the control-vectors are applied during initialization. If i want to change the scale of a vector, remove or add a debias, I need to apply the new vectors and remove the old.
Well right now the control-vectors are applied during initialization. If i want to change the scale of a vector, remove or add a debias, I need to apply the new vectors and remove the old.
I think the only way would be via that PR (or similar) as the current llama.cpp
code is adding the control vectors into the compute graph and there isn't really an easy way to change this at runtime AFAIK.
ok thanks... i'll dive deeper into the PR then.
IIRC, it was the lack of test cases that made the PR get stuck.
Also I refactored these two functions:
static llama_control_vector_data llama_control_vector_load_one(const llama_control_vector_load_info & load_info) {
auto start = ggml_time_ms();
printf("control vector load_one...\n");
int32_t n_tensors;
size_t n_bytes = 0;
uint32_t max_direction_layer = 0;
llama_control_vector_data result = { -1, {} };
// calculate size of ctx needed for tensors, ensure tensors are f32, and find max layer
{
ggml_context * meta_ctx = nullptr;
struct gguf_init_params meta_gguf_params = {
/* .no_alloc = */ true,
/* .ctx = */ &meta_ctx,
};
struct gguf_context * meta_ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), meta_gguf_params);
if (!meta_ctx_gguf) {
fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
ggml_free(meta_ctx);
return result;
}
n_tensors = gguf_get_n_tensors(meta_ctx_gguf);
for (int i = 0; i < n_tensors; i++) {
std::string name = gguf_get_tensor_name(meta_ctx_gguf, i);
// split on '.'
size_t dotpos = name.find('.');
if (dotpos != std::string::npos && name.substr(0, dotpos) == "direction") {
try {
uint32_t layer = std::stoi(name.substr(dotpos + 1));
if (layer == 0) {
fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
gguf_free(meta_ctx_gguf);
ggml_free(meta_ctx);
return result;
}
if (layer > max_direction_layer) {
max_direction_layer = layer;
}
} catch (...) {
fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
gguf_free(meta_ctx_gguf);
ggml_free(meta_ctx);
}
}
struct ggml_tensor * tensor_meta = ggml_get_tensor(meta_ctx, name.c_str());
if (tensor_meta->type != GGML_TYPE_F32 || ggml_n_dims(tensor_meta) != 1) {
fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
gguf_free(meta_ctx_gguf);
ggml_free(meta_ctx);
return result;
}
if (result.n_embd == -1) {
result.n_embd = ggml_nelements(tensor_meta);
} else if (ggml_nelements(tensor_meta) != result.n_embd) {
fprintf(stderr, "%s: direction tensor sizes mismatched in %s\n", __func__, load_info.fname.c_str());
gguf_free(meta_ctx_gguf);
ggml_free(meta_ctx);
return result;
}
n_bytes += ggml_nbytes(tensor_meta);
}
gguf_free(meta_ctx_gguf);
ggml_free(meta_ctx);
}
if (n_tensors == 0) {
fprintf(stderr, "%s: no direction tensors found in %s\n", __func__, load_info.fname.c_str());
return result;
}
// load and scale tensors into final control vector context
struct ggml_context * ctx = nullptr;
struct gguf_init_params params = {
/*.no_alloc = */ false,
/*.ctx = */ &ctx,
};
struct gguf_context * ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), params);
if (!ctx_gguf) {
fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
ggml_free(ctx);
return result;
}
// do not store data for layer 0 (it's not used)
result.data.resize(result.n_embd * max_direction_layer);
for (uint32_t il = 1; il <= max_direction_layer; il++) {
const std::string name = "direction." + std::to_string(il);
const ggml_tensor * tensor = ggml_get_tensor(ctx, name.c_str());
float * dst = result.data.data() + result.n_embd * (il - 1);
if (tensor) {
const float * src = (const float *) tensor->data;
for (int j = 0; j < result.n_embd; j++) {
dst[j] = src[j] * load_info.strength;
}
} else {
for (int j = 0; j < result.n_embd; j++) {
dst[j] = 0.0f;
}
}
}
gguf_free(ctx_gguf);
ggml_free(ctx);
auto end = ggml_time_ms();
printf("control vector load_one took %ums\n", end - start);
return result;
}
llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos) {
auto start = ggml_time_ms();
printf("control vector load...\n");
llama_control_vector_data result = { -1, {} };
for (const auto & info : load_infos) {
auto cur = llama_control_vector_load_one(info);
if (cur.n_embd == -1) {
return result;
}
if (result.n_embd != -1 && (result.n_embd != cur.n_embd || result.data.size() != cur.data.size())) {
printf("%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
return result;
}
if (result.n_embd == -1) {
result = std::move(cur);
} else {
for (size_t i = 0; i < cur.data.size(); i++) {
result.data[i] += cur.data[i];
}
}
}
if (result.n_embd == -1) {
printf("%s: no vectors passed\n", __func__);
}
auto end = ggml_time_ms();
printf("control vector load time: %ums\n", end-start);
return result;
}
So they are much simpler now in the main branch (and also fixed a lot of memory leaks by having a single point of exit), but still the limitation of llama.h
only allowing C-only code meant it still has to pass the control vectors as a zero-padded block of floats.
It really should be passing a std::vector<std::vector<float>>
IMO but via C-only it's very painful to do. llama.cpp
can handle C++ fine if you can manage to pass it via llama.h
and there is a (very) convoluted example of passing the LoRA structs via forward declaration and "wrapping" a pointer in a struct, but I didn't get any further.