LoRA vs. Full Fine-Tuning: An Illusion of Equivalence
(arxiv.org)235 points by timbilt 6 days ago | 59 comments
235 points by timbilt 6 days ago | 59 comments
six_four_eight 6 days ago | root | parent | next |
I wonder how this compares to 'catastrophic forgetting' that can be a problem of full fine tuning. Or at least that's what I've just been reading as a case _for_ using LoRa, as it's not susceptible to that. I guess this paper shows LoRa causes forgetting in a different way.
Are there good general principles yet for what fine tuning method to use in certain situations? It still seems quite difficult to know ahead of time what's going to happen.
K0balt 4 days ago | root | parent |
Catastrophic forgetting or “psychosis” seems to happen when I overtrain. It’s easy to make it happen to models that have been extensively tuned already, but the base models hold up much better. I’m pretty sure there is a point in the n-dimensional space where x discrete vectors with n dimensions stops encoding usefully distinct patterns.
ismailmaj 6 days ago | root | parent | prev | next |
How does it compare to partially fine-tuning the model by freezing most of the network beside the last few layers?
K0balt 4 days ago | root | parent |
Idk but if I was guessing, I would guess that that process would be likely to create intruder dimensions in those layers… but hard to say how impactful that would be. Intuitively I would think it would tend to channel a lot of irrelevant outputs towards the semantic space of the new training data, but idk I how well that intuition would hold up to reality.
Mockapapella 6 days ago | root | parent | prev |
Thank you for this layman explanation
pwillia7 6 days ago | prev | next |
This tracks with my feelings making and using Stable Diffusion Loras and fine tunes. Still, with the speed to train and use, Loras have worked for me in most use cases and it hasn't been worth fine tuning the entire model.
K0balt 6 days ago | root | parent |
Yeah,it reflects the “feel” I get from lLoRa as well, especially if I overdo it. The new data becomes the preferred output even for unrelated inputs. I always felt like it was bludgeoning the model to some extent vs finetuning.
Also, LoRa tuning an extensively tuned model occasionally provokes full on delusional “insanity” or gibberish seizures.
I have had really good luck though using a highly tuned model as the training basis for a LoRa and then applying that LoRa mask to the base version of that model. I’m not sure why that seems to work better than the same LoRa training directly on the base model.
cheald 6 days ago | root | parent |
I've done a lot of tinkering with the internals of LoRA training, specifically investigating why fine-tune and LoRA training result in such different results, and I'm no academic, but I have found that there are definitely some issues with the SOTA at least WRT Stable Diffusion.
I've had significant success with alternate init mechanisms (the standard technique of init'ing B to zeros really does hurt gradient flow), training alpha as a separate parameter (and especially if you bootstrap the process with alphas learned from a previous run), and altering the per-layer learning rates (because (lr * B) @ (lr @ A) produces an update of a fundamentally different magnitude than the fine-tune update of W * lr = lr * B @ A).
In the context of Stable Diffusion specifically, as well, there's some really pathological stuff that happens when training text encoders alongside the unet; for SD-1.5, the norm of "good" embeddings settles right around 28.0, but the model learns that it can reduce loss by pushing the embeddings away from that value. However, this comes at the cost of de-generalizing your outputs! Adding a second loss term which penalizes the network for drifting away from the L1 norm of the untrained embeddings for a given text substantially reduces the "insanity" tendencies. There's a more complete writeup at https://github.com/kohya-ss/sd-scripts/discussions/294#discu...
You also have the fact that the current SOTA training tools just straight up don't train some layers that fine-tunes do.
I do think there's a huge amount of ground to be gained in diffusion LoRA training, but most of the existing techniques work well enough that people settle for "good enough".
doctorpangloss 5 days ago | root | parent |
Most people are using LoRAs as a solution for IP transfer.
Thing is Ideogram v2 has already achieved IP transfer without fine tuning or adapters. So we know those aren't needed.
Is Ideogram v2 an exotic architecture? No, I don't think so.
Are there exotic architectures that will solve IP transfer and other tasks? The Chameleon and OmniGen architectures. Lots of expertise went into SD3 and Flux dataset prep, but: the multimodal architectures are so much more flexible and expressive.
Flow matching models are maybe the last we will see before multi-modal goes big.
What to make of things in the community? How is it possible that random hyperparameters and 30 minute long fine tunings produce good results?
(1) Dreambooth effect: if it's like, a dog, you won't notice the flaws.
(2) Filing drawer problem. Nobody publishes the 99 things that didn't work.
(3) SD <3 struggled with IP transfer on image content that could not have possibly been in its datasets. But laypeople are not doing that. They don't have access to art content that Stability and BFL also don't have access to.
(4) Faces: of course SD family saw celebrity images. Faces are over-represented in its datasets. So yeah, it's going to be good at IP transfer of photographic faces. Most are in-sample.
jey 5 days ago | root | parent |
What's "IP transfer" in this context?
Der_Einzige 5 days ago | prev | next |
This paper seems dubious, because it flies in the face of what the reft/pyreft paper is showing (you can use 0.0001% of the parameters trained for 100 epochs to personalize on a small dataset):
https://github.com/stanfordnlp/pyreft
https://arxiv.org/abs/2404.03592
Note that the OP paper is not peer reviewed yet, and while the one I linked isn't either, it has Christopher Manning (yes, the one you know from youtube), the head of AI at Stanford, as a co-author.
In general, I think that Lora and especially reft should be more resistant to catastrophic forgetting due to them literally not impacting most of the model.
The Stable Diffusion community has literally tens of thousands of lora's that don't cripple a model at small rank.
chompychop 5 days ago | root | parent |
I don't see how the authorship by Christopher Manning shifts favour towards the other paper; this paper has Antonio Torralba as a co-author, who's also one of the big shots in AI.
Eisenstein 6 days ago | prev | next |
Is this just specifying what has been known, that LoRAs skew towards the new training heavily and are not 'more intelligent' just 'more targeted' and become less intelligent the more they are targeted? Or is this proposing something else? I am having a difficult time understanding exactly what 'intruder dimensions' are.
sorenjan 6 days ago | prev | next |
> We randomly initialize A such that it has singular values of 1, freeze it, and only train B. When we do this, we see a sharp reduction in high ranking intruder dimensions in comparison to those in normal LoRA
This sounds interesting, but I can't see that they do much with this result. Are they saving it for a follow up paper? I would think that if their whole paper is about a big problem with LoRAs and they then find what looks like an easy solution for that problem that would warrant more than a paragraph just before the conclusion.
It would also have been interesting if they included the DoRA method, they reference it briefly and that paper claims to resemble fine tuning learning behavior.
But perhaps this paper is focused on LoRA behavior, and a separate paper comparing various improvements is better.
liuliu 5 days ago | root | parent |
Yeah, honestly not too surprising. Happy someone made the experiments though.
I think we know that NN with limited data tends to over-fitting, so to train LoRA you need stronger regularization mechanism, that including:
* Fixing A as projection matrix so it doesn't rotate to an "easier" orientation for B to learn.
* Periodically merging AB into W_tuned to simulate the full-model finetuning behavior.
I think fundamentally, LoRA is sound because gradient matrix is low-rank by its nature.
deskr 5 days ago | prev | next |
What an unfortunate choice of name. LoRa is already a big project.
DidYaWipe 5 days ago | root | parent | next |
Yep. People don't bother to even check anymore.
https://www.youtube.com/watch?v=YQ7aLHCTeeE
And Amazon named its voice assistant after a well-known camera. And... and...
DidYaWipe 5 days ago | root | parent |
[flagged]
BoorishBears 4 days ago | root | parent |
Or they felt like vastly different fields are allowed to share acronyms.
greenavocado 5 days ago | root | parent | next |
Just watch: Pretty soon there will be an LLM optimization called Windows
drweevil 5 days ago | root | parent | prev |
True enough, but it makes searching (on HN, Google, etc.) a PITA.
pclmulqdq 5 days ago | root | parent | prev |
Welcome to ML/AI project naming.
viktour19 6 days ago | prev | next |
> LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution.
The ability for nnets to generalize is inherently tied to their trainable parameter count via mechanisms we don't understand but we know parameter count is the key. When you finetune with lora, you're updating maybe 5% of the parameters, I really don't think there is an illusion of equivalence in the field.
kelseyfrog 5 days ago | root | parent | next |
> When you finetune with lora, you're updating maybe 5% of the parameters
I'm not sure I understand this comment. The LoRA paper[1] specifically says that all of the pretrained weights remain frozen.
> keeping the pre-trained weights frozen
Specifically, the LoRA paper differentiates itself from updating some parameters by stating
> Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks.
viktour19 5 days ago | root | parent |
The effective parameters of the model are the parameters of the original model + lora parameters i.e lora updates only lora parameters, and full finetuning updates only original model parameters.
abhgh 5 days ago | root | parent | prev | next |
More magnitude than count [1] I think, but I haven't kept up in a while.
[1] https://proceedings.neurips.cc/paper_files/paper/1996/file/f...
wrs 6 days ago | root | parent | prev |
Well, I think it depends who you talk to. I suspect quite a few practitioners (as opposed to researchers) regard LoRA as a valid shortcut without full consideration of the difference.
blacklion 5 days ago | prev | next |
Each time I see "LoRA" in title I want to click it. Until I understand, that it is about DNN and not LoRa long distance radio modulation.
danielhanchen 5 days ago | prev | next |
TLDR: 1. Use alpha = 2*rank
2. Don't use too small ranks (rank=1 to 8)
3. Sensational title. Better title "LoRA works if done right"
4. Didn't test SVD init
Tostino 5 days ago | root | parent |
Thanks for the TLDR. Yeah, pretty much fits my experience, though I mainly cared about the specific task performance I was training rather than caring about regressing unrelated tasks.
danielhanchen 5 days ago | root | parent |
:) Oh ye from the paper it looks like if one uses alpha = 2*rank, sometimes LoRA does even better than full finetuning
5 days ago | prev | next |
bArray 6 days ago | prev | next |
[flagged]
HPsquared 6 days ago | root | parent | next |
Nor with LORAN, a WW2 navigation system: https://en.m.wikipedia.org/wiki/LORAN
poizan42 6 days ago | root | parent | next |
Why is GP flagged? I was pretty confused about the title as well. This is the first time I have heard "LoRa" meaning anything else than the wireless protocol, and that meaning was further strengthened by talking about tuning. What is it with the AI crowd reusing long established terminology and then getting mad at people being confused by them usurping terms that have been used for a long time for something else? I mean I could understand that you could post this title on some forum dedicated to AI. But Hacker news is not that, and LoRa have had a another meaning that have been commonly known in Hacker circles for a decade by now.
bArray 18 hours ago | root | parent | next |
> I was pretty confused about the title as well.
As somebody who lurks HackerNews I think either advances in LoRA or LoRa are pretty interesting and have appeared here, it's not unreasonable somebody gets confused.
> What is it with the AI crowd reusing long established terminology and then getting mad at people being confused by them usurping terms that have been used for a long time for something else?
I'm going to start calling controllers with FPGAs in them "AI": Augmented ICs. Then downvote anybody that gets confused.
Tomte 6 days ago | root | parent | prev |
Because we have this discussion every single time LoRA comes up.
Also, neural networks and niche radio technologies are far enough apart that name clashes are to be expected and no problem.
oytis 5 days ago | root | parent | next |
Niche? When I google for "lora" the AI stuff is not even in the first page.
ianbutler 5 days ago | root | parent |
The LoRa paper for llms is my first result. Remember Google [and other search providers] personalize web results.
The point is this kind of conversation is unproductive, at this point on an LLM topic its common jargon.
deskr 5 days ago | root | parent | prev | next |
> every single time
That's a pretty big indicator that it's true.
poizan42 6 days ago | root | parent | prev |
And how should they know that? Do you expect people to read every single post on Hacker News? For me it's the first time I have heard LoRa being used to mean something else. Obviously new people are going to keep being confused for as long as posts with confusing titles are being posted.
And yes, the name clash may not matter in either of those circles, but it does matter right here on Hacker News where those circles overlap.
Also LoRa devices exists in the consumer space, so seems a bit disingenuous to call it a niche radio technology.
Tomte 6 days ago | root | parent |
You don‘t have to know it, and I will still flag your comment. Get a grip! Nobody is doing something serious to you.
poizan42 6 days ago | root | parent |
Nobody has flagged my comment? Are you confusing me with someone else? No-one is doing anything to me at all?
6 days ago | root | parent | prev |
6 days ago | root | parent | prev |
AstroJetson 6 days ago | prev | next |
[flagged]
Tenoke 6 days ago | root | parent | next |
If the issue is reusing names then they also shouldn't have named the radio lora with the same name as my aunt.
sorenjan 6 days ago | root | parent | prev | next |
> I was excited to click the link to see how fine tuning LoRA frequencies I was using on my Mesh network would work.
You're thinking of LoRa radio, from Long Range. There's one of you in each LoRA comment section, I have a hard time believing it's an actual mistake in good faith anymore.
samuellavoie90 6 days ago | root | parent |
LoRA radio things are a lot more common than the newly popular AI stuff.
sorenjan 6 days ago | root | parent | next |
That depends on what your particular filter bubble contains. Just browsing HN should inform you that LoRA and fine tuning are common terms in AI, and even if you genuinely thought the article was about the radio technique there's really no reason to leave the same comment that's on every LoRA post.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
poizan42 6 days ago | root | parent |
If every post has similarly confusing titles then why shouldn't they deserve the same comment? The obvious solution is for people to stop using confusing titles. Something as simple as putting (AI) at the end of the title would go a long way towards fixing the problem.
poizan42 6 days ago | root | parent | prev |
Yes exactly. The AI crowd are the ones not arguing in good faith here. The title is very confusing as it stands right now.
mdp2021 6 days ago | root | parent | prev | next |
'LoRa' ("Long Range") and 'LoRA' ("Low Rank Adaptation") have different capitalization.
AstroJetson 6 days ago | root | parent |
Most search engines are capitalization agnostic, even the one here on HN. I'll have a word with my lawyer, I want to call something iphone, he says that's taken. I'll see if he likes IpHoNe any better to fix the capitalization. :-)
rjsw 6 days ago | root | parent | prev |
We could switch to just referring to everything as "the thing" or "the idea".
drweevil 6 days ago | root | parent |
Or the AI guys could respect the namespace and call it LRA, a la RAG. What's next, WiFI?
idorosen 5 days ago | prev |
Jacob Andreas is one of the smartest people I’ve ever met.
K0balt 6 days ago | next |
So, in layman’s terms, LoRa appears to “traumatize “ the model to some degree, connecting the vector space with strong “jumpers” (intruder dimensions) to change it’s behavior, instead of subtly conforming the entire model into a shape that accommodates the new data.
These jumpers or shortcuts do create connections between the relevant new concepts in the model, but by directly connecting them instead of associating them through the existing network of concepts, nuance is lost and the bypassed areas become deemphasized, leading to forgetting of previously held associations.
Because of this, In general, fine tuning produces better results than LoRa in most cases, especially when forgetting of existing training is detrimental.
Or, to further oversimplify the issue in SE terms, LoRa == monkeypatching. (Is this a kind of intruder dimension?)