The problem isn't ondevice training. The problem is, if I "free up space" by deleting my local copy, and I have 10 years of photos backed up in the cloud, then an update to the ML model requires redownloading all of those photos and running the inference on them.
It doesn't matter how clever you are: if you disallow processing the photos on the server, then any index you have on the client that is cached will be invalidated by an algorithm update, and there is no way to rebuild the index without downloading all the photos once again.
I'm not talking about federated learning, that's a different problem. Let's say we solved that problem using on-device learning and differential privacy. The problem still remains that retrained/updated models on devices require the photos to be processed locally again.
So if for example, you can now detect "hugs" in photos, and I search for "photos of my me hugging my daughter", Google Photos gives me what I want, and your local-only model means I have to wait a long time to redownload and reindex my entire collection. It's a huge user problem.
A workable solution to this problem has to somehow entail homomorphic encryption, if you somehow want to keep the photos encrypted on the server.