I'm not part of their team, but lived with them for a couple months. They've been working on it for ~5 months, and their background is 16-20 year olds who are too smart for university.
Moshi is a good model to build chat applications on, this is designed to be more of a proper base model with all the quirkiness, naturalness, and researcher-friendliness of base modeling.
So essentially this is voice input to voice output? Can you change gender/age/accent? Does it track prosodic information? I've been waiting for something like this.
Hertz-dev is a base model, meaning it's just trained to predict the next token of audio. If your prompt is an old male voice with a British accent, the model will most likely continue speaking in an old male voice with a British accent. Being a base model, hertz-dev is easily finetunable for specific tasks - it would be a simple change to add manual configurations for the gender/age/accent.
I assume this mirroring is due to symmetry being more typical than not among the training data, and if instead trained with contrived diversity (e.g., males only conversing with females) then the output of the base model would follow suit without pulling any levers?
It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.
I'm interested to hear more detail about approaches to adding manual controls for speaker characteristics or emotion or other things you might want to vary. What techniques do you have in mind?
I’ll jump in here - as a former new englander, the cheerful helping tone of all modern voice llms infuriates me. And the slow speed. And the over explanations. ChatGPT advanced can be induced to talk more quickly, less sycophantically and if I like in a not-bad regional accent; essentially I want it to mirror my tone better. But those inducements don’t stick between sessions.
On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.
Oh, you have completed the project I planned. Currently, do you think the difficulty in improving the model lies in voice data, computing power, or algorithm optimization?
I personally think that if you want to achieve the ultimate, you don’t need to remove the background sound from the original audio.
Outputting audio mixed with background sound as new audio may result in background music,
If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high.
If you don’t have money to buy a GPU, just use voice noise reduction processing first.
It seems like if you choose to record transactions to the blockchain you end up posting every transaction to the blockchain which is quite expensive; why not only store the hash of the HTTP request and use a Merkle Tree to store multiple requests in a single Bitcoin transaction? The only reason I see why not is that then the actual request data could be lost or not revealed, but since this protocol saves all requests to an internal database as well, I don't see why that would be a problem.