nicholas-cc's comments

nicholas-cc · on Nov 4, 2024

We're working on a HuggingFace release that will help with finetuning. We'd like to do a paper, after a larger release - we're a team of 4.

netdevnet · on Nov 4, 2024

Very impressive for just 4 people. What's the team background and how long have you been working on this?

programjames · on Nov 4, 2024

I'm not part of their team, but lived with them for a couple months. They've been working on it for ~5 months, and their background is 16-20 year olds who are too smart for university.

unit149 · on Nov 4, 2024

For a rag-tag group of transcendental audiophiles operating electronic circuitry, it ionizes and atomizes well.

nicholas-cc · on Nov 4, 2024

Moshi is a good model to build chat applications on, this is designed to be more of a proper base model with all the quirkiness, naturalness, and researcher-friendliness of base modeling.

nicholas-cc · on Nov 4, 2024

nicholas-cc · on Nov 4, 2024

I'm one of the devs. Our model is fully voice-to-voice, no text was involved in the making of hertz-dev for exactly this reason.

oidar · on Nov 4, 2024

So essentially this is voice input to voice output? Can you change gender/age/accent? Does it track prosodic information? I've been waiting for something like this.

nicholas-cc · on Nov 4, 2024

Hertz-dev is a base model, meaning it's just trained to predict the next token of audio. If your prompt is an old male voice with a British accent, the model will most likely continue speaking in an old male voice with a British accent. Being a base model, hertz-dev is easily finetunable for specific tasks - it would be a simple change to add manual configurations for the gender/age/accent.

hunter2_ · on Nov 4, 2024

I assume this mirroring is due to symmetry being more typical than not among the training data, and if instead trained with contrived diversity (e.g., males only conversing with females) then the output of the base model would follow suit without pulling any levers?

It's interesting to think about what complete diversity (i.e., no tendencies toward homogeneous conversation partners whatsoever among training data) would yield, given that it's trying to deliver whatever is most probable.

modeless · on Nov 4, 2024

I'm interested to hear more detail about approaches to adding manual controls for speaker characteristics or emotion or other things you might want to vary. What techniques do you have in mind?

vessenes · on Nov 4, 2024

I’ll jump in here - as a former new englander, the cheerful helping tone of all modern voice llms infuriates me. And the slow speed. And the over explanations. ChatGPT advanced can be induced to talk more quickly, less sycophantically and if I like in a not-bad regional accent; essentially I want it to mirror my tone better. But those inducements don’t stick between sessions.

On the technical side having some sort of continuation or summarization loop on seems interesting to me as a product feature. It’s not enough to build a company off of though. But it would be nice.

wwwlouishinofun · on Nov 4, 2024

Oh, you have completed the project I planned. Currently, do you think the difficulty in improving the model lies in voice data, computing power, or algorithm optimization? I personally think that if you want to achieve the ultimate, you don’t need to remove the background sound from the original audio. Outputting audio mixed with background sound as new audio may result in background music,

If you use completely unprocessed speech data (including speech information with background music on YouTube), I think the potential will be higher, but the requirements on your computing power are too high. If you don’t have money to buy a GPU, just use voice noise reduction processing first.

makinario · on Nov 9, 2024

Have you thought that this would be useful for an end-to-end translation for calls in Asterisk?

nicholas-cc · on Jan 4, 2023

Looks like the author's company failed, but he's back working on similar technology? See https://www.linkedin.com/in/kevin-shambrook-a4884a4/.

nicholas-cc · on Oct 7, 2020

It seems like if you choose to record transactions to the blockchain you end up posting every transaction to the blockchain which is quite expensive; why not only store the hash of the HTTP request and use a Merkle Tree to store multiple requests in a single Bitcoin transaction? The only reason I see why not is that then the actual request data could be lost or not revealed, but since this protocol saves all requests to an internal database as well, I don't see why that would be a problem.

miked85 · on Oct 7, 2020

The first sentence of the introduction paragraph states that transactions are offchain:

"Vapor is an OFFCHAIN Bitcoin protocol for building a decentralized web by "Bitcoinizing" HTTP requests"

drschwabe · on Oct 8, 2020

further down however, it indicates you may also optionally post onchain ie- to take advantage of a blockchain backed timestamp.

nicholas-cc · on June 17, 2020

Have you seen Bret Victor’s substroke? Loglo really reminds me of this: http://worrydream.com/substroke/