Interesting! Are you saying you would first want tools to really design your character, and only after start making videos with the character you built? That's interesting.
HeyGen (and our V1 model) literally uses the user on-boarding video in the final output. See here for a demonstration of this (https://toinfinityai.github.io/v2-launch-page/#comparisons). We are not talking about that in this thread. We are trying to solve a quirk of our Diffusion Transformer model (V2 model).
Our V2 model is trained on specific durations of audio (2s, 5s, 10s, etc) as input. So, if give the model a 7s audio clip during inference, it will generate lower quality videos than at 5s or 10s. So, instead, we buffer the audio to the nearest training bucket (10s in this case). We have tried buffering it with a zero array, white noise and just concatenating the input audio (inverted) to the end. The drawback is that the last frame (the one at 7s) has a higher likelihood to fail. We need to solve this.
And, no shade on HeyGen. It's literally what we did before. And their videos look hyper realistic, which is great for B2B content. The drawback is you are always constrained to the hand motions and environment of the on-boarding video, which is more limiting for entertainment content.
First and second pictures are profile pictures that were generated years ago, before openai went on stage. I keep them around for when I need profile pics for templates. The third one has been in my random pictures folder for years.
Can you share an example of this happening? I am curious. We can get static videos if our model doesn't recognize it as a face (e.g. an Apple with a face, or sketches). Here is an example: https://toinfinityai.github.io/v2-launch-page/static/videos/...
I would be curious if you are getting this with more normal images.
I got it with a more normal image which was two frames from a TV show[1]; with "crop face" on, your model finds the face and animates it[2] and with crop face off the picture was static... just tried to reproduce to show you and now instead it's animated both faces.
That's how I see it as well. Very soon, people will assume most videos are AI generated, and the burden of prove will be on people claiming videos are real. We plan to embed some kind of hash to indicate our video is AI generated, but people will be able to get around this. Google/Apple/Samsung seem to be in the best place to solve this: whenever their devices record a real video, they can generate a hash directly in HW for that video, which can be used to verify that it was actually recorded by that phone.
Also, I think it will cost around $100k to train a model at this quality level within 1-2 years. And, will only go down from there. So, the genie is out of the bag.
That makes sense. It isn’t reasonable to expect malicious users to helpfully set the “evil bit,” but you can at least add a little speedbump by hashing your own AI generated content (and the presence of videos that are verifiably AI generated will at least probably catch some particularly lazy/incompetent bad actors, which will destroy their credibility and also be really funny).
In the end though, the incentive and the capability lies in the hands of camera manufacturers. It is unfortunate that video from the pre-AI era have no real reason to have been made verifiable…
Anyway, recordings of politicians saying some pretty heinous things haven’t derailed some of their campaigns anyway, so maybe none of this is really worth worrying about in the first place.