Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nvidia Uses AI to Slash Bandwidth on Video Calls (petapixel.com)
293 points by srirangr on Oct 9, 2020 | hide | past | favorite | 192 comments


> they have managed to reduce the required bandwidth for a video call by an order of magnitude. In one example, the required data rate fell from 97.28 KB/frame to a measly 0.1165 KB/frame – a reduction to 0.1% of required bandwidth.

A nitpick, perhaps, but isn't that three orders of magnitude?

We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.

Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.

I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.


I think you are right on the money with your thoughts on virtual avatars. I've already noticed this phenomenon cropping up in some niches.

1. the phenomenon of VTubers https://en.m.wikipedia.org/wiki/Virtual_YouTuber

2. in the virtual animal crossing late night show, Animal Talking, the presenter's (Gary Whitta) avatar doesn't really resemble how the presenter looks in real life https://en.m.wikipedia.org/wiki/Animal_Talking_with_Gary_Whi...

3. I watch a lot of interview s with people in VR Chat and it's very interesting how people seem to find it easier(?) to open up while they are embodying a character. https://youtu.be/KZWOXgc7PA4

Being able to experiment with identity in this way is really interesting to me, and I hope it becomes more mainstream with the proliferation of this technology


This sounds a lot like the movie Surrogates, where at one point the protagonist notices the badge of a android surrogate is completely different from the human behind the surrogate (as printed on the badge).


There's a Webtoon (it's okay but not great) that had a premise I think will turn out to be precient and reminds me of your last bit. The gist is that it's a future where everyone wears vr goggles. As a result, the teens in the comic all have personalized visuals mapped to their bodies that you can see if you also have your goggle's on when you look at them. The cooler kids even have full blown avatars that cover up their entire body and make them look like everything from aliens to elves.


I'm reminded of Permutation City where they talk to one another with virtual avatars that are able to mask emotional responses and such.


Jaron Lanier’s book on VR went in-depth on the importance of avatars, and experiences people had embodying different avatars — particularly in the early days of first-wave VR.


You just need to connect GPT3, and the dialogue is taken care of. Lyrebird’s API will take care of the speech synthesis.

Viola! My deep fake can stand in at meetings now while I code.


GPT will also code for you.


Great, I can spend more time gratifying my limbic system while accumulating resources to survive my impending obsolescence.


But who's playing the viola?


Sorry, misspelled Voila!


Imagine if everyone did that


Reminds me of this classic montage from Real Genius.

https://youtu.be/wB1X4o-MV6o


This is one of the themes brilliantly explored (in my opinion), in the (largely hard) scifi book "Lady of Mazes" by Karl Schroeder.


I posit that no productivity would be lost. Just have a summarizing AI email you the outcomes.


Hello, isoprophlex, I'm your Assistant AI

Today's meeting was an hour-and-a-half spent on bike-shedding the position--and color--of the "logout" button on our product page.

15 minutes were spent debating the resident usability expert who suggested white text on a dark blue background would be more readable for people with low vision. The department manager insisted on retaining pastel blue as it is his favorite color.

45 minutes were spent arguing over whether "logout" or "log out" comprises proper semantics. Our linguistic expert was unfortunately not able to attend as she was sent on a business trip earlier in the week.

The last 30 minutes were focused on team-building exercises as Bob doodled on his tablet and Susan smiled politely at the speaker whilst screaming internally as she had 3 more meetings to attend before her department could move forward.


That was... believable, haha.

WFH hasn't lessened the amount of bullshit, but it's become more tolerable. I mute my mic and put on a playlist with elevator music.

Now all I need is an AI standin and this summarizer and we've basically achieved universal basic income.


Next version would be:

45 minutes were spent by humans arguing over whether "logout" or "log out", while I created both buttons (GPT-3 can already do that) and A/B tested it.


> A nitpick, perhaps, but isn't that three orders of magnitude?

Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)


As these trends sort of become more and more prevalent, I am so shocked at how David Foster Wallace had nailed this prediction in his book Infinite Jest.

Humans becoming more and more dependent on virtual face-to-face meetings and also relying on embellishment of their supposed appearance through the screen. It reminds me of how SciFi authors predicted technology, but with a complimentary commentary on human psychology.

Sorry if it isn't directly related to the post, but it is so striking to me.


The avatar thing isn't one-sided either: it'd be an awesome power to have to remake others!

Real-time silly hats for people I talk to and I'm sold.


"Visualize your audience naked." they said. "Helps calm the nerves."

They had no idea.


When there's a will there's a way


I imagine people in home office situations would like to use this not only for the background, but them-self. I mean, if you are in doors all day, you might not be perfectly groomed for the day - so faking that would probably appeal to many people.


I think the ability to, as someone mentioned it, have yourself look a bit tidier than you actually are (working from home) could be a huge benifit.

I mean taking away focus on things that doesn't matter in a virtual meeting such as: Where you are sitting - via Virtual Background Your daily hair style status or if you have a nose pimple - Via NVIDIAs AI showcased here. Would be great.

Though replacing yourself with a "digital" avatar I think takes away many of the benefits an actual live meeting provides.


It depends on how accurately the avatar is able to represent important information: emotion, attention, state of mind, etc. There's a lot of bandwidth in looking at a face (and bodylanguage as well), that's where the value in face-to-face meetings is.


>isn't that three orders of magnitude?

sure is .. if you stream your face at >30-50Mbit/s. For contrast highest bitrate available on Twitch, used for streaming high motion full screen updating twitchy 1080@60 gaming, is ~6-8Mbit/s.


David Foster Wallace predicted this in his novel Infinite Jest. Except they where static images inserted over a video phone, and the user had to keep their head positioned just right to make them work.


It's possible that the "order of magnitude" statement was the majority case, and the 0.1% statement was a best case scenario. So, 1 magnitude is to be expected, but 3 is possible.


My prediction is that people will just change their avatars as often as they change their personal fashion. For some that’s never and for others it’s every season or even more often.


"Its just easier to apply the <deep fake of myself> than it is to apply foundation. Its how I'd look anyway" - delusional early adopters probably


This reminds me of a sci-fi novel I read in the nineties. The premise had something to do with actors who took on roles in virtual reality where their bodies are fit with sense-points. They're cast in live-action role-plays with wealthy remote clients. They're basically deep-fakes in VR.


Neal Stephenson's Diamond Age had an element like that. It was even possible that the actors ("ractors") didn't necessarily know what they were acting in, just the general parameters and the next set of lines and actions necessary to continue the performance.


> A nitpick, perhaps, but isn't that three orders of magnitude?

I dunno that I'd call it three orders - it's close at about 830x - but it's definitely not even close to being one order either.


I think it is three orders of magnitude.

"An order-of-magnitude difference between two values is a factor of 10. For example, the mass of the planet Saturn is 95 times that of Earth, so Saturn is two orders of magnitude more massive than Earth."

https://en.wikipedia.org/wiki/Order_of_magnitude


> "More precisely, the order of magnitude of a number can be defined in terms of the common logarithm, usually as the integer part of the logarithm, obtained by truncation."

    $ bc -l
    l(835)/l(10)
    2.92168647548360208478
That would make it 2 orders of magnitude by that method. Happy to accept that it's 3 orders of magnitude by the N=a*10^b method though. Either way, it's definitely not one.


I think the whole point of orders of magnitude is to be a back-of-the-napkin estimation of what's what.

"a new car is an order of magnitude difference in price compared to a used car" is appropriate even if a new car is 40k and a used car is 5k

Electric cars have two orders of magnitude less energy storage than gasoline cars, but newer ones are only one order of magnitude.


Since orders of magnitude are multiplicative, rounded to the nearest whole order of magnitude ~300× to ~3000× is three orders of magnitude.


I think it’s an unnecessary nitpick. One statement is general and the other is a specific, extreme example. “usually 10x, sometimes 1000x”, that sort of thing


A technology very similar to this plays a plot point in Vernor Vinge's 1992 novel A Fire Upon the Deep.

In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".


I was thinking of exactly this when I read the article.

The plot point being that when the bandwidth gets too low, the interpolation AI has to make lots of stuff up, you are not quite sure exactly what was said.

I seem to remember the bandwidth in the book was very tiny, small number of bits per second (?) so the AI was taking the speech and compressing it into something more compressed than text then decompressing it at the other end into something that was more or less the same.


Wow, that’s a fascinating concept. Effectively a bitmap index of possible terms said and synthesized speech back on the other end based on old footage of how that person talks.


I highly recommend A Fire Upon the Deep. Its a rare mix of really interesting hard scifi with an actually good story and characters. Hard scifi often has very flat characters but this is not a book which suffers from it.

It has a very very cool twist to explain the Fermi Paradox and is a really good example of a universe with one modified rule.


There is also a similar technoligy in Rob Reids After On book. The AI has the ability in thet book to "refocus" the person so that they are looking into the camera.

I believe this is huge and would create higher engagement if everybody was acutally looking into the camera instead of to the side or up all the time. Creating a more human an emotional relation with the people you are talking to.



Fundamentally, I don't know if people realise that what we're on the verge of here.

It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.

- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)

- There's no reason this couldn't extend to replacing your sent projection with another image (or person)

- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)

- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.

- Why couldn't we model the rest of the environment?

Not there today, but this future is closer than many realise.


>>- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)<<

This would be interesting as an upgrade to the “name on resume” test.

Could also see a future company policy that runs peoples data through a “sameness” filter before letting them into the company to scrub bias.


There are limitations, but you can do this today: https://github.com/eyaler/avatars4all


This is a lot like Framefree.[1] That was developed around 2005 at Kerner Optical, which was a spinoff from Lucasfilm. The system finds a set of morph points in successive keyframes and morphs between them. This can do slow motion without jerkyness, and increase frame rate. Any modern GPU can do morphing in real time, so playback is cheap. There used to be a browser plug-in for playing Framefree-compressed video.

Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.

As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.

You can be too early. Kerner Optical went bust a decade ago.

[1] https://youtu.be/VBfss0AaNaU


This may be projecting expectations but the example compressed video looks very slightly fake in a way that is just a little uncanny valley type unsettling.

Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.


What I don't like about AI processed images is that they are not real. I can't go past the fact that I am not looking at the picture as it looks like in reality but somehow smart approximation of the world that is not necessarily true.


Video is not real either. The perspective is different and the colors are wrong.

It is even less real after it went through lossy compression. The whole point of a lossy compression is to remove details that you perceive as unimportant. For example, leaves on a tree may look like a greeny mess, but that's fine, from afar, you don't make a difference.

Using neural networks for compression is far from being a new concept, and the result is not more or less real than any other technique. It is just that Nvidia implementation is really good at keeping the most important details in a small size.

If you want a more "real" image, you can just use the AI as a predictor and use the remaining bits to encode the difference, like in a traditional MPEG-style codec.


The distinction that makes ai compression creepy is that it can use prior knowledge of what a face is.

Traditional compression has no high level knowledge of what a video call looks like, the algorithms are about patterns in how pixels change in time and space, so the artifacts from the algorithm are pixel-based effects (blockiness, blurriness, etc).

A neural net that has been trained on a million faces and is setup to draw a face may draw a perfectly clear image of a face on low bandwidth...but when it doesn't have bandwidth to be accurate, it doesn't blur things, it fills in details learned from looking at strangers who aren't on the call.


You're sitting at your computer, you're watching your colleagues are discussing something in detail, slowly you start to notice - your two black colleagues are starting to literally look the same. That's weird, are you slowly getting more racist? oh right, your bandwith dropped and your AI upscaler has decided to upscale all your black colleagues up to some stereotype it got trained with.

I'm afraid I suspect this isn't that far fetched.


In some ways, of course, I agree, but as some other commenters have pointed out, this is essentially a "deepfake" of yourself...in principle there's no restriction on what one could use as their "avatar." As much was demonstrated in the "Different Output Styles" slide.

So I'd have a hard time saying this is not dramatically less "real" than some lossy compression technique. Is there some way to formalize "realness?" Maybe it would be inversely proportional to the hardness of manipulating the medium as a user..?


Maybe simply using an objective video quality measurement against the original picture would do the trick. Something like PSNR or SSIM. A "deepfake" is likely to score low on such a score that doesn't depend on high level visual perception.

Also know that you can "deepfake" yourself using a traditional video encoder, just change the keyframe to someone else's face. Of course, it will look broken and totally unconvincing but because of motion compensation, you can sort of map the movement of your face on someone else's face.

The technique in the paper simply has way better motion compensation, so good that it still works if you change the keyframe. Traditional video compression algorithms don't work like that because they are not just for talking faces and can't use such advanced techniques for performance and ease of implementation reasons.


I think the point is that for very low bandwidth, the "deepfake" version will score higher than the heavily compressed video stream for PSNR.


I understand what parent means, usually processing was too linear to alter the meaning of what you see, it's distorded, altered, maybe filtered and enhanced but in general, it's the same. With ml/dl you can really be sent .. whatever.


"the result is not more or less real than any other technique"

I disagree. If real / not real is binary, sure, it's not real. But if we allow "real" to be a range of values, it is less real.


come on man there's a difference between a GAN hallucinating 90 % of the image and a very predictable compression algorithm where both parties understand what's going on


That's how your brain works as well.

Adding a secondary, uncontrolled layer of perception is definitely not the same thing, but in general, what you "see" is barely "real".


If the sending machine (that films my face) does the same decoding I know will be on the other end, then diffs the raw video with the decoded video and finally sends both things, then the receiver should be able to always piece together a 100% reproduction of the actual video feed on my end. The transmitter can always predict exactly what the receiver will decode, so the correct amount of data to send is the amount of data that makes the receiver see what I want. If that means I send 0.1mbps key point data and 1mbps diff pixels then that’s what I want to send.

Is that how this works? Because it should be how it works...

Not sure if this is a viable way of compressing the video stream or if a transmission of the “diff” would be too costly. If this method would give 1/2 the original bandwidth I’d think it’s more impressive than cutting 3 orders of magnitude via “avatars”.


You’re thinking along the right lines, but the challenge is that a raw diff will have the same number of pixels as the raw image, so no compression in bandwidth. So, how do we represent the diff/residue also with fewer numbers? At that point it’s the same as choosing better parameters within some clever encoding (be it pre-designed like JPEG or H.264 or learned via ML).


I was thinking that subtracting the predicted image would give an image that has more zeroes and compresses better (much like dct+quantization for jpeg). After all, any time the neural network would predict an area of the image almost exactly, it can be omitted from the diff stream completely too.


The diff will be low norm (for some suitable norm), but it needn’t be sparse in pixel space. It could be sparse in some other spaces (Eg: wavelet basis), but that’s right where the challenge lies — finding a basis where the data (and/or the residue) to be transmitted is sparse.


I suspect that this isn't feasible. Otherwise lossy video codecs wouldn't exist (or would be very niche). Because if you sent a perfect diff you would essentially have lossless compression again.

So either your idea will revolutionize video compression or the "diff" would bring you back to the ballpark of lossless codecs.


It is a lossless codec yes, but I has a “base” image for the key frames (which are the most expensive) which can be transmitted as the key points. The question is how much that actually helps the end result vs how much it costs. As I said it would likely kill at lest nearly all the gains, but again - a small improvement in the state of the art for “normal” looking video would be more impressive than three orders of magnitude vs sending instructions to my head puppet.


It could be bad for telemedicine, since the NN may gloss over some "weird" pathological features not found in its training set. The current setup involving an initial image and keypoints should be fine though, as long as the initial image (keyframe?) is not messed with and is transmitted frequently enough.


your brain does lots of preprocessing to create a stable view of "reality", glossing over blind spots, retinal blood flow, etc. in a way we are always hallucinating, and it seems natural that computers should learn what we are actually paying attention to in video, and focus on that


Yes! E.g. AAA games render the entire 4k screen with the highest graphics, but only a small spot hits the part of retina that can perceive jagged edges or lighting details.


The raw sensor data for a digital camera looks something like this https://commons.wikimedia.org/wiki/File:Normal_and_Bayer_Fil...

In addition to the colour grading and interpolation, lens defects might be 'fixed' before you ever get data on disk.

ML is just really good interpolation.


any lossy video-compression algorithm face the same challenge. what you are seeing is artificial and is constructed by the algorithm to minimize the perceptual difference between the real and the constructed video feed.


You'd get used to it very quickly. Like you got used to TV and film, which differ substantially from an in-person view in many ways.


Wonderful technical achievement but I think I’d rather squint through garbled video to see a real human.

Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.


I see a lot of people being alienated by the fact that people could take on different avatars during their meeting. I would honestly accept that with no question.

In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.

But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.

I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions


This is why I keep my camera on, so that whoever is talking sees me nodding away whilst they talk and when they go "any questions" you can pause and say "None from me". It's also why you always unmute yourself when someone joins the call - yes, we can hear you, yes your mic is fine, no you're not muted, it's just no one is replying becuase they're all muted and causedoing something else because we haven't started yet.

It feels like people really haven't put any thought into how to handle video conferences.


I assume it's because people are awaiting their return to physical work environments. If your employer or team hasn't done anything to facilitate working remotely other than moving meetings to Zoom/WebEx/Meet, then you're right: people haven't put thought into video conferences.

There are numerous articles on why this impromptu remote environment isn't the same as traditional remote environments. Are people turning on their camera during meetings? Are people actually responding when talked to during meetings? Is there some type of plan outlined for team members who have kids in virtual school or are unexpected caretakers? Have teams been given the proper collaborative tools to work together remotely? Are working hours being respected?

There are lots of things people didn't think about when places went remote. The problem is that they never went back to address them either.


The camera on or off brings up another question: When do you look straight into the camera and when do you look at the screen to see the person speaking? I can see argument for both.


My first thought was about the diversity of faces used in the demo and how ten years ago, computers didn't think black people were humans.

https://www.youtube.com/watch?v=t4DT3tQqgRM

But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.


> ten years ago, computers didn't think black people were humans

And they still don't often enough (see recent ExamSoft issues discussed e.g. here: https://news.ycombinator.com/item?id=24641063 )


If I had to guess, the issue around "black people" is that photos are 2D.

We don't really understand just how little information is actually in a photo (we add huge amounts of info in our perception).

My guess is that predictive systems are using contrast as a guide to essentially 3D structures which, simply, just cannot be reconstructed from 2D. And therefore, probably struggle more on dark faces which have different contrast properties.


While contrast is almost certainly part of it, I’d hazard a guess that the training set is also partly to blame.

Now, I don’t know much about neural networks (AI), but my understanding is that if you provide a training set representative of the population makeup, (in America at least) it’ll be biased towards white people as it hasn’t “seen” enough black person images. My limited understanding would then make me think one would need equal white person photos as well as black person photos.


Sure, but were those datasets of 3D snapshots of white people, it wouldnt matter.

Black faces and white faces are "statistically equivalent" in 3D.

The issue is more, in my view, the hubris of calling this system "facial recognition". It isnt: it's pixel pattern color recognition which sometimes coincides with certain facial patterns.


I imagine you're right on both counts. Contrast is certainly an issue. You definitely have to make adjustments to photograph dark skin (or animals with black fur).


If the "Free View" really works well, that sounds like possibly the most important part. The missing feeling of eye contact is a significant unsolved problem in video calls.


Apple has just implemented their own version of this in FaceTime in iOS 14.


I would imagine Apple doing this with FaceTime soon.

Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!

And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )

Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.


> I would imagine Apple doing this with FaceTime soon.

What's this claim based on? Last I looked into FaceTime tech, they didn't do anything special - their quality comes from use of H.265 and the fact that iOS devices have good quality HW encoding blocks which provide good compression at low bandwidths.

FaceTime stream is usually also low motion / change so it's possible to achieve very good compression even with basic quality. Although they still don't quite match the quality of AV1 powered Google Duo on very poor connections.


I'm not the person you're replying to, but I think the main reason you'd expect to see this on iOS is due to the additional Face ID sensors available on most of their devices. Presumably Apple could offer a version of this tech that features higher quality tracking and thus a more realistic and natural-looking final output.


The latest NPU can't deliver this kind of "deep compression" at 25fps, not even 10fps. But in the future they can just send the Facemesh vertices and streaming text of the speech (and classically compress non-speech audio, if it's even desired as most people are happy just using it to talk) so it will be less data than 1Kbps.


I think Face ID could be used to create the point map, instead of generating it from an image. They could also use Face ID to prevent the tech from malicious deep fakes, e.g only allow people to use this feature I when Face ID confirms the user and the manipulated photo are the same person.


Isn’t the point map supposedly encrypted on the Secure Enclave?[a] If it is, you can’t access it from the main CPU; you can only ask it to compare a provided one with the stored one.

[a]: I know the password is stored on it, but idk about the face mesh


The verifier point cloud used for unlocking the phone must be kept in the secure element.

That doesn’t mean you can’t read a point cloud for other purposes.

They currently expose ARPointCloud API but I don’t know what sensors it uses to produce it;

https://twitter.com/nobbis/status/1292262455490629633?lang=e...


> Remember when the industry were trying to compress CD Audio quality (aka 128Kbps MP3) down to 64Kbps?

128kb mp3 is good enough for most people most of the time, but it isn't CD quality. Having said that, 64kb Opus is almost or about as good as 128kb mp3.

I wonder how well these techniques can be applied to audio.


I doubt the general technique described here would have any applicability to audio, unless the idea is to dynamically create a text-to-speech model of the speaker and transmit only text. I don’t see that being practical any time soon.


> Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.

Maybe it will be exclusive to Android devices.

Or maybe it will work on any device (consuming CPU or GPU depending on the hardware) but only on Nvidia's communication app.


> I would imagine Apple doing this with FaceTime soon.

And we'll notice it because somehow our and other people faces in facetime will start to subtly translate strangely aggravating emotions, as current memojis do.


The problem is with phone the background change more, and you show more things than just a static face, because it is easier to move around.


Probably just turning an R&D project into a marketing feature.


I would go to the origin announcement rather than this reproduction with ads https://developer.nvidia.com/maxine?ncid=so-yout-26905#cid=d...


What ads? I can't see any. That damn uBlock Origins, must be the culprit in my case :D


Isn’t this just like apples animated emoji (Animoji) where your face is mapped to a emoji character? Except instead of a cartoon it’s mapped to your actual face.

https://blog.emojipedia.org/apples-new-animoji/

And how well does that work when you switch to screen sharing?


I’m actually somewhat surprised someone hasn’t already built video chat just based on this ^


Nvidia Uses AI to Slash Bandwidth on Video Calls... But only if everybody in the call has a 600$ Nvidia GPU


Depends on how big the network is. You can process neural networks on CPU with acceptable speed in certain cases. And nicer smart phones tend to have a decent GPU.


Indeed, making this tool usable for consumers is a challenge in and of itself.


Facemesh[0] already runs on a mobile phone. This doesn't seem much more complex.

[0] https://github.com/tensorflow/tfjs-models/tree/master/faceme...


$600 today, but in 3-5 years an equivalent component will be cheaper, perhaps significantly so.

This is a brilliant idea, even if the hardware is pricey today.

Disclaimer: no affiliation, but I use Zoom/Teams/Slack/FaceTime/YouTube.


That would still have major advantages. You could use the extra bandwidth to stabilize everything. Even on the best of networks you have buffer pauses. And on wifi or cell you have a lot of those. If this can be used to improved those ... you'll have improved quality greatly.


I wonder how weird it gets when you turn your head too much. This is very cool though - I was expecting to be able to tell a difference and maybe slip into uncanny valley territory but it looks good.

Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?


If I understand correctly the sender just uses classic object detection/tracking. So the question would be how bad does it look if the receiver just tried to distort the image using that tracking data without having a trained model to smooth out the distortions.


They have an example of head orientation on their site. It moves the head to look at the camera.


Now the person you are speaking to is going to be n% (partially) emulated. n is going to increase in future. One day there will be a paid feature letting you emulate 100% of yourself to respond to video calls when you are not available. And finally, they will replace yourself even without you knowing, and even after you die.


That won't be possible unless full-brain emulation with live data sync gets invented.

Knowledge workers are all about the knowledge that they build up over time when working with a particular environment (be it an industry, a system, or a person/group of people). That knowledge is non-deterministically synthesised in the brain based on the experiences of that person, and being non-deterministic, no AI will come to the same conclusions about every item as this particular human would.

In that case, an emulated personality that is meant to make themselves available as your replacement will be an impostor. One that is less of an expert than yourself at best, and at worst one that is misinformed or misled on various issues (which in turn causes other people to be misled or misinformed).


Perhaps. In his Revelation Space universe, Alastair Reynolds categorises AI as Alphas (full brain simulations), Betas (non-conscious mimics which are based on a person’s behaviour over an unspecified but very long time period), and Gammas (which I read as being mere apps with voice interfaces).

If the goal is to make it seem like your present when you aren’t, I can believe we’re only a few years away from Reynolds‘ Betas — a Markov chain can’t mimic a human well, but it can mimic a human; GPT3 can do better, and while it still isn’t great, the main reason it feels like it might not be enough for public figures is how easy it is to get it to answer as if it were someone else rather than as The Right Honourable Sir Obvious Madeupname, MP for Oxbridge-upon-Wells who is paying for the chatbot to mimic him in particular.


This was part of what drove me off Twitter after the last election. I realized through my links and tweets that I had already provided enough training data for a bot that could not only sound like me, but share content from the future that I might be interested in.


That's an interesting perspective.

The bandwidth and requirement to participate correlate.


At what resolution? And also, does the output actually resembles the original image? Examples with background other than uniform? Would be nice if they provided more than just screenshots

It's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).

So they say it can be 0.1KB, so better than that... Exciting, if realistic.

Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)


This is probably first-order-model[1] using keyframes. You send only one image each 2 seconds and the mesh of face 30 times per seconds.

Then use first-order-model to extrapolate 2 seconds of video from the keyframe.

Rinse, repeat.

Very doable. AMAZING!

The original first-order-model could not do 30 frames per second, but maybe this Nvidia model has some improvements.

1 - https://aliaksandrsiarohin.github.io/first-order-model-websi...


Now we are just guessing. This is a press release, which does not contain enough technical info. So one can't say how groundbreaking this is.

It's like with those news "the new battery type has been discovered", with very little actual data, just guesswork.


Can't wait to see the bugs! GAN's are famous for some... interesting reconstructions. And better still, nvidia will have no way to debug it since the model is essentially a black box.


People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.


> extremely sensitive to subtleties in mouth articulation

Agree, the woman's mouth in the video looks _very_ off at 1:03 in the video.


Stills OK, it would be interesting to see it move. Risk for uncanny valley?

Petapixel is a blog spam site btw. Why not go to the source that is linked in the post?


I suppose on can put boundaries in movement. If the face changes significantly, just send another keyframe.


There is a video at the top of the article.


See it now, Firefox Klar wasn't willing to play it.


We're all deepfakes now, it seems.


Same as it ever was.


So it’s a deep fake, not video. These headlines man.


In Vernor Vinge's boon Fire Upon the Deep he describes how interstellar calls work in the future. In the lowest bandwidth tier you are watching an animated static 2d photo with text-to-speech . The book also touches topics like translation, different spectrum and senses different species want from the call.


Is AI cheaper than bandwidth?


I think computation is always cheaper than bandwidth, so in theory if AI can approach it‘s computational limit, it ought to be.


Depends what you’re measuring, given both AI and bandwidth are sliding scales.

Even JPEG takes advantage of human perception, throwing away first what we can’t perceive.


Sometimes... Sometimes not...

spotty mobile connections with a powerful device (iPhone Android etc...) it might make sense.


Look neat. I wonder that the system requirements / license for software will be?

There's a real network effect with things like codecs - unless some significant proportion of calls can use it, it'll remain a cool but obscure experiment.

I hope Nvidia have the foresight to release something that'll run on any hardware, and under a permissive license, but I suspect not.

The idea is out there already (it's basically deep fake tech, right?), and I'm sure it won't be so long before some open source version of it gets released. Nvidia would be wise to get out in front of that and at least have their brand associated with a widely used variant on the theme.


I find it very nice that for once tech will be used to lower cost of communication.


Great concept. With higher quality input, keyframe, point tracking, and ML model the output should improve significantly with similar bandwidth improvement. I think this could also smooth between dropped frames and offer higher bandwidth for the audio feed.

The issues are social. I would hope that the receiver is the one able to choose between original or AI stream, as I can understand some people being uncomfortable with the artifacts, gaze, expressions, and other edge cases. But when the quality is higher I could see a lot of people preferring this option as a default.


Huge if true.

Recall the promise of 5G is an order of magnitude increase in speed (among other things like low latency).

If we can get there by reducing bandwidth requirements by an order, that will be great. Wonder if it applies to Netflix...


TL;DR: You can't reconstruct a generic video stream from a dozen detected facial data points.

Not sure if I understood your post correctly, but you're slightly misled here.

Watch the video. This is not a new general purpose video codec, it is basically Deep Fake - taking a still image (key frame) and superimposing detected facial expression/movement on this key frame (leaving out some technical details).

This is an improvement over (not-anymore-)state-of-the-art h264 since transmission of only a few coordinates mapping your facial expression is significantly less data to transmit then a delta of arbitrary video and periodic keyframes (again generously leaving out important technical details of modern video codecs). Trying to reconstruct a moving car/background/etc. from these facial expression key points will lead nowhere.


one of the comments [1] in the article (not mine) is excellent tongue-in-cheek but thought provoking:

> Next step would be to just predict both sides of the conversation and sever the real-life link entirely.

Gmail already does a little bit of this. Google books appointments over the phone on your behalf.

We're on the road to this...

[1] http://disq.us/p/2ccckuy


To reproduce something akin, use this Colab notebook: https://colab.research.google.com/github/eyaler/avatars4all/...

In the final form, use "You=" as reference and just press it one at 1 seconds to simulate keyframe.

AMAZING!


Not shown: 12U quarter rack stuffed full of GPUs


No way. This probably runs on a single consumer graphics card.


This will probably run in a 2020 CPU at 20 frames per second.


This includes a feature:

> Called “Free View,” this would allow someone who has a separate camera off-screen to seemingly keep eye contact with those on a video call.

Am I the only one who thinks eye contact on video calls feels creepy? I think I would prefer this feature to remove eye contact on video calls rather than add it.


Apparently researchers have measured this, and for people meeting in person it's normal to maintain eye contact 30%-60% of the time in a conversation [1] with each contact lasting ~3 seconds [2]

So a system that maintained eye contact continuously would indeed risk looking creepy!

[1] https://www.forbes.com/sites/carolkinseygoman/2014/08/21/fac... [2] https://www.businessinsider.com/heres-how-long-you-should-ho...


Indeed, have a look at what too much eye contact looks like: https://samharris.org/look-into-my-eyes/


If you have a laptop and an external screen this is useful to not look like someone looking away from people


I can steal all your video data of 'you' and call someone else back, as you?

I could even get an accomplice to do it while I'm talking to you. They would have your today clothes on and you'd be tied up talking to me.

I'm dubious on the tech being as good as they say now. But it's getting exciting.


* instead of sending a stream of pixel-packed images, it sends specific reference points on the image around the eyes, nose, and mouth*

So they're trading bandwidth for CPU load at either end. I wonder what the tradeoff is in terms of energy? Would this result in higher CO2 emissions?


Maybe but you're discounting the energy required to send those 99% of pixels over the network. Just because it's not your CPU doesn't mean it's not doing work.


Indeed, in the case of LTE comms and mobile chips the difference in power usage is greater than six orders of magnitude.


Or maybe lower since less data is transmitted?


> So they're trading bandwidth for CPU load at either end

Given it's Nvidia I would imagine it's more likely going to be GPU load rather than CPU load. Don't underestimate the current computational overhead of existing lossy video compression.


My tests already take twice as long to run when I’m on a Google hangout. For my case, I’d honestly rather use more bandwidth and do minimal local processing. If my machine is slowed down any more then I might have to stop working completely and focus on the meeting I’m in!


Admittedly I have no idea how resource-intensive this image processing is, but it's possible that it's less of a hit than the fact that all video call programs today run on HTML+JS. If Hangouts was a native program, there would be plenty of power to spare.


Modern machines aren't really designed for real computation like running tests or compiling stuff. They're made to render webpages and load up games. Run your computation tasks on a VM in the cloud...


Are remote development environments common? There’s a CI server doing the heavy lifting but I still run tests constantly on a docker box while developing locally.


I develop entirely remotely. Save the files on a network mounted filesystem, run shell commands over SSH/mosh, etc.

The server is in the same city as me to avoid excessive latency.

The benefits of being able to develop anywhere (including from a phone in a pinch), and being able to add extra CPU's and RAM at the click of a button outweigh the need for a network connection for my usecases.


Essentially deepfaking yourself. There's no way to know that the nuances of the emotions passed will be reliably passed, as everything but face lines is hallucinated. And then, it's so life like that you have no paystubs deniability


Hmm... interesting. I think it looks really good. I wonder how soon it will be before a LEO agency uses tech similar to this to alter bodycam footage.

(Yes, I know this is realtime webcam footage, not recorded footage, I'm just curious).


Is it just me or do the videos no longer look natural? I feel like I see highly non-linear movements (parts of the video moving when they shouldn't, or vice-versa), and facial expressions don't really look quite the same.


They look perfectly natural to me

I tried to pay attention to what you and others flagged (parts not moving, facial expressions, shoulders, etc.) But I cannot spot anything out of place.

That said, looking at other comments here, you're definitely not the only one


fair, in that they aren't perfect photorealism, but if their comparisons with the regular codecs and techniques are correct, I'd take their modelled faces over the comparable- level of digital artifacts for the same bandwidth.

After all, it doesn't need to be said that when you're on a regular videoconferencing call and bandwidth starts to suffer, the resulting images don't really look anything like a photorealistic person either. I think this is actually a really good use of NN.


The thing is you only want to make this trade-off when the bandwidth is actually starting to suffer. It'd be nice if there was a nice way to make this adaptive and use NNs only when throughput is low, but the nonlinearity of the distortions makes me think this would be really hard. [1] I know what I don't want is for a normal conversation in an uncongested network to look unnatural or for facial expressions to get distorted unnecessarily.

Edit: [1] I meant to say doing a mixture of these (with the NN image as the "base", with H.264 to improve accuracy) seems really hard. On the other hand, just a hard switch from H.264 to NN when quality degrades is probably quite practical?


Perhaps it's just me, and my philosophical bent, but I can actually see coming at it from the exact opposite end.

I don't want bandwidth and things spent/wasted without it providing a significant benefit (I'm probably one of those 1080p/720p is good enough for most things type guys).

I definitely don't want work making large bandwidth or resource claims on my connection when I'm working at home. And if any of this remote working has taught me anything, it's that most of my colleagues don't have steady/reliable tech or connections, so i'd almost want it used pre-emptively as a default so we can spend the rest of those resources on robustness or other qualities. (I realise of course that at the moment none of them have high-grade Nvidia graphics cards, but I'm talking hypothetically in the far off future).

In short, I want a world where the cost to benefit ratio of things is orders of magnitude larger, because things like this let us spend network/resources on things which matter.

Yes, when I'm calling my parent/grandparent one on one I might want to upgrade the signal, but I don't need to see random colleague's face in all their HD glory, or remote people whom I have no idea who they look or sound like anyway (i believe that's also been one of the findings with deepfakes, that you don't notice the eerieness/falseness as much if it's a reference of a face that you don't have pre-determined knowledge of).


Adaptive use would be awesome. Especially on conference calls at work there is always that one (or ten) person whose connection is absolutely awful and looks like a giant pixelhead.


This is perceptible to some degree, but in my opinion a highly practical way of applying this technology is to intentionally lower/shift the quality on the fly just a bit so as to "blend" the layer boundaries where the stitching happens.

That way the doctoring isn't apparent, you still benefit from the massive bandwidth savings (which I consider most important anyway), and it appears more believable in the real-world context of variable bitrates.


Agreed, it's especially visible on the shoulders.


How soon before this incorporates GPT3 and guesses what we were going to say anyway, so we no longer need to say it? Or doesn't quite guess right, and says something that gets you fired!


Do we actually need GPUs to run this? There is no training involved, only inference, and CPUs (or low-end GPUs) should comfortably run the workload, at least for a couple of faces.


When will we get VR googles + this tech for couples so we can shape shift during sex, edit out the VR googles and explore scenes together while still viewing our partner?


Perhaps they are overdoing it (if you have a hammer, ...). I would think that the most useful way to use AI in this context would be to predict who is going to speak next.


What is the product? Is this going to be licensed to Zoom, Skype, Teams, etc? Or is this a distinct product? Does it depend on specific hardware?


For something a bit closer to traditional compression while using NNs / "AI", there's wave.one


The biggest application that I can see is being able to send video messages from Mars and beyond.


You can get pretty good network speeds in space. Some Ka-band satellites (SpaceX Starlink for example) have gigabit networks between them. All you need to extend that out to Mars would be high gain directional antennas on things. If you used lasers instead it could be even higher bandwidth.


I have to disagree you there. I owned a satellite uplink/downlink in Afghanistan and it wasn’t cheap per megabit (supply/demand issues) and I highly doubt you’d be able to get a gigabit connection to Mars, especially when it’s on the other side of the sun. Also, the earth is rather noisy, so it may be easier to get a fast connection from Mars, the SNR from earth will be much lower.


it wasn’t cheap per megabit

I made no claims about the cost. :)

I guess I'm also talking about the theoretical maximums in ideal conditions as well, which is sort of cheating on my part. There would be times when you can't get the optimal speed if, like you say, the sun is the way.


I think I saw Apple had a patent with a similar idea when they first launched FaceTime.


Maybe Zoom could use this so their video quality doesn't look like it's from 1999.


Using middle out compression?


How does this work when I show my back garden through the video stream?


It doesn't, and just falls back to using full bandwidth as usual.


Watching the video it no longer feels like you're looking at a real person, but instead just another npc. It no longer feels as personal. The last thing remote relationships need is more impersonality. I hope this is used only when it's needed.


I'm really confused. I just watched the video again, and I cannot really see the effect that you're talking about.

In fact, I struggle to see any difference from the original. It totally feels genuine, I'm not sure why you perceive them to be npcs


I think it would be easier to see if they had a comparison between a high bit rate traditionally compressed video and their neural net compressed video. To me the facial shapes and movement seem just a little bit wrong like you’re looking at a 3D animation of a person instead of a video.


They do this using GANs so I don’t think it is compressed in any way. Instead, the points on the face are sent across the net and the face is actually generated. I’m sure it can get better with time. There is a site thispersondoesnotexist.com that uses the same tech but looks real.


> Instead, the points on the face are sent across the net and the face is actually generated.

This is not fundamentally different from other lousy compression algorithms.

The bits that are missing are reconstructed using a bias built in the compression algorithm. Deep learning based algorithms simply have a more realistic bias, so the artifacts that introduced to replace the missing bits are less noticeable.


I want to see if people notice this effect when they don't know it's artificially generated. I have a feeling that the uncanniness is at least part because you know it's fake.


That is cool use of AI !

Sidenote: I always had this idea for "video compression" I'm by no means an expert in compression.

1) Take like the top 10 most "VISUAL DIVERSE" movies (imagine Forrest Gump(slow drama) VS ToyStory(anime) Vs Rambo(fast paced visuals)

2) "Encode/compress"(I know these terms are not interchangeable) the movie as a "diff/ref" to the "most similar" movie from step 1

2b) This the "diff/ref" can be many forms "sliding window" over x section of y amount of frames.

3) The "end-user" or "destination" have these "10-master-movies" locally on hdd and together with the local-data can construct the original "frame" or "movie" from the compression and local-movie on disk.

Tl;DR Try to compress a new movie by saying "the top corner of frame 1-120" is very similar to MasterMovie-2-FrameXYZ-Frame-ABC

4)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: