> they have managed to reduce the required bandwidth for a video call by an order of magnitude. In one example, the required data rate fell from 97.28 KB/frame to a measly 0.1165 KB/frame – a reduction to 0.1% of required bandwidth.
A nitpick, perhaps, but isn't that three orders of magnitude?
We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.
Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.
I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.
3. I watch a lot of interview s with people in VR Chat and it's very interesting how people seem to find it easier(?) to open up while they are embodying a character. https://youtu.be/KZWOXgc7PA4
Being able to experiment with identity in this way is really interesting to me, and I hope it becomes more mainstream with the proliferation of this technology
This sounds a lot like the movie Surrogates, where at one point the protagonist notices the badge of a android surrogate is completely different from the human behind the surrogate (as printed on the badge).
There's a Webtoon (it's okay but not great) that had a premise I think will turn out to be precient and reminds me of your last bit. The gist is that it's a future where everyone wears vr goggles. As a result, the teens in the comic all have personalized visuals mapped to their bodies that you can see if you also have your goggle's on when you look at them. The cooler kids even have full blown avatars that cover up their entire body and make them look like everything from aliens to elves.
Jaron Lanier’s book on VR went in-depth on the importance of avatars, and experiences people had embodying different avatars — particularly in the early days of first-wave VR.
Today's meeting was an hour-and-a-half spent on bike-shedding the position--and color--of the "logout" button on our product page.
15 minutes were spent debating the resident usability expert who suggested white text on a dark blue background would be more readable for people with low vision. The department manager insisted on retaining pastel blue as it is his favorite color.
45 minutes were spent arguing over whether "logout" or "log out" comprises proper semantics. Our linguistic expert was unfortunately not able to attend as she was sent on a business trip earlier in the week.
The last 30 minutes were focused on team-building exercises as Bob doodled on his tablet and Susan smiled politely at the speaker whilst screaming internally as she had 3 more meetings to attend before her department could move forward.
45 minutes were spent by humans arguing over whether "logout" or "log out", while I created both buttons (GPT-3 can already do that) and A/B tested it.
> A nitpick, perhaps, but isn't that three orders of magnitude?
Perhaps the example was a best-case, and the usual improvement is about 10x. (That or 'order of magnitude' has gone the way of 'exponential' in popular use. I don't think I've noticed that elsewhere, though.)
As these trends sort of become more and more prevalent, I am so shocked at how David Foster Wallace had nailed this prediction in his book Infinite Jest.
Humans becoming more and more dependent on virtual face-to-face meetings and also relying on embellishment of their supposed appearance through the screen. It reminds me of how SciFi authors predicted technology, but with a complimentary commentary on human psychology.
Sorry if it isn't directly related to the post, but it is so striking to me.
I imagine people in home office situations would like to use this not only for the background, but them-self. I mean, if you are in doors all day, you might not be perfectly groomed for the day - so faking that would probably appeal to many people.
I think the ability to, as someone mentioned it, have yourself look a bit tidier than you actually are (working from home) could be a huge benifit.
I mean taking away focus on things that doesn't matter in a virtual meeting such as:
Where you are sitting - via Virtual Background
Your daily hair style status or if you have a nose pimple - Via NVIDIAs AI showcased here.
Would be great.
Though replacing yourself with a "digital" avatar I think takes away many of the benefits an actual live meeting provides.
It depends on how accurately the avatar is able to represent important information: emotion, attention, state of mind, etc. There's a lot of bandwidth in looking at a face (and bodylanguage as well), that's where the value in face-to-face meetings is.
sure is .. if you stream your face at >30-50Mbit/s. For contrast highest bitrate available on Twitch, used for streaming high motion full screen updating twitchy 1080@60 gaming, is ~6-8Mbit/s.
David Foster Wallace predicted this in his novel Infinite Jest. Except they where static images inserted over a video phone, and the user had to keep their head positioned just right to make them work.
It's possible that the "order of magnitude" statement was the majority case, and the 0.1% statement was a best case scenario. So, 1 magnitude is to be expected, but 3 is possible.
My prediction is that people will just change their avatars as often as they change their personal fashion. For some that’s never and for others it’s every season or even more often.
This reminds me of a sci-fi novel I read in the nineties. The premise had something to do with actors who took on roles in virtual reality where their bodies are fit with sense-points. They're cast in live-action role-plays with wealthy remote clients. They're basically deep-fakes in VR.
Neal Stephenson's Diamond Age had an element like that. It was even possible that the actors ("ractors") didn't necessarily know what they were acting in, just the general parameters and the next set of lines and actions necessary to continue the performance.
"An order-of-magnitude difference between two values is a factor of 10. For example, the mass of the planet Saturn is 95 times that of Earth, so Saturn is two orders of magnitude more massive than Earth."
> "More precisely, the order of magnitude of a number can be defined in terms of the common logarithm, usually as the integer part of the logarithm, obtained by truncation."
$ bc -l
l(835)/l(10)
2.92168647548360208478
That would make it 2 orders of magnitude by that method. Happy to accept that it's 3 orders of magnitude by the N=a*10^b method though. Either way, it's definitely not one.
I think it’s an unnecessary nitpick. One statement is general and the other is a specific, extreme example. “usually 10x, sometimes 1000x”, that sort of thing
A technology very similar to this plays a plot point in Vernor Vinge's 1992 novel A Fire Upon the Deep.
In his universe, both the interstellar net and combat links between ships are low bandwidth. Hence, video is interpolated between sync frames or recreated from old footage. Vinge calls the resulting video "evocations".
I was thinking of exactly this when I read the article.
The plot point being that when the bandwidth gets too low, the interpolation AI has to make lots of stuff up, you are not quite sure exactly what was said.
I seem to remember the bandwidth in the book was very tiny, small number of bits per second (?) so the AI was taking the speech and compressing it into something more compressed than text then decompressing it at the other end into something that was more or less the same.
Wow, that’s a fascinating concept. Effectively a bitmap index of possible terms said and synthesized speech back on the other end based on old footage of how that person talks.
I highly recommend A Fire Upon the Deep. Its a rare mix of really interesting hard scifi with an actually good story and characters. Hard scifi often has very flat characters but this is not a book which suffers from it.
It has a very very cool twist to explain the Fermi Paradox and is a really good example of a universe with one modified rule.
There is also a similar technoligy in Rob Reids After On book. The AI has the ability in thet book to "refocus" the person so that they are looking into the camera.
I believe this is huge and would create higher engagement if everybody was acutally looking into the camera instead of to the side or up all the time. Creating a more human an emotional relation with the people you are talking to.
Fundamentally, I don't know if people realise that what we're on the verge of here.
It's effectively a motion-mapped keypoints of the person projected onto a simulated model. I'm assuming the cartoonish avatar was used as an example to partly avoid drawing direct lines to the full implications.
- There's no reason this couldn't extend to voice modelling as well. (much clearer speaking at much lower bandwidth)
- There's no reason this couldn't extend to replacing your sent projection with another image (or person)
- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)
- There's no reason you couldn't replace other people's avatar with one's of your own choosing as well.
- Why couldn't we model the rest of the environment?
Not there today, but this future is closer than many realise.
>>- Professional looking suit wearing presentation when you're nude/hungover/unshaven. Hell, why even stop at using your real gender or visage? Imagine a job interview where every candidate, by definition, visually looked the same :)<<
This would be interesting as an upgrade to the “name on resume” test.
Could also see a future company policy that runs peoples data through a “sameness” filter before letting them into the company to scrub bias.
This is a lot like Framefree.[1] That was developed around 2005 at Kerner Optical, which was a spinoff from Lucasfilm. The system finds a set of morph points in successive keyframes and morphs between them. This can do slow motion without jerkyness, and increase frame rate. Any modern GPU can do morphing in real time, so playback is cheap. There used to be a browser plug-in for playing Framefree-compressed video.
Compression was expensive, because finding good morph points is hard. But now hardware has caught up to doing it in real time on cheap hardware.
As a compression method, it's great for talking heads with a fixed camera. You're just sending morph point moves, and rarely need a new keyframe.
You can be too early. Kerner Optical went bust a decade ago.
This may be projecting expectations but the example compressed video looks very slightly fake in a way that is just a little uncanny valley type unsettling.
Perhaps the nets they’re using are compressing out facial microexpressions and when we see it, it seems just a little unnatural. Compression artifacts might be preferable because the information they’re missing is more obvious and less artificial. In other words i’d rather be presented with something obviously flawed than something i can’t quite tell what is wrong.
What I don't like about AI processed images is that they are not real. I can't go past the fact that I am not looking at the picture as it looks like in reality but somehow smart approximation of the world that is not necessarily true.
Video is not real either. The perspective is different and the colors are wrong.
It is even less real after it went through lossy compression. The whole point of a lossy compression is to remove details that you perceive as unimportant. For example, leaves on a tree may look like a greeny mess, but that's fine, from afar, you don't make a difference.
Using neural networks for compression is far from being a new concept, and the result is not more or less real than any other technique. It is just that Nvidia implementation is really good at keeping the most important details in a small size.
If you want a more "real" image, you can just use the AI as a predictor and use the remaining bits to encode the difference, like in a traditional MPEG-style codec.
The distinction that makes ai compression creepy is that it can use prior knowledge of what a face is.
Traditional compression has no high level knowledge of what a video call looks like, the algorithms are about patterns in how pixels change in time and space, so the artifacts from the algorithm are pixel-based effects (blockiness, blurriness, etc).
A neural net that has been trained on a million faces and is setup to draw a face may draw a perfectly clear image of a face on low bandwidth...but when it doesn't have bandwidth to be accurate, it doesn't blur things, it fills in details learned from looking at strangers who aren't on the call.
You're sitting at your computer, you're watching your colleagues are discussing something in detail, slowly you start to notice - your two black colleagues are starting to literally look the same. That's weird, are you slowly getting more racist? oh right, your bandwith dropped and your AI upscaler has decided to upscale all your black colleagues up to some stereotype it got trained with.
In some ways, of course, I agree, but as some other commenters have pointed out, this is essentially a "deepfake" of yourself...in principle there's no restriction on what one could use as their "avatar." As much was demonstrated in the "Different Output Styles" slide.
So I'd have a hard time saying this is not dramatically less "real" than some lossy compression technique. Is there some way to formalize "realness?" Maybe it would be inversely proportional to the hardness of manipulating the medium as a user..?
Maybe simply using an objective video quality measurement against the original picture would do the trick. Something like PSNR or SSIM. A "deepfake" is likely to score low on such a score that doesn't depend on high level visual perception.
Also know that you can "deepfake" yourself using a traditional video encoder, just change the keyframe to someone else's face. Of course, it will look broken and totally unconvincing but because of motion compensation, you can sort of map the movement of your face on someone else's face.
The technique in the paper simply has way better motion compensation, so good that it still works if you change the keyframe. Traditional video compression algorithms don't work like that because they are not just for talking faces and can't use such advanced techniques for performance and ease of implementation reasons.
I understand what parent means, usually processing was too linear to alter the meaning of what you see, it's distorded, altered, maybe filtered and enhanced but in general, it's the same. With ml/dl you can really be sent .. whatever.
come on man there's a difference between a GAN hallucinating 90 % of the image and a very predictable compression algorithm where both parties understand what's going on
If the sending machine (that films my face) does the same decoding I know will be on the other end, then diffs the raw video with the decoded video and finally sends both things, then the receiver should be able to always piece together a 100% reproduction of the actual video feed on my end. The transmitter can always predict exactly what the receiver will decode, so the correct amount of data to send is the amount of data that makes the receiver see what I want. If that means I send 0.1mbps key point data and 1mbps diff pixels then that’s what I want to send.
Is that how this works? Because it should be how it works...
Not sure if this is a viable way of compressing the video stream or if a transmission of the “diff” would be too costly. If this method would give 1/2 the original bandwidth I’d think it’s more impressive than cutting 3 orders of magnitude via “avatars”.
You’re thinking along the right lines, but the challenge is that a raw diff will have the same number of pixels as the raw image, so no compression in bandwidth. So, how do we represent the diff/residue also with fewer numbers? At that point it’s the same as choosing better parameters within some clever encoding (be it pre-designed like JPEG or H.264 or learned via ML).
I was thinking that subtracting the predicted image would give an image that has more zeroes and compresses better (much like dct+quantization for jpeg). After all, any time the neural network would predict an area of the image almost exactly, it can be omitted from the diff stream completely too.
The diff will be low norm (for some suitable norm), but it needn’t be sparse in pixel space. It could be sparse in some other spaces (Eg: wavelet basis), but that’s right where the challenge lies — finding a basis where the data (and/or the residue) to be transmitted is sparse.
I suspect that this isn't feasible. Otherwise lossy video codecs wouldn't exist (or would be very niche). Because if you sent a perfect diff you would essentially have lossless compression again.
So either your idea will revolutionize video compression or the "diff" would bring you back to the ballpark of lossless codecs.
It is a lossless codec yes, but I has a “base” image for the key frames (which are the most expensive) which can be transmitted as the key points. The question is how much that actually helps the end result vs how much it costs. As I said it would likely kill at lest nearly all the gains, but again - a small improvement in the state of the art for “normal” looking video would be more impressive than three orders of magnitude vs sending instructions to my head puppet.
It could be bad for telemedicine, since the NN may gloss over some "weird" pathological features not found in its training set. The current setup involving an initial image and keypoints should be fine though, as long as the initial image (keyframe?) is not messed with and is transmitted frequently enough.
your brain does lots of preprocessing to create a stable view of "reality", glossing over blind spots, retinal blood flow, etc. in a way we are always hallucinating, and it seems natural that computers should learn what we are actually paying attention to in video, and focus on that
Yes! E.g. AAA games render the entire 4k screen with the highest graphics, but only a small spot hits the part of retina that can perceive jagged edges or lighting details.
any lossy video-compression algorithm face the same challenge. what you are seeing is artificial and is constructed by the algorithm to minimize the perceptual difference between the real and the constructed video feed.
Wonderful technical achievement but I think I’d rather squint through garbled video to see a real human.
Now if I can use it to add a Klingon skull ridge and hollow eyes to my boss or scribble notes on my scrum master’s generous forehead we might be on to something.
I see a lot of people being alienated by the fact that people could take on different avatars during their meeting. I would honestly accept that with no question.
In a work environment, I would expect the person I'm talking to to be presentable, ie their avatar would be presentable, so no goofy backgrounds or annoying accessories.
But the key for me is, I'd actually have something to see. So often in my work in in meetings and three people have cameras on and the rest don't. I don't really care what they look like, I care if they're engaged, nodding their heads, their facial reactions.
I don't always have my video on either, I don't have great upload speeds so I usually appear as a big blob anyway. I'd happily have whatever representation of me be in my place if it meant people could see my reactions
This is why I keep my camera on, so that whoever is talking sees me nodding away whilst they talk and when they go "any questions" you can pause and say "None from me". It's also why you always unmute yourself when someone joins the call - yes, we can hear you, yes your mic is fine, no you're not muted, it's just no one is replying becuase they're all muted and causedoing something else because we haven't started yet.
It feels like people really haven't put any thought into how to handle video conferences.
I assume it's because people are awaiting their return to physical work environments. If your employer or team hasn't done anything to facilitate working remotely other than moving meetings to Zoom/WebEx/Meet, then you're right: people haven't put thought into video conferences.
There are numerous articles on why this impromptu remote environment isn't the same as traditional remote environments. Are people turning on their camera during meetings? Are people actually responding when talked to during meetings? Is there some type of plan outlined for team members who have kids in virtual school or are unexpected caretakers? Have teams been given the proper collaborative tools to work together remotely? Are working hours being respected?
There are lots of things people didn't think about when places went remote. The problem is that they never went back to address them either.
The camera on or off brings up another question: When do you look straight into the camera and when do you look at the screen to see the person speaking? I can see argument for both.
But after that, I was reminded of the paranoia (or not?) around Zoom and that, for an extreme example, the CCP was mining and generating facial fingerprints and social networks using video calls. It seems like this technology is the same concept except put to a useful purpose.
If I had to guess, the issue around "black people" is that photos are 2D.
We don't really understand just how little information is actually in a photo (we add huge amounts of info in our perception).
My guess is that predictive systems are using contrast as a guide to essentially 3D structures which, simply, just cannot be reconstructed from 2D. And therefore, probably struggle more on dark faces which have different contrast properties.
While contrast is almost certainly part of it, I’d hazard a guess that the training set is also partly to blame.
Now, I don’t know much about neural networks (AI), but my understanding is that if you provide a training set representative of the population makeup, (in America at least) it’ll be biased towards white people as it hasn’t “seen” enough black person images. My limited understanding would then make me think one would need equal white person photos as well as black person photos.
Sure, but were those datasets of 3D snapshots of white people, it wouldnt matter.
Black faces and white faces are "statistically equivalent" in 3D.
The issue is more, in my view, the hubris of calling this system "facial recognition". It isnt: it's pixel pattern color recognition which sometimes coincides with certain facial patterns.
I imagine you're right on both counts. Contrast is certainly an issue. You definitely have to make adjustments to photograph dark skin (or animals with black fur).
If the "Free View" really works well, that sounds like possibly the most important part. The missing feeling of eye contact is a significant unsolved problem in video calls.
I would imagine Apple doing this with FaceTime soon.
Using their own NPU ( Neural processing unit ), you can now make FaceTime call with ridiculously low bandwidth. From the Nvidia example, 0.1165 KB/frame even at buttery smooth 60fps ( I could literally hear Apple market the crap out of this ), that is 7KBps or 56Kbps! Remember when the industry were trying to compress CD Audio quality ( aka 128Kbps MP3 ) down to 64Kbps? This FaceTime Video Call is using even less!
And since the NPU and FaceTime are all part of Apple's platform and not available anywhere else. They now have an even better excuse not to open it up and further lock customers into their ecosystem. ( Not such a good thing with how Apple are acting right now )
Not so sure where Nvidia is heading for this since Not everyone will have a CUDA GPU.
> I would imagine Apple doing this with FaceTime soon.
What's this claim based on? Last I looked into FaceTime tech, they didn't do anything special - their quality comes from use of H.265 and the fact that iOS devices have good quality HW encoding blocks which provide good compression at low bandwidths.
FaceTime stream is usually also low motion / change so it's possible to achieve very good compression even with basic quality. Although they still don't quite match the quality of AV1 powered Google Duo on very poor connections.
I'm not the person you're replying to, but I think the main reason you'd expect to see this on iOS is due to the additional Face ID sensors available on most of their devices. Presumably Apple could offer a version of this tech that features higher quality tracking and thus a more realistic and natural-looking final output.
The latest NPU can't deliver this kind of "deep compression" at 25fps, not even 10fps. But in the future they can just send the Facemesh vertices and streaming text of the speech (and classically compress non-speech audio, if it's even desired as most people are happy just using it to talk) so it will be less data than 1Kbps.
I think Face ID could be used to create the point map, instead of generating it from an image. They could also use Face ID to prevent the tech from malicious deep fakes, e.g only allow people to use this feature I when Face ID confirms the user and the manipulated photo are the same person.
Isn’t the point map supposedly encrypted on the Secure Enclave?[a] If it is, you can’t access it from the main CPU; you can only ask it to compare a provided one with the stored one.
[a]: I know the password is stored on it, but idk about the face mesh
> Remember when the industry were trying to compress CD Audio quality (aka 128Kbps MP3) down to 64Kbps?
128kb mp3 is good enough for most people most of the time, but it isn't CD quality. Having said that, 64kb Opus is almost or about as good as 128kb mp3.
I wonder how well these techniques can be applied to audio.
I doubt the general technique described here would have any applicability to audio, unless the idea is to dynamically create a text-to-speech model of the speaker and transmit only text. I don’t see that being practical any time soon.
> I would imagine Apple doing this with FaceTime soon.
And we'll notice it because somehow our and other people faces in facetime will start to subtly translate strangely aggravating emotions, as current memojis do.
Isn’t this just like apples animated emoji (Animoji) where your face is mapped to a emoji character? Except instead of a cartoon it’s mapped to your actual face.
Depends on how big the network is. You can process neural networks on CPU with acceptable speed in certain cases. And nicer smart phones tend to have a decent GPU.
That would still have major advantages. You could use the extra bandwidth to stabilize everything. Even on the best of networks you have buffer pauses. And on wifi or cell you have a lot of those. If this can be used to improved those ... you'll have improved quality greatly.
I wonder how weird it gets when you turn your head too much. This is very cool though - I was expecting to be able to tell a difference and maybe slip into uncanny valley territory but it looks good.
Big question though - is this just substituting the problem of not having good internet with not having a really fast nVidia graphics card?
If I understand correctly the sender just uses classic object detection/tracking. So the question would be how bad does it look if the receiver just tried to distort the image using that tracking data without having a trained model to smooth out the distortions.
Now the person you are speaking to is going to be n% (partially) emulated. n is going to increase in future. One day there will be a paid feature letting you emulate 100% of yourself to respond to video calls when you are not available. And finally, they will replace yourself even without you knowing, and even after you die.
That won't be possible unless full-brain emulation with live data sync gets invented.
Knowledge workers are all about the knowledge that they build up over time when working with a particular environment (be it an industry, a system, or a person/group of people). That knowledge is non-deterministically synthesised in the brain based on the experiences of that person, and being non-deterministic, no AI will come to the same conclusions about every item as this particular human would.
In that case, an emulated personality that is meant to make themselves available as your replacement will be an impostor. One that is less of an expert than yourself at best, and at worst one that is misinformed or misled on various issues (which in turn causes other people to be misled or misinformed).
Perhaps. In his Revelation Space universe, Alastair Reynolds categorises AI as Alphas (full brain simulations), Betas (non-conscious mimics which are based on a person’s behaviour over an unspecified but very long time period), and Gammas (which I read as being mere apps with voice interfaces).
If the goal is to make it seem like your present when you aren’t, I can believe we’re only a few years away from Reynolds‘ Betas — a Markov chain can’t mimic a human well, but it can mimic a human; GPT3 can do better, and while it still isn’t great, the main reason it feels like it might not be enough for public figures is how easy it is to get it to answer as if it were someone else rather than as The Right Honourable Sir Obvious Madeupname, MP for Oxbridge-upon-Wells who is paying for the chatbot to mimic him in particular.
This was part of what drove me off Twitter after the last election. I realized through my links and tweets that I had already provided enough training data for a bot that could not only sound like me, but share content from the future that I might be interested in.
At what resolution? And also, does the output actually resembles the original image? Examples with background other than uniform? Would be nice if they provided more than just screenshots
It's not uncommon to see video calls at 100kbs-150kbps, which is ~10KB/s, and this is for 7fps or so, including audio. So "per frame" that would be 1KB or so (more for key frames, less for I frames).
So they say it can be 0.1KB, so better than that... Exciting, if realistic.
Also, add on top audio, and packet overhead :-) there is at least 0.1KB overhead for sending the packet (bundle it with audio if possible!)
Can't wait to see the bugs! GAN's are famous for some... interesting reconstructions. And better still, nvidia will have no way to debug it since the model is essentially a black box.
People are extremely sensitive to subtleties in mouth articulation which facial landmark tracking tends to have trouble capturing. I question whether just a keyframe and facial landmarks are enough to generate convincing lip sync or gaze. I suspect that this is why the majority of the samples in the video are muted, which is a trick commonly used by facial performance capture researchers to hide bad lip sync results.
In Vernor Vinge's boon Fire Upon the Deep he describes how interstellar calls work in the future. In the lowest bandwidth tier you are watching an animated static 2d photo with text-to-speech
. The book also touches topics like translation, different spectrum and senses different species want from the call.
Look neat. I wonder that the system requirements / license for software will be?
There's a real network effect with things like codecs - unless some significant proportion of calls can use it, it'll remain a cool but obscure experiment.
I hope Nvidia have the foresight to release something that'll run on any hardware, and under a permissive license, but I suspect not.
The idea is out there already (it's basically deep fake tech, right?), and I'm sure it won't be so long before some open source version of it gets released. Nvidia would be wise to get out in front of that and at least have their brand associated with a widely used variant on the theme.
Great concept. With higher quality input, keyframe, point tracking, and ML model the output should improve significantly with similar bandwidth improvement. I think this could also smooth between dropped frames and offer higher bandwidth for the audio feed.
The issues are social. I would hope that the receiver is the one able to choose between original or AI stream, as I can understand some people being uncomfortable with the artifacts, gaze, expressions, and other edge cases. But when the quality is higher I could see a lot of people preferring this option as a default.
TL;DR: You can't reconstruct a generic video stream from a dozen detected facial data points.
Not sure if I understood your post correctly, but you're slightly misled here.
Watch the video. This is not a new general purpose video codec, it is basically Deep Fake - taking a still image (key frame) and superimposing detected facial expression/movement on this key frame (leaving out some technical details).
This is an improvement over (not-anymore-)state-of-the-art h264 since transmission of only a few coordinates mapping your facial expression is significantly less data to transmit then a delta of arbitrary video and periodic keyframes (again generously leaving out important technical details of modern video codecs). Trying to reconstruct a moving car/background/etc. from these facial expression key points will lead nowhere.
> Called “Free View,” this would allow someone who has a separate camera off-screen to seemingly keep eye contact with those on a video call.
Am I the only one who thinks eye contact on video calls feels creepy? I think I would prefer this feature to remove eye contact on video calls rather than add it.
Apparently researchers have measured this, and for people meeting in person it's normal to maintain eye contact 30%-60% of the time in a conversation [1] with each contact lasting ~3 seconds [2]
So a system that maintained eye contact continuously would indeed risk looking creepy!
Maybe but you're discounting the energy required to send those 99% of pixels over the network. Just because it's not your CPU doesn't mean it's not doing work.
> So they're trading bandwidth for CPU load at either end
Given it's Nvidia I would imagine it's more likely going to be GPU load rather than CPU load. Don't underestimate the current computational overhead of existing lossy video compression.
My tests already take twice as long to run when I’m on a Google hangout. For my case, I’d honestly rather use more bandwidth and do minimal local processing. If my machine is slowed down any more then I might have to stop working completely and focus on the meeting I’m in!
Admittedly I have no idea how resource-intensive this image processing is, but it's possible that it's less of a hit than the fact that all video call programs today run on HTML+JS. If Hangouts was a native program, there would be plenty of power to spare.
Modern machines aren't really designed for real computation like running tests or compiling stuff. They're made to render webpages and load up games. Run your computation tasks on a VM in the cloud...
Are remote development environments common?
There’s a CI server doing the heavy lifting but I still run tests constantly on a docker box while developing locally.
I develop entirely remotely. Save the files on a network mounted filesystem, run shell commands over SSH/mosh, etc.
The server is in the same city as me to avoid excessive latency.
The benefits of being able to develop anywhere (including from a phone in a pinch), and being able to add extra CPU's and RAM at the click of a button outweigh the need for a network connection for my usecases.
Essentially deepfaking yourself. There's no way to know that the nuances of the emotions passed will be reliably passed, as everything but face lines is hallucinated. And then, it's so life like that you have no paystubs deniability
Is it just me or do the videos no longer look natural? I feel like I see highly non-linear movements (parts of the video moving when they shouldn't, or vice-versa), and facial expressions don't really look quite the same.
I tried to pay attention to what you and others flagged (parts not moving, facial expressions, shoulders, etc.) But I cannot spot anything out of place.
That said, looking at other comments here, you're definitely not the only one
fair, in that they aren't perfect photorealism, but if their comparisons with the regular codecs and techniques are correct, I'd take their modelled faces over the comparable- level of digital artifacts for the same bandwidth.
After all, it doesn't need to be said that when you're on a regular videoconferencing call and bandwidth starts to suffer, the resulting images don't really look anything like a photorealistic person either. I think this is actually a really good use of NN.
The thing is you only want to make this trade-off when the bandwidth is actually starting to suffer. It'd be nice if there was a nice way to make this adaptive and use NNs only when throughput is low, but the nonlinearity of the distortions makes me think this would be really hard. [1] I know what I don't want is for a normal conversation in an uncongested network to look unnatural or for facial expressions to get distorted unnecessarily.
Edit: [1] I meant to say doing a mixture of these (with the NN image as the "base", with H.264 to improve accuracy) seems really hard. On the other hand, just a hard switch from H.264 to NN when quality degrades is probably quite practical?
Perhaps it's just me, and my philosophical bent, but I can actually see coming at it from the exact opposite end.
I don't want bandwidth and things spent/wasted without it providing a significant benefit (I'm probably one of those 1080p/720p is good enough for most things type guys).
I definitely don't want work making large bandwidth or resource claims on my connection when I'm working at home. And if any of this remote working has taught me anything, it's that most of my colleagues don't have steady/reliable tech or connections, so i'd almost want it used pre-emptively as a default so we can spend the rest of those resources on robustness or other qualities. (I realise of course that at the moment none of them have high-grade Nvidia graphics cards, but I'm talking hypothetically in the far off future).
In short, I want a world where the cost to benefit ratio of things is orders of magnitude larger, because things like this let us spend network/resources on things which matter.
Yes, when I'm calling my parent/grandparent one on one I might want to upgrade the signal, but I don't need to see random colleague's face in all their HD glory, or remote people whom I have no idea who they look or sound like anyway (i believe that's also been one of the findings with deepfakes, that you don't notice the eerieness/falseness as much if it's a reference of a face that you don't have pre-determined knowledge of).
Adaptive use would be awesome. Especially on conference calls at work there is always that one (or ten) person whose connection is absolutely awful and looks like a giant pixelhead.
This is perceptible to some degree, but in my opinion a highly practical way of applying this technology is to intentionally lower/shift the quality on the fly just a bit so as to "blend" the layer boundaries where the stitching happens.
That way the doctoring isn't apparent, you still benefit from the massive bandwidth savings (which I consider most important anyway), and it appears more believable in the real-world context of variable bitrates.
How soon before this incorporates GPT3 and guesses what we were going to say anyway, so we no longer need to say it? Or doesn't quite guess right, and says something that gets you fired!
Do we actually need GPUs to run this? There is no training involved, only inference, and CPUs (or low-end GPUs) should comfortably run the workload, at least for a couple of faces.
When will we get VR googles + this tech for couples so we can shape shift during sex, edit out the VR googles and explore scenes together while still viewing our partner?
Perhaps they are overdoing it (if you have a hammer, ...). I would think that the most useful way to use AI in this context would be to predict who is going to speak next.
You can get pretty good network speeds in space. Some Ka-band satellites (SpaceX Starlink for example) have gigabit networks between them. All you need to extend that out to Mars would be high gain directional antennas on things. If you used lasers instead it could be even higher bandwidth.
I have to disagree you there. I owned a satellite uplink/downlink in Afghanistan and it wasn’t cheap per megabit (supply/demand issues) and I highly doubt you’d be able to get a gigabit connection to Mars, especially when it’s on the other side of the sun. Also, the earth is rather noisy, so it may be easier to get a fast connection from Mars, the SNR from earth will be much lower.
I guess I'm also talking about the theoretical maximums in ideal conditions as well, which is sort of cheating on my part. There would be times when you can't get the optimal speed if, like you say, the sun is the way.
Watching the video it no longer feels like you're looking at a real person, but instead just another npc. It no longer feels as personal. The last thing remote relationships need is more impersonality. I hope this is used only when it's needed.
I think it would be easier to see if they had a comparison between a high bit rate traditionally compressed video and their neural net compressed video. To me the facial shapes and movement seem just a little bit wrong like you’re looking at a 3D animation of a person instead of a video.
They do this using GANs so I don’t think it is compressed in any way. Instead, the points on the face are sent across the net and the face is actually generated. I’m sure it can get better with time. There is a site thispersondoesnotexist.com that uses the same tech but looks real.
> Instead, the points on the face are sent across the net and the face is actually generated.
This is not fundamentally different from other lousy compression algorithms.
The bits that are missing are reconstructed using a bias built in the compression algorithm. Deep learning based algorithms simply have a more realistic bias, so the artifacts that introduced to replace the missing bits are less noticeable.
I want to see if people notice this effect when they don't know it's artificially generated. I have a feeling that the uncanniness is at least part because you know it's fake.
Sidenote: I always had this idea for "video compression" I'm by no means an expert in compression.
1) Take like the top 10 most "VISUAL DIVERSE" movies (imagine Forrest Gump(slow drama) VS ToyStory(anime) Vs Rambo(fast paced visuals)
2) "Encode/compress"(I know these terms are not interchangeable) the movie as a "diff/ref" to the "most similar" movie from step 1
2b) This the "diff/ref" can be many forms "sliding window" over x section of y amount of frames.
3) The "end-user" or "destination" have these "10-master-movies" locally on hdd and together with the local-data can construct the original "frame" or "movie" from the compression and local-movie on disk.
Tl;DR
Try to compress a new movie by saying "the top corner of frame 1-120" is very similar to MasterMovie-2-FrameXYZ-Frame-ABC
A nitpick, perhaps, but isn't that three orders of magnitude?
We've already seen people use outlandish backgrounds in calls, now it's going to be possible to design similar outlandish views, but actually be this new invention in real time. There's been a lot of discussion centered around deep fakes and its problems, this is essentially deep faking yourself into whatever you want.
Video calls are a very important form of communication at the moment, if this becomes as accepted as background modification, that would open the societal door to a whole range of self presentation that up till now was restricted to in game virtual characters.
I wonder what kind of implications that could have. Would people come to identify themselves strongly with a virtual avatar, perhaps stronger than their real life "avatar"? It is an awesome freedom to have, to remake yourself.