Hi, I'm a developer at NexOptic[0] and we are a company that was deeply inspired by this paper when it was first published. We had a lot of early success when attempting to replicate the results on our own and ended up running with it, and extending it into our own product line under our ALIIS brand of AI powered solutions.
For those curious, our current approach differs in some very significant ways to the author's implementation, such as performing our denoising and enhancement on a raw bayer -> raw bayer basis with a separate pipeline for tone mapping, white-balance, and HDR enhancement. As well, we explored a fair amount of different architectures for the CNN and came to the conclusion that a heavily mixed multi-resolution layering solution produces superior results.
As other commentators have pointed out, the most interesting part of it is really coming to terms that, as war1025 pointed out, "The message has an entropy limit, but the message isn't the whole dataset." It is incredibly powerful what can be accomplished with even extraordinarily noisy information as long as one has a extremely "knowledge packed" prior.
If anyone has any questions about our research in this space, please feel free to ask.
It would be really cool if you could feed the network a photo with flash that it could use for gathering more information, but then recreated a photo without flash from the non-flash raw.
Often flash is not the look people are going for, but would be okay with the flash firing in order to improve the non-flash photo.
Absolutely! We recently rebranded our AI solutions from ALLIS (Advanced Low Light Imagine Solution) to ALIIS (All Light Intelligent Imaging Solution) specifically because we are beginning to branch out to handle use cases such as this!
As a proof of concept that this task can be tackled directly, a quick search brought up "DeepFlash: Turning a Flash Selfie into a Studio Portrait"[0]
Beyond denoising, we are already running experiments with very promising results on haze, lens flare, and reflection removal; super resolution; region adaptive white balancing; single exposure HDR; and a fair bit more.
One of the other cooler things we are doing is putting together a unified SDK where our algorithms and neural nets will be able to run pretty much anywhere, on any hardware, using transparent backend switching. (e.g. CPU, GPU, TPU, NPU, DSP, other accelerator ASICs, etc..)
Before reading your reply to OP's comment I got to thinking about how the super-resolution process and flash photography might interact (https://news.ycombinator.com/item?id=22905317). I get the impression you left the point I got to a long time ago :)
The way I mistakenly initially parsed this comment gave rise to a potentially-dumb idea/question:
What would happen if you
- begin capturing video (unsure of fps) on a phone-quality sensor in a near-dark environment
- pulse the phone's flash LED(s) like you're taking a photo
- do super-resolution on the resulting video to extract a photo...
- ...while factoring in the decay in brightness/saturation in consecutive video frames produced by the flash pulse?
I vaguely recall reading somewhere that oversaturated photos have more signal in them and are easier to fix than undersaturated. Hmm.
IIRC super-resolution worked with 30fps source video for better quality; I wonder if 60fps or 120fps source video would produce better brightness decay data, or whether super-resolution could actually help extract more signal out of the decay sequence too.
On the other hand, I'm not sure if super-resolution fundamentally requires largely consistent brightness in order to work as well as it does. :/
Perhaps individual networks could be trained/tuned to specific slices/windows of the brightness gradient. I also wonder if it would be useful to factor the superresolution process into each of the brightness-specific stages or just to do it at the end.
For the most part, our effort has been focused on single exposure image enhancement, however we are beginning to use recurrent models to improve quality when video information is available.
Nonetheless, it's kinda a neat idea, so I tried testing the feasibility of it. I set up a recent flagship phone that claims to have 960fps super-slow-motion video capture next to another phone with a strobe app at 12Hz with a short delay in between pulses.
There are definitely a few frames where the LED is at an intermediate brightness, however teasing out the exact timings between the flash and the camera may prove to be difficult to correctly synchronize.
As for over-saturated images having more signal... although the PSNR calculation may give you a better number, in practice, a region that is over-saturated is just a blob of 1s on the image (assuming float64 pixel values of 0-1) and there is no information there to extract. With a black level near but not at 0, we've found there is often more information hidden in the 'dark noise' than can be discerned by the human eye alone.
Wow, cool, you actually tested it! And an effective test too.
Stepping back and forth throughout the frames (using mpv), the flash clearly enhances several spots of localized brightness where contrast pops out into clear relief.
The effect is clearest at the very bottom of the image which goes from "shadow blob" to "adequately discernible", but I think the area just above that (the 3rd vertical quarter of the image) is most interesting; the detail visible in frames 24-29 (immediately before 00:00:01 / 30.030fps) is excellent, and that's with the flash LED at peak brightness.
Flash synchronization would be effectively impossible to achieve (the camera would need to stream LED status information inside each frame), but achieving such synchronization may provide no net gain, even with "LED is on" information available, both because the exact point the hardware says "LED is off" will not necessarily correspond to the exact moment in time the light decays to zero (based on 1/960 = 1.0416 milliseconds per frame, the video suggests it takes apparently 2 frames or ~2.08 milliseconds for the light to decay), which will never be the same as the flash sends light outwards into arbitrarily different environments. I can't help but wonder if calibration references for everything from Vantablack to mirrors would be needed... for each camera sensor... and that there would then be the problem of figuring out which reference(s?) to select.
Staring at the video frames some more, two ideas come to mind: 1), analyzing all the frames to identify areas of significant difference in brightness, then 2), for each (perhaps nonrectangular) region of difference, figuring out the "best" source reference for that specific region. As an example reference, I'd generally use frame 13 for most of the image, and frame 44 or so (out of many, many possible candidates) for the bits that, as you say, become float64 1.00 :). Obviously a nontrivial amount of normalization would then be needed.
I'm not aware of how you'd do either of these neurally :) but the idea for (1) came from https://en.wikipedia.org/wiki/Seam_carving (although just basic edge detection may be more correct for this scenario), while the idea for (2) came from
https://github.com/google/butteraugli which "estimates the psychovisual similarity of two images"; perhaps there's something out there that can identify "best contrast"? I'm not sure.
Trivial aside: I wondered why mpv kept saying "Inserting rotation filter." and also why the frame numbers appeared sideways. Then I realized the video has rotation metadata in it, presumably so the device doesn't need to do landscape-to-portrait frame buffering at 960fps (heh). I then realized the left-to-right rolling shutter effect I was seeing was actually a bottom-to-top rolling shutter. I... think that's unusual? I'm curious - after Googling then reading (or, more accurately, digging signal out of) https://www.androidauthority.com/real-960fps-super-slow-moti... - was the device an Xperia 1?
(And just to write it down for future reference: --vf 'drawtext=fontcolor=white:fontsize=100:text="%{n}"' adds frame numbers to mpv. Yay.)
Sounds like you have taken this pretty far, do you have any example outputs? The only one I found via your website was a PDF with a low res image with no context.
Sure, we have a short deck[0] that gives an intro to our noise reduction, and also here is a folder[1] that shows off a calibration target we captured with a actual camera (20ms, f22) in low-light conditions: (original, 100x gain, 100x gain + ALIIS)
We also have some more raw data[2] where there is the original bayer data available as .npy files with 40db analog gain applied, however I think the calibration targets show off what we are able to accomplish more dramatically. Finally, we have a short youtube video[3] that shows off how it works when applied to video.
For those curious, our current approach differs in some very significant ways to the author's implementation, such as performing our denoising and enhancement on a raw bayer -> raw bayer basis with a separate pipeline for tone mapping, white-balance, and HDR enhancement. As well, we explored a fair amount of different architectures for the CNN and came to the conclusion that a heavily mixed multi-resolution layering solution produces superior results.
As other commentators have pointed out, the most interesting part of it is really coming to terms that, as war1025 pointed out, "The message has an entropy limit, but the message isn't the whole dataset." It is incredibly powerful what can be accomplished with even extraordinarily noisy information as long as one has a extremely "knowledge packed" prior.
If anyone has any questions about our research in this space, please feel free to ask.
[0] https://nexoptic.com/artificialintelligence/