Wednesday, April 25, 2018

A tale of multiple BC7 encoders

To get going with BC7, I found an open source block decompressor and first created a BC7 struct (using bitfields) to create a single mode 6 block. I filled in the fields for a simple mode 6 block, called the decompressor, and examined the output. (Turns out this decompressor has some bugs which I've reported to the author.)

For my first encoder I started in C++ and wrote a straightforward but feature complete encoder. It supports exhaustive partition scanning or it can estimate a single partition to evaluate which is around 12x faster. It's also pretty generic, so you can call the endpoint optimizer with any number of pixels (not just 16). About the only thing good you can say about this encoder is that it's high quality and easy to follow. Its performance is dreadful (even worse then DirectXTex, believe it or not) unless you force it to evaluate only a single partition per mode in which case it's ~2.6x faster than DirectXTex.

I then wrote a new, simpler C++ encoder, then quickly ported this to C. For this simpler encoder, I focused on just the modes that mattered most to performance vs. quality. The endpoint optimizer was less generic and more focused, and I removed some less useful features in perceptual mode. I then added in additional modes to this encoder to bring its quality up vs. Intel's ispc_texcomp. After this, I took the C encoder and got it building with iscp, then profiled and optimized this over a period of a week or so.

1. Ridiculously slow C++ encoder, evaluate all partitions:
46286.7 secs, 51.56 Luma PSNR

2. Ridiculously slow C++ encoder, evaluate the single best partition only:
4615.5 secs, 51.35 Luma PSNR

3. Faster C encoder:
458.7 secs, 50.77 Luma PSNR

4. Vectorized encoder - C encoder ported to ispc then further optimized and enhanced:
76.4 secs, 51.16 Luma PSNR

Notice each time I got roughly a 6-10x performance increase at similar quality. The iscp version is currently 6x faster than the C version. To get it there I had to examine the code generated for all the inner loops and tweak the iscp code for decent code generation. Without doing this it was only around 2x faster.

I used the previous encoder as a reference design during each step of this optimization process, which was invaluable for checking for regressions.

Some things about BC7 that are commonly believed but aren't really true, and other things I've learned along the way (along with some light ranting but I mean well):

- "BC7 has this massive search space - it takes forever to encode!" Actually, with a properly written encoder it's not that big of a deal. You can very quickly estimate which partition(s) to examine (which is an extremely vectorizable operation), you can drop half the modes for alpha textures, mode 7 is useless with opaque textures, the 3 subset modes can be dropped unless you want absolute max quality, and many fast endpoint optimization methods from BC1 are usable with BC7. BC1-style cluster fit isn't a good match for BC7 (because it has 3 and 4 bit indices), but you can easily approximate it.

For a fast opaque-only encoder, my statistics show that supporting just modes 1 and 6, or 0/1/6, results in reasonably high quality without noticeable block artifacts.

- CPU encoders are too slow, so let's use the GPU!
Actually, a properly vectorized CPU encoder can be extremely fast. There is nothing special about BC7 that requires compute shaders if you do it correctly. Using gradient descent or calling rand() in your encoder's inner loop are red flags that you are going off the rails. The search space is massive but most of it is useless.

You need the right algorithms, and strong heuristics matter in GPU texture compression. Write a BC1 encoder that can compete against squish first, then tackle BC7.

- Graphics dev BC7 folklore: "3 subset modes are useless!"
No, they aren't. iscp_texcomp and my encoder make heavy use of these modes. The perf reduction for checking the 3 subset modes is minimal but the PSNR gain from using them is high:

Across my 31 texture corpus, all modes (but 7) gets 46.72 dB, disabling the two 3 subset modes I get 45.96 dB. This is a significant difference. Some timings: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled).

Of the publicly available BC7 encoders I've tested or looked at so far none of them have strong or properly working mode 0 encoders. Mode 0 is particularly tough to handle correctly because of its 4-bit/component endpoints with pbits. Most encoders mess up their pbit handling so mode 0 appears much weaker than it really is. 

- Excluding iscp_texcomp, I've been amazed at how weak the open source CPU/GPU BC7 encoders are. Every single encoder I've looked at except for iscp_texcomp has pbit handling issues or bugs holding them back. DirectXTex, Volition, NVTT - all use pbit voting/parity without properly rounding the component value, which is incorrect and introduces significant error. DirectXTex had outright pbit bugs causing it to return lower quality textures when 3 subset modes were enabled.

Note to encoder authors: You can't estimate error for a trial solution without factoring in the pbits, or your encoder will make very bad choices that actually increase the solution error. Those LSB's are valuable, and you can't flip them without adjusting the components too.

- None of the available encoders support perceptual colorspace metrics (such as computing weighted error in YCbCr space like etc2comp does), which is hugely valuable in BC7. You can get up to 2 dB gain from switching to perceptual metrics, which means your encoder doesn't have to work nearly as hard to compete against a non-perceptual encoder.

Before you write a BC1 or BC7 encoder, you should probably write a simple JPEG compressor first to learn all about chroma subsampling, etc.

- SSIM and PSNR are highly correlated for basic 4x4 block compression. I'll blog about this soon. I test with both, but honestly it's just not valuable to do so because they are so correlated and SSIM is more expensive to compute.

- Whoever wrote iscp_texcomp at Intel should be given a big bonus because it's overall very good. Without it the BC7/BC6H situation would be pretty bad.

No comments:

Post a Comment