Monday, June 19, 2017

Basis's RDO DXTc compression API

This is a work in progress, but here's the API to the new rate distortion optimizing DXTc codec I've been working on for Basis. There's only one function (excluding basis_get_version()): basis_rdo_dxt_encode(). You call it with some encoding parameters and an array of input images (or "slices"), and it gives you back a blob of DXTc blocks which you then feed to any LZ codec like zlib, zstd, LZHAM, Oodle, etc.

The output DXTc blocks are organized in simple raster order, with slice 0's blocks first, then slice 1's, etc. The slices could be mipmap levels, or cubemap faces, etc. For highest compression, it's very important to feed the output blocks to the LZ codec in the order that this function gives them back to you.

On my near-term TODO list is to allow the user to specify custom per-channel weightings, and to add more color distance functions. Right now it supports either uniform weights, or a custom model for sRGB colorspace photos/textures. Also, I may expose optional per-slice weightings (for mipmaps).

I'm shipping the first version (as a Windows DLL) tomorrow.

// File: basis_rdo_dxt_public.h
#pragma once

#include <stdlib.h>
#include <memory.h>

#ifdef BASIS_DLL_EXPORTS
#define BASIS_DLL_EXPORT __declspec(dllexport)  
#else  
#define BASIS_DLL_EXPORT
#endif  

#if defined(_MSC_VER)
#define BASIS_CDECL __cdecl
#else
#define BASIS_CDECL
#endif

namespace basis
{
   const int BASIS_VERSION = 0x0100;

   typedef unsigned int basis_uint;
   typedef basis_uint rdo_dxt_bool;

   enum rdo_dxt_format
   {
      cRDO_DXT1 = 0,
      cRDO_DXT5,
      cRDO_DXN,
      cRDO_DXT5A,

      cRDO_DXT_FORCE_DWORD = 0xFFFFFFFF
   };

   const basis_uint RDO_DXT_STRUCT_VERSION = 0xABCD0001;

   const basis_uint RDO_QUALITY_MIN = 1;
   const basis_uint RDO_QUALITY_MAX = 255;

   struct rdo_dxt_params
   {
      basis_uint m_struct_size;
      basis_uint m_struct_version;

      rdo_dxt_format m_format;

      basis_uint m_quality;

      basis_uint m_alpha_component_indices[2];

      basis_uint m_lz_max_match_dist;
      basis_uint m_output_block_size;

      basis_uint m_num_color_endpoint_clusters;
      basis_uint m_num_color_selector_clusters;

      basis_uint m_num_alpha_endpoint_clusters;
      basis_uint m_num_alpha_selector_clusters;

      float m_l;
      float m_selector_rdo_quality_threshold;
      float m_endpoint_selector_rdo_quality_threshold;

      float m_selector_rdo_quality_threshold_low;
      float m_endpoint_selector_rdo_quality_threshold_low;

      float m_block_max_y_std_dev_rdo_quality_scaler;

      basis_uint m_endpoint_refinement_steps;
      basis_uint m_selector_refinement_steps;
      basis_uint m_final_block_refinement_steps;

      float m_adaptive_tile_color_psnr_derating;
      float m_adaptive_tile_alpha_psnr_derating;

      basis_uint m_endpoint_rdo_max_search_distance;

      rdo_dxt_bool m_optimize_final_endpoint_clusters;
      rdo_dxt_bool m_optimize_final_selector_clusters;

      rdo_dxt_bool m_srgb_metrics;
      rdo_dxt_bool m_debugging;
      rdo_dxt_bool m_debug_output;
      rdo_dxt_bool m_hierarchical_mode;
      rdo_dxt_bool m_multithreaded;
   };

   inline void rdo_dxt_params_set_to_defaults(rdo_dxt_params *p)
   {
      memset(p, 0, sizeof(rdo_dxt_params));

      p->m_struct_size = sizeof(rdo_dxt_params);
      p->m_struct_version = RDO_DXT_STRUCT_VERSION;

      p->m_format = cRDO_DXT1;

      p->m_quality = 128;

      p->m_alpha_component_indices[0] = 0;
      p->m_alpha_component_indices[1] = 1;

      p->m_l = .001f;

      p->m_selector_rdo_quality_threshold = 1.75f;
      p->m_endpoint_selector_rdo_quality_threshold = 1.75f;

      p->m_selector_rdo_quality_threshold_low = 1.3f;
      p->m_endpoint_selector_rdo_quality_threshold_low = 1.3f;

      p->m_block_max_y_std_dev_rdo_quality_scaler = 8.0f;

      p->m_lz_max_match_dist = 32768;
      p->m_output_block_size = 8;

      p->m_endpoint_refinement_steps = 2;
      p->m_selector_refinement_steps = 2;
      p->m_final_block_refinement_steps = 1;

      p->m_adaptive_tile_color_psnr_derating = 1.5f;
      p->m_adaptive_tile_alpha_psnr_derating = 1.5f;
      p->m_endpoint_rdo_max_search_distance = 8;

      p->m_optimize_final_endpoint_clusters = true;
      p->m_optimize_final_selector_clusters = true;

      p->m_hierarchical_mode = true;

      p->m_multithreaded = true;
   }

   const basis_uint RDO_DXT_MAX_IMAGE_DIMENSION = 16384;

   struct rdo_dxt_slice_desc
   {
      // Pixel dimensions of this slice. A slice may be a mipmap level, a cubemap face, a video frame, or whatever.
      basis_uint m_image_width;
      basis_uint m_image_height;
      basis_uint m_image_pitch_in_pixels;

      // Pointer to 32-bit raster image. Format in memory: RGBA (R is first byte, A is last)
      const void *m_pImage_pixels;
   };

} // namespace basis

extern "C" BASIS_DLL_EXPORT basis::basis_uint BASIS_CDECL basis_get_version();

extern "C" BASIS_DLL_EXPORT bool BASIS_CDECL basis_rdo_dxt_encode(
   const basis::rdo_dxt_params *pEncoder_params,
   basis::basis_uint total_input_image_slices, const basis::rdo_dxt_slice_desc *pInput_image_slices,
   void *pOutput_blocks, basis::basis_uint output_blocks_size_in_bytes);

Sunday, April 30, 2017

Binomial stuff

One MS employee recently said to Stephanie (my partner) that (paraphrasing) "your company isn't stable and can't possibly last". My reply: We've been in business for over a year now, and our business is just a natural extension and continuation of our careers. I've been programming since 1985, and developing commercial data compression and other software since 1993. I've been doing this for a while and I'm not going to stop anytime soon.

Having my own small consulting company vs. just working full-time for a single corporation is just a natural next step to me. One thing I really liked about working at Valve was the ability to wheel my desk to virtually anywhere in the company and start adding value. I can now "wheel my desk" to anywhere in the world, and the freedom this gives us is amazing.

Binomial is a self-funded startup. We work on both development contracts and our current product (Basis). We haven't taken any investment money. Our "runway" is basically infinite.

Wednesday, April 19, 2017

Basis status

Just a small update. We've put like 99% of our effort into ETC1 and ETC1+DXT1 over the previous 5-6 months. Our ETC1 encoder supports RDO and an intermediate format, and has shipped on OSX/Linux/Windows. I've been modifying the ETC1 encoder to also support DXT1 (for our universal format) over the previous few weeks.

Our ETC1 encoder was written almost from scratch. The next major step is to roll back all the improvements and things I've learned while implementing our ETC1 encoder back into our DXT-specific encoder. crunch's support for DXT has a bunch of deficiencies which hurt ratio. (Roy Eltham and Fabian Giesen have recently pointed this issue out to me. I've actually been aware of inefficiencies in crunch's codebook generator for a few months, since working on the new codebook generator for ETC1.) I'm definitely fixing this problem (and others!) in Basis.

Saturday, March 18, 2017

Probiotic yogurt making

Just got back from a wonderful business trip to Portland Maine, visiting ForeFlight. Making more probiotic yogurt tonight because I ate up almost my entire stock on the trip. (It didn't help that we got stuck in a blizzard while there, but that turned out to be really fun.) The food in Portland is amazing!

The pot of boiling water is for sterilizing the growth medium, in this case 2% organic grassfed milk+raw sugar. After the milk is boiled (repastuerized) and cooled I inoculate it using a 10 strain probiotic blend from Safeway. I tried a bunch of probiotics before finding this particular brand, which seems magical for me. Without this extremely strong yogurt I have no idea how long it would have taken my gut to heal after the antibiotics I had to take in 2015.

Yogurt making like this is tricky. Early on, around 30% of my attempts failed in spectacular ways. These days my success rate is almost 100%. Sterilization of basically everything (including tools, spoons, etc.) over and over throughout the process is critical to success.



Nerd/Brogrammer spectrum

Okay, I've been watching Silicon Valley. This show is so realistic and exaggerated that I find it painful to watch at times (especially the first couple episodes), but it's also funny as hell. It really helps to put things into perspective about why I'm now a consultant and not "exclusive" to a single corporation anymore.


Some observations:

- Notice that the interpersonal relationships between basically all the characters are pretty toxic. Everyone seems to be exploiting everyone else in some way, and money/wealth is a large motivator to most characters.

- Look at the above frame. Where's the programmer in the middle in this series? The person who works out, is empathetic, loves to program, but isn't a total and complete asshole.

Friday, March 10, 2017

Virtual selector codebook example image

I'm going to post some examples over the next few days. Note these examples are not .basis compressed images, instead they are just ETC1 compressed images where each block's selectors have been replaced by the "best" corresponding virtual selector codebook entry. Each block was processed independently, using a brute force search on a 20 core workstation.

I used an 8-bit seed dictionary and 11-bits to control the amplification functions, so the virtual CB size was 512K entries. The seed dictionary and functions were tuned on the set of photos, test images, and textures used to tune crunch. My current implementation seems to work best on photos and textures, and worst on synthetic images and high-contrast text.

In .basis, this method is currently only used to compressed selector codebooks, not entire images, so a brute force search over the entire virtual codebook isn't too insane. Also, basis supports a hybrid codebook mode, so it can select between virtual VB entries and "raw" entries depending on quality.

(Posting links so blogger doesn't resample the images.)

ETC1 compressed with selectors chosen from a 512K virtual codebook - .977149 Luma SSIM

ETC1 compressed - .996428 Luma SSIM

Original

Other ETC1 examples using vitual selector CB's (sorry no SSIM's, although I do have this data):

Blues Brothers





Sunday, March 5, 2017

Virtual global selector codebooks in Basis

Much of this information will be present in .basis's open source transcoder, so here's a little brain dump of how it works.

Selectors in DXT1 and ETC1 are 2-bits each. These values select which block color is used to represent each texel in a block. The block size in these formats is 4x4 texels, so each selector can be treated as a 16D vector. The uncompressed size of a selector vector is 32-bits (4*4*2).

One way of compressing these selector vectors is vector quantization (VQ). One problem with straightforward VQ is the expense of codebook storage. crunch stores selector codebooks in a compressed form, using order-1 delta compression on each vector's component, and it tries to rearrange the codebook entry order to minimize these deltas. With large codebooks the amount of bits needed to store the compressed selector vectors becomes prohibitive relative to the compressed texture data.

An alternative technique is to switch to a multi-level codebook scheme, where each .basis file has a "local" selector codebook which refers into a much larger constant "global" selector codebook. (This is somewhat inspired by Graham Devine's ROQ video file format used in Quake 3, which used multilevel codebooks.) Now we've got a serious memory problem, because the global (constant) selector codebook is going to require several megabytes of memory. A 1024^2 global selector codebook requires a 4MB table, which is undesirable on many platforms (i.e. WebGL).

.basis works around this problem by using a very small (256 entry) global selector vector codebook which is procedurally "amplified" using a series of small functions. These functions can rotate the vector by 90, 180, or 270 degrees, vertically flip the vector, change its contrast, add Gaussian noise to the vector, invert the vector, etc. It turns out that this method is surprisingly powerful and simple, and lends itself well to a hardware implementation.

To select a "virtual" selector codebook entry in this scheme, the encoder first selects a "seed" vector from the small global codebook (which requires 8-bits), then it specifies a series of control bits (typically 6-12) to select which procedural routines are used to modify the codebook entry. The control bits can be optionally arithmetically coded, or stored as-is. It's also possible to completely discard with the seed codebook and just use a PRNG (with some post processing) to generate the initial selector entries. In this case, the seed bits are used to prime the PRNG.

Function usage statistics in kodim18 (on each ETC1 block):

shift_x: Samples: 24576, Total: 12737.000000, Avg: 0.518270, Std Dev: 0.499666
shift_y: Samples: 24576, Total: 10472.000000, Avg: 0.426107, Std Dev: 0.494510
flip: Samples: 24576, Total: 10811.000000, Avg: 0.439901, Std Dev: 0.496375
rot: Samples: 24576, Total: 18679.000000, Avg: 0.760050, Std Dev: 0.427052
erode: Samples: 24576, Total: 6115.000000, Avg: 0.248820, Std Dev: 0.432329
dilate: Samples: 24576, Total: 7067.000000, Avg: 0.287557, Std Dev: 0.452623
high_pass: Samples: 24576, Total: 10076.000000, Avg: 0.409993, Std Dev: 0.491832
rand: Samples: 24576, Total: 12522.000000, Avg: 0.509521, Std Dev: 0.499909
div: Samples: 24576, Total: 10053.000000, Avg: 0.409058, Std Dev: 0.491660
shift: Samples: 24576, Total: 8016.000000, Avg: 0.326172, Std Dev: 0.468811
contrast: Samples: 24576, Total: 6080.000000, Avg: 0.247396, Std Dev: 0.431499
inv: Samples: 24576, Total: 12075.000000, Avg: 0.491333, Std Dev: 0.499925
median: Samples: 24576, Total: 7947.000000, Avg: 0.323364, Std Dev: 0.467760

rot's usage was ~76% because it was used 3/4 times. ~25% of the time the vector wasn't rotated. I've been continually surprised how easy it has been to find useful functions.

Here's a description of the current set of procedural functions in .basis:

  • shift_x/y: Shifts the block's selectors up or left by 1 row/column
  • flip: Vertical flip
  • rotate: Rotates by 0, 90, 180, or 270 degrees
  • erode: 3x3 erosion morphological  operator
  • dilate: 3x3 dilation morphological operator
  • high pass: 3x3 high-pass filter
  • rand: Adds Gaussian noise to the selectors, using the selectors themselves as a seed for the PRNG.
  • div: Selector remapping through table { 2, 0, 3, 1 }
  • shift: Adds 1 to the selectors with clamping
  • contrast: Boosts contrast of selectors by remapping them through 1 of 3 tables: { 0, 0, 3, 3 }, { 1, 1, 2, 2 }, { 1, 1, 3, 3 }
  • inv: Inverts the selectors
  • median: 3x3 selector median filter

The order that these functions are applied matters, and I'm still figuring out the optimal order. The control bits select which combination of the above functions is used to modify the selectors, and for a couple functions (like rot and contrast) multiple control bits are needed.

With a method like this it's possible to compress a selector vector down to 14-16 bits. Quality is extremely good. The biggest problem I'm working on solving now is how to efficiently search such large virtual codebooks during encoding without sacrificing too much quality or introducing artifacts. Full codebook searches are very slow.