Toward Textual Transform Coding¹¹1Based on a lecture entitled “Learning from Humans How to Improve Lossy Data Compression”, given at the International Symposium on Information Theory in Espoo, Finland, summer of 2022.

Tsachy Weissman
Department of Electrical Engineering
Stanford University
[email protected]

Abstract

Inspired by recent work on compression with and for young humans, the success of transform-based approaches to information processing, and the rise of powerful language-based AI, we propose textual transform coding. It shares some of its key properties with traditional transform-based coding underlying much of our current multimedia compression technologies. It can form the basis for compression at bit rates until recently considered uselessly low, and for boosting human satisfaction from reconstructions at more traditional bit rates.

1 Context and Motivation

1.1 State of our Compressors

Are our technologies for compression and streaming of audio, image and video approaching some kind of a rate-distortion-complexity Pareto front, or is there substantial room for improvement? This question was brewing in my group and elsewhere throughout the last decade, in light of the growing appetite for capturing, storing and communicating such data, along with the emergence of even more voluminous data types such as $3D$ point clouds and multi-sensor data from self-driving vehicles.

1.2 What would Shannon do?

Following the publication of [Sha48], Shannon directed his attention to the compressibility of English. Among other insights, [Sha51] suggested that English text should be compressible to about $1.3$ bits/character once our algorithmic predictive models catch up with those in Mrs. Mary E. Shannon’s brain. Subsequent work [CK78, RTT19] has shown that number to be largely consistent across English speaking humans. It took our technologies over seven decades to catch up and deliver [Bel21, Kno21].

Inspired by this progression, we wanted to understand from humans what might be achievable in the context of multimedia data compression once:

•

our algorithmic models catch up to those in our brains
•

our codebooks/encoding/decoding make good use of humanity’s publicly available “side information”
•

we tailor the reconstructions to what humans actually care about

In the summer of 2018 we were fortunate to host 3 high school students for an internship dedicated to an attempt at addressing these questions.

1.3 Image Compression with Human Encoders, Decoders and Scorers

One intern (human encoder) was given a photo they hadn’t previously seen and that was not available online. Their task was to describe it, in text, to another intern (human decoder) using the inbuilt Skype text chat (turning off their outgoing audio/video). The decoder could share partial, in-progress reconstructions with the human encoder using Skype’s screen share feature²²2The intermediate reconstructions are based on bits flowing from the encoder and thus such “feedback” schemes are legitimate lossy compressors.. We compared the quality of the reconstructions under the human compression system to the quality under WebP using human scorers. Human compressors achieved higher quality than WebP confined to a similar bit rate on most image types [Bho+19, Bho+19a, Fis+21].

1.4 SHTEM

Motivated by that project, in the summer of 2019 we launched the SHTEM summer internship program for high school and community college students [SHT23]. The H stands for both Humanities and the Human element, with an emphasis on multi-disciplinary projects combining humanities with STEM, and that use humans to assess, guide and evaluate what our technologies should strive to achieve.

A couple of the SHTEM 2019 projects continued the project from the preceding summer: One was dedicated to gleaning insight into the potential for better facial image compression from the way in which a police sketch artist creates an image [Cha+19]. Another eliminated the human at the decoder from the setting of [Bho+19] while reducing the bit rate by another order of magnitude without compromising human satisfaction. This was done by tasking the human encoder with describing the image in Python code in a way that would allow a computer running it to create the reconstruction [Mah+19].

1.5 Video

One of the SHTEM projects from the summer of 2020 was instrumental to what became StageCast [Ro21], a project born during the early days of the pandemic which explored and developed technology enabling real time performances with geographically distributed members of the cast and audience. The idea was to detect and stream the location of key points in the face and body of a human and have their digital puppet reconstructed by the decoder [Pra+21]. The substantial reduction in the required bit rate enabled essentially real-time interactions that were not previously achievable with existing (full video streaming) technologies. That project also suggested the potential for a similarly low bit rate scheme streaming real video, replacing the digital puppet by a deep fake [PS21] of the original person at the reconstruction. This suggestion was evidently heeded by Nvidia in [Max].

With appetites whet, in [Tan+23] we pursued a framework for video-conferencing compression at yet lower bit rate regimes. The gist was the realization that in a typical teleconference video, once the background and the person speaking have been learnt, inclusive of how that person sounds and moves according to the content of their words, information theoretically the main part that remains is the content of their words. We devised a video streaming system exploiting existing technologies (such as for extracting text from audio and for synthesizing speech from text), which achieved user satisfaction levels similar to existing standards while requiring three orders of magnitude lower bit rates. Our SHTEMers implemented a version that runs on a standard web browser [TW23].

In [Per+22] we used human input to teach a small deep net to anticipate regions of importance in a video, guiding the bit rate allocation to boost the performance of an existing video codec (x264). Naturally, the importance regions learned corresponded to objects that can be tagged by one or a couple of words.

1.6 Emergence of Text

This work, performed largely by SHTEMers and their mentors, has revealed much potential for improvements over our existing technologies for multimedia data compression, with human language emerging as key for assessing this potential and delivering on it. Meanwhile, natural language processing and machine learning have been progressing dramatically, with the rise of powerful language models (such as ChatGPT, LaMDA, Bard, and Copilot), generators of multimedia data based on textual descriptions (such as DALL-E2), generators of text descriptions of multimedia data (such as in GPT4), etc.

2 Quest for a New Transform

Our most widely used information processing technologies are transform based: transform, process-in-the-transform-domain, inverse-transform. Most widely used have been the linear tansfroms (FFT, wavelets, etc.), with non-linear transforms recently playing an increasingly prominent role [Bal+21, Vas+17].

2.1 Effective Transforms

An effective transform in a given application has all or most of the following characteristics:

•

coefficients in the transform domain correspond to elements (basis functions) that are meaningful (biologically, physically, perceptually, or conceptually)
•

sparsity of and simple relationships between transform coefficients (uncorrelated, independent, weakly dependent, etc.)
•

smoothness of the forward and inverse transforms with respect to relevant metrics (i.e., two “similar” inputs remain “similar” in the transform domain, and vice versa)
•

low complexity (of both the forward and inverse transform)

An added benefit of such transforms is that they induce wieldy yet realistic data models.

2.2 FFT as an Allegory

The most celebrated example of an effective transform is that named after Fourier. A seasoned audio engineer, on a quick glance of the Fourier transform of a segment of audio, can tell whether it’s a musical piece and, if so, which instruments are playing what notes, the level of the background noise, etc. Peaks in the transform domain correspond to dominant frequencies, in turn corresponding to “harmonics”, i.e. signal components resonant with our auditory system.

A harmonic in the Fourier context is meaningful to us: we know what it looks like on an oscilloscope, what it feels (sounds) like to our auditory system, and we have a (numerical) way of labeling it (frequency). What are our conceptual “harmonics”? What “thing” lights up non-trivial subsets of our neurons?

The answer, one might argue, are our words, which refer to and describe concepts or things that resonate with us. In the context of image data, our quest might benefit from the following analogy:

Fourier Transform	New Transform
harmonics	objects for which we have words
frequencies	words
amplitude and phase of harmonics	size,location, and orientation of objects
precision of amplitude and phase	precision of size and location
frequency resolution	number of words per object
component-wise fidelity	fidelity to story being told
FFT	computationally efficient image captioning

Words are as central to the transform we seek as frequencies are to that of Fourier.

2.3 Qualitative Examples

It’s instructive to contemplate what the transform we seek might look like in a couple of familiar edge cases. Let’s begin with white noise. It remains white noise under Fourier or any of the other orthonormal transforms such as wavelets. But in the new transform domain it would be totally concentrated (a “delta function”) on one element: white noise.

Refer to caption — (a) in a traditional transform domain (remains white noise)

Next let’s consider the South Park image in Figure 2 and a cartoon of what a version of it might look like in such a transform domain.

The point is that there are image types – where the story being told is more important than the pixel-wise fidelity – that would have a sparse approximate representation under this transform. These might include holiday greetings cards, wedding photos, cartoons, computer-generated images, propaganda posters and ads.

2.4 A Textual Transform

Moving from the qualitative to the concrete, a possible version of a (lossy) representation of an image in a transform domain of the type alluded to in Figure 1 and Figure 2 would comprise the following elements:

•

description of how many objects (up to prescribed number $L$ for cognitive load)
•

list of what they are (one word each)
•

their sizes, locations, orientations (at prescribed physical resolution $R$ )
•

words of description for each and how they relate to each other (at prescribed ‘textual resolution’ $W$ )

Different choices of $(L,R,W)$ correspond to different bit rates and description quality. Each of these elements can be described exclusively in text, yielding what boils down to a verbal description of the image, confined to a given budget of words per object, sentences, overall length, etc.

Such a transform seems to check the boxes of Subsection 2.1, inclusive of “smoothness” with respect to textual similarity metrics (cf. [WD20] and references therein) and “low complexity”. Technology for implementing versions of such transforms (image captioning) and their inverses (generating images from text) is rapidly evolving [Osm+23, GCV21].

Such a transform is particularly useful for lossy compression. It comprises the compressed representation, as in the setting of Subsection 1.3, sans the human in the loop. Beyond checking the boxes of a good transform, this one comes with another level of meaningfulness of the compressed representation, which is human readable and queryable. Here encoding and decoding are each of standalone value: the encoding being a human readable textual summary of the data, and the decoder being a generator of a point in the original data domain based on its textual description.

Figure 3 depicts an extreme version of this idea, implementable with existing technologies, highlighting that the compression rate, in the traditional sense of bits per symbol (pixel), can be made ludicrously low with reconstructions that might be satisfactory in some applications. Perhaps more meaningful is that the overall number of bits required for the compressed representation, along with the complexity of the implementation, scale with the amount of nuance in the story being told rather than the physical (pixel) resolution of the image.

2.5 Classical vs. Textual regimes

Shannon’s Rate-Distortion theory has focused on symbol-by-symbol distortion criteria. Figure 4(a) depicts the Distortion-Rate function of a memoryless Gaussian source under squared error distortion as representative of what will qualitatively look quite similar for any non-degenerate source and symbol-by-symbol distortion criterion. Rate is measured in bits per source symbol, commensurate with a regime of an overall number of bits linear in the number of source symbols being represented. Sufficiently large rate achieves arbitrarily small distortion (in fact zero for discrete sources when reaching or exceeding the entropy rate). Equivalently, one might define fidelity as the negation of distortion, with zero distortion corresponding to maximal fidelity. Figure 4(b) depicts the Fidelity-Rate function that would result for the Distortion-Rate function of Figure 4(a), with maximal fidelity set in this example to the variance of the source. Yet another equivalent representation is depicted in Figure 4(c), with a logarithmic scale for the rate. It elucidates the fact that, in this framework of symbol-by-symbol distortion, extremely small information rates can at best only negligibly boost the fidelity from its minimal value.

This behavior is quite different from that we have been observing of human satisfaction from text-based reconstructions in multimedia data compression, as depicted in Figure 5. Our initial experiment discussed in Subsection 1.3 has shown that extremely low bit rates – well on the left in a plot with logarithmic rate on its $x$ -axis – suffice for substantially boosting human satisfaction levels beyond the minimal value. Subsequent work mentioned in Subsection 1.4 showed similar reconstruction quality achievable while reducing the rate by another order of magnitude. Our current experiments (not yet public) are suggesting that non-trivially positive human satisfaction levels are achievable with further orders of magnitude reductions in rate via textual transform coding at regimes of paragraphs, handful of sentences (as in Figure 3), handful of words, etc.

3 Next steps

3.1 A Fidelity Measure Aligned with Human Satisfaction

We conjecture that human satisfaction, given a source and its possible reconstruction, in the context of multimedia data, is mainly determined by two factors. The first is the degree of fidelity to the story being told, as would be captured by similarity in the textual domain. The second is fidelity in the traditional sense. In other words, human satisfaction can be well approximated/predicted by a function of the form

s(x,\hat{x})=g\left(T_{f},P_{f}\right),

(1)

where $x$ is the source, $\hat{x}$ its reconstruction, $T_{f}(x,\hat{x})$ is similarity measured between the respective textual transforms ( $T_{f}$ standing for “textual fidelity”), and $P_{f}(x,\hat{x})$ a traditional type of similarity measure such as PSNR or SSIM ( $P_{f}$ standing for “pixel-wise fidelity”). Figure 5 suggests that high satisfaction can be achieved primarily via high traditional fidelity and would be largely determined by it in that regime. In the other extreme, when traditional fidelity is low, the level of satisfaction would be largely determined by the level of fidelity in the textual domain. In other words, assuming without loss of generality that $T_{f}$ and $P_{f}$ are normalized to reside in $[0,1]$ , $g\left(\alpha,\beta\right)$ would be largely dominated/determined by $\alpha$ for $\beta$ low and $\beta$ when $\beta$ is high. We plan to dedicate a couple of SHTEM 2023 projects to gauging and quantifying human satisfaction, and to experimentally assessing the validity of our conjecture along with the form of the function $g$ .

3.2 Characterizing and Approaching $S(R)$

As alluded, the textual transform would be a useful framework for modeling the data. Unlike symbol-by-symbol approaches which tend to yield models either too simplistic for the actual data or too complicated to be useful, one could start in the textual domain, which is amenable to fairly simple and realistic modeling, followed by a generative model going from the text to the original domain. With such an approach, and the satisfaction function discussed in the preceding subsection identified, it would be realistic to characterize the best achievable trade-offs between satisfaction and (log of the) rate by applying the well-developed tools from the Shannon theoretic arsenal in both the textual and generative domains.

On the constructive side, characterizing $S(R)$ would accompany and guide the development of new practical schemes. Progress on textual technologies of the type mentioned in Subsection 1.6 will directly translate to improved schemes for the ultra low rate region. No less interesting would be the largely unexplored region bridging the ultra-low and the traditional symbol-by-symbol (pixel-wise) fidelity region where most current technologies reside. This region will likely benefit from new approaches for effectively leveraging and combining information from the textual domain with low resolution versions from the more traditional ones, as suggested in Figure 6.

3.3 Beyond Compression

3.3.1 Textual Transform and Privacy

The ultra low-rate regime enabled by textual transform coding could be beneficial not only for the dramatic space/bandwidth savings but also privacy. Representing the source data via text that can be compressed into a handful of $k$ bits translates trivially to a privacy guarantee that the mutual information between the image and its representation is upper bounded by $k$ . A variant is to let those $k$ bits represent answers to a set of queries about the source data that would be agreed upon as non-private, naturally extractable from the textual domain. Recent work in this direction has been reported on in [Guo+22].

3.3.2 Denoising

One could envision applications and data types where denoising can be performed particularly dramatically and effectively in the textual domain. E.g., simply removing the word ‘noisy’ from the textual description or replacing it with the word ‘crisp’ could result in a substantial quality boost, as can be measured and optimized for under the satisfaction function of Subsection 3.1.

4 Conclusion

The textual transform is a conceptually useful and increasingly practical framework for multimedia information processing based on a code optimized over many years of human evolution. It yields human readable and searchable representations, with bounded implementation complexity that does not grow with traditional physical characteristics such as pixel resolution. It may be key to enabling compression and streaming at unprecedentedly low bit rates, and to characterizing fundamental trade-offs between bit rate and human satisfaction.

Our focus here was images for illustrative purposes, but the ideas are similarly applicable to other traditional and new multimedia data types. Our work joins other recent activity attempting to incorporate notions of perception and semantics into traditional compression and communication paradigms, cf. [BM19, Zha+21, Wan+22, Guo+22a, Gun+23, Agu+23] and references therein. Further progress in this area will likely come from multidisciplinary collaborations between information scientists, engineers, neuroscientists and psychologists.

5 Acknowledgement

I am grateful to the many who have been making our SHTEM program a growing success. In particular, to Cindy Nguyen, Sylvia Chin and Suzanne Sims for their tireless selfless work running everything since the summer of 2019; to Dr. Shubham Chandak, Dr. Kedar Tatwawadi, Dr. Pulkit Tandon, Dr. Irena Hwang, Dr. Yanjun Han, Devon Baur, Noah Huffman, Dr. Dmitri Pavlichin, Jay Mardia and many other Stanford faculty and students for their exquisite mentorship of the projects; and to the SHTEMers for their inspiring work. Thanks to Lara Arikan for the example in Figure 3 and stimulating discussions.

References

[Sha48] C.. Shannon “A mathematical theory of communication” In The Bell System Technical Journal, 1948
[Sha51] C.. Shannon “Prediction and entropy of printed English” In The Bell System Technical Journal 30.1, 1951, pp. 50–64 DOI: 10.1002/j.1538-7305.1951.tb01366.x
[CK78] T. Cover and R. King “A convergent gambling estimate of the entropy of English” In IEEE Transactions on Information Theory 24.4, 1978, pp. 413–421 DOI: 10.1109/TIT.1978.1055912
[RTT19] Geng Ren, Shuntaro Takahashi and Kumiko Tanaka-Ishii “Entropy Rate Estimation for English via a Large Cognitive Experiment Using Mechanical Turk” In Entropy 21.12 MDPI AG, 2019, pp. 1201 DOI: 10.3390/e21121201
[Bel21] F. Bellard “NNCP”, https://bellard.org/nncp/, 2021
[Kno21] B. Knoll “CMIX”, http://www.byronknoll.com/cmix.html, 2021
[Bho+19] Ashutosh Bhown, Soham Mukherjee, Sean Yang, Shubham Chandak, Irena Fischer-Hwang, Kedar Tatwawadi and Tsachy Weissman “Humans are Still the Best Lossy Image Compressors” In 2019 Data Compression Conference (DCC), 2019 DOI: 10.1109/DCC.2019.00070
[Bho+19a] Ashutosh Bhown, Soham Mukherjee, Sean Yang, Shubham Chandak, Irena Fischer-Hwang, Kedar Tatwawadi, Judith Fan and Tsachy Weissman “Towards improved lossy image compression: Human image reconstruction with public-domain images”, 2019 arXiv:1810.11137 [eess.IV]
[Fis+21] Irena Fischer-Hwang, Shubham Chandak, Kedar Tatwawadi and Tsachy Weissman “Forget JPEG, How Would a Person Compress a Picture?” In IEEE Spectrum, 2021
[SHT23] SHTEM “Summer Internship Program”, https://compression.stanford.edu/outreach/shtem-summer-internships-high-schoolers-and-community-college-students, 2019-2023
[Cha+19] H. Chau, T. Pauly, D. Phan, R. Prabhakar, L. Quach and C. Vo “Building a Human-Centric Lossy Compressor for Facial Images” In The Informaticists Journal for High Schoolers, 2019 URL: https://theinformaticists.com/2019/08/28/building-a-human-centric-lossy-compressor-for-facial-images/
[Mah+19] V. Mahtab, G. Pimpale, J. Aldama and P. Truong “Human-Based Image Compression; Using a Deterministic Computer Algorithm to Reconstruct Pre-Segmented Images” In The Informaticists Journal for High Schoolers, 2019 URL: https://theinformaticists.com/2019/08/29/human-based-image-compression-using-a-deterministic-computer-algorithm-to-reconstruct-pre-segmented-images/
[Ro21] M. Rau and others “StageCast: Experiments in Performance and Technology”, https://taps.stanford.edu/stagecast/, 2021
[Pra+21] Roshan Prabhakar, Shubham Chandak, Carina Chiu, Renee Liang, Huong Nguyen, Kedar Tatwawadi and Tsachy Weissman “Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry” In 2021 Data Compression Conference (DCC), 2021 DOI: 10.1109/DCC50243.2021.00057
[PS21] Swathi P and Saritha Sk “DeepFake Creation and Detection:A Survey” In 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA), 2021, pp. 584–588 DOI: 10.1109/ICIRCA51532.2021.9544522
[Max] NVIDIA Maxine, https://developer.nvidia.com/maxine
[Tan+23] Pulkit Tandon, Shubham Chandak, Pat Pataranutaporn, Yimeng Liu, Anesu M. Mapuranga, Pattie Maes, Tsachy Weissman and Misha Sra “Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text” In IEEE Journal on Selected Areas in Communications 41.1, 2023, pp. 107–118 DOI: 10.1109/JSAC.2022.3221953
[TW23] Arjun Barrett; Laura Gomezjurado; Shuvam Mukherjee; Arz Bshara; Sahasrajit Sarmasarkar; Pulkit Tandon and Tsachy Weissman “Txt2Vid-Web: Web-based, Text-to-Video, Video Conferencing Pipeline” IEEE Signal Processing Society SigPort, 2023 URL: https://sigport.org/documents/txt2vid-web-web-based-text-video-video-conferencing-pipeline
[Per+22] Evgenya Pergament, Pulkit Tandon, Oren Rippel, Lubomir Bourdev, Alexander Anderson, Bruno Olshausen, Tsachy Weissman, Sachin Katti and Kedar Tatwawadi “PIM: Video Coding using Perceptual Importance Maps”, 2022 DOI: 10.48550/arXiv.2212.10674
[Bal+21] Johannes Ballé, Philip A. Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung Jin Hwang and George Toderici “Nonlinear Transform Coding” In IEEE Journal of Selected Topics in Signal Processing 15.2, 2021, pp. 339–353 DOI: 10.1109/JSTSP.2020.3034501
[Vas+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser and Illia Polosukhin “Attention Is All You Need” In Advances in Neural Information Processing Systems, 2017
[WD20] J. Wang and Y. Dong “Measurement of Text Similarity: A Survey” In Information 11.9, 2020 DOI: 10.3390/info11090421
[Osm+23] Asmaa A.. Osman, A. Wahby Shalaby, Mona M. Soliman and Khaled M. Elsayed “A Survey on Attention-Based Models for Image Captioning” In International Journal of Advanced Computer Science and Applications 14.2, 2023 URL: https://www.proquest.com/scholarly-journals/survey-on-attention-based-models-image-captioning/docview/2791786135/se-2
[GCV21] Federico Galatolo, Mario Cimino and Gigliola Vaglini “Generating Images from Caption and Vice Versa via CLIP-Guided Generative Latent Space Search” In Proceedings of the International Conference on Image Processing and Vision Engineering SCITEPRESS - ScienceTechnology Publications, 2021 DOI: 10.5220/0010503701660174
[Guo+22] Tao Guo, Jie Han, Huihui Wu, Yizhu Wang, Bo Bai and Wei Han “Protecting Semantic Information Using An Efficient Secret Key” In 2022 IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2660–2665 DOI: 10.1109/ISIT50566.2022.9834462
[BM19] Yochai Blau and Tomer Michaeli “Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff”, 2019 arXiv:1901.07821 [cs.LG]
[Zha+21] George Zhang, Jingjing Qian, Jun Chen and Ashish J Khisti “Universal Rate-Distortion-Perception Representations for Lossy Compression” In Advances in Neural Information Processing Systems, 2021 URL: https://openreview.net/forum?id=_wdgJCH-Jf
[Wan+22] Yizhu Wang, Tao Guo, Bo Bai and Wei Han “The Estimation-Compression Separation in Semantic Communication Systems” In 2022 IEEE Information Theory Workshop (ITW), 2022, pp. 315–320 DOI: 10.1109/ITW54588.2022.9965794
[Guo+22a] Tao Guo, Yizhu Wang, Jie Han, Huihui Wu, Bo Bai and Wei Han “Semantic Compression with Side Information: A Rate-Distortion Perspective”, 2022 arXiv:2208.06094 [cs.IT]
[Gun+23] D Gunduz, Z Qin, IE Aguerri, HS Dhillon, Z Yang, A Yener, KK Wong and C-B Chae “Beyond Transmitting Bits: Context, Semantics, and Task-Oriented Communications” In IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 41, 2023, pp. 5–41 DOI: 10.1109/JSAC.2022.3223408
[Agu+23] Eirikur Agustsson, David Minnen, George Toderici and Fabian Mentzer “Multi-Realism Image Compression with a Conditional Generator”, 2023 arXiv:2212.13824 [cs.CV]

Toward Textual Transform Coding111Based on a lecture entitled “Learning from Humans How to Improve Lossy Data Compression”, given at the International Symposium on Information Theory in Espoo, Finland, summer of 2022.