« Du sperme, j'en veux encore! | Main | All joking aside... »
February 02, 2005
Automatic Ontology
What follows is a gloss on and interpretation of Rudi Cilibrasi and Paul Vitanyi's paper 'Automatic Meaning Discovery Using Google'. Prepared for my own notes, it might be useful as a relatively de-equationed digestible version of the research, and secondarily as a discussion of how it links to other of my own (and others') research themes.
The paper has its background in the researchers' previous work which used 'feature-free' compression-based methods to identify similarities amongst 'texts' in varied domains including genomes, computer programs and music. In the most accessible and widely-reported research result, a compression-based use of information theory was used to define a similarity measure for pieces of music expressed as MIDI files: Given a number of pieces by Beethoven, the method was able successfully to identify further pieces by the same composer.
The idea of 'feature-free' compression methods takes us right to the heart of important problems of information, definition, explanation, and meaning.
Compressibility can be thought of as a measure of information. (We treat a binary string as the generalised form into which all information can in principle be translated.) If you can compress a string of binary digits S of x bits into a smaller string S' of x' bits, and subsequently recover S from S', it is reasonable to assume that the information content of S is measured by x', not x - or, in other words, that the articulation of x in S is redundant. Compression equals the elimination of redundancy.
Discussion of compression techniques brings into play the question of the level of contextual assumption about the data to be compressed: a compressor such as mp3, developed specifically for use with digital audio data, is successful because its compression algorithm relies on certain empricially-derived assumptions about the type of data it will be asked to compress. In fact, mp3 works by throwing away information it 'knows' to be redundant to the ultimate musical 'message' (a message which, unusually, is not to be identified with the digital source code itself) - complex frequency groupings are simplified by taking out those which the user would not hear anyway.
No true (information-theoretically speaking) compression algorithm is afforded such an extravagant disposition. A general-use compressor must be able to compress and recover in full a binary string about which no assumptions as to usage can be made. They work, therefore, on a 'pure' definition of redundancy with indifference to contextual meaning. "Such compressors implicitly assume that the data has no global, structured meaning," looking only for "statistical biases" to exploit.
Kolmogorov Complexity;
Kolmogorov complexity represents the maximum compressibility of a string. K(x); stands for the shortest possible algorithm that will produce x; as an output (like Chaitin's program-length complexity). K-complexity can be understood best as a regulative ideal; we cannot actually discover or define the maximum compressibility of x;, but we can approximate it from above with the use of any number of programming languages or compression methods. So that for any computing system C;, we can say that:
K(x) < C(x) < x;
As should be clear, programming languages and compressors are theoretically identical. The difference is wholly on the side of practicality for the user, where programming languages need to present to the progammer some reassuring kinship with natural language in order to be usable. Actually, all programming languages consist of a syntax and lexicon together with a compiler that transforms the program into machine-code, constituting two distinct stages of compression, one facing towards natural language, the other towards the pure binary seriality of the processor opcodes.
So, K(x); is approximated from above by C(x); where C; is some compression mechanism, programming language, or computing system whereby x; can be transformed into x'; with the guarantee of complete recovery. "The choice of computing system C; changes the value of K(x); by at most an additive fixed constant" that corresponds to the weight of assumptions built into C;. Regardless of this, C(x); can always be considered a linear approximation of the unknown K(x);. The authors give the practical example of the familiar compression program gzip, and the lesser-known but more powerful bzip2 and PPMZ, where gzip(x); gives xg; where |xg| < |x|; (meaning that the bit-length of xg; is less than that of x;); but the other two programs b; and p; give:
|xp| < |xb| < |xg| < |x|;
By definition, |x|; and |Kx|; given by K(x); limit the series:
|Kx| < .... < |xp| < |xb| < |xg| < .... < |x|;
- "and our task in designing better and better compressors is to approach this lower bound as closely as possible." To amplify and generalise, then: our task in writing computer programs is generally to approach optimum efficiency or source-code to output ratio;and our task in any rational scientific endeavour is to account for as wide a field of phenomena as possible with the simplest rule possible.
Normalized Information Distance;
The authors next (leaving aside a great deal of technical detail) define the shortest program that will compute output y from input x as the Information Distance E(x,y);. Defined in terms of K; this is shown to be:
E(x,y) = K(x,y) - min{K(x),K(y)};
So the quickest way to get from x; to y; algorithmically is found by the difference between the shortest program for producing x; and y;, and the lesser of the lengths of the shortest program for producing either x; or y;.
Normalized to yields a value between 0 and 1, E; is proportional to K; and has the same limiting properties - all ways of getting from x; to y; using a given computational system C(x,y); will converge at the lower limit of E(x,y);. Normalized Information Distance; (NID) "discovers for every pair of strings the feature in which they are most similar, and expresses that similarity on a scale of 0 to 1" where 0 = identical and 1 = sharing no features.
The Normalized Compression Distance; (NCD) is straightforwardly an implementation of NID for a given C;. "Thus the NCD is actually a family of compression functions parameterized by the given data compressor C;0".
Now, the method used to compare MIDI files analysed (with no assumptions) all statistical features (all measures of NCD between two MIDI pieces), just as if the purpose were to find the most effective compression method, and used the most predominant of the statistical features as comparison operators. The likelihood of piece c; being by the same composer as piece y; was defined as NCD C(x,y); - this latter assumed to approximate NID E(x,y); which would name the optimal way of determining the 'Beethoven-ness' of a piece (and hence, arguably would encode the meaning of Beethoven-ness) Similarity, and thus relatedness, were deduced from Kolgomorov-complexity or compressibility.
This was a method that dealt directly with literal objects (pieces of music) in order to discover mathematically the distinctions that a musician would intuitively make between pieces of music. However, the proposition is to apply a similar technique to more abstract meanings "that cannot be given literally, but only by name and acquire their meaning from their contexts in background common knowledge in humankind." Two obvious intuitions stand out with regard to such semantic; analysis, roughly corresponding to the founding intuitions of structural linguistics: that context and usage is as or more important than the literal content of an object (a word), and that linguistic meaning is actually a function of the relationship between 'names' for objects, rather than between objects themselves.
So an analysis of the statistical features of the names of objects; will yield more significant results in this field than one based on words themselves as concrete informational entities, and this is what the Google method sets out to achieve.
Normalized Google Distance;
In effect the researchers use Google as a set of assumptions for the building a compressor. Their initial intuition is that, encoded in the probability distribution matrix constituted by the Google database, is a mechanism which can 'define', 'explain' or compress the semantic relatedness of millions of words (the 'low level' basic semantic knowledge embedded in the web) into a relatively short program-length (the Google database).
Thus "in essence, the choice of compressor brings a particular set of assumptions to bear upon an incoming data stream, and if these assumptions turn out to be accurate for the data in question, then compression is achieved. This is the same as saying that the probability mass function defined by the compressor concentrates high probability on these data. If a pair of files share information in a way that matches the assumptions of a particular compressor, then we obtain a low NCD."
Remembering what was said above concerning the shortcomings of generalised compressors such as gzip, it would be unlikely for them to attain optimal compression for a characteristically-featured (non-ergodic) domain such as natural language. Kolmogorov asks "what real meaning is there, for example, in [applying a generalised compression approach] to 'War and Peace'?" Generalised compressors cannot take advantage of any feature in the source material, no matter how obviously such features may seem, to us, to distinguish that material from random noise. We might conjecture that for a feature-rich source x;, |x'|; given by C(x); with a generalised C; is bound to converge to |x'| = |x|;, that is, no, or minimal, compression is achieved, compression being inversely proportional to the 'featuredness' or information in x;.
"Everywhere gzip looks it only finds a loaded die or biased coin, resolute in its objective and foolish consistency...Thus instead of starting with a standard data compression program, we start from a probability mass function that reflects knowledge" - namely, the function that maps search terms to number of pages returned in Google - giving Normalized Google Distance; or NGD. "We have replaced the obstinate objectivity of classical compressors with an anthropomorphic subjectivity derived from the efforts of millions of people worldwide." giving a highly adaptable, non-domain-restricted measure that uses names of objects to discover the structural relationships between those objects.
Given their simple example it is easy to see how superior Google is as a predictor for texts which are not indistinct and ergodic but contextual: given 'the quick brown', a purely probablistic compressor would give the most likely NCD outputs own, row, she, the;; where Google gives fox,vet,jot,hex;.
Of course, using a 10-billion page database as a compressor is bound to make things a little inconvenient! But, theoretically speaking, since the size of the computing system increases C(x); by a constant (C; is static as x; increases), at larger program-length sizes this will become less important, and we will still converge to K(x);. The trade-off between compressor-assumptions and compressive power changes as a function of |x|;...
The distinction the researchers make is between a literal-object based NCD and one based on contexts and names; of objects, where information about the context is built into C;. And the germane historical fact is that the Google system has unprecedented latent semantic properties;. Despite the fact that many projects have been set up to construct just such semantic webs, Google has produced a superior one 'for free' (and keep in mind that with this research we are only dealing with using the pagecounts in this research, not even the page contents!)
"Google events [the mapping between a search term and the number of pages it returns] capture in a particular sense all background knowledge about the search terms concerned available (to Google) on the web. Therefore it is natural to consider code words for those events as coding this background knowledge" This last sentence is suggesting the consideration of the "Google compressor function" as a unique mapping between all search-results S; for given terms and background-knowledges, or if you will, contextual meanings; of those terms M; so:
G(S) -> M;
or, once more, Google constitutes a computing system C; such that the function C(x); mapping a word to its meaning approaches the supposed Kolmogorov-complexity function K(x); giving the optimally-compressed definition of x;'s meaning. Google would then be the closest turing-compatible system we have to the optimal understanding or definition of a term such as (or better than) that which a human being builds up in the context of lived experience.
Experiments;
The experiments proceed to give some examples of how this latent semantic content can be extracted, using feature-free knowledge-extraction methods. This approach basically takes a set of terms, and on the basis of the NGD between all terms builds a 'best-guess' topological model of their relatedness. The elegant and impressive results are shown in Figure 2 in the paper reproduced here.
<img alt="tree.gif" src="http://blog.urbanomic.com/dread/archives/tree.gif" width="520" height="641" />
The authors refer to this as a 'best-matching unrooted ternary tree' (unrooted tree? rhizome, surely!) and refer to the result as an "example of automatic ontology creation". Here we have an interesting analog to the question of 'popular numerics', and also of its compromise with anthropology, insofar as one is tempted to question whether, in building in such a vast amount of information into the compressor, we do not simply minimise the influence of the source-code on the result, and merely reap the benefits of an overloaded computational language or mapping that has become more like a gigantic lookup dictionary than an operative algorithm? The equivalent situation arises with programming languages, where the compromise is measured in the ratio of programming time-to-runtime-cost: in 'high-level' languages where the compiler must translate from near-natural-language constructs, runtime and compiler size is increased, whereas with 'low-level' languages the programmer must invest greatly in coming to terms with the unnatural syntax of the inner workings of the processor.
However let us rehearse again the fact that since the complexity of C; increases C(x); only by a constant, the difference between a low and a high level C; will matter less in proportion to a larger |x|;. In computing terms, you only have to load the compiler into memory once, no matter the length of your program.
Next, the authors show a series of more elaborate tests where a training procedure is used to 'teach' a system by providing a set of terms divided into two categories as a basis for a web of pairings that can then be run through Google yielding NCDs (or rather, NGDs). Here where the system is given a 'headstart' (analogous, perhaps to the certain amount of learning-by-rote a child has to do before being able to spontaneously add to their knowledge) the results are even more impressive (the experiments section of the paper should provide no difficulties to the non-mathematical mind). Lastly, and rather cruelly, they compare their results to those from WordNet, a semantic model that has been built up deliberately over years with the contributions of experts, and find that the Google-derived results fall within the standard deviation from the WordNet results.
As they conclude, it is generally assumed that there is a "many-to-many relationship between...utterances and their many possible meanings," and what Google is spontaneously providing wil converge to a ready-to-use model of the entirety of this massive web of interconnections. Whereas a "context-free statistical compressor" such as gzip must assume the source code to be essentially random, its probability mass function corresponding to "the probability that the reference universal Turing machine outputs the string if its input program is generated by fair coin flips" - the universal distribution - assumes, then, a totally featureless 'web' of equal probabilities between all source-code-words (absolutely-decoded space), the NGD "is a comparable notion, in the context of the world-wide-web background information, to the universal distribution" i.e. it offers an actual; model of the quasi-universal distribution of probabilities rather than an idealised pure one. What is important to consider is that the historical transformation in computational circumstances have only now made it possible, through the collection of massive-enough sets of data, to analyse such a real distribution. The relation between the fixed constant |C|; and source-code |x|; indexes the impact of increased computational power on the applicative distinction between source-text and coding-machinery. This suggests the need to radicalise our everyday notions of the distinction between data and program in line with the theoretical model, and, along with it, the distinction between what intelligence is, what it does, and what it numbers or names.
Finally, to put it at its most speculative, what has been shown is that it would be possible for an alien with no foreknowledge or asumptions whatsoever, and no background in human culture or 'ontological assumptions', to analyse the data returned by Google as a response to random ASCII-strings, and eventually to discover the structural properties of any domain of natural language (the examples given in the paper are of automatic ontologies of whole numbers, colours, painters, and religious beliefs).
We could imagine such an alien, accidentally tuning some outlandish binary radio device into some stray emanations from earth, and being filled with wonder at the strange but palpably meaningful patterns returned by this binary oracle, intimations of an occult ordering from across the cosmos, an intelligence from outside...we also should share its extra-terrestrial wonder!
Posted by robin at February 2, 2005 06:55 PM