« melancholic science | Main | All joking aside... »

February 02, 2005

Automatic Ontology

What follows is a gloss on and interpretation of Rudi Cilibrasi and Paul Vitanyi's paper 'Automatic Meaning Discovery Using Google'. Prepared for my own notes, it might be useful as a relatively de-equationed digestible version of the research, and secondarily as a discussion of how it links to other of my own (and others') research themes.

The paper has its background in the researchers' previous work which used 'feature-free' compression-based methods to identify similarities amongst 'texts' in varied domains including genomes, computer programs and music. In the most accessible and widely-reported research result, a compression-based use of information theory was used to define a similarity measure for pieces of music expressed as MIDI files: Given a number of pieces by Beethoven, the method was able successfully to identify further pieces by the same composer.

The idea of 'feature-free' compression methods takes us right to the heart of important problems of information, definition, explanation, and meaning.

Compressibility can be thought of as a measure of information. (We treat a binary string as the generalised form into which all information can in principle be translated.) If you can compress a string of binary digits S of x bits into a smaller string S' of x' bits, and subsequently recover S from S', it is reasonable to assume that the information content of S is measured by x', not x - or, in other words, that the articulation of x in S is redundant. Compression equals the elimination of redundancy.

Discussion of compression techniques brings into play the question of the level of contextual assumption about the data to be compressed: a compressor such as mp3, developed specifically for use with digital audio data, is successful because its compression algorithm relies on certain empricially-derived assumptions about the type of data it will be asked to compress. In fact, mp3 works by throwing away information it 'knows' to be redundant to the ultimate musical 'message' (a message which, unusually, is not to be identified with the digital source code itself) - complex frequency groupings are simplified by taking out those which the user would not hear anyway.

No true (information-theoretically speaking) compression algorithm is afforded such an extravagant disposition. A general-use compressor must be able to compress and recover in full a binary string about which no assumptions as to usage can be made. They work, therefore, on a 'pure' definition of redundancy with indifference to contextual meaning. "Such compressors implicitly assume that the data has no global, structured meaning," looking only for "statistical biases" to exploit.

Kolmogorov Complexity

Kolmogorov complexity represents the maximum compressibility of a string. K(x) stands for the shortest possible algorithm that will produce x as an output (like Chaitin's program-length complexity). K-complexity can be understood best as a regulative ideal; we cannot actually discover or define the maximum compressibility of x, but we can approximate it from above with the use of any number of programming languages or compression methods. So that for any computing system C, we can say that:

K(x) < C(x) < x

As should be clear, programming languages and compressors are theoretically identical. The difference is wholly on the side of practicality for the user, where programming languages need to present to the progammer some reassuring kinship with natural language in order to be usable. Actually, all programming languages consist of a syntax and lexicon together with a compiler that transforms the program into machine-code, constituting two distinct stages of compression, one facing towards natural language, the other towards the pure binary seriality of the processor opcodes.

So, K(x) is approximated from above by C(x) where C is some compression mechanism, programming language, or computing system whereby x can be transformed into x' with the guarantee of complete recovery. "The choice of computing system C changes the value of K(x) by at most an additive fixed constant" that corresponds to the weight of assumptions built into C. Regardless of this, C(x) can always be considered a linear approximation of the unknown K(x). The authors give the practical example of the familiar compression program gzip, and the lesser-known but more powerful bzip2 and PPMZ, where gzip(x) gives xg where |xg| < |x| (meaning that the bit-length of xg is less than that of x); but the other two programs b and p give:

|xp| < |xb| < |xg| < |x|

By definition, |x| and |Kx| given by K(x) limit the series:

|Kx| < .... < |xp| < |xb| < |xg| < .... < |x|

- "and our task in designing better and better compressors is to approach this lower bound as closely as possible." To amplify and generalise, then: our task in writing computer programs is generally to approach optimum efficiency or source-code to output ratio;and our task in any rational scientific endeavour is to account for as wide a field of phenomena as possible with the simplest rule possible.

Normalized Information Distance

The authors next (leaving aside a great deal of technical detail) define the shortest program that will compute output y from input x as the Information Distance E(x,y). Defined in terms of K this is shown to be:

E(x,y) = K(x,y) - min{K(x),K(y)}

So the quickest way to get from x to y algorithmically is found by the difference between the shortest program for producing x and y, and the lesser of the lengths of the shortest program for producing either x or y.

Normalized to yields a value between 0 and 1, E is proportional to K and has the same limiting properties - all ways of getting from x to y using a given computational system C(x,y) will converge at the lower limit of E(x,y). Normalized Information Distance (NID) "discovers for every pair of strings the feature in which they are most similar, and expresses that similarity on a scale of 0 to 1" where 0 = identical and 1 = sharing no features.

The Normalized Compression Distance (NCD) is straightforwardly an implementation of NID for a given C. "Thus the NCD is actually a family of compression functions parameterized by the given data compressor C0".

Now, the method used to compare MIDI files analysed (with no assumptions) all statistical features (all measures of NCD between two MIDI pieces), just as if the purpose were to find the most effective compression method, and used the most predominant of the statistical features as comparison operators. The likelihood of piece c being by the same composer as piece y was defined as NCD C(x,y) - this latter assumed to approximate NID E(x,y) which would name the optimal way of determining the 'Beethoven-ness' of a piece (and hence, arguably would encode the meaning of Beethoven-ness) Similarity, and thus relatedness, were deduced from Kolgomorov-complexity or compressibility.

This was a method that dealt directly with literal objects (pieces of music) in order to discover mathematically the distinctions that a musician would intuitively make between pieces of music. However, the proposition is to apply a similar technique to more abstract meanings "that cannot be given literally, but only by name and acquire their meaning from their contexts in background common knowledge in humankind." Two obvious intuitions stand out with regard to such semantic analysis, roughly corresponding to the founding intuitions of structural linguistics: that context and usage is as or more important than the literal content of an object (a word), and that linguistic meaning is actually a function of the relationship between 'names' for objects, rather than between objects themselves.

So an analysis of the statistical features of the names of objects will yield more significant results in this field than one based on words themselves as concrete informational entities, and this is what the Google method sets out to achieve.

Normalized Google Distance

In effect the researchers use Google as a set of assumptions for the building a compressor. Their initial intuition is that, encoded in the probability distribution matrix constituted by the Google database, is a mechanism which can 'define', 'explain' or compress the semantic relatedness of millions of words (the 'low level' basic semantic knowledge embedded in the web) into a relatively short program-length (the Google database).

Thus "in essence, the choice of compressor brings a particular set of assumptions to bear upon an incoming data stream, and if these assumptions turn out to be accurate for the data in question, then compression is achieved. This is the same as saying that the probability mass function defined by the compressor concentrates high probability on these data. If a pair of files share information in a way that matches the assumptions of a particular compressor, then we obtain a low NCD."

Remembering what was said above concerning the shortcomings of generalised compressors such as gzip, it would be unlikely for them to attain optimal compression for a characteristically-featured (non-ergodic) domain such as natural language. Kolmogorov asks "what real meaning is there, for example, in [applying a generalised compression approach] to 'War and Peace'?" Generalised compressors cannot take advantage of any feature in the source material, no matter how obviously such features may seem, to us, to distinguish that material from random noise. We might conjecture that for a feature-rich source x, |x'| given by C(x) with a generalised C is bound to converge to |x'| = |x|, that is, no, or minimal, compression is achieved, compression being inversely proportional to the 'featuredness' or information in x.

"Everywhere gzip looks it only finds a loaded die or biased coin, resolute in its objective and foolish consistency...Thus instead of starting with a standard data compression program, we start from a probability mass function that reflects knowledge" - namely, the function that maps search terms to number of pages returned in Google - giving Normalized Google Distance or NGD. "We have replaced the obstinate objectivity of classical compressors with an anthropomorphic subjectivity derived from the efforts of millions of people worldwide." giving a highly adaptable, non-domain-restricted measure that uses names of objects to discover the structural relationships between those objects.

Given their simple example it is easy to see how superior Google is as a predictor for texts which are not indistinct and ergodic but contextual: given 'the quick brown', a purely probablistic compressor would give the most likely NCD outputs own, row, she, the; where Google gives fox,vet,jot,hex.

Of course, using a 10-billion page database as a compressor is bound to make things a little inconvenient! But, theoretically speaking, since the size of the computing system increases C(x) by a constant (C is static as x increases), at larger program-length sizes this will become less important, and we will still converge to K(x). The trade-off between compressor-assumptions and compressive power changes as a function of |x|...

The distinction the researchers make is between a literal-object based NCD and one based on contexts and names of objects, where information about the context is built into C. And the germane historical fact is that the Google system has unprecedented latent semantic properties. Despite the fact that many projects have been set up to construct just such semantic webs, Google has produced a superior one 'for free' (and keep in mind that with this research we are only dealing with using the pagecounts in this research, not even the page contents!)

"Google events [the mapping between a search term and the number of pages it returns] capture in a particular sense all background knowledge about the search terms concerned available (to Google) on the web. Therefore it is natural to consider code words for those events as coding this background knowledge" This last sentence is suggesting the consideration of the "Google compressor function" as a unique mapping between all search-results S for given terms and background-knowledges, or if you will, contextual meanings of those terms M so:

G(S) -> M

or, once more, Google constitutes a computing system C such that the function C(x) mapping a word to its meaning approaches the supposed Kolmogorov-complexity function K(x) giving the optimally-compressed definition of x's meaning. Google would then be the closest turing-compatible system we have to the optimal understanding or definition of a term such as (or better than) that which a human being builds up in the context of lived experience.

Experiments

The experiments proceed to give some examples of how this latent semantic content can be extracted, using feature-free knowledge-extraction methods. This approach basically takes a set of terms, and on the basis of the NGD between all terms builds a 'best-guess' topological model of their relatedness. The elegant and impressive results are shown in Figure 2 in the paper reproduced here.

tree.gif

The authors refer to this as a 'best-matching unrooted ternary tree' (unrooted tree? rhizome, surely!) and refer to the result as an "example of automatic ontology creation". Here we have an interesting analog to the question of 'popular numerics', and also of its compromise with anthropology, insofar as one is tempted to question whether, in building in such a vast amount of information into the compressor, we do not simply minimise the influence of the source-code on the result, and merely reap the benefits of an overloaded computational language or mapping that has become more like a gigantic lookup dictionary than an operative algorithm? The equivalent situation arises with programming languages, where the compromise is measured in the ratio of programming time-to-runtime-cost: in 'high-level' languages where the compiler must translate from near-natural-language constructs, runtime and compiler size is increased, whereas with 'low-level' languages the programmer must invest greatly in coming to terms with the unnatural syntax of the inner workings of the processor.

However let us rehearse again the fact that since the complexity of C increases C(x) only by a constant, the difference between a low and a high level C will matter less in proportion to a larger |x|. In computing terms, you only have to load the compiler into memory once, no matter the length of your program.

Next, the authors show a series of more elaborate tests where a training procedure is used to 'teach' a system by providing a set of terms divided into two categories as a basis for a web of pairings that can then be run through Google yielding NCDs (or rather, NGDs). Here where the system is given a 'headstart' (analogous, perhaps to the certain amount of learning-by-rote a child has to do before being able to spontaneously add to their knowledge) the results are even more impressive (the experiments section of the paper should provide no difficulties to the non-mathematical mind). Lastly, and rather cruelly, they compare their results to those from WordNet, a semantic model that has been built up deliberately over years with the contributions of experts, and find that the Google-derived results fall within the standard deviation from the WordNet results.

As they conclude, it is generally assumed that there is a "many-to-many relationship between...utterances and their many possible meanings," and what Google is spontaneously providing wil converge to a ready-to-use model of the entirety of this massive web of interconnections. Whereas a "context-free statistical compressor" such as gzip must assume the source code to be essentially random, its probability mass function corresponding to "the probability that the reference universal Turing machine outputs the string if its input program is generated by fair coin flips" - the universal distribution - assumes, then, a totally featureless 'web' of equal probabilities between all source-code-words (absolutely-decoded space), the NGD "is a comparable notion, in the context of the world-wide-web background information, to the universal distribution" i.e. it offers an actual model of the quasi-universal distribution of probabilities rather than an idealised pure one. What is important to consider is that the historical transformation in computational circumstances have only now made it possible, through the collection of massive-enough sets of data, to analyse such a real distribution. The relation between the fixed constant |C| and source-code |x| indexes the impact of increased computational power on the applicative distinction between source-text and coding-machinery. This suggests the need to radicalise our everyday notions of the distinction between data and program in line with the theoretical model, and, along with it, the distinction between what intelligence is, what it does, and what it numbers or names.

Finally, to put it at its most speculative, what has been shown is that it would be possible for an alien with no foreknowledge or asumptions whatsoever, and no background in human culture or 'ontological assumptions', to analyse the data returned by Google as a response to random ASCII-strings, and eventually to discover the structural properties of any domain of natural language (the examples given in the paper are of automatic ontologies of whole numbers, colours, painters, and religious beliefs).

We could imagine such an alien, accidentally tuning some outlandish binary radio device into some stray emanations from earth, and being filled with wonder at the strange but palpably meaningful patterns returned by this binary oracle, intimations of an occult ordering from across the cosmos, an intelligence from outside...we also should share its extra-terrestrial wonder!

Posted by undercurrent at February 2, 2005 06:55 PM

Comments

footnote - can't resist quickly commenting on the obviously beautiful features of the automatic ontology in the diagram - love how zero is grouped at the end of the line together with transparent, three with 'small' (as per Ifrah's exposition of 'intensive' numerical thinking 1, 2, some, many).

Posted by: u/c at February 2, 2005 08:16 PM

you're right - this is incredibly fascinating. The sheer (flat?) empiricism of it is amazingly refreshing, completely breaking with stuffy conceptual cathedral building.
Google is proving to be an astounding piece of cyberspace architecture - the Economist had an article recently about linguists using it as a research domain. To (kind of) repeat, it re-distributes knowledge in a mode of exteriority, allowing it to be experimented with rather than cognized - true 'thought experiments' at last

Posted by: nick at February 4, 2005 07:05 AM

it's also a wonderful example of the power of open systems. About a year ago, Google released an open-access API that allows programmers to tap in directly to the Google system, and since that Google has got more attention than ever, and its every machinic potential is being extracted in ways we probably can't imagine...I've done some work using the Google API and it's pretty amazing, with a couple of lines of code you can call on the resources of 10 billion pages.

Also, the power of 'blind' collectivity in building robust low-level structures a lovely riposte to 'stupidity of the masses'-type arguments - the 'bog-standard' knowledge they discover from google actually turns out to 'work' better than the experts' finely-honed topdown systems...

'thought experiments' - love it!

Posted by: u/c at February 4, 2005 09:44 AM

Hi everybody,

Some interesting comments here. I'm linking you up to my page if you don't mind. I've got more interesting experiments along the same lines as the numbers - colors ones you have already seen up at

http://diesel.ins.cwi.nl:8080/ncd/show/NGD+with+SVD


Maybe you will enjoy trying the 2D or 3D VRML NGD - SVD visualizations. Cheers

Rudi

Posted by: Rudi Cilibrasi at February 8, 2005 05:49 PM

wow, thanks...I sort of feel a little embarrassed and foolish now, caught in my quasiphilosophical ramblings by someone who knows what they're talking about (obviously, feel free to correct me).

...and I hope you don't mind me stealing your beautiful diagram!

Posted by: u/c at February 8, 2005 07:54 PM

btw Rudi - all I see on the VRML is the cube, with nothing in it (may be my fault tho', I'm using a Mac...)

Posted by: u/c at February 8, 2005 08:51 PM

I've run this Algorithm and, yes it does work (to an extent): The real 'magic' is in the Clustering! But, - you can use PMI or Dice Coefficient measures (or simply use Bayesian probabilities ... which is, essentially, what NGD is doing). There's some great ideas in the paper though.

Posted by: DeeDee at February 11, 2005 06:37 PM

you mean the complearn software is the clever bit? Obviously, it is - but in an empirical sense the fact that Google exists, its breadth of content, and the latent descriptive power of that content, is also what's being demonstrated, isn't it...? ie. The choice of statistical probability technique used to create the matrix is a separate issue from that of the 'compressor'.
Interested in the relation to Bayes too: where is the measure of uncertainty located?

Posted by: u/c at February 11, 2005 08:11 PM

Applying the idea of 'Compression'/Kolgomorov Complexity' here is just one way of doing this kind of thing. In essence, -- the NGD is simply:
max(P(X|Y), P(Y|X) / max(P(X), P(Y)
Also if 'variances' in the data items (words/phrases) do not 'scale' with Google's increase in index size ... where is the stability in the measure (Rudi does point this out .. and anyway - it's a problem in general with trying to 'count' Language). Peter Turney's work is sort of similar to the NGD. It all stacks up to: CONTEXT: If X and Y occur together a lot -- you 'assume' a relationship (although this is not necessarily so)

Posted by: DeeDee at February 12, 2005 12:48 PM

Sorry, I meant ... Kolmorogov .. I always do that!

Posted by: DeeDee at February 12, 2005 12:52 PM

Right, so it really _is_ just the basic principle as latent semantic analysis (ie co-occurrence), but with a particular measure of 'closeness', and a particularly large dataset in which the latency is supposed to dwell. But surely these two choices as to measure and dataset are not insignificant?

Can you explain further the point about scaling (sorry for being dim, I'm just not sure I get it).

Given that we are assuming as a 'regulative principle' for experimentation, that language/meaning _can_ somehow be counted or quantified, what other ways are there of approaching it other than 'context'? If we imagine that we've access to every single utterance in the language and matricised their co-occurrence, what would still be missing...?

Posted by: u/c at February 12, 2005 01:11 PM

isn't it kolmogorov ? :D

Posted by: u/c at February 12, 2005 01:12 PM

Kolmogorov ... sorry it's a form of 'Theory Dyslexia' I've got:

Simple Scale Invariancy Example:
(1)
WORD 1: 100, WORD2: 100, W1W2:50, Google: 1000
(2)
WORD 1: 200, WORD2: 200, W1W2:100, Google: 2000
(2)
WORD 1: 150, WORD2: 150, W1W2:75, Google: 2000d
(1) NGD = 0.30
(2) NGD = 0.30
(3) NGD = 0.27 <-- The Measure Changes because the Frequency of the words does not scale with the Google index size.

Posted by: DeeDee at February 12, 2005 03:43 PM

(again forgive me if I'm being stupid, I'll go back and review the paper again)
Isn't that just because the information in the Google index is not homogenous? Couldn't the NGD be expected to converge and stabilise the higher the number of pages considered?

Posted by: u/c at February 12, 2005 03:52 PM

It's often not about what's 'missing' in the data ( -- the attraction of Google as a Corpus is that we think it will 'average out' to give us coherent patterns' -- ) it's about how 'good' the stuff we are looking at is for finding (possibly) useful 'patterns'. Unfortunately, the web is full of junk patterns that do not mean much at all!

Posted by: DeeDee at February 12, 2005 03:53 PM

Sorry, that example was not that great. It's not you .. it's my poor explanation.

Posted by: DeeDee at February 12, 2005 03:57 PM