« Wisdom of the Rocks | Main | Du sperme, j'en veux encore! »
January 31, 2005
Matrix of meaning
[NOTE: Don't read this, read this one instead ]
I mentioned elsewhere that the urbanomic research dept. had been looking into latent semantic indexing; the simple principle behind this approach to indexing unstructured documents is that meaning can be defined mathematically in terms of patterns or clusters of connected words, and is discoverable through a 'feature-free' analysis ('feature-free' meaning, roughly speaking, arbitrarily mechanistic, not directed toward any specific feature) of co-occurence (a co-occurrence matrix with as many dimensions as distinct words considered). By analysing the words which occur close together in large 'unstructured' (=natural language) texts, it is possible to build up a co-occurrence matrix that 'defines' the meaning of words by their relation to a cluster of other words (before protesting circularity, think about dictionaries!). There are several variable parameters to be considered - the 'window' of words one considers (2,3,4,100 words?) and the 'shape' of the window (does co-occurrence 3 words apart matter less than one 1 word apart, and if so how much less?) Although not attaining the ever-unattainable dreams of AI, the results have been good enough to find commercial use (Autonomy, one of the most highly-capitalised companies of the dotcom boom, produces a system for semantically indexing large amounts of documents).
A dense-enough semantic matrix holds out the possibility of the simulation of meaning (under what condition would it cease to be a simulation?) A possibly apocryphal story of the 90s had it that 'bots based on word-sequences generated from semantic matrices of the entire archive of newsgroup postings had survived on newsgroups for a matter of months before it is detected that they are non-human. (The fact that the streams of bizarre but half-comprehensible text they produce resembles the writing of schizophrenics may actually aid camouflage in certain newsgroups. Over at Hyperstition the temptation is to conclude that they are in the majority ;) however we might take heart in Cilibrasi and Vitanyi's statement that "The sheer mass of the information available [on the web] about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense." !)
Semantic indexing has long been a much-invested 'holy grail' for the web (a common problem is that in searching for one meaning of a word, you are inevitably bombarded with documents pertaining to the most-commonly-used meaning - thus the french family cialis' woes upon trying to research their family tree online.) Perhaps a feature-free mechanistic indexing approach will in future be utilised so that our searches will be less feature-free.
One problem with semantic indexing echoes the problems of 'expert systems', in many ways their (far more linear, top-down) predecessors: their effectiveness is invariably context-limited. The systems are generally good at creating matrices for delimited subjects, where the important keywords are easily detected and linked, but not so good at the selection of multi-contextual documents that most of us come across every day.
However some new research (Calibrasi and Vitanyi's paper "Automatic Meaning Discovery Using Google", see link below) shows that, whereas a 'semantic search engine' to replace Google has long been a pet project, Google itself can be used as a latent co-occurrence matrix. "Normalised Google Distance" between word A and B is defined as a function of the number of results returned when searching for word A and B together; the web itself as the ultimate collection-quelconque of english-language documents would obviously provide the most generalised available basis for a semantic co-occurrence matrix. Thus, the only way to build a semantic search engine is to ask the web itself (some interesting connection to be made with Chatin's Omega here, as well as the spinoza post below - google as the "web of the web"...). Interestingly the authors of the Google research claim the novelty of the approach is not only in "its unrestricted problem domain [and] simplicity of implementation" but also in its "manifestly ontological underpinnings".
How much longer will it be before classical and neoclassical AI researchers stop trying to build intelligence in a box, and realise it's already escaped...?
More on this when I've read the Google paper more closely...
Posted by undercurrent at January 31, 2005 01:35 PM
Comments
That's good, but would be helped if writing tools enabled the author, as they are writing, to compose semantic maps of the words that they are using. For example, if they could simply highlight a set of keywords in the text as significant and related. You could imagine a document consisting of both text and as a set of semantic maps painted on top of the text. In this way it would be less dependent upon word frequency, which in many texts just doesn't give enough data for a decent map to be constructed.
We might get to build such a tool in the next version of our blog system:
http://blogs.warwick.ac.uk/rbotoole/entry/semantic_cartography_to/
The word 'ontology' seems to get abused in the semantic mapping world, often being used in place of taxonomy. Is that the case in these papers?
Posted by: RobO at February 2, 2005 12:14 AM
no, it'd be no good for microsoft-paperclip-like conceptual helpers: you need a _lot_ of material to work on, otherwise you'd just be getting granularity artefacts. Unless you'd already written 100,000 words, in which case you'd probably have quite a good idea what you were doing anyway...
Ontology, taxonomy - well, what they're doing with their 'automatic ontologies' isn't filing things away under pre-arranged species, genera, etc., the system actually discovers degrees of difference and connectivity. Not sure whether it matters whether you call it an ontology, I think they call it that because the differentiation thus discovered is presumed to derive from an ontological differentiation in 'the real' (of language, at least)
Posted by: u/c at February 3, 2005 02:59 PM
Are you saying that the search engine in some way discovers a meaning that is in some way superior, as a result of its ability to abstract statistical aggregates from a much larger body of text?
I'm interested in something quite different, that is, how meaning only exists in the process of actively creating a map, of writing and creating differences, uniqueness - at the point of undecideablity and divergence. Once it becomes statistically significant its too late, and its likely that someone has already written a definitive statement anyway, so there's no need for this semantic mapping engine. It's the creativity that matters, making the creativity happen. That's why i'm so interested in people building and breaking their own semantic maps while they are writing.
Posted by: RobO at February 3, 2005 08:16 PM
If there's any reality or ontology in the taxonomy its in its relation to those breakages/irreversibilities.
Posted by: RobO at February 3, 2005 08:17 PM
So you're only interested in meaning insofar as it can't be explicitly articulated and is always in the process of fleeing before the creative act? Sounds horribly Derridean (in a bad way) to me!
The whole point is that no, no-one _has_ been able to write a definitive statement, and yes, the argument is precisely that the massive statistical density allows us to extract meaning.
Think (see comment on the other post above) there are some really key anthropic prejudices being shattered by work like this - namely, the reflex valorization of ineffable/ungraspable (and invariably subject-correlated) epiphenomena in the face of small-but-palpable steps of machinic intelligence-from-outside, and the condescension of pure connective density as somehow 'qualitatively insufficient' to produce anthing more than "glorified sums" ("if you can mathematise it, then someone's probably already done it and it can't be interesting anyway" - what more glorious expression of the human spirit can there be :).
So how do you create a map of something that is not statistically-significant? Surely the problem of probability-distributions and granularity artefacts is utterly germane to this, and the only possible way to treat it procedurally? The point of undecidability and divergence of what? If it's undecidable what are you mapping?
Posted by: u/c at February 4, 2005 09:56 AM