« Nothing ever died... | Main | When non-philosophers party »
February 17, 2005
On the Trail of the Oracle
I devised a fascinating but simple experiment with Google, featuring a recursive mapping through the google-database that constitutes a sort of monstrous misapplication of Vitanyi and Cilibrasi's intuition about the database's latent semantic properties. Essentially, the aim is to use the database in a way which is rigorous but does not presuppose the utility of the experimental outcome. The function I describe maps words to series of words using the Google database in a very rough-and-ready way.
Some technical parameters need to be set to control the experiment properly, and I will post a program to automatically calculate the output, but for now you can try it manually.
1. For the purposes of the experiment, we limit Searchterms to single words. We understand these Searchterms as numbers (which, of course, they are).
2. By the Trail function, or T, we designate the function which maps one Searchterm number to a larger number by the following, extremely simple algorithmic process:
- Search for the Searchterm (using syntax +"searchterm");
- Each main Google result consists of a Title (shown in blue) followed by what is called a 'snippet', a small extract from the page. Take the first word of the first snippet;
- Using this first word as the new SearchTerm, iterate the process;
- T for SearchTerm s is defined as the concatenation of the whole series of results, and may be finite or infinite.
3. As will be obvious, T(s) will be one of several types, comparable to types of real number;
- Numbers which terminate (when a SearchTerm fails to return any results)
- Numbers which cycle (when the last of a series of SearchTerms outputs one of the previous SearchTerms in the series)
- Numbers which recur on one term (when a SearchTerm returns itself) (this is a special case of the above. In any case, a SearchTerm-as-number really represents a multibyte binary number).
4. The first experiments with this method demonstrated the existence of recurring trails, but as yet no cyclical or terminating trails. Several observations follow:
- It is not necessary the case that terminating results for T exist since this would mean the existence of a word in a snippet that was not indexed for searching (this is a technical question concerning the Google database structure). However there is no reason why we should not find cyclical terms.
- the progress of a Searchterm along the series T owes very little to common semantic usage, because of the 'arbitrary' selection of the first word (arguably the result owes more to large-granularity artefacts than anything else - however it does represent all the same a thoroughly rigorous mapping)
- as a corollary to this, words which are not at all cognate or related in 'normal' usage converge to the same attractors.
- One of the easiest identifiable groupings of 'basin attractors' for T, which result in one-place SearchTerm recurring numbers, are tradenames. It is likely that once you reach a tradename (at least, we might say, if the e-marketing dept. has done its job properly) you will never escape.
- In real language some words, we might say, represent 'dead ends'. Some have a limited semantic connectivity. Some ('a','our','is') can connect to almost anything. This hyperconnectivity is heightened by the 'arbitrary' use of the first word of the first result. This could obviously be modulated in various ways but the present experiment provides an interesting and simple limit-case.
- There seem to be a very limited amount of attractors - we were surprised at arriving in the same 'loop' several times within the space of a very few experiments. It should be possible via experimentation to map the 'landscape' of T.
- The Google database is a dynamic entity, so this landscape is not static.
More to report later, but for now I leave you with a few more examples., please post other interesting results below.
Tradename trail basin:
know iknowthat.com iknowthat.com ...
london discounts travel 3 welcome whitehouse.gov whitehouse.gov
Genre 'ownership' through trail:
porn penisbot's penisbot's ...
Massive coverage through 'ownership' of a trail of common connective linguistic parts:
test cleveland governor's homepage The A our anne designer all alltheweb
alltheweb
calculus tutorials the a our anne designer all alltheweb alltheweb ...
robin us the a our anne designer all alltheweb alltheweb ...
ruth home the a our anne designer all alltheweb alltheweb ...
randomness writings eben academia next excellence copyright US The our anne designer all alltheweb alltheweb ...
Unexpected cognates (no comments please ;) via interlinguistic trails:
smell enter magazyn miesiecznik peryskop peryskop ...
dread enter magazyn miesiecznik peryskop peryskop ...
'Capture' of one tradename by another:
hyperstition further furthurnet furthurnet...
norton more in you free nedstat nedstat
'Capture' and partial escape into a subcategory:
lovecraft dedicated UK Yahoo! Finance Finance ...
'Singletons':
eleutheria eleutheria ...
liberty liberty ...
apocalypse apocalypse...
derrida derrida ...
Cognates, quickly-settling recurrences:
roulette outlines executable executable ...
anybody executable executable ...
Posted by robin at February 17, 2005 01:01 PM