The lining of forgetting: raw questions
Here follow the list of the raw questions we have about archiving(in progress).
*Words:
For some time yet, digital research and archival storage will remain be a question of words: for a computer, interpreting patterns which have a significance in image documents or sounds is still very problematic.
Classifying sounds and images still relies on textual description, key words and terms hierarchy (thesaurus). How can we take into account, assess the limits of this way of describing these different media ?
*Existing structures:
In order to archive media and texts, there are norms and methods of classification (dublin core, rdf);
What do these systems introduce into the content ?
For instance, to which idea in a text or of an author do they refer ?
What are the consequences of information formatting to fit in with classification criteria ? What becomes apparent, what disappears ?
*Supply and demand:
Is a record an offer of information displayed in optimal fashion in relation to the supposed request ? Or is it a narrative, a mythology recreating the identity of what is being archived ?
Is a search for data to be preferred because we get straight to the information or because the path to reach it is the most interesting ?
*Hypervisibility:
Is a society which has a faultless and complete memory conceivable ?
If human beings never forgot anything, they would go mad. What would happen to a society in which its members never forgot anything?
Can we learn from errors from imperfect archival systems ?
*Database politics:
Producing data filed under precise forms can give some groups some visibility, it has also an exchange value. For instance, databases set up by parents of gene-deficient children collect amounts of information on tissue analysed with comments. Researchers' access to this data has been negotiated in order that parents are implicated in the decision-making process which directs their research;
*Algorithms and ontologies:
So far, the two favoured research methods of archival data on the net are Google and "ontologies", which are sets of terms intended to describe digital documents structurally.
Google's success is based on mixing the content and form found in HTML. The more the web is chaotic, the more Google's algorithms are worth their cost. On the other hand norms are being developed which postulate that content and form can be separate, such as XML. Their content could therefore be described structurally via sets of categories which are part of ontologies. The latter are developed by the W3C consortium, which implies that the entire universe is capable of agreeing on a way to describe structures of contents to be exchanged on the web. In reality, certain types of contents are profusely described : those produced by universities, by industries and technical documents.
Many observers fear this principle might be diverted by the main software producers who could implement "false" versions of these ontologies and introduce little by little a certain confusion between universal norms and their own firm's idiom;
*Questions arising from this are:
What are the important outcomes if Google becomes not only a method of access but also in control of the archives ?
What are the consequences of opaque algorithms ?
And :
What will be the effective power structure when the consortia set up ontologies which have "universal" application ?
What ideology could underpin the idea that content and form of a document can be completely separate.
"Joi Ito, presumably in response to this news, wrote the following about a possible Microsoft strategy as regards to Google, searching, and metadata:
Google likes scraping html, mixing it with their secret sauce and creating the all-mighty page ranking. Anything that detracts value from this rocket science or makes things complicated for Google or easy for other people is probably a bad thing for Google.
...
I have a feeling they (Microsoft) will embrace a lot of the open standards that we are creating in the blog space now, but that they will add their usual garbage afterwards in the name spaces and metadata so that at the end of the day it all turns funky and Microsoft.
That's a good read. The power behind Google is that the company owns the algorithms used to find data from the featureless mess of HTML that exists today. The more sophisticated the data storage, the less important the algorithms, and the less edge that Google has. Microsoft, by controlling the origination of much of this data can build in the missing knowledge about the data and basically undercut the ground on which the House of Google is written.
I also agree with Joi -- Microsoft will then make it proprietary by their own funkiness. For those who think RDF is bad, try working with MS generated XML. If you've ever seen what the company can do to HTML generated from Word, I rest my case." (Shelley Powers RDF Manual)