12 June 2010

Semantic Web - Part 1 (What is it?)

The science fiction future that futurologists love predicting is some way away because meaningful data is currently not well integrated. You cannot have anything like the rampant “Imagine...” scenarios (typically using “Agents” and RFIDs) that futurologists speak to on a wide scale until (in terms of data) a Single Version of Truth (SVOT) is agreed upon for the data we are looking at (or disambiguation as it is called) and it is integrated wholesale.

Many of these scenarios will happen since they portrait a more efficient, opportunity filled or simply fun lifestyle; one enabled by information. Someone in the future will find a way to monetize it (maybe in a Minority Report/advertising kind of way) because market forces always apply.

Let us try one out in the next paragraph to illustrate:

Imagine you are in a bar. A distributed agent *knows* your location through your mobile. It also *knows* that the CEO of a large company in your industry is in the same bar and that that company is hiring for a role one level above your current one. Of course, it *knows* that you have been in your current role for a while. It does not *know* whether the CEO has any people with him at the moment (or is waiting for them) as he has restricted this information by his personal privacy settings. The agent suggests though (by texting or calling you) that it is worth you going up to the CEO and introducing yourself but not before it has already emailed him your resume and informed you that his drink of choice is a dry Martini.

This scenario is by turns – fantastically cool, moderately disturbing, highly efficient, opportunistically enabling and culturally changing. Some version of it will likely happen. What it is not is - artificially intelligent. Aside from the technology existing to do it all right now (mine GPS location data in real-time to determine matches against pre-set scenarios e.g. connecting people for jobs then checking social encounter rules and privacy settings) and the rule-based logic involved is straightforward (if people in same location and same industry and networking opportunity exists then...) the fact that all data required to fulfil this scenario is in different formats and in any case – secured (since there is little reason for the owners to share) will prevent our vista from happening. A lesser inhibitor is the rule-based logic - straightforward certainly; but the types of scenarios we are talking about require a lot of rules and it is unclear who will maintain them.

Future agent does not *know* anything, it has simply traversed the Internet (using look-up tables or schemas) to find an SVOT (because data is well integrated your location is stored disambiguously) and acted upon them as directed by predefined rule-based logic. Basically it has acted like any program around today (but on better data).

To fully integrate data you need senders and receivers to agree on a standard for both storage and communication (storing data in one format and communicating it in another defeats the purpose of data integration). This standard needs to be simple (since we also want to exchange data with mobile and embedded devices and generally want rapid and broad diffusion of the format) and not restrict others from building niche and more complex standards on top. The simplest standard is - a fact (Sales, personal etc.). Facts – of course, are condensed down to, something that the fact is about (the subject), something about the subject (the predicate) and something that is related to the subject through the predicate (the object). Examples are:

You (subject) located in (object) bar (predicate)
bar (subject) place of (object) socializing (predicate)

You cannot decompose facts any more than this otherwise; they would not tell us anything. It is conceptually akin to storing data at a physical level as 0s and 1s. Any type of information in the world can ultimately be stored as a list of linked facts in this way.

I (subject) purchase (predicate) Jack Daniels (object)

What is missing here is (you might ask) the timestamp and location; don’t we have to add them as columns four and five? Surely our future scenario needs that information? No – the idea is that we stick with the simple triple representation and it becomes:

I (subject) purchase (predicate) Jack Daniels (object)
My purchase (subject) was timed at (predicate) 1430HRS (object)
My purchase (subject) was located at (predicate) O'Malley's bar

While it is certainly true that much rule-based logic will always be required to fulfil the type of scenarios above, the amount of it is significantly reduced by the ability to make inferences using facts. Consider our facts:

You (subject) located in (object) bar (predicate)

This fact is automatically generated by your phone broadcasting its GPS location and it being matched to its location as commercial premises and finally cross-referenced against its business type.

bar (subject) place of (object) socializing (predicate)

This is a core fact that never changes and was created by some internationally accepted community effort. Because we have a like-term (bar), we can now infer that:

You (subject) are currently (object) socializing (predicate).

You have not specifically told anyone that you are socializing. It has not been encoded. Indeed it may be the middle of the afternoon so, in lieu of further information anyone may have otherwise assumed you were at work. We could have built-in rule-based logic to achieve the same result (if you are in a bar then you are socialising) but we have been saved the trouble by inference. Performance has been maintained as inference was in memory. This type of logic – syllogism has been around since at least Ancient Greeks. The implicit knowledge that both you and the CEO are physically in the same informal situation at the same time allows an opportunistic suggestion to be made; opening and closing a loop in real-time without having written a rule for it.

If everyone used the fact format (and its inferencing - managed by reasoners) for data storage and communication then we should all be able to resign from our jobs and hang-out in bars; secure in the knowledge that we have reached a technological plateau and an agent will at some point fix us up with a new role. Imagine.

The existing Internet is still very page focussed. You go to a job search site and have to trawl through pages of job descriptions, applying your own experience to decide which ones are interesting e.g. is a “Sales Executive” the same as a “Business Development Executive”? Does that term differ by industry category? If so, should I follow these links instead? You have to do a lot of work to find the things you want. So much so that you either give-up or end up with things that you don’t want. Using the fact format at Internet scale with disambiguation removes the necessity for humans to contextualise the data and so enables machines to better process it; which in-turn leads to more pervasive automation and those Imagine scenarios. This is what is meant by the Semantic Web (Web 3.0/SemWeb).

Is the Semantic Web inevitable? Opinion is divided to say the least. The Imagine scenarios (or variations of them) are inevitable. They absolutely require disambiguation and wholesale data integration. This in-turn has to necessitate a standard for storage and communication of facts. The Semantic Web is an attempt (the only one in town) to deliver that fact standard. Inferencing must be considered an optional component of the Semantic Web. It may uncover previously unknown information or simply be required to make it work at scale. It may require too much effort to be practicable for many due to its current reliance on an Open World Assumption (OWA).

The core promise of the Semantic Web is - disambiguation and wholesale data integration (Linked Data). It is the primary enabler for data-mashups. There are certain parallels with the early days of the Object Orientated Programming (OOP) movement. The Semantic Web is inevitable but it won't be called such. It will still be the Web.

There is an established fact format right now – RDF (Resource Descriptor Framework). Much of the supporting systems are also in place e.g. query and descriptor languages (SPARQL and OWL respectively). They have been quietly developed over the last eight years or so and all focus around the core premise of the simple fact (or triple as it is known [subject/predicate/object]). Next post will explore why we have yet to see widespread adoption of these technologies.

UPDATE: Great source of further reading links here.

No comments:

Post a Comment