13 June 2010

Semantic Web - Part 2 (Where is it?)

The enterprise in general has barely given these technologies a second thought to-date and the consumer has little idea about them (short of a vague idea that Web 3.0 will make the Web more intelligent). The concept of storing more data in a graph/RDF format remains disputed. MSFT, for example (as a populist bridge between the two), could be said to have a less-than-enthusiastic approach (all the main APIs are in Java with .NET versions managed only by enthusiasts). Few MSFT products use RDF internally (Media Management). None (including SQL) use OWL/SPARQL. Google have their recent Rich Snippets initiative but their Chart API currently only works with spreadsheets (rather than RDF). Facebook are actively pursuing the graph format as its on a secret logo inside Mark Zuckerberg's hoodie. Twitter have recently announced annotations - a way to add meta-data to tweets (which could be used semantically in future). Some breaking sites use RDF e.g. Glue, Drupal and Tripit but there are no killer apps yet.

The world awaits an application that inherently makes a tool of the Semantic Web. This will likely be focussed around disambiguation since wholesale data integration is a tougher nut to crack.

Reasons follow (descending order of importance to-date):

1) Openness. There are few clear reasons for the enterprise (the people who manage the vast majority of data) to be more open with data. Especially their raw data (as opposed to their massaged/reporting data). Government has a remit of transparency so they have more data in this format.
2) Federation. Of business processes. There are a host of facts (and rule-based logic linking them together) required to make the above scenario (and anything like it) function; all working over several different organisations with different remits. Each taking revenue share. Building federated applications; using other peoples data are also rife with issues of SLA and legalities.
3) Performance. Storing data as facts (triple or graph) results, in most cases, in a poor performing back-end. Relational databases are widely recognised as most efficient for most scenarios and therefore what most organisations use. Tests indicate triple queries are on average 2-20 times slower than an equivalent relational query. This alone instantly negates OWL/RDF and SPARQL for widespread use in the enterprise. There are also huge uncertainties over how distributed SPARQL queries and reasoners will work in practice - at Internet scale.
4) Ambiguity. Many believe that the real world is just too complex to be mapped:
a. It is just not able to be boxed and named in terms of categorisation and relationship (or ontology as it is called). This is essentially the same as the schema or Entity Relationship Diagram (ERD) for relational databases.
b. Linking ontologies together accurately (differing owners, world-views, drivers) is impracticable at Internet scale. Related to this is the almost philosophic issue around the use of the URI to identify a resource. It is hard to give a crisp definition of what 'representation' (of a resource) means in order to justify an assertion e.g. an image of Texas does not 'represent' Texas.
c. The recombination of facts necessary for inference are too simplistic e.g. the addition of - drunks (subject) frequent (predicate) bars (object) to our scenario might allow our agent to infer that the CEO is therefore a drunk. This may or may not be true but given known facts, it is a rash statement to make (especially if you are considering networking with him). You might not even know that the agent considers the CEO to be a drunk (it will just be one of the many factors that it uses to provide actions for you). This makes the situation much worse since bad decisions are then difficult to debug/improve.
5) Validation. Maintaining graph data is less structured than a relational format. It makes validating data integrity challenging. Data quality is a huge issue for the enterprise. Many CIOs will look to the ceiling when you ask them about their data quality and meta-data (needed by the Semantic Web to function) takes a back step in priority. Existing relational databases have much better tools and processes.
6) Semantics. The Semantic Web community have not helped themselves by casually using unclear terms, bolting on reasoning/inference to the Semantic Web definition and generally setting-up camps about well - semantics. Running parallel to Semantic Web development has been Natural Language Processing (NLP) development which, by contrast, has a clearer mission statement, can achieve some of the same goals and is actually more about human language semantics than the Semantic Web.

The first two above are related. There is simply an absence of reason for the enterprise to be more open with its data and link its transactions and business processes with other enterprises right now. However, it is fair to say there is a small revolution going on in respect of consumer personal openness or privacy. Consumers are seriously considering the benefits of publishing their purchases and sharing their screens for mass consumption. At the same time, Facebook is criticized for being too open and free-wheeling with personal data. What this tells us is that – the consumer wants to be in control of their data. This desire to control their data, all of it, all the time is essentially a form of self-actualisation; managing their on-line presence – facts, hopes, desires, history and image. Self-actualisation as popularised by Maslow is an established end-game for individuals and society in general. This will happen initially due to consumer demand. The enterprise will then be dragged into this process by more powerful forces of expectation and socialization (Web 2.0) – market forces. They will have little choice but to serve up their data and integrate with others. This could happen surprisingly quickly – within a year: Six-months of observed consumer demand, three-months to get an enterprise pilot going and another three for everyone to assess their new situation (slightly faster than Web 2.0 enterprise adoption) and bang – Web 3.0 is a competitive differentiator.

It is tempting to suggest that performance will become a non-issue due to better future technology (and network is more of a bottleneck than database for many applications) but the reality is the information explosion is so potent that it will broadly compete with Moore’s law to keep performance in play. Poor performance simply blocks enterprise adoption right now. There are creative solutions mooted e.g. swarm based reasoners. Similar high concept solutions are likely necessary since the performance gap is huge.

Ambiguity removal is essentially what the Semantic Web is all about. Objections around whether the world can be generally mapped or not are valid concerns; they can certainly be well illustrated by examples showing literally stupid inferences that can be made. Such examples are on miniscule datasets though. With the real live Internet (as with all complex systems) - outliers even-out at scale.

It is easy to imagine a massively connected system refining ontologies in real time based on user input (although certainly people would need some carrot to do this). It is less easy to imagine this happening for reasoning but a great algorithm could well emerge that will simply not be questioned (or even understood by the majority of the population) as it just works most of the time and anyway; when it doesn’t, you can’t really tell e.g. PageRank.

Tool interest will be stimulated once the inhibitors above start to be addressed. Tangential to this, a general consumer focus on their personal data will mean organisations are compelled to improve their customer data in order to meet new consumer driven activity (personal profile, reputation management).

The semantics point is ultimately an artefact of Semantic Web technologies not being commercialized. Once the enterprise gets involved, efficiency will drive out any hint of academic ownership. NLP is actually complementary to the Semantic Web since it is a front-end to allow access to the semantic data.

None of the inhibitors above are outright Semantic Web deal breakers. It is not conceptually flawed. If RDF/OWL or SPARQL have implementation requirements (Grouping, BI/aggregation, lineage etc.), they can change; they are still evolving. Collectively though, the inhibitors are assuredly deal breakers for widespread adoption in the five to ten year range. Before that time, as it seeps gradually into collective consciousness, performance and reasoning visibility will likely become the main inhibitors. It is not an all-or-nothing idea though. A little semantics goes a long way as they say. Next post will explore how elements of the Semantic Web can be utilised now.

No comments:

Post a Comment