13 June 2010

Semantic Web - Part 2 (Where is it?)

The enterprise in general has barely given these technologies a second thought to-date and the consumer has little idea about them (short of a vague idea that Web 3.0 will make the Web more intelligent). The concept of storing more data in a graph/RDF format remains disputed. MSFT, for example (as a populist bridge between the two), could be said to have a less-than-enthusiastic approach (all the main APIs are in Java with .NET versions managed only by enthusiasts). Few MSFT products use RDF internally (Media Management). None (including SQL) use OWL/SPARQL. Google have their recent Rich Snippets initiative but their Chart API currently only works with spreadsheets (rather than RDF). Facebook are actively pursuing the graph format as its on a secret logo inside Mark Zuckerberg's hoodie. Twitter have recently announced annotations - a way to add meta-data to tweets (which could be used semantically in future). Some breaking sites use RDF e.g. Glue, Drupal and Tripit but there are no killer apps yet.

The world awaits an application that inherently makes a tool of the Semantic Web. This will likely be focussed around disambiguation since wholesale data integration is a tougher nut to crack.

Reasons follow (descending order of importance to-date):

1) Openness. There are few clear reasons for the enterprise (the people who manage the vast majority of data) to be more open with data. Especially their raw data (as opposed to their massaged/reporting data). Government has a remit of transparency so they have more data in this format.
2) Federation. Of business processes. There are a host of facts (and rule-based logic linking them together) required to make the above scenario (and anything like it) function; all working over several different organisations with different remits. Each taking revenue share. Building federated applications; using other peoples data are also rife with issues of SLA and legalities.
3) Performance. Storing data as facts (triple or graph) results, in most cases, in a poor performing back-end. Relational databases are widely recognised as most efficient for most scenarios and therefore what most organisations use. Tests indicate triple queries are on average 2-20 times slower than an equivalent relational query. This alone instantly negates OWL/RDF and SPARQL for widespread use in the enterprise. There are also huge uncertainties over how distributed SPARQL queries and reasoners will work in practice - at Internet scale.
4) Ambiguity. Many believe that the real world is just too complex to be mapped:
a. It is just not able to be boxed and named in terms of categorisation and relationship (or ontology as it is called). This is essentially the same as the schema or Entity Relationship Diagram (ERD) for relational databases.
b. Linking ontologies together accurately (differing owners, world-views, drivers) is impracticable at Internet scale. Related to this is the almost philosophic issue around the use of the URI to identify a resource. It is hard to give a crisp definition of what 'representation' (of a resource) means in order to justify an assertion e.g. an image of Texas does not 'represent' Texas.
c. The recombination of facts necessary for inference are too simplistic e.g. the addition of - drunks (subject) frequent (predicate) bars (object) to our scenario might allow our agent to infer that the CEO is therefore a drunk. This may or may not be true but given known facts, it is a rash statement to make (especially if you are considering networking with him). You might not even know that the agent considers the CEO to be a drunk (it will just be one of the many factors that it uses to provide actions for you). This makes the situation much worse since bad decisions are then difficult to debug/improve.
5) Validation. Maintaining graph data is less structured than a relational format. It makes validating data integrity challenging. Data quality is a huge issue for the enterprise. Many CIOs will look to the ceiling when you ask them about their data quality and meta-data (needed by the Semantic Web to function) takes a back step in priority. Existing relational databases have much better tools and processes.
6) Semantics. The Semantic Web community have not helped themselves by casually using unclear terms, bolting on reasoning/inference to the Semantic Web definition and generally setting-up camps about well - semantics. Running parallel to Semantic Web development has been Natural Language Processing (NLP) development which, by contrast, has a clearer mission statement, can achieve some of the same goals and is actually more about human language semantics than the Semantic Web.

The first two above are related. There is simply an absence of reason for the enterprise to be more open with its data and link its transactions and business processes with other enterprises right now. However, it is fair to say there is a small revolution going on in respect of consumer personal openness or privacy. Consumers are seriously considering the benefits of publishing their purchases and sharing their screens for mass consumption. At the same time, Facebook is criticized for being too open and free-wheeling with personal data. What this tells us is that – the consumer wants to be in control of their data. This desire to control their data, all of it, all the time is essentially a form of self-actualisation; managing their on-line presence – facts, hopes, desires, history and image. Self-actualisation as popularised by Maslow is an established end-game for individuals and society in general. This will happen initially due to consumer demand. The enterprise will then be dragged into this process by more powerful forces of expectation and socialization (Web 2.0) – market forces. They will have little choice but to serve up their data and integrate with others. This could happen surprisingly quickly – within a year: Six-months of observed consumer demand, three-months to get an enterprise pilot going and another three for everyone to assess their new situation (slightly faster than Web 2.0 enterprise adoption) and bang – Web 3.0 is a competitive differentiator.

It is tempting to suggest that performance will become a non-issue due to better future technology (and network is more of a bottleneck than database for many applications) but the reality is the information explosion is so potent that it will broadly compete with Moore’s law to keep performance in play. Poor performance simply blocks enterprise adoption right now. There are creative solutions mooted e.g. swarm based reasoners. Similar high concept solutions are likely necessary since the performance gap is huge.

Ambiguity removal is essentially what the Semantic Web is all about. Objections around whether the world can be generally mapped or not are valid concerns; they can certainly be well illustrated by examples showing literally stupid inferences that can be made. Such examples are on miniscule datasets though. With the real live Internet (as with all complex systems) - outliers even-out at scale.

It is easy to imagine a massively connected system refining ontologies in real time based on user input (although certainly people would need some carrot to do this). It is less easy to imagine this happening for reasoning but a great algorithm could well emerge that will simply not be questioned (or even understood by the majority of the population) as it just works most of the time and anyway; when it doesn’t, you can’t really tell e.g. PageRank.

Tool interest will be stimulated once the inhibitors above start to be addressed. Tangential to this, a general consumer focus on their personal data will mean organisations are compelled to improve their customer data in order to meet new consumer driven activity (personal profile, reputation management).

The semantics point is ultimately an artefact of Semantic Web technologies not being commercialized. Once the enterprise gets involved, efficiency will drive out any hint of academic ownership. NLP is actually complementary to the Semantic Web since it is a front-end to allow access to the semantic data.

None of the inhibitors above are outright Semantic Web deal breakers. It is not conceptually flawed. If RDF/OWL or SPARQL have implementation requirements (Grouping, BI/aggregation, lineage etc.), they can change; they are still evolving. Collectively though, the inhibitors are assuredly deal breakers for widespread adoption in the five to ten year range. Before that time, as it seeps gradually into collective consciousness, performance and reasoning visibility will likely become the main inhibitors. It is not an all-or-nothing idea though. A little semantics goes a long way as they say. Next post will explore how elements of the Semantic Web can be utilised now.

12 June 2010

Semantic Web - Part 1 (What is it?)

The science fiction future that futurologists love predicting is some way away because meaningful data is currently not well integrated. You cannot have anything like the rampant “Imagine...” scenarios (typically using “Agents” and RFIDs) that futurologists speak to on a wide scale until (in terms of data) a Single Version of Truth (SVOT) is agreed upon for the data we are looking at (or disambiguation as it is called) and it is integrated wholesale.

Many of these scenarios will happen since they portrait a more efficient, opportunity filled or simply fun lifestyle; one enabled by information. Someone in the future will find a way to monetize it (maybe in a Minority Report/advertising kind of way) because market forces always apply.

Let us try one out in the next paragraph to illustrate:

Imagine you are in a bar. A distributed agent *knows* your location through your mobile. It also *knows* that the CEO of a large company in your industry is in the same bar and that that company is hiring for a role one level above your current one. Of course, it *knows* that you have been in your current role for a while. It does not *know* whether the CEO has any people with him at the moment (or is waiting for them) as he has restricted this information by his personal privacy settings. The agent suggests though (by texting or calling you) that it is worth you going up to the CEO and introducing yourself but not before it has already emailed him your resume and informed you that his drink of choice is a dry Martini.

This scenario is by turns – fantastically cool, moderately disturbing, highly efficient, opportunistically enabling and culturally changing. Some version of it will likely happen. What it is not is - artificially intelligent. Aside from the technology existing to do it all right now (mine GPS location data in real-time to determine matches against pre-set scenarios e.g. connecting people for jobs then checking social encounter rules and privacy settings) and the rule-based logic involved is straightforward (if people in same location and same industry and networking opportunity exists then...) the fact that all data required to fulfil this scenario is in different formats and in any case – secured (since there is little reason for the owners to share) will prevent our vista from happening. A lesser inhibitor is the rule-based logic - straightforward certainly; but the types of scenarios we are talking about require a lot of rules and it is unclear who will maintain them.

Future agent does not *know* anything, it has simply traversed the Internet (using look-up tables or schemas) to find an SVOT (because data is well integrated your location is stored disambiguously) and acted upon them as directed by predefined rule-based logic. Basically it has acted like any program around today (but on better data).

To fully integrate data you need senders and receivers to agree on a standard for both storage and communication (storing data in one format and communicating it in another defeats the purpose of data integration). This standard needs to be simple (since we also want to exchange data with mobile and embedded devices and generally want rapid and broad diffusion of the format) and not restrict others from building niche and more complex standards on top. The simplest standard is - a fact (Sales, personal etc.). Facts – of course, are condensed down to, something that the fact is about (the subject), something about the subject (the predicate) and something that is related to the subject through the predicate (the object). Examples are:

You (subject) located in (object) bar (predicate)
bar (subject) place of (object) socializing (predicate)

You cannot decompose facts any more than this otherwise; they would not tell us anything. It is conceptually akin to storing data at a physical level as 0s and 1s. Any type of information in the world can ultimately be stored as a list of linked facts in this way.

I (subject) purchase (predicate) Jack Daniels (object)

What is missing here is (you might ask) the timestamp and location; don’t we have to add them as columns four and five? Surely our future scenario needs that information? No – the idea is that we stick with the simple triple representation and it becomes:

I (subject) purchase (predicate) Jack Daniels (object)
My purchase (subject) was timed at (predicate) 1430HRS (object)
My purchase (subject) was located at (predicate) O'Malley's bar

While it is certainly true that much rule-based logic will always be required to fulfil the type of scenarios above, the amount of it is significantly reduced by the ability to make inferences using facts. Consider our facts:

You (subject) located in (object) bar (predicate)

This fact is automatically generated by your phone broadcasting its GPS location and it being matched to its location as commercial premises and finally cross-referenced against its business type.

bar (subject) place of (object) socializing (predicate)

This is a core fact that never changes and was created by some internationally accepted community effort. Because we have a like-term (bar), we can now infer that:

You (subject) are currently (object) socializing (predicate).

You have not specifically told anyone that you are socializing. It has not been encoded. Indeed it may be the middle of the afternoon so, in lieu of further information anyone may have otherwise assumed you were at work. We could have built-in rule-based logic to achieve the same result (if you are in a bar then you are socialising) but we have been saved the trouble by inference. Performance has been maintained as inference was in memory. This type of logic – syllogism has been around since at least Ancient Greeks. The implicit knowledge that both you and the CEO are physically in the same informal situation at the same time allows an opportunistic suggestion to be made; opening and closing a loop in real-time without having written a rule for it.

If everyone used the fact format (and its inferencing - managed by reasoners) for data storage and communication then we should all be able to resign from our jobs and hang-out in bars; secure in the knowledge that we have reached a technological plateau and an agent will at some point fix us up with a new role. Imagine.

The existing Internet is still very page focussed. You go to a job search site and have to trawl through pages of job descriptions, applying your own experience to decide which ones are interesting e.g. is a “Sales Executive” the same as a “Business Development Executive”? Does that term differ by industry category? If so, should I follow these links instead? You have to do a lot of work to find the things you want. So much so that you either give-up or end up with things that you don’t want. Using the fact format at Internet scale with disambiguation removes the necessity for humans to contextualise the data and so enables machines to better process it; which in-turn leads to more pervasive automation and those Imagine scenarios. This is what is meant by the Semantic Web (Web 3.0/SemWeb).

Is the Semantic Web inevitable? Opinion is divided to say the least. The Imagine scenarios (or variations of them) are inevitable. They absolutely require disambiguation and wholesale data integration. This in-turn has to necessitate a standard for storage and communication of facts. The Semantic Web is an attempt (the only one in town) to deliver that fact standard. Inferencing must be considered an optional component of the Semantic Web. It may uncover previously unknown information or simply be required to make it work at scale. It may require too much effort to be practicable for many due to its current reliance on an Open World Assumption (OWA).

The core promise of the Semantic Web is - disambiguation and wholesale data integration (Linked Data). It is the primary enabler for data-mashups. There are certain parallels with the early days of the Object Orientated Programming (OOP) movement. The Semantic Web is inevitable but it won't be called such. It will still be the Web.

There is an established fact format right now – RDF (Resource Descriptor Framework). Much of the supporting systems are also in place e.g. query and descriptor languages (SPARQL and OWL respectively). They have been quietly developed over the last eight years or so and all focus around the core premise of the simple fact (or triple as it is known [subject/predicate/object]). Next post will explore why we have yet to see widespread adoption of these technologies.

UPDATE: Great source of further reading links here.

05 June 2010

Beware the IDEs? Not so much

Whether to run an Integrated Development Environment (IDE) in a browser or not can be a surprisingly emotive subject. The majority of developers, if they do not already have it, want a well appointed workstation running Visual Studio (for .NET), Eclipse (for Java), Dreamweaver (for JavaScript/HTML/CSS) or similar. They baulk at the idea of running a browser-based IDE despite building browser-based applications and clear advantages to working in this way. It is worth logically and dispassionately looking at the situation.

Key advantages are:

1) Portability. Developers can work from anywhere with a web connection. Does this really happen? Offshore developers will typically have a work desktop and maybe a personal laptop. They may also come onshore for a short period and use a client desktop. If they work through an outsourcer/consultancy they may have another. This counts but it is a Dropbox like aspect of portability. Its main advantage is in supporting those lifestyle situations; when you were not scheduled to develop but – thinking about it - you can; either to get ahead or to react to real time issues – even when you are on vacation, travelling or visiting friends. There are also humanitarian reasons for being able to learn a trade and contribute without having to own even a $100 laptop but let us save that for a later post.
2) Collaboration. Developers can let others debug their code by sharing it via unique URL. Anyone navigating to that link will receive a separate, fully modifiable and executable version of the code. That means no API version inconsistencies come compile time. Real-time collaborative coding is also easier.
3) Efficiency. Hours and on larger projects – days are wasted in setting-up workstations and supporting environments e.g. source control/configuration management at the start of each project. Even if it has been done several times before, at-some-point it will fail because there are just too many variables on a desktop. This just goes away.
4) Cost. Older workstations can be used since compilation and anything else heavy is performed on the server. Cost also improved by collaboration and efficiency (2 and 3 above).

Key disadvantages are:

1) Usability. The browser is not perceived as being rich enough to accommodate a responsive editor, allow class/library management and debug. It is also impracticable to design and test a GUI due to the greater drag/drop precision required.
2) Connectivity. You need to be connected to the Internet in order to run your IDE and therefore develop. This limits your portability (1 above).

There are other points on both sides but above are the key ones. Let us hold those two disadvantages up to the tiniest bit of scrutiny:

1) Usability. Large text file editing in a responsive (next to no latency) way (the main criticism) can now be achieved using HTML5 Canvas/JavaScript. See Bespin and also Kodingen (extending Bespin and integrating with other services) for examples of this using Python/PHP/ROR. See CodeRun for an example of full-on code management using .NET/JavaScript. They are both free, quick and have clean efficient interfaces. Bespin is more of a work-in-progress and does not yet support all browsers though. GUI design is admittedly more of an issue right now but:
a. People already successfully use graphical editors in a browser e.g. Splashup, SUMO Paint and the recently Google acquired Picnick.
b. HTML5 adoption is affording more options here.
c. In both consumer and enterprise spaces, we are moving toward a widget-based UX making designing GUIs from scratch less common.
2) Connectivity. By this - offline development is meant i.e. enabling those circumstances where the developer is using a laptop (since if they were using a desktop surely it would be connected to a network?) and they are in an area of no Wi-Fi coverage (since otherwise they would have network access?). Granted this is a situation that occurs but consider further that this also means:
a. There are no collaboration or research possibilities available (no IM/no Google). If you get stuck when developing or need to clarify a technical point – you’re on your own.
b. You need a single professional and automated solution that synchronises all code, images, configuration files, media (that has previously been unit and integration tested and checked in) and also synchronises test data and potentially business rules (since it is good practice to keep these out of code). Either it will synchronise actual data/business rules in which case you need high grade encryption on your laptop (as you are likely using customer data) or your solution needs to de-sensitize the data/business rules somehow (and you need to have agreed this process with any customer). What kind of developers will be happy with these two restrictions? Only sole developers working on their own project.

Staying with the logical analysis of this, there are four decent reasons in favour of widespread use of browser-based IDEs and two against. There is also enough mitigation to mostly address the two negatives. There is clearly a significant net gain to be made. Side points around “developers won’t stand for it”, “Development is an art (it’s not!)” or “you just do not understand” etc. are emotional and really do not have a place in the decision. It is understandable (in a carpenter cherishing his chisel kind of way) but this noise is a real contributory factor as to why browser-based IDE have not made more of an impact to-date.

When their new OS comes out later this year, are Google really going to say – buy Chrome laptops, they can do everything your regular laptop can – unless you are developing? Given their long developer-friendliness; this would appear a peculiar move. Unless of course it is precisely because they are developer-friendly; that they will pander to populist developer belief and treat them as artisans needing powerful magical workstations? If they do this though – they risk confusing consumers and certainly the non-developing IT community as to their strategy at a time when they are already perplexed by what is happening with Chrome OS, Chrome and Android as a run-time environment.

Google have a new programming language – Go which currently needs OSX or Linux. Like the majority of languages today, it is C-based. It has been out for nearly a year but has not received a great deal of press. It will need a differentiator other than speed to compete (how many web applications really have a processor bottleneck these days?). Surely there is an opportunity to build a browser-based IDE for Go and enable a new generation of more casual (but also more open) developers around the world?

04 June 2010

Where did those mash-up tools go?

Just eighteen months ago, mash-up tools were big. The promise of building applications quickly, with minimal development and with context directly reflected within the application (as they are made by SMEs rather than IT resources) remains appealing. They looked to be the perfect tool for civic activists and knowledge workers alike. They were high in Garner’s top ten technologies to watch. The enterprise was starting to take them seriously as a mechanism to reduce crippling data integration challenges and consumers - bored on a diet of pushing links - thought they would be fun and/or a showcase for themselves in much the same way as blogs have been.

Now, the wind has shifted and MSFT’s Popfly and Google’s Mash-up Editor are both gone. Other niche vendors e.g. Sprout Builder similarly disappeared. Of the big players, Intel Mash-up Maker and Yahoo Pipes continue. All of them attempt/have attempted to straddle the void between consumer and enterprise spaces. This is an important distinction since – mash-ups even within the enterprise rely upon an ad-hoc passionate approach rather than formal development. They are typically built at home by passionate non-programmers who want to invest time in a single non-niche tool so whatever they learn is portable (work and social/other organisations etc). Even more so than blogging (which also uses one tool across both domains) a single tool is required - as more learning investment is required. With the exception of SAP’s Visual Composer (if you are an SAP shop), none of the tools mentioned have been particularly successful in either space let alone both. Why is this?

1) No UX standards. For both enterprise and consumer, there are no standard for widgets (or gadgets or web-parts or whatever else you call discrete self-contained UX functions). There are standards for business cards (vCard) – why not widgets?
2) Slow linked open data adoption. More of a consumer inhibitor at the moment. Linked Data is a core component of the Semantic Web vision that uses a specific set of current technologies. It provides a way to readily mix-and-match data in a meaningful way and so is a key enabling technology for mash-ups. Sig.ma is a simple mash-up tool for RDF data. Unlike other data-based mashups which tend to be query-based, Sig.ma is search-based. You enter a search term, the search engine gets your data, you remove the bits that are not relevant and (if you like) re-publish the data again as RDF (or other formats). This is perhaps too simplistic for users right now but it is evolving and could become a potent research tool. Any mash-ups that rely upon open linked data (ideally the best data of all) suffer from a lack of it; although this is changing as Government initiatives in particular publish RDF data for transparency reasons.
3) Insular data integration. Although the various flavours (SOA/ETL/EII/EAI) have been core CIO agenda topics for over five years, they have been mainly confined to the particular enterprise itself; especially in the narrow form of web services and have been of limited success even there. Extranet take-up, where common data is shared between parties in the supply chain, has been leisurely and this is precisely where mash-ups are needed. Very few organisations treat meta-data with the same focus as data. This means, mash-ups have trouble vouching for data currency and lineage which detracts from user take-on. It is possible that the solution to data integration will be the Semantic Web and a greater openness of organisations to share data. If so, we will be waiting some years yet.
4) Industry standards. Have been slow to be adopted. A notable exception here is XBRL for common reporting.
5) SSO. This is not exactly an issue for consumer mash-ups; assuming you are using open linked data but it is still a huge inhibitor to data integration for many organisations.
6) Blogging comparison. Although superficially similar to blogging - mashing (if you are going to do it properly) requires a thorough data understanding and a lot more effort than committing stream-of-conscious thoughts before they float away into the ether (or linking to other peoples work). Blogging is simply an easier way to achieve microcelebrity and also - because the majority of posts are written in the first-person (I think...) they can be defended (if need be) by the simple statement - "These are my opinions". This is a segueway into a whole minefield of philosophy, politics and culture that is best left alone. People mainly do. Only a small proportion will directly challenge someone's written thoughts. Publishing a mashup however - where you are vouching for the legitimacy of the data opens you up to direct challenge (people may have provably better data) and so people resist it. Only when the number of single versions of truth in the world become smaller and more consolidated will this situation change.

Popfly showed early promise as a learning tool but never really got past being a Silverlight showcase. Its focus was on looking good (geo mash-ups and slick drag/drop) rather than data integration. It did not use RDF at all. MSFT have been slow to utilise semantic web approaches in general. Some of their media management technologies use RDF in the background but their main focus has been around semantic search through the Semantic Engine. This initiative uses recently acquired Powerset technologies and will be released through SQL Server. PowerPivot has been significantly downsized from its original Project Gemini remit which would have provided not just a potent reporting mash-up environment but the management and support processes and infrastructure to QA and promote the mash-ups throughout the enterprise. This latter point is a key inhibitor to mash-up growth in general.

There are still signs of life in the mash-up tool space. Dapper is advertising focused. NetVibes is portal focussed. Alchemy API takes a content management/annotation approach (similar to Intel Mash-up Maker). Birst takes an analytic portal approach. Jackbe looks interesting; it appears to take a sales analytics approach. Snaplogic is not exactly a mash-up tool but it certainly takes a non-technical approach to data integration. None of them really play in that sweet-spot between enterprise and consumer though.

The parallel economic downturn has influenced mash-up take-on. The enterprise essentially stopped unproven development and consumers have yet to be sold on the concept but let’s face it – the focus on one or two drop-downs for configuration and Google Maps didn’t help either. The future of mash-ups is secure because it is the future of building useful applications quickly by SMEs and that will always be desirable. Five back-end data sources (database, RSS etc.) linked to five middle-ware components (aggregation, integration etc.) and five front-end components (analytics, data entry etc.) generates 125 possible combinations of application straight off the bat. Adding tailoring through filtering, personalisation and general configuration takes in into the thousands. This simple logic guarantees a future at least in the enterprise.

Whether the name “mash-up” has been tainted by its recent hiatus and – like its raison d’etre will need to resurface as part of something new remains to be seen. We can be sure though that the next generation of mash-up activity needs to be three things in order to stick around; interactive, data focused and usable in both enterprise and consumer spaces.