18 March 2010

Text breeds data. Data breeds information.

(Originally posted 7 March 2009).

Received wisdom tells us that unstructured information is 80% of the data in an organisation. Reporting, BI and PM systems are still tied, in the main, to structured information in transactional systems (obtained through ETL, staging, dimensional modelling, what-if modelling and data mining). An opportunity has existed for some time to incorporate unstructured data. The inhibitor is technology. Where product sales figures can be extracted over years and then extrapolated to determine likely sales next period; how does a BI solution use the dozens of sales reports, emails, blogs, unstructured data embedded within database fields, call centre logs, reviews and correspondence describing the product as outmoded, expensive or unsafe? If they do not, they may lose out since this information (the 80%) can affect the decision of how much of the product to produce.

Most organisations currently handle the general need to unlock unstructured data by market sampling through techniques of interviewing, questionnaires and group discussion. They attempt to apply structure to the data by categorizing it e.g. “On a scale of one to ten – how satisfied are you with this product?” They either ignore information already there or manually transpose it; typically by outsourcing. A minority use the only technology that can truly unlock unstructured data within the enterprise right now – text mining. Note that this is different to both Sentiment Analysis (too interpretive right now) and the Semantic Web (too much data integration required right now) .

Many of these organisations however believe text mining simply makes information easier to find. This is a function of currently available products. The principal MSFT text mining capability is in its high-end search platform Fast. Such products use text mining techniques to cluster related unstructured content. It is not enough to loosely link data however, they need to be linked at an entity (ERM) level so they are subject to identical policies of governance, accountability and crucially; the same decision making criteria. It is worth stating the pedestrian; text mining is like data mining (except it’s for text!); establishing relationships between content and linking this information together to produce new content; in this case new data (whereas data mining produces new information).

Consider a scenario where an order management application retrieves all orders for a customer. At the same time, search technologies return all policy documents relating to that customer segment together with scanned correspondence stating that dozens of the orders were returned due to defects. The user now has to perform a series of manual steps; read and understand the policy documents, determine which returned order fields identify policy adherence, check the returned orders against these fields, read and understand the correspondence, make a list of all orders that were returned (perhaps by logging them in a spreadsheet), calculate the client value by subtracting their value from the value displayed in the application and finally modify their behaviour to the customer based upon their value to the organisation. Human error at any step can adversely affect customer value and experience. Much better to build and propagate an ERM on-the-fly (by establishing “Policy” as a new entity with a relationship to “Customer Segment”), grouping the orders as they are displayed while at the same time removing returned orders. I’m not aware of any organisations currently working in this way. This is the technology inhibitor and where text mining needs to go next.

No comments:

Post a Comment