---
Email this article   Print article 

How To Whip Unstructured Content Into Shape

By Gautam Desai and Linda Andrews, CRN
April 30, 2004    1:13 PM ET

Your clients' unstructured information is cause for concern. All of those word-processing documents, spreadsheets, e-mails and images comprise the majority of an organization's business-critical content and, as such, constitute the wealth of its information. The problem? This information treasure-trove is largely inaccessible without the tools to identify and extract it. If tapped, it could improve employee productivity and knowledge transfer, and even lead to innovation. What you need are some strategies for the best approaches to accessing, sharing and making optimal use of this content through service-oriented architectures (SOAs) and Web services.

Gain Control
Structured data makes up a small percentage of all data that is typically found within a large enterprise. In fact, approximately 15 percent of data in the average enterprise is structured, and that generally consists of the information stored in relational databases and other organized systems.

Then there's the other 85 percent--unstructured data--which is typically difficult to find and organize unless an organization makes the data more easily accessible and retrievable. Methods include putting it into a content-management system and tagging it with metadata. To help manage unstructured data, SOAs based on Web services have been developed. With the proper security credentials, systems and applications across a wide geography can access and execute Web services. In this way, any information exposed through a Web service will be accessible across an organization using standard Web protocols and browsers.

The approach we recommend consists of using the Web-services standards of the SOA running on top of the standards you have outlined in your internally defined Enterprise Reference Architecture. This is a framework that clearly defines the ideal, logical Enterprise Reference Architecture for a particular organization.

Blasting Or Digging?
Before you get started, there are two issues that need to be addressed: management of the content life cycle, and the definition of a taxonomy for unstructured content. The former is important because some of your company's unstructured information is likely to include active content and fixed content.

Active content refers to document types whose actual content changes frequently, such as legal contracts and other documents that are created collaboratively. In contrast, fixed content includes documents such as e-mail or check images, of which the only thing that changes is its metadata. For many types of unstructured content, these content types generally represent a lifecycle continuum from creation to retention to destruction. Depending on whether a given document represents active or fixed content, it will have very different management requirements over its life cycle. The challenge for solution providers is matching management policies and related technologies for each of these documents within the various stages of the content's life cycle.

Access to the right information is imperative to making better-informed decisions. To accomplish this, key data systems and applications must be integrated so that organizations can request and get data on demand, regardless of where the data is stored.

The first and most critical step in structuring unstructured content is defining an information taxonomy. This is different than those enterprisewide data-structure reference manuals that sit on shelves, collecting dust.

An enterprise-information taxonomy does not define every data element that exists within a company; rather, it is a hierarchical guide to the groups of related information that are used to drive the organization. These elements may include customer information, employee-benefits information, order-processing data, life-insurance policy information and pharmaceutical drug trial documents. The taxonomy attempts to codify the relationship between these key process-driving data elements that are used on a daily basis within a company. The taxonomy is ultimately stored in an easily accessible repository within the enterprise. The information that comprises the taxonomy is typically stored in XML format and is called metadata.

The result of defining this taxonomy is that the organization now has a unified approach to tagging existing and new unstructured data to make it more structured by attaching metadata that is defined via the taxonomy.

A key failing of many organizations is that they make the mistake of attempting to define a perfect taxonomy on their first attempts. Instead, taxonomies should be regarded as living documents, subject to modification, particularly as the business grows or otherwise changes.

The next step is to create a unified metadata repository to house all taxonomy data. The metadata repository will play a critical role in allowing users to find and access information across an organization's disparate data stores.

Going forward, solution providers must ensure that any new data has some structure to it when it is collected. To do this, it is important to define policies for information creation and storage, as well as to put technologies in place to make this easy and straightforward to accomplish. Most modern enterprise content-management solutions provide the technology needed to ensure that all types of content creators properly tag their data with new information that is then entered into the system. Some of these products also come with robust auto-categorization technology that will attempt to understand and pretag content to help ensure compliance.

One step that can be conducted in parallel and over the course of time is the tagging of pre-existing unstructured content--the backfile of unstructured content. Using the taxonomy, solution providers can begin tagging customers' vast unstructured data resources. This particular task is tedious and time-consuming; however, auto-categorization technologies (such as the Autonomy auto-categorization engine) can be used to help with this task.

The new content should then be indexed to make it easier to search and retrieve information. Organizations will typically have one or more indexing engines. Ultimately, the metadata repository can be used to route search requests to the proper indices based on the content being requested.

The Last Step
The final step in this standards-based approach is to access and use the information that has now been exposed. With the underlying standards of an SOA and the preparatory work spent in defining your client's taxonomy and set of policies around fixed and active content, it's now a matter of educating your end-user community on the business value of all your efforts. That's why it's a good idea to involve the ultimate end users of the information early in this approach and throughout all steps of this approach.

Once you have gotten control of your unstructured content, take some time to understand the possible benefits. From this point forward, it's a matter of enhancing or building applications that can take advantage of the structured data stores, keeping in mind your enterprise reference architecture, of course. Innovative organizations are even beginning to build common access services to unify content, as in online banking applications, that allow access to structured account and balance information along with check images.

The problem seems to get more complex by the minute, but you have to start somewhere. Mining your own valuable data stores is a great place to start.

Gautam Desai and Linda Andrews are analysts at Doculabs (www.doculabs.com), a research and consulting firm based in Chicago.

To continue reading this article, please download the CRN Tablet Edition app from the iPad App store.

SHARE THIS ARTICLE

More Channel Programs

Recent Articles

SP500: CSC Sales Dip, ePlus Opens HP Cloud Center

News at several of CRN's top solution providers made headlines this week, including CSC's declining sales and ePlus' cloud computing center.

Scenes From HTG Summit: VARs Helping VARs

Scenes from Heartland Tech Groups HTG Summit in Dallas brought hundreds of solution providers and VARs together to improve their businesses.

Five Companies That Came To Win This Week

For the week ending May 18, CRN looks at five companies that brought their 'A' game and made moves to beat out competitors.

  More Slide Shows




Related Videos
Loading...