The Xerox DocuShare CPX Extensible Database— Real Time Connection of XML Content White Paper April 2007 Xerox DocuShare - DocuShare CPX XDB White Paper - page  The Xerox DocuShare CPX Extensible Database— Real Time Connection of XML Content With the majority of today’s business content passing through Internetbased networks, XML has become the de facto language for transferring structured information among many business applications and processes. XML is now the informational fluid for managing the movement of data across an organization in ways not previously possible. Its inherent flexibil - ity enables myriad options to format and structure that information. But as a consequence, the structural variety of fles passed makes them impossible to connect without either a precise matching of the XML structure (schema) or the use of some intermediary translation. Effective integration requires advanced knowledge of the detailed application or process schema, so that connection points can be pre-determined and accommodated in the XML code. Without that “prior alignment,” conver - sion steps are often needed, involving costly and time-consuming human intervention. Harnessing the true potential of XML as the conduit of informational connectivity requires a seamless mapping of structured content regardless of the source fle’s XML construction. Previously, this capability did not exist. Now, Xerox DocuShare CPX offers an extensible database (XDB) that enables simple, direct XML-to-XML connection, quickly and automatically linking diverse organizational content to accelerate business processes and productivity. DocuShare CPX Takes XML to a New Level Unlike many XML information passing systems, the DocuShare CPX extensible database retains an original document (such as a Microsoft Word, Microsoft Excel, Adobe PDF, or Adobe FrameMaker fle) while also providing direct access to the information contained within the document. The XDB summarizes that information into XML and then uses the converted XML to extract and share relevant data for other organizational needs, such as quickly creating reports that pull from multiple source documents. This capability not only applies to new documents created after business processes are defned, but also extends to archived or legacy documents which are already associated with a process. CPX XDB spans all structured information to identify common process touch points, eliminating manual intervention in mapping source document structures. To ensure adher - ence to established security policies, once information is brought into DocuShare CPX, its access permissions are enforced, whether the infor - mation is accessed in XML or in the original source format. Over 80% of data within enterprises is estimated to be in unstructured formats like Microsoft Word and Excel as well as Adobe PDF fle formats. There are 300 million Excel installations worldwide, 200 million PDF documents on the Web, and 100 million new Microsoft Offce docu - ments created every day. —Informatica, Inc., November 2006 1 2 3 4 1 .xls 2 .pdf 3 .doc 4 XML Index Data Document Renditions + Metadata Relational Database DocuShare File Store Extensible Database (XDB) Technology DocuShare Repository XDB Indexer Source Content Files XDB Intake and Indexing Process Content Parsers .xml Data Source File XML Data Files submitted to DocuShare are added to the DocuShare File Store and associated metadata is indexed into a relational database. When XDB processing is enabled, incoming files are also processed through a specific file parser that creates an XML file that represents the structure and hierarchy of information components. Each new XML file is then processed using a patented schema-less mapping algorithm that indexes the hierarchical content structure into a relational database. Once the XML representation is indexed, it is added to the DocuShare repository as an alternative rendition of the original content. 1 .xls 2 .pdf 3 .doc 4 XML Index Metadata Xerox DocuShare - DocuShare CPX XDB White Paper - page  Intake and Indexing—with a Twist DocuShare’s extensible database accomplishes these connectivity goals through its unique intake and indexing process. The process begins with the source content fles, including today’s most common formats. With the standard DocuShare CPX content management system, source fles are added to the DocuShare repository where they are stored and where metadata is added to facilitate content management. However, when the XDB is enabled, an additional process on the incoming content is performed in tandem. The original source fle is passed through a content parser that creates an associated XML fle. The XML fle is stored in the DocuShare repository as a second rendition of the original document. The XML rendition is then passed through the CPX XDB Indexer, a technology used by DocuShare that indexes the content into a relational database management system (either Oracle or Microsoft SQL Server). The resulting XDB index in the database co-exists along with the metadata attached in the standard DocuShare CPX process, becoming part of a flexible DocuShare knowledge network through which users can easily search for and retrieve stored content. Process manager uses the Excel Summari zation Template to design an automated summarization spreadsheet in Excel. This spreadsheet can then be retained on the desktop or uploaded to DocuShare. Process manager creates a form in Excel utilizing the Excel Submission Template and uploads it to DocuShare. Process participants download and complete the forms then upload them to DocuShare by clicking the Submit button. XDB intakes, processes, and indexes the information from each submission. Query Process manager views an aggregated report of all inputs in the Excel summari zation spreadsheet. Information in this spreadsheet is automatically updated based on dynamic queries to the XDB. XML Indexer Because incoming content is converted to XML and indexed based on its structural organiza tion, XDB reports can access information that is embedded in viturally any type of docu ment, including Microsoft Word and E-form responses. DocuShare Repository E-form Pathway Word Pathway XDB In Action Updated Information Excel Pathway Xerox DocuShare - DocuShare CPX XDB White Paper - page  Because the information is indexed based on contextual identifers, the XDB can easily access information represented in the content and summa - rize it across documents whenever needed. These optional processes are easily enabled by the DocuShare CPX administrator who specifes what types of content should be subject to XML conversion and XDB indexing. Find and Repurpose Specific Content Strings One of greatest strengths of the XDB technology is its ability to fnd and assimilate specifc content components from similarly structured business documents like contracts, presentations, or spreadsheets. The individual components can be retrieved and re-assembled by the XDB to create concise summaries of relevant information across multiple source documents. The components are found based on the XML context that is associated with them. For instance, a company may use standard contracts created in Microsoft Word as part of a specifc business process. Each contract has a unique termination clause as part of its content, which is identifed based on textual markers (headlines, bolding, underlining, page position, etc.) and tagged as context during the XML conversion. The XDB has a simple-to-use search function that queries the context information to fnd all occurrences of the termination heading. It then returns the text in the “Enterprises must recog - nize that the thousands of uncontrolled spreadsheets their employees use every day represent a signifcant risk. Poorly managed spreadsheetsmay—through negligence, incompetence or deliberate criminal conduct—result in signifcant business losses, exposure to legal liability, damage to reputation and unwelcome regulatoryatent tion.” —From Gartner, Symposium/ITXpo 2006, “The Information Explosion and What to Do About It,” Toby Bell, October 2006 Xerox DocuShare - DocuShare CPX XDB White Paper - page  paragraphs of just that clause for each contract fle within a designated DocuShare collection. The retrieved content summary can be either saved to a report format or repurposed, wholly or in part, into another document through a simple cut and paste. This capability is especially useful for highly complex content structures such as those found in Excel spreadsheets. Excel content can vary from basic names of columns or rows to detailed cell ranges. The XDB content parser identifes spreadsheet content based on range names, attaches XML context data, and then passes it on to the XDB Indexing process and into the RDBMS. The content is then readily retrieved and shared on demand. Because Excel information so frequently drives corporate business processes, the XDB can be a particularly powerful tool for integrating systems around Excel spreadsheets or quickly accessing summarizations of Excel data from disparate sources. Even further, the extensible database is impartial to the original source format of stored data. Once it is passed through the XDB Indexing process, identically named data from varying source documents and formats, such as from Word and Excel, can be retrieved to the same report. For example, a column labeled ‘location of travel’ from Excelbased expense reports can be combined with ‘location of travel’ information contained in standard Word-based sales trip reports. Accumulated studies by audit frms since 1998 show that as many as 94% of corporate spreadsheets may have some form of error, ranging from negligible to extremely serious. —Results of research by R. Panko, “What We Know About Spreadsheet Errors,” University of Hawaii, January 2005 Faster, More Accurate Business Intelligence with XML Submission and Summary: Universities Space Research Association  60 person hours each month to capture and collate the necessary information into useful reports. This manual process also generated a high number of transcription errors—an audit of one $700M program with over  00 mile-stones revealed a  0% discrepancy rate. USRA addressed this growing problem by creating a performance management tool with NASA that leveraged the XML submission and rendering capabilities built into the DocuShare CPX extensible database. USRA uses XDB as an XML-hub for managing, storing, and synchroniz - ing project data among source documents, including integration with the organiza - tion’s core systems from Oracle and SAP. The solution enables project managers to automate submission of content through XDB-enabled source documents, such as Excel spreadsheets. The Universities Space Research Association (USRA), a non-profit research organization chartered to foster cooperative research, development, and education associated with space science and technology, helps the National Aeronautics and Space Administration (NASA) manage its business intelligently. With billions of dollars worth of research and development projects currently underway, certain centers within NASA were facing efforts required to manage information resources for its fnancial and project performance reports. Manag - ers were required to manually copy and paste detailed fnancial and project information from many disparate sources into numerous reports. This resulted in valuable time being spent consolidat - ing data rather than analyzing it. For example, one report alone took up to XDB then reassembles the XML content as required by each manager into accurate summary documents. $  .B of internal activity is now managed using the tool. The resulting time and labor effciencies have made project performance informa - tion available to managers in a much more timely, accurate, and effective manner. By automating the process, report creation time was signifcantly reduced, from  60 person hours down to  for example, and discrepancies were virtually elimi - nated. Now managers and analysts can spend time actually analyzing and using data rather than consolidating it. For more information, contact USRA’s Research Institute for Advanced Computer Science ( www.riacs.edu) info@riacs.edu. Xerox DocuShare - DocuShare CPX XDB White Paper - page 6 Would You Like to Learn More? For more information on DocuShare CPX XDB, please call 1.800.735.7749 or visit docushare.xerox.com. About DocuShare CPX Xerox DocuShare CPX, a highly intuitive and secure Enterprise Content Management (ECM) application, enables document intensive organiza - tions to dynamically capture, manage, retrieve and distribute information easily, regardless of skill level or location. Part of the Xerox DocuShare family of ECM products, DocuShare CPX customers can signifcantly improve productivity, streamline business processes, and reduce the time and cost of managing routine business documents and information. Leading the industry in speed of deployment and ease of administration and use, DocuShare CPX signifcantly reduces installation and complexity, and flexibly extends into an existing infrastructure, resulting in lower total cost of ownership and faster return on investments. Tightly integrated with Xerox Document Centre and WorkCentre Pro, DocuShare CPX can manage both hard copy and electronic content with unsurpassed ease and convenience. Xerox DocuShare Business Unit A Division of Xerox Global Services  00 Hillview Avenue Palo Alto, California 9  0  U.S.A. 1.800.7  .77 9 ©  007 Xerox Corporation. All rights reserved. Copyright protection claimed includes all forms and matters of copyrightable material and information now allowed by statutory or judicial law or hereinafter granted. Xerox, DocuShare, and WorkCentre are registered trademarks of Xerox Corporation. All other trademarks are the property of their respective companies and are recognized as such.