XML-based Pattern Base Management system

XML-based Pattern Base Management system

Pattern management is a very important issue in several disciplines and domains where a large collection of data is available and the need for knowledge extraction and manipulation is present. Patterns are defined as compact and rich in semantics representations of data. For example, patterns are extracted using Data Mining techniques, including association rules, clusters, decision trees or time sequences. Obviously, patterns can be defined in many different ways depending on the application and the domain we are dealing with. In any case, to extract or define patterns from a collection of data is not enough. Patterns should be stored, queried, compared and combined in order to exploit the knowledge they represent.

A Pattern-Base Management System (PBMS) is a system for storage and retrieval of patterns just like data are stored and retrieved in a Data-Base Management System (DBMS). There are several possibilities under development for a PBMS. Patterns can be stored along with the data they are extracted from in a Database (inductive database approach) or, alternatively, patterns can be stored in a different repository, either in a Relational, or an Object-Relational, or in a XML database format. Due to the extensibility and wide acceptance of XML, it comprises a promising alternative. PMML, a standard in Data Mining pattern exchange, is developed in XML and becomes more and more acceptable by database software vendors.

XML-based PBMS architecture

Our proposed PBMS architecture is an XML-based system that encapsulates the pattern model developed in the context of the European Project PANDA (IST/FET Working Group, IST-2001-33058) “Patterns for Next-Generation Database Systems”. According to this model, three basic components are defined; pattern, pattern type and class. A pattern type is a description of the pattern structure. It consists of five elements; pattern type name, structure, source, measure and formula. The “structure” element is the structure schema that describes the structure of the pattern type (in an association rule, for example, the structure would consist of head and body), “source” is the source schema that describes the dataset which patterns of this pattern type are constructed from, “measure” is the measure schema that defines the quality of the source data representation achieved by patterns of this pattern type and “formula” is the formula that describes the relationship between the source space and the pattern space. A pattern is an instance of the corresponding pattern type and a class is a set of semantically related patterns. The relation between these three components of the pattern model is illustrated below:

The proposed PBMS architecture is illustrated in the following figure. Data can reside in any database system and patterns may be extracted by a Data Mining engine (WEKA, SPSS Clementine, Oracle DM, etc.) or defined by the user. The PBMS consists of the Pattern Definition/ Manipulation/ Query Languages (PDL / PML / PQL) and the pattern query processor.

Users may ask the system to run a data mining algorithm with specific parameters on a specific dataset. Then, the system will “translate” the task in the appropriate data mining engine commands and after the results have been produced, they will be stored in a pre-defined XML format (PMML) along with other information (the algorithm and the parameters of the data mining task, the session timestamp, etc.) to the Pattern-Base.

User may also ask for cross-over queries (queries that access the database as long as the pattern base using the formula element of patterns (see Pattern model above), and the query processor will query the database and the pattern base transparently to the user and combine the results.