Statistical modeling of large distribution sets
説明
In this paper we deal with a ubiquitous problem in data management: hierarchical model estimation for large distribution sets. This particular problem arises in many applications. Classification, top-k query processing, clustering and outlier detection are just a few possible applications. Our aim is to continuously and incrementally estimate the model parameters of 'typical' distributions that describe the characteristics of a database. Our approach to model estimation can handle arbitrary types of data (e.g., categorical and numerical data) in databases, incrementally, quickly, and with little resource consumption. Moreover, this paper proposes not only incremental algorithms for model fitting, but also a modeling framework in which the learning approach recognizes hierarchical groups, each of whose distributions has similar characteristics, and separately updates the model parameters of each group without scanning all the distributions in the database. Thus, it can provide a response, i.e., the parameters of typical distribution models, with an arbitrary level of granularity, at any time. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to two specific problems that arise in the context of data management.
収録刊行物
-
- Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
-
Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research 43-48, 2010-06-11
ACM