A number have of folks have asked me about my principles behind managing data. For me, I always apply first principles:
Types of data –
- Reference data – major business entities and their properties; For e.g. and investment bank may consider client, securities and product information to be reference data; A pharmacy may consider Drug, Product, Provider, Patient to be reference data
- Master data – this is core business reference data of broad interest to a number of stakeholders; In this case an organization may want to master this data to identify entities that are the same but referenced differently by different systems or stakeholders; Typically you would use an MDM tool to achieve this.
- Transaction data – Events, Interactions and transactions. This measures the granular transactions that different entities do either for trade, or transaction or service. For e.g. an investment bank may have deal transactions, or trade data from sales and trading desk that may be another example. A pharmacy may similarly have Rx fulfillment records as transaction data.
- Analytic data – Inference or analysis results derived from transaction data combined with reference data through various means such as correlation or regression etc.
- Meta data – data about data such as its, definition, form and purpose
- Rules data – Information governing system or human behavior of a process
Types of data storage paradigms –
- Flat files – usually used for unstructured data like log files
- DBMS – data base management systems
- Hierarchical – a scheme that stores data as a tree of records, with each record having a parent and multiple children for e.g. IBM IMS (Information Management System). Any data access begins from the root record. My first experience with this was while programming at CSX and British Telecom.
- Network – a modification of the hierarchical scheme above to allow for multiple parents and multiple children thus forming a generalized graph structure invented by Charles Bachman. It allows for a better modeling of real life relationships between natural entities. Examples of such implementations are IDS and CA IDMS (Integrated Database Management Systems). Again saw a few implementations at CSX.
- Relational or RDBMS – based on the relational model invented by Edgar F Codd at IBM. The general structure of a DB server consists of a storage model (data on disk organized in tables (rows and columns) and indexes, logs and control files), a memory model (similar to storage but consisting only of a portion of most frequently accessed data cached in memory + meta code, plan and SQL statements for accessing that data) and a process model (consisting of a reader, writer, logging and checkpoints). Most modern relational databases like DB2, Oracle, Sybase, My SQL, SQL Server etc. follow a variation of the above. This is by far the most prevalent DBMS model.
- Object or ODBMS is where information is stored in the form of objects as used in OOP. These databases are not table oriented. Examples are Gemstone products (which are now available as Gemfire object cache which is notable for complex event processing, distributed caching, data virtualization and stream event processing) and Realm available as an open source ODBMS.
- Object-Relational DBMS which aim to bridge the gap between relational databases and object oriented modeling techniques used in OOP via allowing complex data, type inheritance and object behavior; examples include Illustra and PostgreSQL. Although most modern RDBMS like DB2, Oracle DB, SQL Server now claim to support ORDBMS via compliance to SQL:1999 via structured types.
- NoSQL databases allow storage and retrieval of data modeled outside of tabular relations. Reasons for using them stem from scaling via clusters, simplicity of design and finer control over availability. Many compromise consistency in favor of availability, speed and partition tolerance. A partial list of such databases from wikipedia is as follows –
- Column: Accumulo, Cassandra, Druid, HBase, Vertica
- Document: Apache CouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos DB, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
- Key-value: Aerospike, Apache Ignite, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Risk, Berkley DB, SDBM/Flat File dbm
- Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso
- Multi-model: Apache Ignite, ArangoDB, Couchbase, FoundationDB, InfinityDB, Marklogic, OrientDB
Common Data Processing Paradigms and Questions about why we perform certain operations –
- Transaction management – OLTP
- Primarily read access – Is when a system is responsible for reading reference data but not maintaining it. A number of techniques can be utilized for this but primarily the approach is read-only services providing data realtime, readonly stored procs for batch or file access. A number of times people prefer to replicate a read only copy of the master database to reduce contention on the master.
- Update access – Usually driven off a single master database to maintain consistency with a single set of services or jdbc/odbc drivers providing create/update/delete access, but certain use cases may warrant a multi-master setup.
- Replication solutions – unidirectional (master-slave), bi-directional, multi-directional (multi-master) or on-premises to cloud replication solutions are available.
- Analytics – OLAP
- Star schema – a single large central fact table and one table for each dimension.
- Snow flake – Is a variant of the star schema model where there still is a single large central fact table and one or more dimension tables, but the dimension tables are normalized in to additional tables.
- Fact constellation or Galaxy: A collection of star schemas, which is a modification of the above where multiple fact tables share dimension tables.
- When to create data warehouses vs. data marts vs. data lakes?
- Data Warehouse – A data warehouse stores data that has been modeled and structured. It holds multiple subject areas with very detailed information and works to integrate all these data sources. It is available for business professionals for running analytics to support their business optimizations. It is fixed configuration, less agile and expensive for large data volumes with a mature security configuration. May not necessarily use a dimension model but feeds other dimension models.
- Data marts – a mart that contains a single subject area, often with rolled up or summary information with the primary purpose of integrating information from a given subject area or source systems.
- Data lakes – On the other hand contain structured, semi-structured, unstructured and raw data that is designed for low cost storage whose main consumers are data scientists who want to figure out new and innovative ways to use this data
Defining what your needs are for each of the above dimensions will usually allow you to choose the right product or implementation pattern for getting the most out of your system.