Setting Up a Data Science Platform

What is a Data Science Platform?

It’s a platform that allows for data collection, extraction, storage, exploration, visualization, inference & analytics and ultimately modeling to provide predictions and classifications.

Why do you need a data science platform and what do you plan to achieve from it?

Usually the business need is to predict the future and establish causation. In order to predict the future one must determine independent and dependent variables, causation vs. correlation and thus come up with models that allow for predicting what the future will be when we tweak certain causative factors.

What are the key elements needed?

There are mainly three key elements of data management –

1. Data engineering – is the set of practices that convert raw data to processed data by building data pipelines, applications and APIs. This process is concerned with how data is captured, moved, stored, secured, processed (transformed, cleansed and aggregated) and finally utilized.

The different stages within data engineering are: 

Data acquisition is the process of acquiring data with concerns around the format of source data, existing interfaces that are available or new ones that have to be built, security  (including authentication, authorization and encryption), maintaining reliability and finally latency.

Data transport manages reliability and integrity in transport, security, latency and costs and bandwidth.

Data storage deals with the flexibility in storage, choices around schema and schema less storage, high availability and redundancy and cost of storage.

Data processing deals with transformation, cleansing, filtering, enriching, aggregating and machine learning models for prediction.

Finally, Data Servicing is the availability pattern to end consumers of the data, dealing with latency, redundancy and availability, consumer competency in understanding and utilization of these data sets, flexibility of schema and ultimately the APIs for consumer applications.

2. Data Analytics – Is the practice of using the data produced by data engineering to convert it into insights and information. The tools we currently use in this space are Tableau and Qlikview reporting packages.

3. Data Science – Ability to use analytics and insights to predict the future using data and patterns observed in the past. The work includes integrating and exploring data, building models using such aggregated data, extracting patterns in past data and finally presenting results either through reports or model-powered applications. Some of the key tools we have used on our platform are Jupyter and RStudio for ongoing algorithm development, Spark for distributed execution and Kafka for messaging and data acquisition.

How do big data patterns complicate this exercise?

Big data introduces the following additional complexities in data processing:

1. Volume: the needs for the size of data sets to be managed and processed is usually a few orders of magnitude higher than in usual OLTP scenarios. This means additional resources, ability to scale horizontally and manage latency requirements.

2. Velocity: Needs rem time data and event handling with a view to be fast and avoid bottlenecks.

3. Variety: With data collections augmented by IOT devices in addition to the traditional data collection mechanisms, implies the need to manage text, images, audio and video.

4. Variability:  data may be available in fits and starts implying the need to deal with spikes, an architecture that allows for decoupling and manage using buffers and finally the ability to maintain latency requirements.

Happy to share real life experiences on the above… please reach out if there is interest.

Why Set Up A Data Warehouse?

What is a DataWarehouse? And why would you need one?

A data warehouse is a central repository that aggregates data from all transactional and other data sources within a firm, to create a historical archive of all of the firm’s data even when transactional systems have hard data retention constraints.

It provides for the following capabilities:

  • Aggregates data from disparate data sources into a single DB; hence a single query engine can be used to query, join, transform/transpose and present the data.
  • Mitigate the problem of database isolation level lock contention in transactional systems due to running of large analytical queries
  • Maintain data history even when source systems do not and provide a temporal view of the data
  • Ability to create trend reports comparing yoy (year over year) or qoq (quarter over quarter) performance for senior management
  • Improve data quality and drive a consistency in organization information – consistent code/description/ref names/values etc. Allows for flagging and fixing of data
  • Provide a single data model for all data regardless of source
  • Restructure data so that it makes sense to the business users
  • Restructure data to improve query performance
  • Add contextual value to operational systems and enterprise apps like CRMs or ERPs.

What is an Operational Data Store (ODS)?

An ODS is a database designed to integrate data from multiple sources. It allows for cleaning, resolving redundancy and integrity checking before additional operations. The data is then passed back to the operational systems and to the DWH for storage and reporting. It is usually designed to contain atomic or low level data such as transactions and prices and also has limited history which is captured real time or near real time. Much greater volume of data is stored in the DWH generally on a less frequent basis.

Why do we add Data/Strategic Marts to most modern data management platforms?

Data marts are fit for purpose access layers that support specific reporting use for individual teams or use cases for e.g. a sales and operations data mart, or a marketing strategy data mart. Usually a subset of the DWH, and very focused on the elements needed  for the purpose it is designed for. The usual reasons to create data marts are –

  • Easy access to frequently needed data with contentions
  • Creates a collective view for a group of users
  • Improves end user response times
  • Ease of creation and lower cost than a DWH
  • Potential users are well defined than in a full DWH
  • Less cluttered as it contains only business essential data

And finally what are Data Lakes and Swamps?

A single store of all the data in the Enterprise in its raw form. It is a method of storing data within a system or repository in its natural format and facilitates the colocation of data in various schemas, structured and unstructured in files or object blobs or data bases. A deteriorated data lake, inaccessible to its intended users and of no value is called a “data swamp”.

A Primer on Data Management

A number have of folks have asked me about my principles behind managing data. For me, I always apply first principles:

Types of data –

  1. Reference data  – major business entities and their properties; For e.g. and investment bank may consider client, securities and product information to be reference data; A pharmacy may consider Drug, Product, Provider, Patient to be reference data
  2. Master data – this is core business reference data of broad interest to a number of stakeholders; In this case an organization may want to master this data to identify entities that are the same but referenced differently by different systems or stakeholders; Typically you would use an MDM tool to achieve this.
  3. Transaction data – Events, Interactions and transactions. This measures the granular transactions that  different entities do either for trade, or transaction or service. For e.g. an investment bank may have deal transactions, or trade data from sales and trading desk that may be another example. A pharmacy may similarly have Rx fulfillment records as transaction data.
  4. Analytic data – Inference or analysis results derived from transaction data combined with reference data through various means such as correlation or regression etc.
  5. Meta data – data about data such as its, definition, form and purpose
  6. Rules data – Information governing system or human behavior of a process

Types of data storage paradigms –

  1. Flat files – usually used for unstructured data like log files
  2.  DBMS – data base management systems
    1. Hierarchical – a scheme that stores data as a tree of records, with each record having a parent and multiple children for e.g. IBM IMS (Information Management System). Any data access begins from the root record. My first experience with this was while  programming at CSX and British Telecom.
    2. Network – a modification of the hierarchical scheme above to allow for multiple parents and multiple children thus forming a generalized graph structure invented by Charles Bachman. It allows for a better modeling of real life relationships between natural entities. Examples of such implementations are IDS and CA IDMS (Integrated Database Management Systems). Again saw a few implementations at CSX.
    3. Relational or RDBMS – based on the relational model invented by Edgar F Codd at IBM. The general structure of a DB server consists of a storage model (data on disk organized in tables (rows and columns) and indexes, logs and control files), a memory model (similar to storage but consisting only of a portion of most frequently accessed data cached in memory + meta code, plan and SQL statements for accessing that data) and a process model (consisting of a reader, writer, logging and checkpoints). Most modern relational databases like DB2, Oracle, Sybase, My SQL, SQL Server etc. follow a variation of the above. This is by far the most prevalent DBMS model.
    4. Object  or ODBMS is where information is stored in the form of objects as used in OOP. These databases are not table oriented. Examples are Gemstone products (which are now available as Gemfire object cache which is notable for complex event processing, distributed caching, data virtualization and stream event processing) and Realm available as an open source ODBMS.
    5. Object-Relational DBMS which aim to bridge the gap between relational databases and object oriented modeling techniques used in OOP via allowing complex data, type inheritance and object behavior; examples include Illustra and PostgreSQL. Although most modern RDBMS like DB2, Oracle DB, SQL Server now claim  to support ORDBMS via compliance to SQL:1999 via structured types.
    6. NoSQL databases allow storage and retrieval of data modeled outside of tabular relations. Reasons for using them stem from scaling via clusters, simplicity of design and finer control over availability. Many compromise consistency in favor of availability, speed and partition tolerance. A partial list of such databases from wikipedia is as follows –
      1. Column: Accumulo, Cassandra, Druid, HBase, Vertica
      2. Document: Apache CouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos DB, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
      3. Key-value: Aerospike, Apache Ignite, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Risk, Berkley DB, SDBM/Flat File dbm
      4. Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso
      5. Multi-model: Apache Ignite, ArangoDB, Couchbase, FoundationDB, InfinityDB, Marklogic, OrientDB

Common Data Processing Paradigms and Questions about why we perform certain operations –

  • Transaction management – OLTP
    • Primarily read access – Is when a system is responsible for reading reference data but not maintaining it. A number of techniques can be utilized for this but primarily the approach is read-only services providing data realtime, readonly stored procs for batch or file access. A number of times people prefer to replicate a read only copy of the master database to reduce contention on the master.
    • Update access – Usually driven off a single master database to maintain consistency with a single set of services or jdbc/odbc drivers providing create/update/delete access, but certain use cases may warrant a multi-master setup.
    • Replication solutions – unidirectional (master-slave), bi-directional, multi-directional (multi-master) or on-premises to cloud replication solutions are available.
  • Analytics – OLAP
    • Star schema – a single large central fact table and one table for each dimension.
    • Snow flake – Is a variant of the star schema model where there still is a single large central fact table and one or more dimension tables, but the dimension tables are normalized in to additional tables.
    • Fact constellation or Galaxy: A collection of star schemas, which is a modification of the above where multiple fact tables share dimension tables.
  • When to create data warehouses vs. data marts vs. data lakes?
    • Data Warehouse – A data warehouse stores data that has been modeled and structured. It holds multiple subject areas with very detailed information and works to integrate all these data sources. It is available for business professionals for running analytics to support their business optimizations. It is fixed configuration, less agile and expensive for large data volumes with a mature security configuration. May not necessarily use a dimension model but feeds other dimension models.
    • Data marts – a mart that contains a single subject area, often with rolled up or summary information with the primary purpose of integrating information from a given subject area or source systems.
    • Data lakes – On the other hand contain structured, semi-structured, unstructured and raw data that is designed for low cost storage whose main consumers are data scientists who want to figure out new and innovative ways to use this data

Defining what your needs are for each of the above dimensions will usually allow you to choose the right product or implementation pattern for getting the most out of your system.