IT Management: Technology review of Data warehousing

Data flow processes before the advent of data warehousing
Prior to data warehousing, data flow within an organization used to be quite slow and cumbersome. Data redundancy was required to assist in the multiple decision environments. Although each environment used to be related to a different set of users but the information required by them was quite the same and hence this was the reason behind substantial duplication of data. Besides that gathering and integration of data used to consume considerable of time and similar to the data involved these procedures were also replicated for the different environments. Furthermore, the operational systems in place were also subject re-examination whenever there was a change in requirements.

The advent of data warehousing
Data warehousing dates back to the 1980s, when in 1983 Teradata introduced the first data management system. This system was particularly designed to support in decision making.
In 1988, researchers at IBM namely Barry Devlin and Paul Murphy came up with the first of its kind business data warehouse. This was a concept which was intended to facilitate a data flow from operational level to strategic level. Moreover their proposed architecture also helped in comprehensively developing the data warehousing concept.

In 1990, a complete database management system by the name of Red Brick Warehouse was launched which was focused at fulfilling the purpose of data warehousing. This system was a product of Red Brick Systems.

Historical evolution
Bill Inmon- Father of data warehousing
A year later in 1991, Prism Solutions under the leadership of Bill Inmon came forward with software capable of developing data warehouses namely Prism Warehouse Manager.
Bill Inmon is a prolific writer on the issues of data warehousing, business intelligence and database management and has penned more than 40 books and a thousand articles on data warehousing and related issues. He has made many significant contributions in developing awareness of data warehousing along with developing the concept of data warehousing itself. In 1999, he created a website by the name of Corporate Information Factory which was focused at educating professionals and company executives over the effective use of data warehousing.

His contributions on this website include some excellent methodologies which proved to be decisive in the development of data warehousing over the years. In 2007, Bill Inmon was named among the most influential people in the first 40 years of IT industry. As of today, he acts as a consultant to a large number of Fortune 1000 companies on data warehousing issues along with the management of his own companies as well .

What is it
Data warehouse
Data warehouse is primarily the secondary storage (repository) of an entitys electronic data. Along with the storage these warehouses can also be helpful for reporting and analysis purposes. Data stored in a data warehouse can be retrieved, modified and analyzed as and when needed.

Data warehouse designing
Designing a data warehouse is a complex process since it is essential that the design supports the various types of queries that the data warehouse will be used for.
This process requires a clear understanding of the database intended to be developed and therefore interaction with the future users is needed. Data warehouse designing is an iterative process since it requires considerable modifications before a model can achieve stability. Development at this stage has to be done with great care, the reason being that once the data model is mounted with data it cant be easily modified in fact some of it would be very difficult to recreate.

Although there are many different methodologies available for data warehouse designing, but two major approaches are bottom up and top down. There are two main approaches to data warehouse designing
Bottom-up approach
Ralph Kimball who is an acclaimed author on data warehousing, recommends a bottom-up approach. Under this method data marts are developed at first.

Data marts
Data marts are subsets of data which contain summarized data which is intended to be used for some specific purpose. Some of the reasons and benefits behind the creation of data marts are as follows
They provide easier access to data which is used frequently since such data can be summarized and stored in a data mart.

Helps in creating a collective perception.
Since the purpose of data mart is known at the outset it helps in improving the end-user response time.
They can be developed relatively easily
Similarly they comprise of smaller amounts of data, they incur less costs
Moreover users of data marts are known much more clearly than those of a complete data warehouse.
Once the data marts have been developed, they are then combined and their integration is then known as the data warehouse. This is achieved through the implementation of data warehouse bus architecture. The implementation of this process has to be managed effectively if an organization intends to maintain and ensure data integrity.

One of the key aspects of this process is to make sure that dimensions among various data marts are consistent.
Top-down
Bill Inmon the revolutionary author on data warehousing put forward another methodology which was the complete opposite of Kimballs bottom-up design. Inmon defines a data warehouse as the central storage mechanism of an organizations data.
All data even that belonging to the lowest levels known as Atomic data is also stored in the data warehouse. Data marts which as we know are focused at fulfilling a specific purpose or a particular applications requirements are then derived from the data warehouse.
Moreover Inmon is of the opinion that a data warehouse has the following characteristics
Data in a data warehouse is highly organized in fact data with similar elements are interlinked.
Data once stored in a data warehouse is never modified, deleted etc., he is of the opinion data remains static i.e. read only and is only retained to help in future reporting.

A data warehouse contains data from all the operational systems of an organization and is therefore consistent.
Considering the aforementioned characteristics, a top down approach presents highly consistent dimensional data and is quite robust to deal with future changes in business needs.
This approach is considered to be simpler than bottom-up since data marts can be created from the data warehouse quite easily.
However a major disadvantage of this methodology is that it has been developed with only larger organizations in perspective and is quite costly for smaller entities and therefore not practical. Moreover since this methodology results in a very large project, it is quite time consuming as well.
Another problem with the top-down approach is that it can be potentially irresponsive to the changing departmental and organizational needs in both the implementation phase and in the long run.

Hybrid
Over the years, a third design methodology has come to the forefront as well. The hybrid design aims to employ the benefits of both bottom-up and top-down approaches and overcome their disadvantages. For instance it comprises of the fast retrieval characteristic of the bottom up design and the data consistency of top-down design which stretches across the organization.
Data storage approaches
When concentrating on data storage in a data warehouse there are two major approaches for data storage namely the dimensional approach and the normalized approach.

Dimensional approach
It this approach, data is segmented into facts or dimensions. Dimensions are basically references to the facts. This approach employs the following concepts
Dimension it is a category of data
Attribute It is the level of data.
Hierarchy This represents the relationship between various attributes in a dimension.
A major benefit of using the dimensional approach is that it makes the data warehouse easier for the user to both understand and use the data warehouse. Moreover, data retrieval from a data warehouse is faster under this approach.

However, the dimensional approach has some disadvantages as well for instance
Data integrity is a challenging task since data is loaded from different operational systems in the data warehouse which is quite complicated.
Similarly, modification of data warehouse becomes difficult if the adoption dimensional approach introduces changes to the way organization carries out its day-to-day tasks.

Normalized approach
Under this approach, data in the data warehouse is stored as per the data normalization rules. Data normalization is a systematic approach which ensures that database structure is appropriate for general-purpose queries and is free from anomalies which could undermine data integrity.
In this approach, tables of data are grouped together through subject areas. These subject areas are reflective of general categories of data. One of the advantages of the normalized approach is that it is quite straight-forward to include information in the database. However there are some disadvantages as well for instance
Since there are a number of tables involved, it is difficult to join data from different sources into a single meaningful form.
Accessing information is also difficult since there is no precise understanding of the data sources and the data structure.

How does it work
The functioning of a data warehouse is determined through its architecture. Architecture of a data warehouse is a concept of how the data warehouse is developed. The appropriateness of the architecture is determined from the development, intended usage and maintenance of the data warehouse.

A simple format of data warehouse architecture comprises of several interconnected layers
Operational layer
This comprises of the source data used for developing the data warehouse. This layer of the architecture comprises of the ERP systems in an organization.
ERPs
Enterprise resource planning (ERP) systems are used to manage the information and various functions of an organization. They facilitate integration of all the operational level information in an organization through the integration of all operational departments.
Data migration
Data access layer
This architecture layer works as the interface between the operational layer and the information retrieval layer. This layer comprises of the tools needed to extract, transform and load data into the warehouse.
Extract, transform and load (ETL)

It is a process which involves the following
Extracting data It is the process of retrieving data from multiple data sources. Besides retrieving it also involves the analyses of extracted data in which the data is checked to ensure whether it corresponds to a pre-determined pattern or structure. If the extracted data fails to meet the requirement it is rejected.
Data transformation This process comprises of a set of rules which are applied to the extracted data. There are various types of data transformation
Data cleansing Under this method data records are detected for inaccurate and corruption of data, and where such discrepancies are found the data is corrected (i.e. cleansed).
Filtering
Sorting
Data merging from various sources
Aggregationdisaggregation
Load In this phase, data is loaded into the data warehouse. This process and its requirements vary from organization to organization.

Metadata layer
This layer of data warehouse architecture is associated with the handling of metadata. Metadata can be defined as data about data. Metadata management therefore ensures that metadata is appropriately created, stored and controlled, rationale behind which is that inconsistencies and unnecessary redundancies are avoided. Metadata is stored in a repository and can provide essential information on the various tools used. Moreover it can also analyze the impact of changes in database tables on the data itself.
Key considerations when designing a metadata management system
Metadata comprises of various forms and uses. Determining the scope of the repository is considered to be key when defining a new project.

Similarly diversity of data is also crucial since it will ensure consistency and information retrieval.
A flexible and resilient metadata management system has to be developed which not only responds to both technological and business changes but also ensures that data and access to it is protected.
Rationalization is another key consideration in making data retrieval more useful. It is a process which focuses on making sense of the raw (unprocessed) metadata which is usually obtained from legacy systems. It is therefore critical in creating business value from information stored.

Data dictionary
Data dictionary is an example of metadata management and is a centralized collection of data which contains information on meaning, attributes, origin, use and format of data. It therefore describes a database or a group of databases.
Data dictionary can also function as a middleware. Middleware is software which links various software components etc. Usually this is done with the introduction of another layer of data dictionary which is linked with the basic DBMS (database management system) data dictionary. This new data dictionary is considered to be the high-level data dictionary since it provides a lot many features than the low-level data dictionary whose sole purpose is to describe the various data elements of a DBMS.
The high-level data dictionary can facilitate in fulfilling the requirements of any new applications. Moreover they can also help in achieving query optimization among the various databases.

Data dictionary has considerable advantages
It helps in maintaining the consistency of data utilized across the organization
It also helps in providing much needed clarity in the data which makes them usable for both the user and the developer. Most of the data used within an organization is non-standardized data and the lack of clarity can cause different users a lot of trouble, in fact it can result it in a misunderstanding which could lead to wrong and erroneous reporting of a matter and ultimately affect decisions at the management level.
Data dictionary also encourages reusability of data. Since it also ensures consistency and clarity of data then the data is largely usable if its use is intended for future projects which are similar in nature to the existing ones. Reusability of data crucial in this information age if an entity intends to save money on reinventing of data.
Data properties need to be complete if improper use and misunderstanding of data has to be avoided. Data dictionary thoroughly ensures that data being stored is complete in all respects.
Having ensured the above mentioned advantages data dictionary can make the data more presentable and easier to understand for a user.

Benefits of Metadata management
As organizations worldwide are facing increased compliance issues, measures for proper retention, protection and disposal of data objects has to be developed in an objects metadata.
Metadata also plays a key role in providing archived information for example for customers, suppliers etc. this can result in improved efficiency and a reduction in costs.
Most of the knowledge worker applications are dependent on metadata for instance decision support systems, real-time collaboration systems.
Informational access layer
In this phase, data is accessed largely for the purposes of reporting and analyzing. There are numerous tools used for reporting and analysis of data.

Business intelligence tools
Business intelligence is a broad term which encompasses the different skills, methods, technologies and applications which play a significant role in facilitating decision making.
BI tools can develop both historic views and future projections from data obtained. Some of the more commonly used BI tools are spreadsheets, OLAP, data mining and reportingquerying software.

Online Analytical Processing (OLAP)
It is an approach which is used to speedily answer analytical queries which are multi-dimensional. Usual applications of OLAP are business reporting, management reporting, business process management (BPM).
OLAP configured databases are faster than their predecessor OLTP, probably because they borrow various aspects of both navigational and hierarchical databases which are comparatively faster than relational databases.

Data mining
It is a process which focuses on extracting patterns from a selected data. Since data mining carries out its analysis on samples of data, it is essential that these samples are accurately reflective of the larger pool of data. It involves four sub-tasks
Classification of data into pre-defined groups.
Clustering Classifying similar data items together.
Regression
Searching for relationships between variables in a data record.
Over the years data mining has gained significant popularity as a tool for transformation of data into information.

Situation Analysis
Business benefits
Data warehouses use a common data model for the complete data of an organization regardless of the data type or its source hence making it easier to both report and analyze information.
Any inconsistencies in the data are both identified and resolved prior to its loading in data warehouse. This can help in further simplifying reporting and hence can save substantial costs for an organization
Since their platform is separate from an organizations operational systems, data retrieval is swift and it doesnt slow down the performance of other operational systems as well.
Information stored in data warehouses serves as a secondary storage and can therefore prove helpful if the source system somehow gets corrupted.

Market trends
Data warehouses are increasingly playing a major role in the data management of organizations. However the current recession has forced organizations worldwide to withdraw their data warehouse systems.
On the other hand data warehouses are projected to face increased demand five years from now as per James Kobelius who is an expert on data warehousing.
General purpose Enterprise wide data warehouses (EDWs) are projected to replace the more common application specific data marts based data warehouses. These are used largely for tactical purposes whereas EDWs incorporate data marts which are more general purpose and serve an organization wide purpose. Moreover more rationalization, streamlining and consolidation of data marts is also projected trend.
More and more data warehouses are projected to be based hub and spoke architecture which will save an organization both software and hardware costs.
Social software is expected to be combined with business intelligence tools which will encourage collaborative decision making.

Usage models
Usage of data warehouses evolves with time. For starters organizations use a simpler form of data warehousing which gets sophisticated with time. Below mentioned are the various usage stages a data warehouse goes through in an organization.
Off line operational data warehouse
This form of data warehouse is only used to copy operational data of an organization onto another server.
Offline data warehouse
This form is a step further to the operational data warehouse since it also supports reporting purposes.
Real time data warehouse
At this stage data stored in a data warehouse is updated whenever a transaction is processed.

Integrated data warehouse
Data warehouses at this level are updated whenever a transaction is made by operational systems. However these transaction are also passed in the operational systems as well.
Data warehouses on an organizational level are used for validation, tactical reporting and exploration purposes. Furthermore, some sample applications of data warehouses are Insurance fraud analysis, logistics management.

Solutions Application Vendors
Data warehousing has been increasingly used by organizations worldwide to secure and properly manage their data which is crucial for organizations survival.
Build a data warehouse
Developing a data warehouse in house can be expensive and usually large organizations can afford it since initial setup costs are quite high and hence unaffordable for smaller entities.

Buy a data warehouse
Organizations can buy data warehouse softwares to fulfill their data storage needs. This will not only come at a competitive price but also overcome the hassle of setting it up themselves. Some of the major vendors of data warehouse softwares are Teradata, Vertical Systems, EMC Corporation, FileTek Inc. Greenplum etc.

Challenges Opportunities
Impact of data warehouses in the current business world
Data warehousing has redefined the way organizations worldwide manage their data. Data warehousing is although still largely considered a data backup mechanism but it has shown immense potential to serve many other purposes of an organization.

Data warehouses have facilitated increased reporting and analysis in organizations over time. Moreover data warehouses also actively support decision support systems which mean that they have started to effectively influence strategic decision making in organizations

Future usage
As discussed earlier the usage of data warehouses is largely under-utilized. Even today organizations usage of data warehouses is more than 50 only for tactical purposes and validation.
Data warehouses will see an elevation in their role in organization. Since efficient data management and utilization has become the key to achieve competitive advantage data warehouses will play an integral role in decision making in the near future.

Data warehousing encompasses a broad spectrum of stages which evolve with the passage and the growth of an organization. Data warehousing has considerable advantages for instance it However data warehouses do have their share of disadvantages for example data has to be extracted, transformed and loaded which can be quite time consuming. Similarly data warehousing can have high initial costs and subsequently costly maintenance.

Although data warehousing is complex to manage but the potential for higher returns on investment and possible competitive advantage make it worth risk taking.

IT Management

Wednesday, December 18, 2013

Technology review of Data warehousing

No comments:

Post a Comment