Saturday, October 30, 2010

BI - Business Intelligence in Data warehouse

Business Intelligence is a term introduced by Howard Dresner of Gartner Group in 1989. He described Business Intelligence as a set of concepts and methodologies to improve decision making in business through use of facts and fact based systems. Over time as use of Business Intelligence has become mainstream more definitions of Business Intelligence have emerged. Business intelligence (BI) is a broad category of application programs and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining.

Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining.

Business intelligence applications can be:

* Mission-critical and integral to an enterprise's operations or occasional to meet a special requirement
* Enterprise-wide or local to one division, department, or project
* Centrally initiated or driven by user demand

ETL - Extraction Transformation and Loading In Data warehosuing

The extract-transform-load (ETL) system, or more informally, the "back room," is often estimated to consume 70 percent of the time and effort of building a data warehouse. But there hasn't been enough careful thinking about just why the ETL system is so complex and resource intensive. Everyone understands the three letters: You get the data out of its original source location (E), you do something to it (T), and then you load it (L) into a final set of tables for the users to query.

When asked about breaking down the three big steps, many designers say, "Well, that depends." It depends on the source, it depends on funny data idiosyncrasies, it depends on the scripting languages and ETL tools available, it depends on the skills of the in-house staff, and it depends on the query and reporting tools the end users have.

The "it depends" response is dangerous because it becomes an excuse to roll your own ETL system, which in the worst-case scenario results in an undifferentiated spaghetti-mess of tables, modules, processes, scripts, triggers, alerts, and job schedules. Maybe this kind of creative design approach was appropriate a few years ago when everyone was struggling to understand the ETL task, but with the benefit of thousands of successful data warehouses, a set of best practices is ready to emerge.

I have spent the last 18 months intensively studying ETL practices and ETL products. I have identified a list of 38 subsystems that are needed in almost every data warehouse back room. That's the bad news. No wonder the ETL system takes such a large fraction of the data warehouse resources. But the good news is that if you study the list, you'll recognize almost all of them, and you'll be on the way to leveraging your experience in each of these subsystems as you build successive data warehouses.

The 38 Subsystems

1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.

2. Change data capture system. Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system.

3. Data profiling system. Column property analysis including discovery of inferred domains, and structure analysis including candidate foreign key — primary relationships, data rule analysis, and value rule analysis.

4. Data cleansing system. Typically a dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. "De-duplication" including identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving" using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (such as natural keys) to all participating original sources.

5. Data conformer. Identification and enforcement of special conformed dimension attributes and conformed fact table measures as the basis for data integration across multiple data sources.

6. Audit dimension assembler. Assembly of metadata context surrounding each fact table load in such a way that the metadata context can be attached to the fact table as a normal dimension.

7. Quality screen handler. In line ETL tests applied systematically to all data flows checking for data quality issues. One of the feeds to the error event handler (see subsystem 8).

8. Error event handler. Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality

9. Surrogate key creation system. Robust mechanism for producing stream of surrogate keys, independently for every dimension. Independent of database instance, able to serve distributed clients.

10. Slowly Changing Dimension (SCD) processor. Transformation logic for handling three types of time variance possible for a dimension attribute: Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new field).

11. Late arriving dimension handler. Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.

12. Fixed hierarchy dimension builder. Data validity checking and maintenance system for all forms of many-to-one hierarchies in a dimension.

13. Variable hierarchy dimension builder. Data validity checking and maintenance system for all forms of ragged hierarchies of indeterminate depth, such as organization charts, and parts explosions.

14. Multivalued dimension bridge table builder. Creation and maintenance of associative (bridge) table used to describe a many-to-many relationship between dimensions. May include weighting factors used for allocations and situational role descriptions.

15. Junk dimension builder. Creation and maintenance of dimensions consisting of miscellaneous low cardinality flags and indicators found in most production data sources.

16. Transaction grain fact table loader. System for updating transaction grain fact tables including manipulation of indexes and partitions. Normally append mode for most recent data. Uses surrogate key pipeline (see subsystem 19).

17. Periodic snapshot grain fact table loader. System for updating periodic snapshot grain fact tables including manipulation of indexes and partitions. Includes frequent overwrite strategy for incremental update of current period facts. Uses surrogate key pipeline (see subsystem 19).

18. Accumulating snapshot grain fact table loader. System for updating accumulating snapshot grain fact tables including manipulation of indexes and partitions, and updates to both dimension foreign keys and accumulating measures. Uses surrogate key pipeline (see subsystem 19).

19. Surrogate key pipeline. Pipelined, multithreaded process for replacing natural keys of incoming data with data warehouse surrogate keys.

20. Late arriving fact handler. Insertion and update logic for fact records that have been delayed in arriving at the data warehouse.

21. Aggregate builder. Creation and maintenance of physical database structures, known as aggregates, that are used in conjunction with a query-rewrite facility, to improve query performance. Includes stand-alone aggregate tables and materialized views.

22. Multidimensional cube builder. Creation and maintenance of star schema foundation for loading multidimensional (OLAP) cubes, including special preparation of dimension hierarchies as dictated by the specific cube technology.

23. Real-time partition builder. Special logic for each of the three fact table types (see subsystems 16, 17, and 18) that maintains a "hot partition" in memory containing only the data that has arrived since the last update of the static data warehouse tables.

24. Dimension manager system. Administration system for the "dimension manager" who replicates conformed dimensions from a centralized location to fact table providers. Paired with subsystem 25.

25. Fact table provider system. Administration system for the "fact table provider" who receives conformed dimensions sent by the dimension manager. Includes local key substitution, dimension version checking, and aggregate table change management.

26. Job scheduler. System for scheduling and launching all ETL jobs. Able to wait for a wide variety of system conditions including dependencies of prior jobs completing successfully. Able to post alerts.

27. Workflow monitor. Dashboard and reporting system for all job runs initiated by the Job Scheduler. Includes number of records processed, summaries of errors, and actions taken.

28. Recovery and restart system. Common system for resuming a job that has halted, or for backing out a whole job and restarting. Significant dependency on backup system (see subsystem 36).

29. Parallelizing/pipelining system. Common system for taking advantage of multiple processors, or grid computing resources, and common system for implementing streaming data flows. Highly desirable (eventually necessary) that parallelizing and pipelining be invoked automatically for any ETL process that meets certain conditions, such as not writing to the disk or waiting on a condition in the middle of the process.

30. Problem escalation system. Automatic plus manual system for raising an error condition to the appropriate level for resolution and tracking. Includes simple error log entries, operator notification, supervisor notification, and system developer notification.

31. Version control system. Consistent "snapshotting" capability for archiving and recovering all the metadata in the ETL pipeline. Check-out and check-in of all ETL modules and jobs. Source comparison capability to reveal differences between different versions.

32. Version migration system. development to test to production. Move a complete ETL pipeline implementation out of development, into test, and then into production. Interface to version control system to back out a migration. Single interface for setting connection information for entire version. Independence from database location for surrogate key generation.

33. Lineage and dependency analyzer. Display the ultimate physical sources and all subsequent transformations of any selected data element, chosen either from the middle of the ETL pipeline, or chosen on a final delivered report (lineage). Display all affected downstream data elements and final report fields affected by a potential change in any selected data element, chosen either in the middle of the ETL pipeline, or in an original source (dependency).

34. Compliance reporter. Comply with regulatory statutes to prove the lineage of key reported operating results. Prove that the data and the transformations haven't been changed. Show who has accessed or changed any such data.

35. Security system. Administer role-based security on all data and metadata in the ETL pipeline. Prove that a version of a module hasn't been changed. Show who has made changes.

36. Backup system. Backup data and metadata for recovery, restart, security, and compliance requirements.

37. Metadata repository manager. Comprehensive system for capturing and maintaining all ETL metadata, including all transformation logic. Includes process metadata, technical metadata, and business metadata.

38. Project management system. Comprehensive system for keeping track of all ETL development.

Data Warehouse Architecture

A great listing found on it.toolbox.com

A data warehouse architecture is primarily based on the business processes of a business enterprise taking into consideration the data consolidation across the business enterprise with adequate security, data modeling and organization, extent of query requirements, meta data management and application, warehouse staging area planning for optimum bandwidth utilization and full technology implementation.

The Data Warehouse Architecture includes many facets. Some of these are listed as follows:

1 Process Architecture
2 Data Model Architecture
3 Technology Architecture
4 Information Architecture
5 Resource Architecture
6 Various Architectures
7 More Resources

Process Architecture
Describes the number of stages and how data is processed to convert raw / transactional data into information for end user usage.
The data staging process includes three main areas of concerns or sub- processes for planning data warehouse architecture namely “Extract”, “Transform” and “Load”.

These interrelated sub-processes are sometimes referred to as an “ETL” process.

1)Extract- Since data for the data warehouse can come from different sources and may be of different types, the plan to extract the data along with appropriate compression and encryption techniques is an important requirement for consideration.

2)Transform- Transformation of data with appropriate conversion, aggregation and cleaning besides de-normalization and surrogate key management is also an important process to be planned for building a data warehouse.

3)Load- Steps to be considered to load data with optimization by considering the multiple areas where the data is targeted to be loaded and retrieved is also an important part of the data warehouse architecture plan.

Data Model Architecture
In Data Model Architecture (also known as Dimensional Data Model), there are 3 main data modeling styles for enterprise warehouses:
1.3rd Normal Form - Top Down Architecture, Top Down Implementation
2.Federated Star Schemas - Bottom Up Architecture, Bottom Up Implementation
3.Data Vault - Top Down Architecture, Bottom Up Implementation

Technology Architecture
Scalability and flexibility is required in all facets. The extent of these features are largely depend upon organizational size, business requirements, nature of business etc.
Technology or Technical architecture primary evolved from derivations from the process architecture, meta data management requirements based on business rules and security levels implementations and technology tool specific evaluation.
Besides these, the Technology architecture also looks into the various technology implementation standards in database management, database connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other related technologies.

Information Architecture
Information Architecture is the process of translating the information from one form to another in a step by step sequence so as to manage the storage, retrieval, modification and deletion of the data in the data warehouse.

Resource Architecture
Resource architecture is related to software architecture in that many resources come from software resources. Resources are important because they help determine performance. Workload is the other part of the equation. If you have enough resources to complete the workload in the right amount of time, then performance will be high. If there are not enough resources for the workload, then performance will be low.

Various Architectures
Please notice that with the different architectures there is one that stands out: Data Model Architecture. What is happening in the integration industry at large is: the ability to integrate information across the enterprise is becoming dependent on the quality of the data model architecture below.
The ability to be compliant, consistent and repeatable depends on how the data model is built under the covers.
There are 3 main data modeling styles for enterprise warehouses:
3rd Normal Form - Top Down Architecture, Top Down Implementation
Federated Star Schemas - Bottom Up Architecture, Bottom Up Implementation
Data Vault - Top Down Architecture, Bottom Up Implementation
You can read more about the Data Vault by searching for "Data Vault Data Model" on the web.
The point to Data Warehousing Architecture, is it is not JUST a data warehouse anymore. It is now a full-scale data integration platform, including right-time (real-time) data, and batch or strategic data sets in a single, auditable (and integrated) data store.

Requirements of a Data Warehouse

The data warehouse must make an organization’s information easily accessible.

The contents of the data warehouse must be understandable. The data must be intuitive and obvious to the business user, not merely the developer. Understandability implies legibility; the contents of the data warehouse need to be labeled meaningfully. Business users want to separate and combine the data in the warehouse in endless combinations, a process commonly referred to as slicing and dicing. The tools that access the data warehouse must be simple and easy to use. They also must return query results to the user with minimal wait times.

The data warehouse must present the organization’s information consistently.

The data in the warehouse must be credible. Data must be carefully assembled from a variety of sources around the organization, cleansed, quality assured, and released only when it is fit for user consumption. Information from one business process should match with information from another. If two performance measures have the same name, then they must mean the same thing. Conversely, if two measures don’t mean the
same thing, then they should be labeled differently. Consistent information means high-quality information. It means that all the data is accounted for and complete. Consistency also implies that common definitions for the contents of the data warehouse are available for users.

The data warehouse must be adaptive and resilient to change.

We simply can’t avoid change. User needs, business conditions, data, and technology are all subject to the shifting sands of time. The data warehouse must be designed to handle this inevitable change. Changes to the data warehouse should be graceful, meaning that they don’t invalidate existing data or applications. The existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse. If descriptive data in the warehouse is modified, we must account for the changes appropriately.

The data warehouse must be a secure bastion that protects our informationassets.

An organization’s informational crown jewels are stored in the data warehouse. At a minimum, the warehouse likely contains information about what we’re selling to whom at what price—potentially harmful details in the hands of the wrong people. The data warehouse must effectively control access to the organization’s confidential information.

The data warehouse must serve as the foundation for improved decision making.

The data warehouse must have the right data in it to support decision making. There is only one true output from a data warehouse: the decisions that are made after the data warehouse has presented its evidence. These decisions deliver the business impact and value attributable to the warehouse. The original label that predates the data warehouse is still the best description of what we are designing: a decision support system.

The business community must accept the data warehouse if it is to be deemed successful.

It doesn’t matter that we’ve built an elegant solution using best-of-breed products and platforms. If the business community has not embraced the data warehouse and continued to use it actively six months after training, then we have failed the acceptance test. Unlike an operational system rewrite, where business users have no choice but to use the new system, data warehouse usage is sometimes optional. Business user acceptance has more to do with simplicity than anything else.

As this list illustrates, successful data warehousing demands much more than being a stellar DBA or technician. With a data warehousing initiative, we have one foot in our information technology (IT) comfort zone, while our other foot is on the unfamiliar turf of business users. We must straddle the two, modifying some of our tried-and-true skills to adapt to the unique demands of data warehousing. Clearly, we need to bring a bevy of skills to the party to behave like we’re a hybrid DBA/MBA.

SAS DATA Step

A SAS program is composed of two parts: data steps that deal with data cleaning and data format, and procedures that perform required statistical analyses and/or graphically present the results. Data steps are important for several reasons. First, the dataset may not be in a SAS compatible format, although this is usually not the case for the datasets in class examples or exercises. Second, sometimes you need to extract some of the variables or some of the observations from the dataset to perform analysis. Third, different procedures may require the same dataset in different format. A data step is needed to transform the dataset into the appropriate format for a procedure.

Mathematical operations are listed in the following table:

Manipulating variables in a data step (recoding, if/then statements)

To illustrate the data manipulation, let’s take a sample data set:

data a1;
input gender $ age weight;
cards;
M 13 143
M 16 132
F 19 140
M 20 120
M 15 110
F 18 95
F 22 105
;

Suppose you want a data set of females only. The following SAS code will create a new data set call aa and store those observations whose value for the variable gender is not ‘M’. The set a1 statement after the data aa statement tells SAS to make a copy of the dataset a1 and save it as aa. The if/then statement deletes the observations in dataset aa whose gender variable has a value ‘M’. Quotation marks are used on M because gender is a categorical variable. The dollar sign ($) is used when you have a text variable rather than a numerical variable (i.e., gender coded as M, F rather than as 1 denoting male and 2 denoting female).

data aa;
set a1;
if gender eq 'M' then delete;
5
or
if gender eq 'F';
run;

If you want to include those who are 16 years or older, you can do:

data ab;
set a1;
if age lt 16 then delete;
run;

You can also select variables from a dataset for analysis. The statement is keep or drop. For example, if you do not need the variable age in your analysis, you can do:

data ac;
set a1;
drop age;
or
data ac;
set a1;
keep gender weight;

This last statement will create a dataset that only contains the two variables specified, gender and weight.

SAS Issues

Save the file
Now you are familiar with program editor window, log window, and output window. If you want to save the work you’ve done in a session, you’ll need to save the contents of each window separately. Usually, you only need to save the program; you can always run the program to get the log and output. To save a
program file, you’ll need first to make sure the program editor is the active window, then go to file and select save command. Similarly, you can save a log file when a log window is active, or an output file when the output window is active. You do not have to run the entire program every time you make a correction to your SAS program. Each SAS procedure is relatively independent of other procedures. As long as you have the dataset you need in this procedure in SAS, you can run only part of the program by highlighting the part of the program you want to run and then clicking the run button in the tool bars.

Missing values
a. Dots for missing observations. If your data set has missing values, you’ll need to specify them as adot in the SAS dataset.
b. What if data set does not have dots? You can add a dot to the corresponding missing value locations using a data step. For example, if you have two variables, X and Y, in your data set, and 10 observations. The ninth value of Y is missing. The following code with an if statement will do:

data a2; set a1;
If _n_ eq 9 then Y=.

c. Reading in data @@. You’ve already learned that when you input your dataset after a CARDS or DATALINES statement, every observation needs to be on an individual line. In case you want to make better use of the window and want to have more than one observation per line, @@ is the syntax that tells SAS where the end of one observation is. For example:

data b1;
input x y z @@;
cards;
1.1 2.2 3.3 4.4 5.5 6.5
;

It may be that your variables are data strings instead of numbers, for example gender or disease type. We call these variables categorical. In this case, SAS wants you to specify which variables are categorical by adding a $ sign right after the name of the variable in the input statement. Sample code follows:

data b1;
input state $ county $ name $ gender $ weight;
cards;
indiana tipp brown female 125
;

d. What if my Excel data file is not reading properly into SAS or not at all? If the Excel data file is not reading into SAS at all, most likely it’s because your Excel data file is open. The Excel file must be closed before you import it into SAS. There are other reasons that the Excel data file is not reading in properly. It could be that the data type of your Excel cells is not correctly defined. Inappropriate reading also happens when you do not have a header in the first row, since the import procedure takes the first row as header by default. However, this can be changed during the import procedure under options.
How do you know if SAS is reading your dataset correctly? Use the proc print procedure and see if the dataset in SAS is what you expected.

Exporting to Excel, Access, or SPSS (.txt, .xls, .prn)
Exporting a data set to Excel is the opposite procedure of the import process. If you go to FILE and then select EXPORT DATA, an export wizard window pops up. Then just follow the wizard through the following steps.
Step 1: Choose a data set that you created in the WORK library (where the SAS datasets are stored automatically by SAS). Click next button when you are done.
Step 2: Choose the file type you want to export to. Available types include Excel, Access, dBase, delimited file, and many others. Choose Excel 2000 and then click next.
Step 3: Type in the directory path where you want to save your data file in. If you are not sure of the path, click on the browse button and find the location. At this time, you may click on the OK button to finish the export. To export the data to Access, procedures are basically the same except that you need to choose Access type or SPSS in step 2 of the above exporting procedure.

How to use the help menu
The SAS help menu is helpful if you want to self-improve your knowledge of SAS procedures. There are two ways of getting SAS help. One is to go to the help menu and then SAS system help. Then go to the Index tab and type in the name of the procedure. SAS will give you the syntax of the procedure as well as
some examples. If you have a specific question, you can use the Search tab, and type in the key word of your question. The other way of getting help is go to the books and training and then online doc. Online doc is easier to browse.

SAS Procedure

A SAS program is composed of one or more (statistical) procedures. Each procedure is a unit, although some are needed to run others. Some often-used procedures for statistical analysis are explained in detail.

Proc print
The output of this procedure is the data set that you specified by writing data=dataname option after the print key word. This data= option is common for almost every SAS procedure. It is a good habit to use this option all the time so that you know with which dataset you are working. This is helpful especially when there are multiple datasets, which is usually the case when you are performing statistical analysis using SAS. Here’s an example of how proc print works. In the data step section, we created a data set called a1 with three variables (gender, age, weight), and seven observations. It’s a good idea to always check if SAS has read in your dataset correctly before performing any analyses on the data.

proc print data=a1;
run;

If you highlight this section of code and click on the run button, you’ll see the dataset in the output window
as follows:

Obs gender age weight
1 M 13 143
2 M 16 132
3 F 19 140
4 M 20 120
5 M 15 110
6 F 18 95
7 F 22 105
6

If you want to see only some variables in the data set, you could add a statement after the proc print line in the format of var gender age;. This would generate output similar to that shown above except the weight variable would not be included.

Proc univariate
It is one of the most important procedures for elementary statistical analysis. It outputs the basic statistics of one or more variables, and has optional statements to generate qqplots and histograms. Sample code follows:

proc univariate data=a1;
var weight;
qqplot;
histogram;
run;

The var statement is optional. Without this statement, a univariate analysis is performed for all numeric variables in the order they appear in the dataset.

Proc capability
It has a variety of functions including creating a normal qq plot, histogram, and probability plots, although it is often used to create a normal qq plot in elementary statistical analysis. A normal qq plot and a histogram can be created using the code in the univariate example, just replacing univariate with
capability.

Proc sort
Proc sort sorts the observations in a dataset by some variables in either ascending or descending order. Forexample:

proc sort data=a1 out=a2;
by gender;
run;

The observations of dataset a1 are sorted in ascending order, by default, of the variable gender, and the sorted data is saved in a dataset named a2. Without the out=a2 option, the unsorted dataset named a1 will be replaced by the sorted dataset. You can also sort the observations in the descending order of some
variable by specifying the descending option in the by statement, e.g. by gender descending. If you need to sort by more than one variable, list all the variables in the by statement. For example, by gender age will sort in the ascending order by gender, and then the observations with the same gender value will be sorted in the ascending order by the values of age.

Proc means
This procedure produces simple univariate descriptive statistics for numeric variables. It also calculates confidence limits for the mean, and identifies extreme values and quartiles. Here’s an example for mean and its confidence limit calculation:

proc means data=a2 alpha=0.05 clm mean median n min max;
run;

The mean, median, sample size, minimal value, maximal value, and 95% confidence intervals will be computed for variables age and weight. The alpha option specifies the confidence level for the confidence limit, clm tells SAS to calculate the confidence interval of the mean. Since gender is a categorical variable, no mean will be computed for it.

If you have a lot of variables and you only want to calculate the mean for some of them, use the var option and list the variables after the keyword var. If you want the means of the variables by group, use the by option. For example,

proc means data=a2 alpha=0.05 clm mean;
var weight;
by gender;
run;

tells SAS to compute the mean and confidence interval of weight for each value of gender, i.e. male and female. If the by statement is used, the observations need to be sorted by the same variable before the

proc means procedure. Note data a2, the sorted dataset, was used in our proc means example.

Proc summary
It computes descriptive statistics on numeric variables in a SAS dataset and outputs the results to a new SAS dataset. The syntax of proc summary is the same as that of proc means. An example follows:

proc summary data=a2 print;
var weight;
by gender;
output out=3;
run;

Proc summary will not run without either the print option or the output statement.

Proc corr
This procedure is used for calculating the correlation between numeric variables. For example, the Pearson correlation coefficient and its P-value can be computed.

proc corr data=a1;
var age weight;
run;

A correlation coefficient matrix is created:
Pearson Correlation Coefficients, N = 7
Prob > |r| under H0: Rho=0
age weight
age 1.00000 -0.43017
0.3354
weight -0.43017 1.00000
0.3354

The correlation coefficient between age and weight in this example is -0.43017, and 0.3354 is the P-value for testing the null hypothesis that the coefficient is zero. In this case, the P-value is greater than 0.05, and the null hypothesis of zero coefficient cannot be rejected.

Proc glm
It performs simple and multiple regression, analysis of variance (ANOVA), analysis of covariance, multivariate analysis of variance, and repeated measures analysis of variance.

proc glm data=a1;
model weight=age;
output out=a3 p=pred r=resid;
run;

performs a simple linear regression with weight as the dependent variable and age the independent variable. The predicted values of weight (the dependent variable) and the residuals are saved in a new dataset called a3 using the output statement. For multiple regression where you have more than one independent
variable, simply list in the model statement all the variables on the right hand side of the equal sign with one space in between, e.g.

model weight=age height;

In the case of ANOVA, a class statement is needed for categorical variables before the model statement.The following code is an ANOVA analyzing the effect of gender on weight. It tests whether the weight is the same for females and males.

proc glm data=a1;
class gender;
model weight=gender;
run;

Proc reg

Proc reg is a procedure for regression. It is capable of more regression tasks than proc glm. It allows multiple model statements in one procedure, can do model selection, and even plots summary statistics and normal qq-plots.You can specify several PLOT statements for each MODEL statement, and you can specify more than one plot in each PLOT statement.

proc reg data=a1;
model weight=age;
plot weight*age;
plot predicted.*age;
plot residual.*age;
plot nqq.*residual.;
run;

In the above example, a simple regression is performed with weight as the response and age as the explanatory variable. The plot statements request four plots: weight versus age, predicted values of weight versus age, residuals versus age, and normal qq plot versus residuals. Predicted., residual., and nqq. are keywords that SAS recognizes. Make sure you keep a dot after the word.

SAS Introduction

What is SAS?
SAS is a statistical software package that allows the user to manipulate and analyze data in many different ways. Because of its capabilities, this software package is used in many disciplines (not just statistics!), including medical sciences,biological sciences, and social sciences.

After you open SAS, you will see three windows,
1.Program editor
2.Explorer window
3.Log Window

Also, there is an output window that is hidden until you actually have output.

Programe Editor
The program editoris where you will type the program that you will eventually run. It works almost exactly like Microsoft Word. (You can cut, paste, move the cursor, etc.)The enhanced program editor will give you color-coded procedures, statements, and options (more on these later) that will help you to find errors in your program before you even run it.

Explorer Window
With the explorer window, you can open\view data you have read into SAS.Click on libraries, then the work folder, and this will show you any datasets you have read into or created in SAS for that session.

Log Window
The log window will inform you of any errors in your program and the reason for the errors. This window is EXTREMELY IMPORTANT if you hope to figure out what is wrong with your program. Always check it first to see if your program ran properly! The output window is where, once you run your program from the program editor, your output appear. (Note: You can also cut, paste, etc. from
the log and output windows.)

Informatica PowerCenter 9.0 Mapping Analyst for Excel Guide

Some of the topics covered in this tutorial are:-

1.Understanding Mapping Analyst for Excel
2.Configuring one Mappings
3.Configuring Multiple Mappings
4.Importing and exporting Mapping Specifications
5.Customizing Mapping Specification Templates

Download Informatica PowerCenter 9.0 Mapping Analyst for Excel Guide

How to implement nested querey in INFORMATICA

You can use in Source Qualifier or Lookup SQL override depending on your requirements. I slightly changed SQL:

select nvl(a.RATING,'NULL') vRating
from W_MBS_AM_BLDG_INSP_HDR_F a
where (W_UPDATE_DT, inspection_date_wid) =(select max(w_update_dt), max(INSPECTION_DATE_WID
from W_MBS_AM_BLDG_INSP_HDR_F b
where b.PROJECT_WID = a.PROJECT_WID)

How to implement nested querey in INFORMATICA

Based on the syntax, I am assuming that you picked this code up from a pl/sql code block and trying to reuse this in Informatica.

While you can use the exact query in the source qualifier (after removing the "INTO " clause), I'd suggest you rewrite the same as follows for the query to be more efficient ,let the code speak for itself and to avoid so many references of the same table.

Select max(nvl(a.RATING,'NULL')) keep
(DENSE_RANK FIRST ORDER BY W_UPDATE_DT desc,
INSPECTION_DATE_WID desc) rating from W_MBS_AM_BLDG_INSP_HDR_F
where PROJECT_WID = vPROJECT_WID;

@Nick,

If they are interested in seeing the latest rating based on a given project very frequently, then a view would make sense. Hopefully that is what you meant....
Otherwise, creating a view just for this one case and that too without the project_wid in the view statement would be far less useful.

All about Te(H.

Saturday, October 30, 2010

BI - Business Intelligence in Data warehouse

ETL - Extraction Transformation and Loading In Data warehosuing

Data Warehouse Architecture

Requirements of a Data Warehouse

SAS DATA Step

SAS Issues

SAS Procedure

SAS Introduction

Informatica PowerCenter 9.0 Mapping Analyst for Excel Guide

How to implement nested querey in INFORMATICA

How to implement nested querey in INFORMATICA

Followers

Feedjit