IBM WebSphere DataStage Developer Blog

Thursday, June 5, 2008

DataStage .. Sparse VS Normal Lookup

A Normal look up tries to fit your reference data into memory,and then does an in-memory look up. And so if your reference data is huge, this can cause performane issues in your job.

Based on your reference data volume - if it is very high - you can either opt for a Merge or a Join stage.

A sparse look up is used when your input or main stream data is very less compared to your reference data. In this case, DataStage executes an SQL query on the reference DB for each incoming record. So,if the input is very high compared to the reference - this type of look up will be a huge hit to performance. A sparse look up is only supported in the Oracle Enterprise, DB2 Enterprise and the ODBC Enterprise stages - and that when directly linked to a Lookup Stage.

Data read by any database stage can serve as the reference input to a Lookup stage. By default, this reference data is loaded into memory like any other reference link.When directly connected as the reference link to a Lookup stage, bothDB2/UDB Enterprise and Oracle Enterprise stages allow the lookup type to be changed to Sparse and send individual SQL statements to the reference database for each incoming Lookup row. Sparse Lookup is only available when the database stage is directly connected to the reference link, with no intermediate stages.It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.
For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate

Tuesday, June 3, 2008

Hi Guys
Please check out he comments made Aneesh ... use ful for more understanding ...
Thnx Anees

First would like to congratulate you on your blog! Its a good effort and am sure well appreciated.

Regarding the Sort Stage, you have said - "The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records...."

Just wanted to point out that the option of retaining the first or last while de-duplication is only available in the Remove Duplicate Stage.

Probably another point that i would like to call out is regarding the usage of the Transformer stage. This stage works faster than some of the others like like the Filter stage. The "evilness" of the Transformer I believe was broken open to the public during the IBM IOD Conference,2007. Vincent Mcburney had called this out in his blogs a couple of times.

Friday, April 11, 2008

Data Specialist-ETL-DataStage Job Openings

Hi Guys
There is job openins for Data Specialist-ETL-DataStage Job Openings at IBM

Skill Set :
3+ year experience with DataStage Server Edition/DS Parallel Extender
3+ year experience with SQL and DB2 Server
3+ year experience with relational database development (DB2 and Oracle)

Job Locatoio :

Pune/Banglore/Kolkata/Pune

Thursday, January 24, 2008

Major business and technical advantages and disadvantages of using DataStage ETL tool

Business advantages of using DataStage as an ETL tool:
Significant ROI (return of investment) over hand-coding

Learning curve - quick development and reduced maintenance with GUI tool

Development Partnerships - easy integration with top market products interfaced with the datawarehouse, such as SAP, Cognos, Oracle, Teradata, SAS

Single vendor solution for bulk data transfer and complex transformations (DataStage versus DataStage TX)

Transparent and wide range of licensing options

Technical advantages of using DataStage tool to implement the ETL processes :

Single interface to integrate heterogeneous applications

Flexible development environment - it enables developers to work in their desired style, reduces training needs and enhances reuse. ETL developers can follow data integrations quickly through a graphical work-as-you-think solution which comes by default with a wide range of extensible objects and functions

Team communication and documentation of the jobs is supported by data flows and transformations self-documenting engine in HTML format.

Ability to join data both at the source, and at the integration server and to apply any business rule from within a single interface without having to write any procedural code.

Common data infrastructure for data movement and data quality (metadata repository, parallel processing framework, development environment) With Datastage Enterprise Edition users can use the parallel processing engine which provides unlimited performance and scalability. It helps get most out of hardware investment and resources.

The datastage server performs very well on both Windows and unix servers.

Major Datastage weaknesses and disadvantages:

Big architectural differences in the Server and Enterprise edition which results in the fact that migration from server to enterprise edition may require vast time and resources effort.

There is no automated error handling and recovery mechanism - for example no way to automatically time out zombie jobs or kill locking processes. However, on the operator level, these errors can be easily resolved.

No Unix Datastage client - the Client software available only under Windows and there are different clients for different datastage versions. The good thing is that they still can be installed on the same windows pc and switched with the Multi-Client Manager program.

Might be expensive as a solution for a small or mid-sized company.

DataStage Interview Question ...

Hi Guys
Lets share DataStage Interview question you have faced while your job interview......
It would be helpful to others......

Features and Benefits ... IBM® WebSphere® DataStage®

IBM® WebSphere® DataStage® and IBM WebSphere DataStage for Linux on System z provide these unique capabilities:

The powerful ETL solution supports the collection, integration and transformation of large volumes of data, with data structures ranging from simple to highly complex. IBM WebSphere DataStage manages data arriving in real-time as well as data received on a periodic or scheduled basis.

The scalable platform enables companies to solve large-scale business problems through high-performance processing of massive data volumes. By leveraging the parallel processing capabilities of multiprocessor hardware platforms, IBM WebSphere DataStage Enterprise Edition can scale to satisfy the demands of ever-growing data volumes, stringent real-time requirements, and ever shrinking batch windows.

Comprehensive source and target support for a virtually unlimited number of heterogeneous data sources and targets in a single job includes text files; complex data structures in XML; ERP systems such as SAP and PeopleSoft; almost any database (including partitioned databases); web services; and business intelligence tools like SAS.

Real-time data integration support operates in real-time. It captures messages from Message Oriented Middleware (MOM) queues using JMS or WebSphere MQ adapters to seamlessly combine data into conforming operational and historical analysis perspectives. IBM WebSphere Information Services Director provides a service-oriented architecture (SOA) for publishing data integration logic as shared services that can be reused across the enterprise. These services are capable of simultaneously supporting high-speed, high reliability requirements of transactional processing and the high volume bulk data requirements of batch processing.

Advanced maintenance and development enables developers to maximize speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure. Full data integration reduces the development and maintenance cycle for data integration projects by simplifying administration and maximizing development resources.

Complete connectivity between any data source and any application ensures that the most relevant, complete and accurate data is integrated and used by the most popular enterprise application software brands, including SAP, Siebel, Oracle, and PeopleSoft.
Flexibility to perform information integration directly on the mainframe. WebSphere DataStage for Linux on System z, provides:

Ability to leverage existing mainframe resources in order to maximize the value of your IT investments
Scalability, security, manageability and reliability of the mainframe
Ability to add mainframe information integration work load without added z/OS operational costs

Somthing about WebSphere DataStage(ETL/DataIntegration Tool)

IBM® WebSphere® DataStage® integrates data across multiple and high volumes of data sources and target applications.
It integrates data on demand with a high performance parallel framework, extended metadata management, and enterprise connectivity.

Supports the collection, integration and transformation of large volumes of data, with data structures ranging from simple to highly complex.
Offers scalable platform that enables companies to solve large-scale business problems through high-performance processing of massive data volumes
Supports real-time data integration.
Enables developers to maximize speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure.
Completes connectivity between any data source and any application