ETL Tech Blog: How to Identify Throughput Issues

How to Identify Throughput Issues

Ideally, how do you determine if your performance issues is due to read or write throughput of your session? Below are few typical issues that result in poor performance (low throughput values)

Target Commit Point is too small

The default target commit point of 10000 is used when processing large data volumes. This results in commits every 10000 rows even if the mapping is processing millions or hundreds of millions of rows. Setting a commit point of 100,000 or even 1 million rows is not unreasonable as data volumes grow and grow. There is a point where roll-back and commit points should be balanced, but if processing 40 million rows, a commit point of under 500,000 is probably too small.

Auto Memory settings are too small

Adding memory to the session is the single item with the biggest impact if you see throughput values in the Workflow Monitor (or in the query results) that are relatively small in relation to the number of rows being processed. Testing is needed to find the sweet spot of adding more memory and not adding memory when it no longer helps, and also adding memory while keeping in mind the resource requirements of other jobs running concurrently. Usually adjusting to 1 GB or 2 GB of Auto Memory can make a difference without being out of line. Be sure to balance the Maximum value for the suto memory with the Maximum Percentage of available memory since it is the smaller of the two that is used.

Multi-pass Sorts

Review the session log for messages about multi-pass sorts. This is an indication that more memory is required. The sorter transformations default to Auto for the sorter cache size, and in the past defaulted to 200 for this value. See the message in the log and set the Auto Memory or explicitly set the sorter cache to a larger value so that the sort can be performed in 1 or 2 passes. Note that the Auto Memory setting for sorters and other transformations that cache data is a shared value across each transformation. You may need to adjust the Auto Memory setting to make it an appropriate value.

Not Sorting before Aggregators and Joiners

Review the session log for messages about nested-loop algorithms for Aggregators and Joiners. These are indications that you did not sort before them, or you did not mark the transformation as having sorted input. Sorting does take some cycles, but the sort-merge algorithm is much faster and makes the other cache files much smaller, and so is worth the effort to improve performance.

Complex SQL Overrides

Using SQL Overrides pushes the logic to the database when initially querying the data or in lookups, etc. This is not a best practice for a number of reasons:

a. It hides the metadata of the mapping

b. It hides transformation logic being done on the data

c. It does not allow the Informatica server to take some of the load of the transformation of data

It is a good practice to take the SQL that will be run against the database from the session log and to run an explain plan and tune it appropriately in the database by adding the right indexes, etc., that assist in the query performance.

Using a Lookup instead of a Join

Both Lookups and Joins can be used to reference data in another table based on data in the main source data and specified logic. Lookup results are cached either in memory, if they are small enough, or to a file on disk. If the results are large, then it does not make sense to use the lookup and is likely more efficient to use a Join instead. This can be a join performed in a Source Qualifier if it is coming from the same instance as the source table, or with a Joiner transformation if they are disparate database sources.

It was observed that there are mappings that were caching lookup results for 40 million or more rows. This is a clear indication that a joiner should have been used instead. Generally, for any table with more than a million rows it is better to use a Join instead of a Lookup. The items marked in the session log below are lookups that should become joins instead because of the number of rows returned.

Unsorted Input in a Lookup

The session log might indicate messages with error code TT_11195 regarding unsorted input for a lookup transformation.

This occurs because the SQL Override eliminated the sort, the the ports in the lookup transformation are not the same datatype and exact precision as the definition in the database, or there is binary data as part of the lookup results and the sort order is ascii and not binary in the mapping.

Some of the items with poor throughput were flagged and we reviewed those mappings, sessions, and session logs to look at Auto Memory Settings, DTM buffer Size settings, session log erros and bottlenecks, logic in the mapping regarding SQL Overrides, Lookup size, sorting before Aggregator and Joins, etc. The candidates for this review were selected because, the execution time was longer than 1 hour, the number of rows per second of throughput was less than 10000 while the number of rows processed was greater than 10000. This was a starting point to try to find some of the worst jobs and review them. We did not get input from the development or production support teams to guide our efforts.

List "WRITE" Throughput of the session

SELECT l.subject_area,

workflow_name,

l.session_name,

l.table_name,

l.successful_rows,

l.successful_affected_rows,

l.throughput AS write_thruput,

l.start_time,

l.end_time,

ROUND ( (l.end_time - l.START_time) * 24 * 60 * 60) Run_Time

FROM rep_sess_tbl_log l, rep_workflows w

WHERE l.start_time > TO_DATE ('20181101', 'YYYYMMDD')

AND last_error_code IN (0, 7004) -- last one without an error

AND (l.start_time, l.session_name) IN

( SELECT MAX (start_time), -- latest run date - uncomment to get all

session_name

FROM rep_sess_tbl_log

GROUP BY session_name)

AND l.successful_rows > 0

AND l.workflow_id = w.workflow_id

ORDER BY l.subject_area,

w.workflow_name,

l.session_name,

l.start_time

List "READ" Throughput of the session

SELECT t.task_name AS workflow_name,

ti.instance_name AS session_name,

swl.instance_name AS table_name,

swl.applied_rows,

swl.affected_rows,

swl.thruput AS read_thruput,

swl.start_time,

swl.end_time,

ROUND ( (swl.end_time - swl.START_time) * 24 * 60 * 60) Run_Time

FROM opb_swidginst_log swl, opb_task_inst ti, opb_task t

WHERE swl.last_err_code IN (0, 7004)

AND (swl.start_time, swl.instance_name) IN

( SELECT MAX (start_time), instance_name

FROM opb_swidginst_log

GROUP BY instance_name)

AND swl.session_id = ti.task_id

AND TI.WORKFLOW_ID = T.TASK_ID

AND (applied_rows > 0 OR affected_rows > 0)

GROUP BY t.task_name,

ti.instance_name,

swl.instance_name,

swl.applied_rows,

swl.affected_rows,

swl.thruput,

swl.start_time,

swl.end_time,

ROUND ( (swl.end_time - swl.START_time) * 24 * 60 * 60)

ORDER BY t.task_name,

ti.instance_name,

swl.instance_name,

swl.start_time

Average Run Time and Throughput

SELECT l.subject_area,

w.workflow_name,

l.session_name,

l.table_name,

AVG (l.throughput) AS write_thruput,

AVG (ROUND ( (l.end_time - l.START_time) * 24 * 60 * 60)) Run_Time,

AVG (l.successful_rows)

AS avg_success_rows,

AVG (l.successful_affected_rows)

AS avg_affected_rows

FROM rep_sess_tbl_log l, rep_workflows w

WHERE l.start_time > TO_DATE ('20180210', 'YYYYMMDD')

--and subject_area='FOLDERName'

AND l.workflow_id = w.workflow_id

GROUP BY l.subject_area,

w.workflow_name,

l.session_name,

l.table_name

ORDER BY l.subject_area,

w.workflow_name,

l.session_name,

l.table_name

ETL Tech Blog

Headings

Monday, November 26, 2018

How to Identify Throughput Issues

Average Run Time and Throughput

No comments:

Post a Comment

Site Disclaimer

Followers

Most Viewed Links