Bulk Update Methods
What I love about writing SQL Tuning articles is that I very rarely end up publishing the findings I set out to achieve. With this one, I set out to demonstrate the advantages of PARALLEL DML, didn't find what I thought I would, and ended up testing 8 different techniques to find out how they differed. And guess what? I still didn't get the results I expected. Hey, at least I learned something.
As an ETL designer, I hate updates. They are just plain nasty. I spend an inordinate proportion of design time of an ETL system worrying about the relative proportion of rows inserted vs updated. I worry about how ETL tools apply updates (did you know DataStage applys updates singly, but batches inserts in arrays?), how I might cluster rows together that are subject to updates, and what I might do if I just get too many updates to handle.
It would be fair to say I obsess about them. A little bit.
The two most common forms of Bulk Updates are:
The two most common forms of Bulk Updates are:
- Update (almost) every row in the table. This is common when applying data patches and adding new columns.
- Updating a small proportion of rows in a very large table.
Case 1 is uninteresting. The fastest way to update every row in the table is to rebuild the table from scratch. All of these methods below will perform worse.
Case 2 is common in Data Warehouses and overnight batch jobs. We have a table containing years worth of data, most of which is static; we are updating selected rows that were recently inserted and are still volatile. This case is the subject of our test. For the purposes of the test, we will assume that the target table of the update is arbitrarily large, and we want to avoid things like full-scans and index rebuilds.
And the nominees are...
The methods covered include both PL/SQL and SQL approaches. I want to test on a level playing field and remove special factors that unfairly favour one method, so there are some rules:
- Accumulating data for the update can be arbitrarily complex. SQL updates can have joins with grouping and sub-queries and what-not; PL/SQL can have cursor loops with nested calls to other procedures. I'm not testing the relative merits of how to accumulate the data, so each test will use pre-preared update data residing in a Global Temporary Table.
- Some methods - such as MERGE - allow the data source to be joined to the update target using SQL. Other methods don't have this capability and must use Primary Key lookups on the update target. To make these methods comparable, the "joinable" techniques will use a Nested Loops join to most closely mimic the Primary Key lookup of the other methods. Even though a Hash join may be faster than Nested Loops for some distributions of data, that is not always the case and - once again - we're assuming an arbitraily large target table, so a full scan is not necessarily feasible.
- Having said that we're not comparing factors outside of the UPDATE itself, some of the methods do have differences unrelated to the UPDATE. I have included these deliberately because they are reasonably common and have different performance profiles; I wouldn't want anyone to think that because their statements were *similar* to those shown here that they have the same performance profile.
The 8 methods I am benchmarking here are as follows (in rough order of complexity):
- Explicit Cursor Loop
- Implicit Cursor Loop
- UPDATE with nested SET subquery
- BULK COLLECT / FORALL UPDATE
- Updateable Join View
- MERGE
- Parallel DML MERGE
- Parallel PL/SQL
For all of the tests, the following table structures will be used:
TEST{n} (Update Source) - 100K rows TEST (Update target) - 10M rows Name Type Name Type ------------------------------ ------------ ------------------------------ ------------ PK NUMBER PK NUMBER FK NUMBER FK NUMBER FILL VARCHAR2(40) FILL VARCHAR2(40)
The data has the following characteristics:
- TEST.PK will contain values 0 .. 9999999, but not in that order.
- TEST.PK is poorly clustered. It is generated by reversing the digits in LPAD(ROWNUM, '0', 7). PK values of 1,2, and 3 are adjacent in the primary key index but one-million rows apart in the table.
- TEST.FK will contain values 0 .. 99, unevenly distributed to favour lower numbers.
- For the first round of testing, the column FK will be indexed with a simple b-tree index.
Method 1: Explicit Cursor Loop
Not many people code this way, but there are some Pro*C programmers out there who are used to Explicit Cursor Loops (OPEN, FETCH and CLOSE commands) and translate these techniques directly to PL/SQL. The UPDATE portion of the code works in an identical fashion to the Implicit Cursor Loop, so this is not really a separate "UPDATE" method as such. The interesting thing about this method is that it performs a context-switch between PL/SQL and SQL for every FETCH; this is less efficient. I include it here because it allows us to compare the cost of context-switches to the cost of updates.
DECLARE CURSOR c1 IS SELECT * FROM test6; rec_cur c1%rowtype; BEGIN OPEN c1; LOOP FETCH c1 INTO rec_cur; EXIT WHEN c1%notfound; UPDATE test SET fk = rec_cur.fk , fill = rec_cur.fill WHERE pk = rec_cur.pk; END LOOP; CLOSE C1; END; /
Method 2: Implicit Cursor Loop
This is the simplest PL/SQL method and very common in hand-coded PL/SQL applications. Update-wise, it looks as though it should perform the same as the Explicit Cursor Loop. The difference is that the Implicit Cursor internally performs bulk fetches, which should be faster than the Explicit Cursor because of the reduced context switches.
BEGIN FOR rec_cur IN ( SELECT * FROM test3 ) LOOP UPDATE test SET fk = rec_cur.fk , fill = rec_cur.fill WHERE pk = rec_cur.pk; END LOOP; END; /
Method 3: UPDATE with nested SET subquery
This method is pretty common. I generally recommend against it for high-volume updates because the SET sub-query is nested, meaning it is performed once for each row updated. To support this method, I needed to create an index on TEST8.PK.
UPDATE test SET (fk, fill) = ( SELECT test8.fk, test8.fill FROM test8 WHERE pk = test.pk ) WHERE pk IN ( SELECT pk FROM test8 );
Method 4: BULK COLLECT / FORALL UPDATE
This one is gaining in popularity. Using BULK COLLECT and FORALL statements is the new de-facto standard for PL/SQL programmers concerned about performance because it reduces context switching overheads between the PL/SQL and SQL engines.
The biggest drawback to this method is readability. Since Oracle does not yet provide support for record collections in FORALL, we need to use scalar collections, making for long declarations, INTO clauses, and SET clauses.
DECLARE CURSOR rec_cur IS SELECT * FROM test4; TYPE num_tab_t IS TABLE OF NUMBER(38); TYPE vc2_tab_t IS TABLE OF VARCHAR2(4000); pk_tab NUM_TAB_T; fk_tab NUM_TAB_T; fill_tab VC2_TAB_T; BEGIN OPEN rec_cur; LOOP FETCH rec_cur BULK COLLECT INTO pk_tab, fk_tab, fill_tab LIMIT 1000; EXIT WHEN pk_tab.COUNT() = 0; FORALL i IN pk_tab.FIRST .. pk_tab.LAST UPDATE test SET fk = fk_tab(i) , fill = fill_tab(i) WHERE pk = pk_tab(i); END LOOP; CLOSE rec_cur; END; /
Method 5: Updateable Join View
This is really a deprecated pre-9i method; the modern equivalent is the MERGE statement. This needs a unique index on TEST1.PK in order to enforce key preservation.
UPDATE ( SELECT /*+ ordered use_nl(old)*/ new.pk as new_pk , new.fk as new_fk , new.fill as new_fill , old.* FROM test1 new JOIN test old ON (old.pk = new.pk) ) SET fk = new_fk , fill = new_fill /
Method 6: MERGE
The modern equivalent of the Updateable Join View. Gaining in popularity due to its combination of brevity and performance, it is primarily used to INSERT and UPDATE in a single statement. We are using the update-only version here. Note that I have included a FIRST_ROWS hint to force an indexed nested loops plan. This is to keep the playing field level when comparing to the other methods, which also perform primary key lookups on the target table. A Hash join may or may not be faster, that's not the point - I could increase the size of the target TEST table to 500M rows and Hash would be slower for sure.
MERGE /*+ FIRST_ROWS*/ INTO test USING test2 new ON (test.pk = new.pk) WHEN MATCHED THEN UPDATE SET fk = new.fk , fill = new.fill /
Here is the Explain Plan
------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| ------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 130K| 9921K| 258K (1)| | 1 | MERGE | TEST | | | | | 2 | VIEW | | | | | | 3 | NESTED LOOPS | | 130K| 11M| 258K (1)| | 4 | TABLE ACCESS FULL | TEST2 | 128K| 6032K| 172 (5)| | 5 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 48 | 2 (0)| | 6 | INDEX UNIQUE SCAN | TEST_PK | 1 | | 1 (0)| -------------------------------------------------------------------------------
Method 7: Parallel DML MERGE
Now we're getting clever... This is the MERGE example on steroids. It uses Oracle's Parallel DML capability to spread the load over multiple slave threads.
ALTER SESSION ENABLE PARALLEL DML; MERGE /*+ first_rows parallel(test) parallel(test2) */ INTO test USING test5 new ON (test.pk = new.pk) WHEN MATCHED THEN UPDATE SET fk = new.fk , fill = new.fill /
Note the differences in the Explain Plan.
-------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| TQ |IN-OUT| PQ Distrib | -------------------------------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 109K| 8325K| 1880 (1)| | | | | 1 | PX COORDINATOR | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10002 | 109K| 10M| 1880 (1)| Q1,02 | P->S | QC (RAND) | | 3 | INDEX MAINTENANCE | TEST | | | | Q1,02 | PCWP | | | 4 | PX RECEIVE | | 109K| 10M| 1880 (1)| Q1,02 | PCWP | | | 5 | PX SEND RANGE | :TQ10001 | 109K| 10M| 1880 (1)| Q1,01 | P->P | RANGE | | 6 | MERGE | TEST | | | | Q1,01 | PCWP | | | 7 | PX RECEIVE | | 109K| 10M| 1880 (1)| Q1,01 | PCWP | | | 8 | PX SEND HYBRID (ROWID PKEY) | :TQ10000 | 109K| 10M| 1880 (1)| Q1,00 | P->P | HYBRID (ROW| | 9 | VIEW | | | | | Q1,00 | PCWP | | | 10 | NESTED LOOPS | | 109K| 10M| 1880 (1)| Q1,00 | PCWP | | | 11 | PX BLOCK ITERATOR | | | | | Q1,00 | PCWC | | | 12 | TABLE ACCESS FULL | TEST2 | 107K| 5062K| 2 (0)| Q1,00 | PCWP | | | 13 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 48 | 0 (0)| Q1,00 | PCWP | | | 14 | INDEX UNIQUE SCAN | TEST_PK | 1 | | 0 (0)| Q1,00 | PCWP | | --------------------------------------------------------------------------------------------------------------------
Method 8: Parallel PL/SQL
This is much easier to do with DataStage than with native PL/SQL. The goal is to have several separate sessions applying UPDATE statements at once, rather than using the sometimes restrictive PARALLEL DML alternative. It's a bit of a kludge, but we can do this in PL/SQL using a Parallel Enable Table Function. Here's the function:
CREATE OR REPLACE FUNCTION test_parallel_update ( test_cur IN SYS_REFCURSOR ) RETURN test_num_arr PARALLEL_ENABLE (PARTITION test_cur BY ANY) PIPELINED IS PRAGMA AUTONOMOUS_TRANSACTION; test_rec TEST%ROWTYPE; TYPE num_tab_t IS TABLE OF NUMBER(38); TYPE vc2_tab_t IS TABLE OF VARCHAR2(4000); pk_tab NUM_TAB_T; fk_tab NUM_TAB_T; fill_tab VC2_TAB_T; cnt INTEGER := 0; BEGIN LOOP FETCH test_cur BULK COLLECT INTO pk_tab, fk_tab, fill_tab LIMIT 1000; EXIT WHEN pk_tab.COUNT() = 0; FORALL i IN pk_tab.FIRST .. pk_tab.LAST UPDATE test SET fk = fk_tab(i) , fill = fill_tab(i) WHERE pk = pk_tab(i); cnt := cnt + pk_tab.COUNT; END LOOP; CLOSE test_cur; COMMIT; PIPE ROW(cnt); RETURN; END; /
Note that it receives its data via a Ref Cursor parameter. This is a feature of Oracle's parallel-enabled functions; they will apportion the rows of a single Ref Cursor amongst many parallel slaves, with each slave running over a different subset of the input data set.
Here is the statement that calls the Parallel Enabled Table Function:
SELECT sum(column_value) FROM TABLE(test_parallel_update(CURSOR(SELECT * FROM test7)));
Note that we are using a SELECT statement to call a function that performs an UPDATE. Yeah, I know, it's nasty. You need to make the function an AUTONOMOUS TRANSACTION to stop it from throwing an error. But just bear with me, it is the closest PL/SQL equivalent I can make to a third-party ETL Tool such as DataStage with native parallelism.
And on to the testing....
ROUND 1
In this test, we apply the 100K updated rows in Global Temporary Table TEST{n} to permanent table TEST. There are 3 runs:
- Run 1: The buffer cache is flushed and about 1 hour of unrelated statistics gathering has been used to age out the disk cache.
- Run 2: The buffer cache is flushed and the disk cache has been aged out with about 5-10mins of indexed reads. Timings indicate that the disk cache is still partially populated with blocks used by the query.
- Run 3: The buffer cache is pre-salted with the table and blocks it will need. It should perform very little disk IO.
RUN 1 RUN 2 RUN 3 ----------------------------------- ----- ----- ----- 1. Explicit Cursor Loop 931.3 783.2 49.3 2. Implicit Cursor Loop 952.7 672.8 40.2 3. UPDATE with nested SET subquery 941.4 891.5 31.5 4. BULK COLLECT / FORALL UPDATE 935.2 826.0 27.9 5. Updateable Join View 933.2 741.0 28.8 6. MERGE 854.6 838.5 28.4 7. Parallel DML MERGE 55.7 46.1 47.7 8. Parallel PL/SQL 28.2 27.2 6.3
The things I found interesting from these results are:
- Amongst the non-parallel methods (1-6), context switches only make a significant and noticable difference with cached data. With uncached data, the cost of disk reads so far outweighs the context switches that they are barely noticable. Context Switching - whilst important - is not really a game-changer. This tells me that you should avoid methods 1 and 2 as a best practice, but it is probably not cost-effective to re-engineer existing method 1/2 code unless your buffer cache hit ratio is 99+% (ie. like RUN 3).
- There were no significant differences between the 6 non-parallel methods, however this is not to suggest that it is not important which one you choose. All of these benchmarks perform Primary Key lookups of the updated table, however it is possible to run methods 5 and 6 as hash joins with full table scans. If the proportion of blocks updated is high enough, the hash join can make an enormous difference in the run time. See Appendix 1 for an example.
- Parallel updates are a game changer. The reason for this is disk latency. Almost all of the time for RUN 1 and RUN 2 of the non-parallel methods is spent waiting for reads and writes on disk. The IO system of most computers is designed to serve many requests at a time, but no ONE request can utilise ALL of the resources. So when an operation runs serially, it only uses a small proportion of the available resources. If there are no other jobs running then we get poor utilisation. The parallel methods 7 and 8 allow us to tap into these under-utilised resources. Instead of issuing 100,000 disk IO requests one after the other, these methods allow (say) 100 parallel threads to perform just 1000 sequential IO requests.
- Method 8, which is the equivalent of running many concurrent versions of Method 4 with different data, is consistently faster than Oracle's Parallel DML. This is worth exploring. Since the non-parallel equivalents (Methods 4 and 6) show no significant performance difference, it is reasonable to expect that parallelising these two methods will yield similar results. I ran some traces (see Appendix 2) and found that the Parallel Merge was creating too many parallel threads and suffering from latch contention. Manually reducing the number of parallel threads made it perform similarly to the Parallel PL/SQL method. The lesson here is that too much parallelism is a bad thing.
ROUND 2
Let's see how a Foreign Key constraint affects things. For this round, I have created a parent table and a Foreign Key on the FK column.
For brevity, this time we'll just flush the buffer cache and run about 5 minutes worth of indexed reads to cycle the disk cache.
RUN 1 RUN 2 ----------------------------------- ----- ----- 1. Explicit Cursor Loop 887.1 874.6 2. Implicit Cursor Loop 967.0 752.1 3. UPDATE with nested SET subquery 920.1 795.2 4. BULK COLLECT / FORALL UPDATE 840.9 759.2 5. Updateable Join View 727.5 851.8 6. MERGE 807.8 833.6 7. Parallel DML MERGE 26.8 29.2 8. Parallel PL/SQL 25.3 23.8
Summary of findings:
- It looks as though there is a small premium associated with checking the foreign key, although it does not appear to be significant. It's worth noting that the parent table in this case is very small and quickly cached. A very large parent table would result in considerably greater number of cache misses and resultant disk IO. Foreign keys are often blamed for bad performance; whilst they can be limiting in some circumstances (e.g. direct path loads), updates are not greatly affected when the parent tables are small.
- I was expecting the Parallel DML MERGE to be slower. According to the Oracle® Database Data Warehousing Guide - 10g Release 2, INSERT and MERGE are "Not Parallelized" when issed against the child of a Foreign Key constraint, whereas parallel UPDATE is "supported". As a test, I issued a similar MERGE statement and redundantly included the WHEN NOT MATCHED THEN INSERT clause: it was not parallelized and ran slower. The lesson here: there may be merit in applying an upsert (insert else update) as an update-only MERGE followed by an INSERT instead of using a single MERGE.
ROUND 3
The two things I hear most about Bitmap indexes is that:
- They are inappropriate for tables that undergo concurrent updates, and
- They are slow to update.
Surely no comparison of update methods could possibly be complete without a test of Bitmap index maintenance.
In this round, I have removed the Foreign Key used in Round 2, and included a Bitmap index on TEST.FK
RUN 1 RUN 2 ----------------------------------- ----- ----- 1. Explicit Cursor Loop 826.0 951.2 2. Implicit Cursor Loop 898.7 877.2 3. UPDATE with nested SET subquery 588.9 633.4 4. BULK COLLECT / FORALL UPDATE 898.0 926.7 5. Updateable Join View 547.8 687.1 6. MERGE 689.3 763.4 7. Parallel DML MERGE 30.2 28.4 8. Parallel PL/SQL ORA-00060: deadlock detected
Well, if further proof was needed that Bitmap indexes are inappropriate for tables that are maintained by multiple concurrent sessions, surely this is it. The Deadlock error raised by Method 8 occurred because bitmap indexes are locked at the block-level, not the row level. With hundreds of rows represented by each block in the index, the chances of two sessions attempting to lock the same block are quite high. The very clear lesson here: don't update bitmap indexed tables in parallel sessions; the only safe parallel method is PARALLEL DML.
The other intesting outcome is the differing impact of the bitmap index on SET-based updates vs transactional updates (SQL solutions vs PL/SQL solutions). PL/SQL solutions seem to incur a penalty when updating bitmap indexed tables. A single bitmap index has added around 10% to the overall runtime of PL/SQL solutions, whereas the set-based (SQL-based) solutions run faster than the B-Tree indexes case (above). Although not shown here, this effect is magnified with each additional bitmap index. Given that most bitmap-indexed tables would have several such indexes (as bitmap indexes are designed to be of most use in combination), this shows that PL/SQL is virtually non-viable as a means of updating a large number of rows.
SUMMARY OF FINDINGS
- Context Switches in cursor loops have greatest impact when data is well cached. For updates with buffer cache hit-ratio >99%, convert to BULK COLLECT or MERGE.
- Use MERGE with a Hash Join when updating a significant proportion of blocks (not rows!) in a segment.
- Parallelize large updates for a massive performance improvement.
- Tune the number of parallel query servers used by looking for latch contention thread startup waits.
- Don't rashly drop Foreign Keys without benchmarking; they may not be costing very much to maintain.
- MERGE statements that UPDATE and INSERT cannot be parallelised when a Foreign Key is present. If you want to keep the Foreign Key, you will need to use multiple concurrent sessions (insert/update variant of Method 8) to achieve parallelism.
- Don't use PL/SQL to maintain bitmap indexed tables; not even with BULK COLLECT / FORALL. Instead, INSERT transactions into a Global Temporary Table and apply a MERGE.
APPENDIX 1 - Nested Loops MERGE vs. Hash Join MERGE
Although we are updating only 1% of the rows in the table, those rows are almost perfectly distributed throughout the table. As a result, we end up updating almost 100% of the blocks. This makes it a good candidate for hash joins and full scans to out-perform indexed nested loops. Of course, as you decrease the percentage of blocks updated, the balance will swing in favour of Nested Loops; but this trace demonstrates that MERGE definitely has it's place in high-volume updates.
MERGE /*+ FIRST_ROWS*/ INTO test USING test2 new ON (test.pk = new.pk) WHEN MATCHED THEN UPDATE SET fk = new.fk , fill = new.fill ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| ------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 95331 | 7261K| 191K (1)| | 1 | MERGE | TEST | | | | | 2 | VIEW | | | | | | 3 | NESTED LOOPS | | 95331 | 8937K| 191K (1)| | 4 | TABLE ACCESS FULL | TEST2 | 95331 | 4468K| 170 (3)| | 5 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 48 | 2 (0)| | 6 | INDEX UNIQUE SCAN | TEST_PK | 1 | | 1 (0)| ------------------------------------------------------------------------------- call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.01 0.01 0 4 1 0 Execute 1 57.67 829.77 95323 383225 533245 100000 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 2 57.68 829.78 95323 383229 533246 100000 Misses in library cache during parse: 1 Optimizer mode: FIRST_ROWS Parsing user id: 140 Rows Row Source Operation ------- --------------------------------------------------- 1 MERGE TEST (cr=383225 pr=95323 pw=0 time=127458586 us) 100000 VIEW (cr=371028 pr=75353 pw=0 time=619853020 us) 100000 NESTED LOOPS (cr=371028 pr=75353 pw=0 time=619653018 us) 100000 TABLE ACCESS FULL TEST2 (cr=750 pr=386 pw=0 time=505310 us) 100000 TABLE ACCESS BY INDEX ROWID TEST (cr=370278 pr=74967 pw=0 time=615942540 us) 100000 INDEX UNIQUE SCAN TEST_PK (cr=200015 pr=227 pw=0 time=4528703 us)(object id 141439) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ db file scattered read 37 0.20 0.72 db file sequential read 94936 0.39 781.52 buffer exterminate 1 0.97 0.97 SQL*Net message to client 1 0.00 0.00 SQL*Net message from client 1 0.05 0.05 ********************************************************************************
MERGE INTO test USING test2 new ON (test.pk = new.pk) WHEN MATCHED THEN UPDATE SET fk = new.fk , fill = new.fill --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| --------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 95331 | 7261K| | 46318 (3)| | 1 | MERGE | TEST | | | | | | 2 | VIEW | | | | | | | 3 | HASH JOIN | | 95331 | 8937K| 5592K| 46318 (3)| | 4 | TABLE ACCESS FULL| TEST2 | 95331 | 4468K| | 170 (3)| | 5 | TABLE ACCESS FULL| TEST | 10M| 458M| | 16949 (4)| --------------------------------------------------------------------------- call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.03 0.40 1 4 1 0 Execute 1 54.50 123.48 94547 82411 533095 100000 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 2 54.53 123.88 94548 82415 533096 100000 Misses in library cache during parse: 1 Optimizer mode: ALL_ROWS Parsing user id: 140 Rows Row Source Operation ------- --------------------------------------------------- 1 MERGE TEST (cr=82411 pr=94547 pw=0 time=123480418 us) 100000 VIEW (cr=75424 pr=74949 pw=0 time=48081374 us) 100000 HASH JOIN (cr=75424 pr=74949 pw=0 time=47981370 us) 100000 TABLE ACCESS FULL TEST2 (cr=750 pr=335 pw=0 time=1207771 us) 9999999 TABLE ACCESS FULL TEST (cr=74674 pr=74614 pw=0 time=10033917 us) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ db file sequential read 19606 0.37 41.24 db file scattered read 4720 0.52 34.20 SQL*Net message to client 1 0.00 0.00 SQL*Net message from client 1 0.03 0.03
That's a pretty significant difference: the same method (MERGE) is 6-7 times faster when performed as a Hash Join. Although the number of physical disk blocks and Current Mode Gets are about the same in each test, the Hash Join method performs multi-block reads, resulting in fewer visits to the disk.
All 8 methods above were benchmarked on the assumption that the target table is arbitrarily large and the subset of rows/blocks to be updated are relatively small. If the proportion of updated blocks increases, then the average cost of finding those rows decreases; the exercise becomes one of tuning the data access rather than tuning the update.
APPENDIX 2 - Parallel DML vs. PARALLEL PL/SQL
Why is the Parallel PL/SQL (Method 8) approach much faster than the Parallel DML MERGE (Method 7)? To shed some light, here are some traces. Below we see the trace from the Parallel Coordinator session of Method 7:
MERGE /*+ first_rows */ INTO test USING test5 new ON (test.pk = new.pk) WHEN MATCHED THEN UPDATE SET fk = new.fk , fill = new.fill call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.02 0.02 0 4 1 0 Execute 1 1.85 57.91 1 7 2 100000 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 2 1.87 57.94 1 11 3 100000 Misses in library cache during parse: 1 Optimizer mode: FIRST_ROWS Parsing user id: 140 Rows Row Source Operation ------- --------------------------------------------------- 128 PX COORDINATOR (cr=7 pr=1 pw=0 time=57912088 us) 0 PX SEND QC (RANDOM) :TQ10002 (cr=0 pr=0 pw=0 time=0 us) 0 INDEX MAINTENANCE TEST (cr=0 pr=0 pw=0 time=0 us)(object id 0) 0 PX RECEIVE (cr=0 pr=0 pw=0 time=0 us) 0 PX SEND RANGE :TQ10001 (cr=0 pr=0 pw=0 time=0 us) 0 MERGE TEST (cr=0 pr=0 pw=0 time=0 us) 0 PX RECEIVE (cr=0 pr=0 pw=0 time=0 us) 0 PX SEND HYBRID (ROWID PKEY) :TQ10000 (cr=0 pr=0 pw=0 time=0 us) 0 VIEW (cr=0 pr=0 pw=0 time=0 us) 0 NESTED LOOPS (cr=0 pr=0 pw=0 time=0 us) 0 PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us) 0 TABLE ACCESS FULL TEST5 (cr=0 pr=0 pw=0 time=0 us) 0 TABLE ACCESS BY INDEX ROWID TEST (cr=0 pr=0 pw=0 time=0 us) 0 INDEX UNIQUE SCAN TEST_PK (cr=0 pr=0 pw=0 time=0 us)(object id 141439) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ db file sequential read 1 0.02 0.02 reliable message 1 0.00 0.00 enq: RO - fast object reuse 1 0.00 0.00 os thread startup 256 0.09 23.61 PX Deq: Join ACK 7 0.00 0.00 PX Deq: Parse Reply 15 0.09 0.19 PX Deq Credit: send blkd 35 0.00 0.00 PX qref latch 5 0.00 0.00 PX Deq: Execute Reply 1141 1.96 30.30 SQL*Net message to client 1 0.00 0.00 SQL*Net message from client 1 0.05 0.05
We can see here that the Parallel Co-ordinator spent 23.61 seconds (of the 57.94 elapsed) simply starting up the parallel threads, and 30.3 seconds waiting for them to do their stuff.
And here are the wait events for just ONE of the parallel threads from the same test case:
Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ cursor: pin S wait on X 3 0.02 0.06 PX Deq: Execution Msg 16 1.96 10.94 PX Deq: Msg Fragment 2 0.00 0.00 latch: parallel query alloc buffer 7 5.89 7.52 db file sequential read 825 0.10 12.00 read by other session 17 0.06 0.18 log buffer space 1 0.03 0.03 PX Deq Credit: send blkd 1 0.02 0.02 PX Deq: Table Q Normal 28 0.19 0.35 latch: cache buffers chains 1 0.01 0.01 db file parallel read 1 0.11 0.11
From this, we can see that of the 30.3 seconds the Co-ordinator spent waiting for the parallel threads, this one spent 7.52 waiting for shared resources (latches) held by other parallel threads, and just 12 seconds reading blocks from disk.
For comparison, here is the trace of the Co-ordinator session of a Parallel PL/SQL run:
SELECT sum(column_value) FROM TABLE(test_parallel_update( CURSOR(SELECT * FROM TEST7) )) call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.05 0.13 7 87 1 0 Execute 1 0.20 12.47 0 3 0 0 Fetch 2 0.21 20.13 0 0 0 1 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 4 0.46 32.74 7 90 1 1 Misses in library cache during parse: 1 Optimizer mode: ALL_ROWS Parsing user id: 140 Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=3 pr=0 pw=0 time=32609316 us) 128 PX COORDINATOR (cr=3 pr=0 pw=0 time=252152371 us) 0 PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us) 0 SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us) 0 VIEW (cr=0 pr=0 pw=0 time=0 us) 0 COLLECTION ITERATOR PICKLER FETCH TEST_PARALLEL_UPDATE (cr=0 pr=0 pw=0 time=0 us) 0 PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us) 0 TABLE ACCESS FULL TEST7 (cr=0 pr=0 pw=0 time=0 us) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ reliable message 1 0.00 0.00 enq: RO - fast object reuse 1 0.00 0.00 os thread startup 128 0.10 11.85 PX Deq: Join ACK 4 0.00 0.00 PX Deq: Parse Reply 46 0.00 0.10 PX Deq Credit: send blkd 128 0.00 0.06 SQL*Net message to client 2 0.00 0.00 PX Deq: Execute Reply 143 1.96 19.86 PX Deq: Signal ACK 4 0.00 0.00 enq: PS - contention 2 0.00 0.00 SQL*Net message from client 2 0.03 0.06
And the wait events for a single parallel thread:
Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ PX Deq: Execution Msg 8 0.13 0.22 PX Deq: Msg Fragment 1 0.00 0.00 library cache load lock 3 0.19 0.30 db file sequential read 872 0.11 16.47 read by other session 13 0.06 0.25 latch: cache buffers chains 3 0.01 0.02
The Parallel PL/SQL spent just 11.85 seconds starting parallel threads, compared to 23.61 seconds for PARALLEL DML. I noticed from the trace that PARALLEL DML used 256 parallel threads, whereas the PL/SQL method used just 128. Looking more closely at the trace files I suspect that the PARALLEL DML used 128 readers and 128 writers, although it hard to be sure. Whatever Oracle is doing here, it seems there is certainly a significant cost of opening parallel threads.
Also, looking at the wait events for the Parallel PL/SQL slave thread, we see no evidence of resource contention as we did in the PARALLEL DML example.
In theory, we should be able to reduce the cost of thread startup and also reduce contention by reducing the number of parallel threads. Knowing from above that the parallel methods were 10-20 time faster than the non-parallel methods, I suspect that benefits of parallelism diminish after no more than 32 parallel threads. In support of that theory, here is a trace of a PARALLEL DML test case with 32 parallel threads:
First the Parallel Co-ordinator:
MERGE /*+ first_rows parallel(test5 32) parallel(test 32) */ INTO test USING test5 new ON (test.pk = new.pk) WHEN MATCHED THEN UPDATE SET fk = new.fk , fill = new.fill call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.02 0.02 0 4 1 0 Execute 1 0.55 31.14 0 7 2 100000 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 2 0.57 31.17 0 11 3 100000 Misses in library cache during parse: 1 Optimizer mode: FIRST_ROWS Parsing user id: 140 Rows Row Source Operation ------- --------------------------------------------------- 32 PX COORDINATOR (cr=7 pr=0 pw=0 time=30266841 us) 0 PX SEND QC (RANDOM) :TQ10002 (cr=0 pr=0 pw=0 time=0 us) 0 INDEX MAINTENANCE TEST (cr=0 pr=0 pw=0 time=0 us)(object id 0) 0 PX RECEIVE (cr=0 pr=0 pw=0 time=0 us) 0 PX SEND RANGE :TQ10001 (cr=0 pr=0 pw=0 time=0 us) 0 MERGE TEST (cr=0 pr=0 pw=0 time=0 us) 0 PX RECEIVE (cr=0 pr=0 pw=0 time=0 us) 0 PX SEND HYBRID (ROWID PKEY) :TQ10000 (cr=0 pr=0 pw=0 time=0 us) 0 VIEW (cr=0 pr=0 pw=0 time=0 us) 0 NESTED LOOPS (cr=0 pr=0 pw=0 time=0 us) 0 PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us) 0 TABLE ACCESS FULL TEST5 (cr=0 pr=0 pw=0 time=0 us) 0 TABLE ACCESS BY INDEX ROWID TEST (cr=0 pr=0 pw=0 time=0 us) 0 INDEX UNIQUE SCAN TEST_PK (cr=0 pr=0 pw=0 time=0 us)(object id 141439) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ reliable message 1 0.00 0.00 enq: RO - fast object reuse 1 0.00 0.00 os thread startup 64 0.09 5.89 PX Deq: Join ACK 9 0.00 0.00 PX Deq: Parse Reply 18 0.04 0.06 PX Deq: Execute Reply 891 1.78 24.18 SQL*Net message to client 1 0.00 0.00 SQL*Net message from client 1 0.03 0.03
And the wait events for one of those threads:
Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ cursor: pin S wait on X 1 0.02 0.02 PX Deq: Execution Msg 34 0.11 0.16 db file sequential read 2419 0.30 22.21 read by other session 5 0.02 0.05 buffer busy waits 1 0.00 0.00 PX Deq Credit: send blkd 1 0.00 0.00 PX Deq: Table Q Normal 4 0.18 0.19
Note in this case
- OS thread startup of just 5.89 seconds
- No more resource contention
- Performance now in line with that of the PARALLEL PL/SQL solution
==================================================
Stolen from ?
ReplyDeletehttp://www.orafaq.com/node/2450