Oracle deadlock problem: enq: TX-allocate ITL entry

I encountered a very interesting problem before and I learned a lot. Share with everyone. Before reading the following story, please understand the following concepts

Deadlock : refers to a blocking phenomenon caused by competition for resources or due to communication between two or more processes in the execution process. If there is no external force, they will not be able to advance. At this time, the system is said to be in a deadlock state or the system has a deadlock. These processes that are always waiting for each other are called deadlock processes.
Database deadlock : If you need to "modify" a piece of data, first the database will lock it to ensure that only one transaction can modify it at the same time. Locking occurs when a transaction obtains a "lock" on a certain resource. At this time, other transactions cannot change the resource. This mechanism exists to ensure data consistency
ITL : (Interested Transaction List) is an internal part of the Oracle data block located in the database header (block header), is composed of a series of ITS (Interested Transaction Slot, transaction slot) lists, ITL is used to record all occurrences of the block Transaction, an its can be regarded as a transaction record. If the transaction has been committed, then the itL location can be used repeatedly, because ITL is similar to a record, so sometimes it is also called an itL slot. If a transaction has not been committed, then this transaction will always occupy an ITL slot. ITL records transaction information, the entry of the rollback segment, the transaction type, and so on. If the transaction has been committed, then the SCN number when the transaction was committed is also stored in the ITL slot.

Problem phenomenon

The story happened like this, a long, long time ago...

Soon after the new version of the system was launched in the prefectures and cities, on-site personnel reported that when performing batch business operations, business processing would be stuck, blocking normal business. Colleagues on the scene also reported that the probability of the occurrence of the problem is not necessarily present. The batch operation is sometimes normal, sometimes it will start, and it will only appear under certain circumstances, and no law of the problem has been found.

It's another occasional problem.....It's another headache problem...

Problem handling

Investigation process

Obtained the log from the scene. Use the UEditor tool to split the log to obtain the information of the batch operation. From the log, 2018-11-19 14:22:27 Thread 5 began to die, and after 1800 seconds (30 minutes) in the waiting state, the suspended animation ended.

2018-11-19 14:22:26,391 INFO [com.star.sms.business.payment.BatchPaymentServiceImpl$1] pool-882-thread-5start|size=32018-11-19 14:22:27,008 INFO [com.star.sms.business.accept2.billing.AcceptSheetBillContext] Current Thread:pool-882-thread-5 ######caclulate halfMonth fee total time : 73  #####2018-11-19 14:22:27,569 INFO [com.star.sms.business.accept2.billing.AcceptSheetBillContext] Current Thread:pool-882-thread-5 ######caclulate halfMonth fee total time : 72  #####2018-11-19 14:52:26,916 INFO [com.star.sms.business.accept2.billing.AcceptSheetBillContext] Current Thread:pool-882-thread-5 ######caclulate halfMonth fee total time : 58  #####2018-11-19 14:52:27,112 INFO [com.star.sms.business.payment.BatchPaymentServiceImpl$1] pool-882-thread-5end|time=1800.721 s2018-11-19 14:52:26,672 INFO [com.star.sms.business.payment.BatchPaymentServiceImpl$1] pool-882-thread-3end|time=1800.281 s2018-11-19 14:52:26,683 INFO [com.star.sms.business.payment.BatchPaymentServiceImpl$1] pool-882-thread-1end|time=1800.292 s2018-11-19 14:52:26,916 INFO [com.star.sms.business.accept2.billing.AcceptSheetBillContext] Current Thread:pool-882-thread-5 ######caclulate halfMonth fee total time : 58  #####2018-11-19 14:52:27,032 INFO [com.star.sms.business.armgmt.ChargeCauseWriteOffListener] Received write offevent,accountid = 83224252018-11-19 14:52:27,033 INFO [com.star.sms.business.armgmt.ArInnerServiceImpl]account id :8322425,there is not owe bill to write off2018-11-19 14:52:27,036 INFO [com.star.sms.business.armgmt.ArInnerServiceImpl]query subscriber by accountId is empty(id:8322425 operWay:2)2018-11-19 14:52:27,038 INFO [com.star.sms.business.armgmt.ArInnerServiceImpl]query subscriber by accountId is empty(id:8322425 operWay:2)2018-11-19 14:52:27,038 INFO [com.star.sms.business.armgmt.ArInnerServiceImpl]account id :8322425,there is not owe bill to write off2018-11-19 14:52:27,112 INFO [com.star.sms.business.payment.BatchPaymentServiceImpl$1] pool-882-thread-5 end|time=1800.721 s

Use jstack to view the thread stack information and analyze the thread stack information: it is found that several sub-threads in the batch thread pool are stopped (line 13). When the $Proxy953.modifyDiscountInstance database modifies the discount case, this method has only one role in the program UPDATE operation

org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:132)org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:120)org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)org.eclipse.gemini.blueprint.service.util.internal.aop.ServiceTCCLInterceptor.invokeUnprivileged(ServiceTCCLInterceptor.java:70)org.eclipse.gemini.blueprint.service.util.internal.aop.ServiceTCCLInterceptor.invoke(ServiceTCCLInterceptor.java:53)org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)org.eclipse.gemini.blueprint.service.importer.support.LocalBundleContextAdvice.invoke(LocalBundleContextAdvice.java:57)org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:132)org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:120)org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)$Proxy953.modifyDiscountInstance(Unknown Source)com.star.sms.business.accept2.instance.PromotionInstanceContext.cancel(PromotionInstanceContext.java:262)com.star.sms.business.accept2.instance.PricePlanInstanceContext.delete(PricePlanInstanceContext.java:443)com.star.sms.business.accept2.utils.RenewalFeeHelper.changePricePlan(RenewalFeeHelper.java:92)com.star.sms.business.accept2.instance.ProductInstanceContextAfterNew.changePricePlanByStrategy(ProductInstanceContextAfterNew.java:965)com.star.sms.business.accept2.order.ProductRenewalFeeOrderContext.doComplete(ProductRenewalFeeOrderContext.java:132)

Suspected that there was a deadlock in the database, so I executed the following SQL to check whether there was mutual blocking and interlocking in the SQL execution. I found that the guessing result was different from what I expected, and there was no SQL waiting because of the lock.

with vw_lock AS (SELECT * FROM v$lock)selecta.sid,'is blocking' blocking,(select 'sid:'||s.sid||' object:'||do.object_name||' rowid:'||    dbms_rowid.rowid_create ( 1, ROW_WAIT_OBJ#, ROW_WAIT_FILE#,ROW_WAIT_BLOCK#, ROW_WAIT_ROW# )    ||' sql_id:'||s.sql_id   from v$session s, dba_objects do    where s.sid=b.sid    and s.ROW_WAIT_OBJ# = do.OBJECT_ID) blockee,b.sid sid2,b.id1,b.id2from vw_lock a, vw_lock bwhere a.block = 1and b.request > 0and a.id1 = b.id1and a.id2 = b.id2;

Check the SQL information being executed in the database again, and find that the SQL for modifying the product instance is blocked by enq: TX-allocate ITL entry, then what is enq: TX-allocate ITL entry?

Take a chestnut to illustrate: Suppose the data block has only 2 itl slots, and now 3 transactions have occurred, and because the block is filled with data or reaches the configured maximum value, there is no remaining space at all. Allocate a new itl, so there is a wait, this is enq: TX-allocate ITL entry

select username,       sid,       serial#,       sql_id,       event,       machine,       program,       blocking_session,       status,       module  from gv$session where wait_class# <> 6 order by blocking_session;

So the direct cause of the problem is that the transaction has not been submitted, the database has been in the ITL card slot allocation waiting state, and finally the transaction timed out. This is also the origin of the 1800-second log in the previous log.

problem causes

The direct cause of the business system stuck has been found, but the problem has come again. Why does the system have the enq: TX-allocate ITL entry problem?

After analysis, it is found that multi-threaded processing is used according to the amount of data in batch business processing. The program allocates threads according to the number of business tasks. When all threads are processed, the transaction is submitted uniformly. Think about how many operators are doing batch processing. The system allocates threads. Processing, child thread UPDATE table data, waiting for transaction, waiting for transaction, let me go! !

  • The business program processes a large amount of data, and the main thread waits for the completion of all threads before being awakened to execute the commit transaction
  • When the UPDATE in the program is executed in the child thread, there are many transactions. ORACLE allocates the ITL transaction card slot, the card slot of the discountInstance data block has been allocated, and other transactions that use the card slot are released, but other transactions wait for notification. Because it is waiting for the main thread to notify commit;

Eventually cause the main thread of the system program to interlock with the ITL card slot allocation in the Oracle transaction

    try {        result = processByCall(datas);            if (CollectionUtils.isEmpty(result.getFailList())) {            result.setSucess(true);            transactionGuarded.hold(Thread.currentThread().getName());        } else {            result.setSucess(false);    	    transactionGuarded.setFailed();        }    } catch (Throwable ex) {    	pringlog(ex);    	result.setSucess(false);    	transactionGuarded.setFailed();    	throw new RuntimeException(ex);    } finally {    	if (transactionGuarded.isSuccess()) {	    	transactionManager.commit(status);    	} else {	    	transactionManager.rollback(status);    	}    }

Solution

The space of ITL is determined by INITRANS. Several parameters when the table is initialized determine ITL related information

  • INITRANS refers to the space (ITLs) initially pre-allocated to parallel transaction control on a BLOCK
  • MAXTRANS means that if the INITRANS space is not enough, it will automatically expand ITL until the maximum value is the MAXTRANS value, the default is 255. However, if the BLOCK space is insufficient, it may not be able to continuously expand to 255 ITS spaces
SQL> select table_name,tablespace_name,ini_trans,max_trans from user_tables where table_name='DISCOUNTINSTANCEEN'; TABLE_NAME                     TABLESPACE_NAME                 INI_TRANS  MAX_TRANS------------------------------ ------------------------------ ---------- ----------DISCOUNTINSTANCEEN             TBLSMSUSER                             10        255

There are 2 situations in which ITL waits

  • ITL has reached the maximum value of MAXTRANS

This situation may be caused by high concurrency, you can consider reducing the number of concurrent business operations and expanding the value of MAXTRANS to solve

  • ITL has not reached the maximum, but there is no room for its expansion

The table where this happens will usually be updated frequently, causing the reserved space (PCTFREE) to be filled. It can be solved by increasing the INITRANS or PCTFREE of the table (depending on the amount of concurrent transactions on the table. Generally, if the amount of concurrency is high, it is recommended to increase  
 INITRANS first, otherwise, consider increasing PCTFREE first)