RecoverPoint 4.0, VNX High CPU, and Thin LUN Extender mechanism

During a recent RecoverPoint 4.0 implementation we ran into an odd issue. Dual VNX 5600’s with RecoverPoint SE (4.0 SP2 m29, physical RPA’s) replicating 4 consistency groups. Two small 4TB groups and 2 larger groups (24TB and 30TB). Setup went without issue and all best practices were followed. Once setup was complete and the first consistency group was configured (single 4TB) we noticed several errors and warnings. Now usually during initial synchronization it is not uncommon to have a few warnings present but they will go away once the group syncs and goes active. The warnings we were receiving would go away and come back with frequency. We also noticed SP B had very high usage, above 60% and peaking to 90%. We immediately disabled the group and began investigating.

We double checked the fabric, the RPA’s, journals, trespassed LUNs, LUN/SP placement, VNX itself, and the storage pools for the source and destination LUNs. Everything checked out normal and was consistent with best practice setup. We decided to destroy the consistency group and setup again. Sure enough the same issues persisted. Tried the other 4TB LUN and the larger 24/30TB LUNs. Same thing occurred. The warnings and SP % was even worse on the larger LUNs. We felt it could be an SP issue so we decided to reboot SP B. Things seemed to improve with the 4 TB LUN finishing sync and going active. The warnings were still present but SP % was operating normally. We went ahead and setup the 24TB LUN. After a few hours of replication the problems returned. SP B got so high in usage that it actually went offline and needed to be manually rebooted again. Not good! Time to open a support case with EMC. 

After EMC reviewed the SP collects they were confident they knew what the issue was. Turns out RecoverPoint has a special way of reading THIN based LUNs from a VNX, the THIN LUN Extender. This method was implemented in RP 3.2 for CX/VNX storage and splitters and used in all RP versions since. When it is disabled, RecoverPoint uses the traditional method to read thin LUNs.  Disabling this feature does NOT stop RecoverPoint from replicating thin LUNs.  RecoverPoint only reads from LUNs (thin or thick) during initializations, so disabling the thin extent reader will have little effect on the environment, and no effect during active replication. In the traditional method all blocks are read on the production LUN pool and on the remote LUN pool. Blocks are copied only when they are hash checked and found to be inconsistent. The THIN Extender uses a special SCSI command "read buffer" on both local and remote site LUNs. It is used to determine which blocks in the LUN pools are actually allocated, so RecoverPoint can avoid reading block addresses that are not allocated. This reduces initialization duration.

Support advised us to disable the THIN LUN Extender and test. We went ahead and followed ETA emc316719 and rebuilt the 4TB consistency group. Everything worked perfectly with no errors or warnings being generated and the consistency went into active mode. We decided to test out the larger 24TB group and again everything worked perfect with no errors or warnings being generated. Consistency groups were built for the remaining LUNs and set to begin synchronization. Once finished both groups went active. This was clearly the issue. The ETA above lists the problem as being solved in RecoverPoint 3.5.2 and 4.0 but obviously it is not.