During a
recent RecoverPoint 4.0 implementation we ran into an odd issue. Dual VNX
5600’s with RecoverPoint SE (4.0 SP2 m29, physical RPA’s) replicating 4
consistency groups. Two small 4TB groups and 2 larger groups (24TB and 30TB).
Setup went without issue and all best practices were followed. Once setup was
complete and the first consistency group was configured (single 4TB) we noticed
several errors and warnings. Now usually during initial synchronization it is
not uncommon to have a few warnings present but they will go away once the group
syncs and goes active. The warnings we were receiving would go away and come
back with frequency. We also noticed SP B had very high usage, above 60% and
peaking to 90%. We immediately disabled the group and began investigating.
We double
checked the fabric, the RPA’s, journals, trespassed LUNs, LUN/SP placement, VNX
itself, and the storage pools for the source and destination LUNs. Everything
checked out normal and was consistent with best practice setup. We decided to
destroy the consistency group and setup again. Sure enough the same issues
persisted. Tried the other 4TB LUN and the larger 24/30TB LUNs. Same thing
occurred. The warnings and SP % was even worse on the larger LUNs. We felt it
could be an SP issue so we decided to reboot SP B. Things seemed to improve
with the 4 TB LUN finishing sync and going active. The warnings were still
present but SP % was operating normally. We went ahead and setup the 24TB LUN.
After a few hours of replication the problems returned. SP B got so high in
usage that it actually went offline and needed to be manually rebooted again.
Not good! Time to open a support case with EMC.
After EMC
reviewed the SP collects they were confident they knew what the issue was.
Turns out RecoverPoint has a special way of reading THIN based LUNs from a VNX,
the THIN LUN Extender. This method was implemented in RP 3.2 for CX/VNX storage
and splitters and used in all RP versions since. When it is disabled,
RecoverPoint uses the traditional method to read thin LUNs. Disabling
this feature does NOT stop RecoverPoint from replicating thin LUNs.
RecoverPoint only reads from LUNs (thin or thick) during initializations, so
disabling the thin extent reader will have little effect on the environment,
and no effect during active replication. In the traditional method all blocks
are read on the production LUN pool and on the remote LUN pool. Blocks are
copied only when they are hash checked and found to be inconsistent. The THIN
Extender uses a special SCSI command "read buffer" on both local and
remote site LUNs. It is used to determine which blocks in the LUN pools are
actually allocated, so RecoverPoint can avoid reading block addresses that are
not allocated. This reduces initialization duration.
Support
advised us to disable the THIN LUN Extender and test. We went ahead and
followed ETA emc316719 and rebuilt the 4TB consistency group. Everything worked perfectly with no
errors or warnings being generated and the consistency went into active mode.
We decided to test out the larger 24TB group and again everything worked
perfect with no errors or warnings being generated. Consistency groups were
built for the remaining LUNs and set to begin synchronization. Once finished
both groups went active. This was clearly the issue. The ETA above lists the
problem as being solved in RecoverPoint 3.5.2 and 4.0 but obviously it is not.