Link

Backup Repository HA using Windows Storage Replica

Introduction

The following page explains the use of Windows Storage Replica, along with Veeam Backup & Replication, to build a high available backup repository based on Windows Server 2016 or Server 2019 and ReFS filesystem.

Windows Storage Replica can transparently replicate an entire volume to a secondary server placed at DR site, providing customers with geographical redundancy. Replication process is ReFS aware: all synthetic-created full backups do not need to be entirely replicated, only changes will be transferred to the replica destination.

In case of failure of the source volume, replication process can be inversed allowing Veeam Backup & Replication to continue the existing backup chains on the target (that has become the new source). As soon as the source has been brought back online, the failback process will be the same as the failover.

As every storage-based replica, Windows Storage Replica does not protect the volume content from logic corruption. If a corruption happens on the source, it will be replicated to the destination as well. Veeam Backup & Replication has its native tool, the Backup Copy Job, that can be used to copy restore points across different backup repositories and meet the 3-2-1 rule.

Architecture

storage replica architecture

Requirements

  • Windows Server 2016 Datacenter edition

  • Active Directory Domain Services forest

  • Storage Spaces with SAS JBODs, Storage Spaces Direct, fibre channel SAN, shared VHDX, iSCSI Target, or local SAS/SCSI/SATA storage. SSD or faster recommended for replication log drives. Microsoft recommends that the log storage be faster than the data storage. Log volumes must never be used for other workloads

  • At least one ethernet/TCP connection on each server for synchronous replication, but preferably RDMA

  • At least 2GB of RAM and two cores per server

  • A network between servers with enough bandwidth to contain your IO write workload and an average of 5ms round trip latency or lower, for synchronous replication. Asynchronous replication does not have a latency recommendation
  • Source and target volumes must have the same size

  • Log volumes must have the same size
  • Log volumes must be initialized as GPT, not MBR

Firewall

The replication process requires ICMP and ports 445 and 5445 opened.

Best Practices

When designing this kind of architecture there are some concept that must be kept in mind:

  • ReFS BPs are the same as standard repository
  • Better to use thin provisioned volume (S2D preferred) as they can dramatically reduce initial sync time. From a storage-based replica perspective a block is a block, either or not it is empty it must be replicated anyway.
  • Use per-VM backup chains as they can help to achieve a better RPO. More file to copy means more granularity: in case of a failure during the replication, having multiple files to work with will increase the chance to copy it in a timely fashion.

Based on Microsoft recommendation there should not be any performance drop using thin volumes (compared to thick).

During the process of creating the replication partnership the latency between source and destination is tested. If it was below 5ms the replication is set automatically in synchronous mode. I suggest to manually switch it to asynchronous because it is more appropriate to handle backup files. Also, DR site might be geographically isolated from the main site and, in this case, having a latency below 5ms could not be possible.

Windows Storage Replica requires a log volume on both servers. The optimal size of the log varies widely depending on the environment and how much write IO is generated during the backup session.

  • A larger or smaller log does not make the repository any faster or slower
  • A larger or smaller log does not have any bearing on a 10GB data volume versus a 10TB data volume, for instance

A larger log simply collects and retains more write IOs before they are wrapped out. This allows an interruption in service between the source and destination computer, such as a network outage or the destination being offline, to go longer. This is fairly impossible when using this architecture to replicate backup files as we would plan to have a log volume which is nearly large as the incremental size of an entire backup session. That said, 100 to 200 GB should fit most of the environment.

Storage Replica relies on the log to speed up writes on data disks and log performance is critical to replication performance. You must ensure that the log volume performs better than the data volume, as the log will serialize and write all IO sequentially.


References