Windows Server 2016 and SQL Server 2016 – A match made in heaven!

For all these years Windows Server 2016 is the Windows version I was waiting for, and this blog post will explain why I’m super excited about this!

With Windows Server 2016, you can now create a Windows Cluster with nodes in multiple domains. Or even without a domain (Workgroup).

With Windows Server 2016 and SQL Server 2016, you can now create an Always On Availability Group configuration with nodes in multiple domain, and this is exactly why I’m super excited as it gives more flexibility when architecting complex/robust systems.

Think about the below scenario (Simplified):


In the above picture, there are 3 separate domains, and before Windows Server 2016 and SQL Server 2016, there was no way to create an Always On Availability Group configuration with these 3 nodes participating.

With Windows Server 2016/SQL Server 2016 it’s possible now to add nodes from different domains to a cluster and configure an AG on top of it:



There are some other nice benefits of Windows Server 2016, and they are:

Seamless Upgrades:

Upgrading to Win Server 2016 is going to be easy and less painful going forward. If you have an existing Win 2012 R2 cluster, then you can opt to do a rolling upgrade. Assume that it’s a 3 node cluster, then you can evict a node and install Win 2016 on it and add it back to the cluster.

If you don’t want to go through the hassle of evicting the node, upgrading and adding it back, then you can also opt for an in-place upgrade of an existing node. Cluster will work in mixed mode.

Cluster Site Awareness:

We can now set policies to allow a SQL instance to failover to a node within the same site before failing over to a node in a different site. More control and more flexibility.

Quarantine nodes:

Let’s assume that in a cluster you have a node which is in a flapping state (not stable and it’s joining in and out frequently). With Win 2016, this node will be Quarantined. Flapping nodes can impact the whole cluster negatively and quarantine will avoid this. The quarantined nodes are not allowed to join the cluster for 2 hours.


Win 2016/SQL 2016 supports 24 TB of memory. Yes! Let’s cache all those databases out there! (No more IO operations).

Win 2016/SQL 2016 supports 640 cores!  <Cough> license per core $ amount! </Cough>

Storage Spaces Direct:

Win 2016 support storage spaces direct and we can now run SQL FCI and AG on storage spaces direct. Storage Spaces Direct is nothing but DAS storage replicated across all nodes.


If I’m running SQL 2016, then I would definitely want to run it on Win 2016. The features and capabilities are too good to ignore.

Thanks for reading, and please watch this space for more!

AlwaysOn Availability Group Database Level Health Detection – SQL Server 2016

SQL Server 2016 AlwaysOn AG got many improvements, and I’m pretty impressed with the log throughput improvements and redo improvements based on some of my test cases(A detailed blog post on that will follow soon). In this blog post we will look at another key improvement, ie Database Level Health Detection.

SQL Server 2016 enhances the AlwaysOn health diagnostics with database health detection. If AlwaysOn availability group database health detection has been selected for your availability group, and an availability database transitions out of the ONLINE state (sys.databases.state_desc), then entire availability group will failover automatically.

Let’s look at a demo to demonstrate the new behaviour –

I have enabled Database Level Health Detection for my AG as shown below –

AG20161Now, in theory if one of the disks which is hosting my database tpcc is gone(both data/log file), then it should trigger a failover.

Currently node 001 is the primary replica. Let’s see, if theory matches with reality.

I went ahead and inserted data continuously to one of the tables in the database, and took the disk(which hosts both data/log files) offline via Disk Management –


That caused the database to go into Recovery Pending State –


Note – This controlled failure simulation was not a straightforward process. Initially the database was not going to a degraded state. After my interaction with some of the top SQL Server Experts out there in our community, I was able to force the DB go into a degraded state.

Ref this twitter thread for more details –

This immediately caused a failover, and now 002 is the primary replica.


Conclusion – 

This proves that Database level health detection works seamlessly and a failover is triggered when the state of a database changes from Online to other degraded states.

Thanks for reading, and keep watching this space for more!

Encrypting AlwaysOn Availability Group Transport- Some thoughts!

Transport security for AlwaysOn Availability Group is an interesting topic, and any day it’s a topic which is debatable.

The transport security involves encryption of packets exchanged between the servers involved, and for AlwaysOn AG the encryption is made possible on the Hadr_endpoint.

By default, its encrypted using AES standard. If you are interested to know more about AES, then please ref here.

Here is the code to create the Hadr_endpoint:

USE [master]

/****** Object: Endpoint [Hadr_endpoint] Script Date: 1/27/2016 3:25:01 PM ******/
CREATE ENDPOINT [Hadr_endpoint] 

This means that by default the end point will ensure that data is encrypted when its sent over. The end point can only connect to end points which uses encryption.

You also have an option to turn off encryption. Yes, you heard that correct. You have an option to disable encryption by using the ALTER END POINT command.

If you closely observe the ALTER command for end point, there are 3 options:


Disabling the end point is where I have some thoughts –

Think about this scenario:

What if you have a very well secured network infrastructure, and there is no way something bad like a sniffing can happen. In that case is there a need for turning on encryption?

Yes, as stated earlier this is a debatable topic and the best answer to this question will be “It depends”.

The best answer for this question is to understand the overhead of AES encryption (AES encryption is usually combined with padding) to the data transfer, and to analyze if there are any potential benefits by disabling it.

There might be some edge cases where in disabling encryption in a controlled fashion yielded benefits.

This definitely is food for thought, and I’m really interested to know your thoughts around this.

Thanks for reading, and keep watching this space for more!

Exploring AlwaysOn Availability Group Template in Azure

This is one of those posts which is due for a long time. I was really excited when Microsoft announced the release of an AlwaysOn AG template for the Azure Portal back in August 2014.

I had decided to explore this AG template soon after the launch, and this post will explain how easy it is to deploy a 2 node AlwaysOn AG deployment in Azure. Trust me, it’s just a matter of few clicks.

Sit back and relax,we are now going to deploy AG in Azure.Here are the steps :

After you login to the Azure portal, you can access the Market Place


You will be surprised to see the number of options which are available in market place now.


Search for AlwaysOn ( I really hope MSFT will correct this, it should be rather be AlwaysOn AG) in the search box.



Choose SQL Server 2014 AlwaysOn and proceed by clicking Create.



Enter details for the required fields and also select the details needed for SQL Server settings. For the purpose of this post, I will be choosing default settings for the domain controller,virtual network,storage etc.


Note : Ignore the warning messages in the picture above.Warning for Resource Group is shown as I already have a RG with the same name.

That’s it, and you click create.

It takes around 40 – 45 minutes for the whole process to complete and once it was completed, I had the below set of virtual machines.



2 machines for the domain controllers, 1 for the file share witness and 2 for the AG nodes.

Here is a view of the Resource Group and its resources :


If you login to one of the SQL nodes and open up SSMS, you can see that AG is fully configured and its ready for use.


Here is the dashboard view and the quorum info :



Conclusion :

Templates are interesting, and this is an excellent way of deploying AGs in large-scale. I’m pretty sure that MSFT will be working more to optimize this, and will add more and features to it.Looking forward to it.

Thanks for reading, and keep watching this space for more.

This one enhancement makes AlwaysOn Availability Group better in SQL Server 2014 !

I keep getting questions on what is improved other than the increased number of secondaries(8 now) with AlwaysOn Availability Group(AG) in SQL 2014. In this blog post  I would like to talk about one major enhancement around the increased readable secondaries availability in SQL Server 2014.

In SQL Server 2012 there is a good chance that your read workloads can be killed during network failures. Typically this happens in a multi subnet environment where you have AG nodes in 2 or more data centers.If network failure results a communication failure between primary and secondary replicas, then the secondaries will go into resolving state.In resolving state, the read workloads are basically killed. (Yes, yet another reason why you need to test your AG deployments)

With SQL Server 2014, there is good news. Read workloads are not impacted during network failures, primary replica down or cluster quorum issues.

Yes, it stays up all the time, and will be serving your read workloads. The replica will be in resolving state, however the databases are still available for your read connections/workloads.

Keep in mind : Read-only routing is not yet supported for this, and you cannot use listener for your reporting workloads.Brent Ozar calls out this in his blog post here.

Lets now walk though this enhancement :

I will be using a 2 node AG setup and the data transfer mode(Availability Mode) is asynchronous.

If I query the DMV sys.dm_hadr_availability_replica_states I can see the replica states.

SELECT role_desc,connected_state_desc  FROM sys.dm_hadr_availability_replica_states

role_desc               connected_state_desc

Lets double-check if the data transfer is set as asynchronous

SELECT  is_primary_replica, synchronization_state_desc FROM sys.dm_hadr_database_replica_states

is_primary_replica       synchronization_state_desc
0                                  SYNCHRONIZING
1                                  SYNCHRONIZED

Everything looks good, and now lets take down the primary replica.

Note : Dont try this at home ! (I mean production)

Okay, As this is a test instance I really don’t care and what I did was shutdown the SQL Services to take the primary down.

If we query the DMV again, we can see

SELECT role_desc,connected_state_desc  FROM sys.dm_hadr_availability_replica_states

role_desc                connected_state_desc

And note, there is no entry for primary as we I have taken it down.

Even though the secondary replica is in resolving state, read connections just works.

SELECT COUNT (*) AS [TotalRowCount] FROM [dbo].[InsertCount]


The secondary database is in NOT SYNCHRONIZING, NOT_HEALTHY state, however its ONLINE.

SELECT synchronization_state_desc,synchronization_health_desc,database_state_desc FROM sys.dm_hadr_database_replica_states

synchronization_state_desc         synchronization_health_desc       database_state_desc
NOT SYNCHRONIZING               NOT_HEALTHY                           ONLINE

Here is a view from Management Studio :

AG Resolving


Yes, this is a very good enhancement which will make sure your readable secondaries are highly available and I’m pretty sure that this is one of those enhancement which will force you to deploy AGs in SQL 2014 rather than opting SQL 2012.

Thanks for reading and keep watching this space for more !

SQLServer Log Shipping Work File ! What is this all about ?

A short and sweet post for a gorgeous morning –

Have you seen a file with extension .wrk in the log shipping copy folder ?

Logshipping work file

What is this all about, Curious? Then here is the answer –

When the transaction log backup files are copied from the primary server to the secondary server the file is named as a work file(.wrk) till the copy operation is completed.

This will ensure that the restore job in the secondary will not pick the file and try a restore.

Once the file is completely copied, then it will be named as .trn, the one which is very familiar to us.

Logshipping work file2

Normally when the log backups are very small you wont really observe the .wrk file. In a flash they are converted to a .trn file. When the log backups are relatively bigger, then you can see .wrk file available in the copy folder for a longer period of time.

Thanks for reading and keep watching this space for more.

Tale of detaching a database which was enabled for replication!

In development environments I prefer to use ‘detach’ database option to move around databases. It’s simple, fast and easy.

Almost all the time when I detached a database earlier the feeling was like something which I mentioned before “Simple, fast and easy”, however today it was more sort of a fun experience. Here is that tale for you all.

Today I had to ‘detach’ a database from one of my development environments and straight away got this error !



[Pic courtesy : ]

The database was setup for replication, hence it cannot be dropped.

Apparently, my next step was to remove the publication which was setup for this database and the tale begins from this point.

Note – This was a very controlled dev environment and action like detaching, removing replication was not going to harm anything. In case you are going to repro this demo, then I would recommend you to do this on your lab environment.

I selected the required publication and went ahead with delete operation. The wizard was double checking with me if I was sure. I said Yes to it !

Bang! here comes the next error

The error was pointing that there are some connection issues to the Distributor or to the Subscriber. 

The distributor and the subscriber for this publication were already retired servers and not online and removing publication requires a connection to be made to the distributor/subscriber servers too. As those servers were not reachable, the wizard couldn’t proceed.

There is no way you could delete this publication directly using SSMS and the best way to do this is via system stored procedures.

I had to use 3 system stored procedures to cleanup this publication and those are –




The scripts used for the clean up processes are mentioned below –

--Cleans up the subscriptions for a publisher

USE <Your Database>
EXEC sp_dropsubscription @publication = N'Your_Publication', 
@article = N'all', @subscriber = N'all', @destination_db = N'all', @ignore_distributor = 1

--Cleans up the publication

USE <Your Database>
EXEC sp_droppublication @publication = N'Your_Publication', @ignore_distributor = 1
--Set Replication database option to false

USE Master
EXEC sp_replicationdboption @dbname = N'Your_Publication', @optname = N'publish', @value = N'false'
,@ignore_distributor = 1

If you note the scripts you can find that they all have an argument @ignore_distributor passed as 1.All the three scripts were failing without this as the connection to distributor was not available as it was offline.

I was curious to check what exactly @ignore_distributor does, however the below BOL entry didnt gave me any clue –

@ignore_distributor = ] ignore_distributor

Identified for informational purposes only. Not supported. Future compatibility is not guaranteed.

It’s very much clear that removing a publication will check connections to the Distributor and in case it’s not available for connection, there will be connection errors.

After running the above mentioned stored procedures I could proceed and detach the database.

Thanks for reading.