Windows Server 2016 and SQL Server 2016 – A match made in heaven!


For all these years Windows Server 2016 is the Windows version I was waiting for, and this blog post will explain why I’m super excited about this!

With Windows Server 2016, you can now create a Windows Cluster with nodes in multiple domains. Or even without a domain (Workgroup).

With Windows Server 2016 and SQL Server 2016, you can now create an Always On Availability Group configuration with nodes in multiple domain, and this is exactly why I’m super excited as it gives more flexibility when architecting complex/robust systems.

Think about the below scenario (Simplified):

win1

In the above picture, there are 3 separate domains, and before Windows Server 2016 and SQL Server 2016, there was no way to create an Always On Availability Group configuration with these 3 nodes participating.

With Windows Server 2016/SQL Server 2016 it’s possible now to add nodes from different domains to a cluster and configure an AG on top of it:

win2

 

There are some other nice benefits of Windows Server 2016, and they are:

Seamless Upgrades:

Upgrading to Win Server 2016 is going to be easy and less painful going forward. If you have an existing Win 2012 R2 cluster, then you can opt to do a rolling upgrade. Assume that it’s a 3 node cluster, then you can evict a node and install Win 2016 on it and add it back to the cluster.

If you don’t want to go through the hassle of evicting the node, upgrading and adding it back, then you can also opt for an in-place upgrade of an existing node. Cluster will work in mixed mode.

Cluster Site Awareness:

We can now set policies to allow a SQL instance to failover to a node within the same site before failing over to a node in a different site. More control and more flexibility.

Quarantine nodes:

Let’s assume that in a cluster you have a node which is in a flapping state (not stable and it’s joining in and out frequently). With Win 2016, this node will be Quarantined. Flapping nodes can impact the whole cluster negatively and quarantine will avoid this. The quarantined nodes are not allowed to join the cluster for 2 hours.

Scaling:

Win 2016/SQL 2016 supports 24 TB of memory. Yes! Let’s cache all those databases out there! (No more IO operations).

Win 2016/SQL 2016 supports 640 cores!  <Cough> license per core $ amount! </Cough>

Storage Spaces Direct:

Win 2016 support storage spaces direct and we can now run SQL FCI and AG on storage spaces direct. Storage Spaces Direct is nothing but DAS storage replicated across all nodes.

Conclusion:

If I’m running SQL 2016, then I would definitely want to run it on Win 2016. The features and capabilities are too good to ignore.

Thanks for reading, and please watch this space for more!

AlwaysOn Availability Group Database Level Health Detection – SQL Server 2016


SQL Server 2016 AlwaysOn AG got many improvements, and I’m pretty impressed with the log throughput improvements and redo improvements based on some of my test cases(A detailed blog post on that will follow soon). In this blog post we will look at another key improvement, ie Database Level Health Detection.

SQL Server 2016 enhances the AlwaysOn health diagnostics with database health detection. If AlwaysOn availability group database health detection has been selected for your availability group, and an availability database transitions out of the ONLINE state (sys.databases.state_desc), then entire availability group will failover automatically.

Let’s look at a demo to demonstrate the new behaviour –

I have enabled Database Level Health Detection for my AG as shown below –

AG20161Now, in theory if one of the disks which is hosting my database tpcc is gone(both data/log file), then it should trigger a failover.

Currently node 001 is the primary replica. Let’s see, if theory matches with reality.

I went ahead and inserted data continuously to one of the tables in the database, and took the disk(which hosts both data/log files) offline via Disk Management –

AG20162

That caused the database to go into Recovery Pending State –

AG20163

Note – This controlled failure simulation was not a straightforward process. Initially the database was not going to a degraded state. After my interaction with some of the top SQL Server Experts out there in our community, I was able to force the DB go into a degraded state.

Ref this twitter thread for more details – https://twitter.com/AnupWarrier/status/748364014682578949

This immediately caused a failover, and now 002 is the primary replica.

AG20164

Conclusion – 

This proves that Database level health detection works seamlessly and a failover is triggered when the state of a database changes from Online to other degraded states.

Thanks for reading, and keep watching this space for more!

Encrypting AlwaysOn Availability Group Transport- Some thoughts!


Transport security for AlwaysOn Availability Group is an interesting topic, and any day it’s a topic which is debatable.

The transport security involves encryption of packets exchanged between the servers involved, and for AlwaysOn AG the encryption is made possible on the Hadr_endpoint.

By default, its encrypted using AES standard. If you are interested to know more about AES, then please ref here.

Here is the code to create the Hadr_endpoint:

USE [master]
GO

/****** Object: Endpoint [Hadr_endpoint] Script Date: 1/27/2016 3:25:01 PM ******/
CREATE ENDPOINT [Hadr_endpoint] 
 STATE=STARTED
 AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
 FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO

This means that by default the end point will ensure that data is encrypted when its sent over. The end point can only connect to end points which uses encryption.

You also have an option to turn off encryption. Yes, you heard that correct. You have an option to disable encryption by using the ALTER END POINT command.

If you closely observe the ALTER command for end point, there are 3 options:

[ , ENCRYPTION = { DISABLED | {{SUPPORTED | REQUIRED } [ ALGORITHM { RC4 | AES | AES RC4 | RC4 AES } ] } ]

Disabling the end point is where I have some thoughts –

Think about this scenario:

What if you have a very well secured network infrastructure, and there is no way something bad like a sniffing can happen. In that case is there a need for turning on encryption?

Yes, as stated earlier this is a debatable topic and the best answer to this question will be “It depends”.

The best answer for this question is to understand the overhead of AES encryption (AES encryption is usually combined with padding) to the data transfer, and to analyze if there are any potential benefits by disabling it.

There might be some edge cases where in disabling encryption in a controlled fashion yielded benefits.

This definitely is food for thought, and I’m really interested to know your thoughts around this.

Thanks for reading, and keep watching this space for more!

Exploring AlwaysOn Availability Group Template in Azure


This is one of those posts which is due for a long time. I was really excited when Microsoft announced the release of an AlwaysOn AG template for the Azure Portal back in August 2014.

I had decided to explore this AG template soon after the launch, and this post will explain how easy it is to deploy a 2 node AlwaysOn AG deployment in Azure. Trust me, it’s just a matter of few clicks.

Sit back and relax,we are now going to deploy AG in Azure.Here are the steps :

After you login to the Azure portal, you can access the Market Place

AGTemplate1

You will be surprised to see the number of options which are available in market place now.

AGTemplate2

Search for AlwaysOn ( I really hope MSFT will correct this, it should be rather be AlwaysOn AG) in the search box.

AGTemplate3

 

Choose SQL Server 2014 AlwaysOn and proceed by clicking Create.

AGTemplate4

 

Enter details for the required fields and also select the details needed for SQL Server settings. For the purpose of this post, I will be choosing default settings for the domain controller,virtual network,storage etc.

AG6

Note : Ignore the warning messages in the picture above.Warning for Resource Group is shown as I already have a RG with the same name.

That’s it, and you click create.

It takes around 40 – 45 minutes for the whole process to complete and once it was completed, I had the below set of virtual machines.

AG5

 

2 machines for the domain controllers, 1 for the file share witness and 2 for the AG nodes.

Here is a view of the Resource Group and its resources :

AG4

If you login to one of the SQL nodes and open up SSMS, you can see that AG is fully configured and its ready for use.

AG1

Here is the dashboard view and the quorum info :

AG2

AGQ

Conclusion :

Templates are interesting, and this is an excellent way of deploying AGs in large-scale. I’m pretty sure that MSFT will be working more to optimize this, and will add more and features to it.Looking forward to it.

Thanks for reading, and keep watching this space for more.

This one enhancement makes AlwaysOn Availability Group better in SQL Server 2014 !


I keep getting questions on what is improved other than the increased number of secondaries(8 now) with AlwaysOn Availability Group(AG) in SQL 2014. In this blog post  I would like to talk about one major enhancement around the increased readable secondaries availability in SQL Server 2014.

In SQL Server 2012 there is a good chance that your read workloads can be killed during network failures. Typically this happens in a multi subnet environment where you have AG nodes in 2 or more data centers.If network failure results a communication failure between primary and secondary replicas, then the secondaries will go into resolving state.In resolving state, the read workloads are basically killed. (Yes, yet another reason why you need to test your AG deployments)

With SQL Server 2014, there is good news. Read workloads are not impacted during network failures, primary replica down or cluster quorum issues.

Yes, it stays up all the time, and will be serving your read workloads. The replica will be in resolving state, however the databases are still available for your read connections/workloads.

Keep in mind : Read-only routing is not yet supported for this, and you cannot use listener for your reporting workloads.Brent Ozar calls out this in his blog post here.

Lets now walk though this enhancement :

I will be using a 2 node AG setup and the data transfer mode(Availability Mode) is asynchronous.

If I query the DMV sys.dm_hadr_availability_replica_states I can see the replica states.

SELECT role_desc,connected_state_desc  FROM sys.dm_hadr_availability_replica_states

role_desc               connected_state_desc
PRIMARY              CONNECTED
SECONDARY        CONNECTED

Lets double-check if the data transfer is set as asynchronous

SELECT  is_primary_replica, synchronization_state_desc FROM sys.dm_hadr_database_replica_states

is_primary_replica       synchronization_state_desc
0                                  SYNCHRONIZING
1                                  SYNCHRONIZED

Everything looks good, and now lets take down the primary replica.

Note : Dont try this at home ! (I mean production)

Okay, As this is a test instance I really don’t care and what I did was shutdown the SQL Services to take the primary down.

If we query the DMV again, we can see

SELECT role_desc,connected_state_desc  FROM sys.dm_hadr_availability_replica_states

role_desc                connected_state_desc
RESOLVING           DISCONNECTED

And note, there is no entry for primary as we I have taken it down.

Even though the secondary replica is in resolving state, read connections just works.

USE DemoAG
GO
SELECT COUNT (*) AS [TotalRowCount] FROM [dbo].[InsertCount]

TotalRowCount
6886

The secondary database is in NOT SYNCHRONIZING, NOT_HEALTHY state, however its ONLINE.

SELECT synchronization_state_desc,synchronization_health_desc,database_state_desc FROM sys.dm_hadr_database_replica_states

synchronization_state_desc         synchronization_health_desc       database_state_desc
NOT SYNCHRONIZING               NOT_HEALTHY                           ONLINE

Here is a view from Management Studio :

AG Resolving

Conclusion:

Yes, this is a very good enhancement which will make sure your readable secondaries are highly available and I’m pretty sure that this is one of those enhancement which will force you to deploy AGs in SQL 2014 rather than opting SQL 2012.

Thanks for reading and keep watching this space for more !

SQLServer Log Shipping Work File ! What is this all about ?


A short and sweet post for a gorgeous morning –

Have you seen a file with extension .wrk in the log shipping copy folder ?

Logshipping work file

What is this all about, Curious? Then here is the answer –

When the transaction log backup files are copied from the primary server to the secondary server the file is named as a work file(.wrk) till the copy operation is completed.

This will ensure that the restore job in the secondary will not pick the file and try a restore.

Once the file is completely copied, then it will be named as .trn, the one which is very familiar to us.

Logshipping work file2

Normally when the log backups are very small you wont really observe the .wrk file. In a flash they are converted to a .trn file. When the log backups are relatively bigger, then you can see .wrk file available in the copy folder for a longer period of time.

Thanks for reading and keep watching this space for more.

Tale of detaching a database which was enabled for replication!


In development environments I prefer to use ‘detach’ database option to move around databases. It’s simple, fast and easy.

Almost all the time when I detached a database earlier the feeling was like something which I mentioned before “Simple, fast and easy”, however today it was more sort of a fun experience. Here is that tale for you all.

Today I had to ‘detach’ a database from one of my development environments and straight away got this error !

 

 

[Pic courtesy : temzlandia.blogspot.com ]

The database was setup for replication, hence it cannot be dropped.

Apparently, my next step was to remove the publication which was setup for this database and the tale begins from this point.

Note – This was a very controlled dev environment and action like detaching, removing replication was not going to harm anything. In case you are going to repro this demo, then I would recommend you to do this on your lab environment.

I selected the required publication and went ahead with delete operation. The wizard was double checking with me if I was sure. I said Yes to it !

Bang! here comes the next error

The error was pointing that there are some connection issues to the Distributor or to the Subscriber. 

The distributor and the subscriber for this publication were already retired servers and not online and removing publication requires a connection to be made to the distributor/subscriber servers too. As those servers were not reachable, the wizard couldn’t proceed.

There is no way you could delete this publication directly using SSMS and the best way to do this is via system stored procedures.

I had to use 3 system stored procedures to cleanup this publication and those are –

sp_dropsubscription

sp_droppublication

sp_replicationdboption

The scripts used for the clean up processes are mentioned below –

--Cleans up the subscriptions for a publisher

USE <Your Database>
GO
EXEC sp_dropsubscription @publication = N'Your_Publication', 
@article = N'all', @subscriber = N'all', @destination_db = N'all', @ignore_distributor = 1

--Cleans up the publication

USE <Your Database>
GO
EXEC sp_droppublication @publication = N'Your_Publication', @ignore_distributor = 1
--Set Replication database option to false

USE Master
GO
EXEC sp_replicationdboption @dbname = N'Your_Publication', @optname = N'publish', @value = N'false'
,@ignore_distributor = 1

If you note the scripts you can find that they all have an argument @ignore_distributor passed as 1.All the three scripts were failing without this as the connection to distributor was not available as it was offline.

I was curious to check what exactly @ignore_distributor does, however the below BOL entry didnt gave me any clue –

@ignore_distributor = ] ignore_distributor

Identified for informational purposes only. Not supported. Future compatibility is not guaranteed.

It’s very much clear that removing a publication will check connections to the Distributor and in case it’s not available for connection, there will be connection errors.

After running the above mentioned stored procedures I could proceed and detach the database.

Thanks for reading.

Max Text Replication Size – When you might have to care about this number !


One of the advanced Server level options for SQL Server is Max Text Replication Size and this is really not a common one. In most of cases the default value is not changed.

This configuration specifies the maximum size of data that can be added to a replicated column in a single INSERT, UPDATE, WRITETEXT or UPDATETEXT statement. This applies to data type text, ntext, varchar (max), nvarchar (max), varbinary (max), xml, and image.

The default value for this configuration is 65536 bytes = 0.0625 MB.

Why we care about this value? Here is a reason why we might have to carefully change this value according to our requirements.

Recently I had setup a database for Citrix XenApp 6.5 farm data store requirements and configured transactional replication as part of the DR requirement.

After configuring replication the Citrix team where able to read data out of the database, however nothing was getting published. They were consistently getting errors like “Unknown error occurred: error code 0x82060035”

Things where working great before and this error started popping up after replication was setup. The only change made was setting replication for this database.

Upon checking the database closely I found out one table called dbo.KEYTABLE which had a column called data which was of type Varbinary (max).

As this column was part of transactional replication the value of Max Text Replication Size came into picture and anything above 65536 bytes on a single Insert was not allowed and Citrix faced issues while publishing new apps.

Carefully choosing the best possible value for this configuration setting fixed the problem.

It has a max value of 2147483647 bytes which is 2 GB.

I really don’t recommend directly increasing the value to 2147483647 bytes, and I would always test a good value which works fine for the environment.

High value will allow huge Inserts,Update in a single statement and can bring in network latency while replicating.

I had a twitter discussion with SQLServer expert Robert L Davis(B/T) to double check if changing this server level value has any other direct impact, and he confirmed that there aren’t any. Many thanks to Robert!

Thanks for reading.

Why AD level permissions are important – The cluster resource ‘SQL Server’ could not be brought online !


When your cluster install fails, then there is lot to learn!!!

Today I am writing about my very recent experience working on a clustering deployment. It was for a two node cluster with single SQL Instance.

I stopped using Active/Passive terminology long back as it is not the right usage. Clustering MVP Allen Hirt (B/T) has pointed out this fact much time via his blog posts and through SQL forums.

There were no errors returned during the initial stages (Rule checks) of SQL cluster install. The setup apparently gave the below error at one point during the final configuration process and the Database Engine Install was failed.

The cluster resource ‘SQL Server’ could not be brought online.  

Error: The resource failed to come online due to the failure of one or more provider resources.

(Exception from HRESULT: 0x80071736)

There were no specific details on the SQL error log (Available under the Setup Bootstrap folder) which I could observe which eventually will lead me to find the reason for the error.

I kept checking the Windows error logs and hit this event right away –

[Click the picture for full view]

The reason for the error is the CNO (cluster computer account) don’t have the create computer perms at OU level.

We can test this by doing a simple Client Access Point Test

We can provide a Name and an IP (which gets picked automatically).This will create a computer object just the same way SQL Server does.

In some cases the Cluster service account are blocked from creating a computer object. In that situation you will need to work with the domain administrator and they should pre-create the virtual server computer object, and then grant certain access rights to the Cluster service account on the pre-created computer object.

In my case the domain services team created the computer object manually and then granted the cluster account full permissions for the same.

Conclusion

Domain level permissions are really important during cluster deployments, hence the person responsible for setting up the SQL cluster should closely interact with both windows team and domain services team(In most of the cases, both operations are handled by one single team) to understand what level of permissions are required or closely work together to isolate and fix potential problems.

Automatic Page Repair – Smart fellow who does its work behind the scenes!


Automatic Page Repair is one feature which is not really famous, however pretty much known to most of us.

I wanted to write about Automatic Page Repair since a very long time. Today I decided to test this feature when I had to deploy and do some test cases with DBM.

In simple words Auto Page Repair feature will replace the corrupt page by requesting a readable page from the partner mirrored database.

We also need to take into consideration that not all pages can be repaired. Read more about this feature here.

Let’s now do some corruptions and see if this feature is smart enough to repair them!

Note  – Don’t try this at home (Production Environment) !

Stage 1

For the purpose of the demo we have a Mirroring Setup for the database AdventureWorks2012 (Downloaded from codeplex)

For the purpose of corrupting the page we will need to make the database offline first, hence will need to remove mirroring for a while as the database which is taking part in a mirroring session cannot be taken offline. We will re-establish DBM once corruption is completed.

We will pick one table from the database and will corrupt the index page of the same. The table which we are going to choose here is HumanResources.Employee

We will now pick one Index Page for the table

The page which we are going to corrupt is Page with ID 875

Note I have noted to take required backups to ensure that proper rollback is possible.

 Using HexEdit tool we can enforce corruptions for page ID 875,and this action requires the database to be offline.

Once the corruptions are made(Ref to this post to understand how can we corrupt a page) we can bring the database Online and identify the corruptions using DBCC CHECKDB command

DBCC CHECKDB(AdventureWorks2012) WITH NO_INFOMSGS
 
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1237579447, index ID 1, partition ID 72057594045136896, 
alloc unit ID 72057594050838528 (type In-row data). 
Index node page (0:0), slot 0 refers to child page (1:875) 
and previous child (0:0), but they were not encountered.

The above mentioned error message is taken from the DBCC result set.

Stage 2

Now that we have a corrupted page, we will proceed and re-establish mirroring.

After mirroring is re-established, we can try running DBCC CHECKDB once again on the same corrupted database.

DBCC CHECKDB(AdventureWorks2012) WITH NO_INFOMSGS

This time we will get a magical confirmation as below

Command(s) completed successfully.

Wow!!! Where did that corruption gone?

Yes, you guessed it right. Auto Page Repair is so smart that it replaced a clean page to the Principal Database from the Mirroring Partner.

No restore or what so ever, everything happened behind the scenes. Very neat, very smart !

Principal and Mirrored database is in sync and they help each other’s too.

We have a view called [sys].[dm_db_mirroring_auto_page_repair] which keeps track of all repair attempts which done behind the scenes.

Let’s quickly query it and see what’s in there

SELECT * FROM [sys].[dm_db_mirroring_auto_page_repair]

Bingo!!! The result came as

 The result is very clear. It says that page 875 was corrupted and it was replaced/repaired.The action was successful and the page is reusable

BOL talks about the below page_status possible values  –

2 = Queued for request from partner.

3 = Request sent to partner.

4 = Queued for automatic page repair (response received from partner).

5 = Automatic page repair succeeded and the page should be usable.

6 = Irreparable. This indicates that an error occurred during page-repair attempt, for example, because the page is also corrupted on the partner, the partner is disconnected, or a network problem occurred. This state is not terminal; if corruption is encountered again on the page, the page will be requested again from the partner.

In SQL 2012 we have an additional view for Always ON AG’s and it is

[sys].[dm_hadr_auto_page_repair]

Conclusion

In case you have a mirroring setup,then its worth querying the view and see how much the Auto Page Repair feature have helped you.

Thanks for reading.