SQLServer Log Shipping Work File ! What is this all about ?


A short and sweet post for a gorgeous morning -

Have you seen a file with extension .wrk in the log shipping copy folder ?

Logshipping work file

What is this all about, Curious? Then here is the answer -

When the transaction log backup files are copied from the primary server to the secondary server the file is named as a work file(.wrk) till the copy operation is completed.

This will ensure that the restore job in the secondary will not pick the file and try a restore.

Once the file is completely copied, then it will be named as .trn, the one which is very familiar to us.

Logshipping work file2

Normally when the log backups are very small you wont really observe the .wrk file. In a flash they are converted to a .trn file. When the log backups are relatively bigger, then you can see .wrk file available in the copy folder for a longer period of time.

Thanks for reading and keep watching this space for more.

Tale of detaching a database which was enabled for replication!


In development environments I prefer to use ‘detach’ database option to move around databases. It’s simple, fast and easy.

Almost all the time when I detached a database earlier the feeling was like something which I mentioned before “Simple, fast and easy”, however today it was more sort of a fun experience. Here is that tale for you all.

Today I had to ‘detach’ a database from one of my development environments and straight away got this error !

 

 

[Pic courtesy : temzlandia.blogspot.com ]

The database was setup for replication, hence it cannot be dropped.

Apparently, my next step was to remove the publication which was setup for this database and the tale begins from this point.

Note - This was a very controlled dev environment and action like detaching, removing replication was not going to harm anything. In case you are going to repro this demo, then I would recommend you to do this on your lab environment.

I selected the required publication and went ahead with delete operation. The wizard was double checking with me if I was sure. I said Yes to it !

Bang! here comes the next error

The error was pointing that there are some connection issues to the Distributor or to the Subscriber. 

The distributor and the subscriber for this publication were already retired servers and not online and removing publication requires a connection to be made to the distributor/subscriber servers too. As those servers were not reachable, the wizard couldn’t proceed.

There is no way you could delete this publication directly using SSMS and the best way to do this is via system stored procedures.

I had to use 3 system stored procedures to cleanup this publication and those are -

sp_dropsubscription

sp_droppublication

sp_replicationdboption

The scripts used for the clean up processes are mentioned below -

--Cleans up the subscriptions for a publisher

USE <Your Database>
GO
EXEC sp_dropsubscription @publication = N'Your_Publication', 
@article = N'all', @subscriber = N'all', @destination_db = N'all', @ignore_distributor = 1

--Cleans up the publication

USE <Your Database>
GO
EXEC sp_droppublication @publication = N'Your_Publication', @ignore_distributor = 1
--Set Replication database option to false

USE Master
GO
EXEC sp_replicationdboption @dbname = N'Your_Publication', @optname = N'publish', @value = N'false'
,@ignore_distributor = 1

If you note the scripts you can find that they all have an argument @ignore_distributor passed as 1.All the three scripts were failing without this as the connection to distributor was not available as it was offline.

I was curious to check what exactly @ignore_distributor does, however the below BOL entry didnt gave me any clue -

@ignore_distributor = ] ignore_distributor

Identified for informational purposes only. Not supported. Future compatibility is not guaranteed.

It’s very much clear that removing a publication will check connections to the Distributor and in case it’s not available for connection, there will be connection errors.

After running the above mentioned stored procedures I could proceed and detach the database.

Thanks for reading.

Max Text Replication Size – When you might have to care about this number !


One of the advanced Server level options for SQL Server is Max Text Replication Size and this is really not a common one. In most of cases the default value is not changed.

This configuration specifies the maximum size of data that can be added to a replicated column in a single INSERT, UPDATE, WRITETEXT or UPDATETEXT statement. This applies to data type text, ntext, varchar (max), nvarchar (max), varbinary (max), xml, and image.

The default value for this configuration is 65536 bytes = 0.0625 MB.

Why we care about this value? Here is a reason why we might have to carefully change this value according to our requirements.

Recently I had setup a database for Citrix XenApp 6.5 farm data store requirements and configured transactional replication as part of the DR requirement.

After configuring replication the Citrix team where able to read data out of the database, however nothing was getting published. They were consistently getting errors like “Unknown error occurred: error code 0×82060035″

Things where working great before and this error started popping up after replication was setup. The only change made was setting replication for this database.

Upon checking the database closely I found out one table called dbo.KEYTABLE which had a column called data which was of type Varbinary (max).

As this column was part of transactional replication the value of Max Text Replication Size came into picture and anything above 65536 bytes on a single Insert was not allowed and Citrix faced issues while publishing new apps.

Carefully choosing the best possible value for this configuration setting fixed the problem.

It has a max value of 2147483647 bytes which is 2 GB.

I really don’t recommend directly increasing the value to 2147483647 bytes, and I would always test a good value which works fine for the environment.

High value will allow huge Inserts,Update in a single statement and can bring in network latency while replicating.

I had a twitter discussion with SQLServer expert Robert L Davis(B/T) to double check if changing this server level value has any other direct impact, and he confirmed that there aren’t any. Many thanks to Robert!

Thanks for reading.

Automatic Page Repair – Smart fellow who does its work behind the scenes!


Automatic Page Repair is one feature which is not really famous, however pretty much known to most of us.

I wanted to write about Automatic Page Repair since a very long time. Today I decided to test this feature when I had to deploy and do some test cases with DBM.

In simple words Auto Page Repair feature will replace the corrupt page by requesting a readable page from the partner mirrored database.

We also need to take into consideration that not all pages can be repaired. Read more about this feature here.

Let’s now do some corruptions and see if this feature is smart enough to repair them!

Note  - Don’t try this at home (Production Environment) !

Stage 1

For the purpose of the demo we have a Mirroring Setup for the database AdventureWorks2012 (Downloaded from codeplex)

For the purpose of corrupting the page we will need to make the database offline first, hence will need to remove mirroring for a while as the database which is taking part in a mirroring session cannot be taken offline. We will re-establish DBM once corruption is completed.

We will pick one table from the database and will corrupt the index page of the same. The table which we are going to choose here is HumanResources.Employee

We will now pick one Index Page for the table

The page which we are going to corrupt is Page with ID 875

Note - I have noted to take required backups to ensure that proper rollback is possible.

 Using HexEdit tool we can enforce corruptions for page ID 875,and this action requires the database to be offline.

Once the corruptions are made(Ref to this post to understand how can we corrupt a page) we can bring the database Online and identify the corruptions using DBCC CHECKDB command

DBCC CHECKDB(AdventureWorks2012) WITH NO_INFOMSGS
 
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1237579447, index ID 1, partition ID 72057594045136896, 
alloc unit ID 72057594050838528 (type In-row data). 
Index node page (0:0), slot 0 refers to child page (1:875) 
and previous child (0:0), but they were not encountered.

The above mentioned error message is taken from the DBCC result set.

Stage 2

Now that we have a corrupted page, we will proceed and re-establish mirroring.

After mirroring is re-established, we can try running DBCC CHECKDB once again on the same corrupted database.

DBCC CHECKDB(AdventureWorks2012) WITH NO_INFOMSGS

This time we will get a magical confirmation as below

Command(s) completed successfully.

Wow!!! Where did that corruption gone?

Yes, you guessed it right. Auto Page Repair is so smart that it replaced a clean page to the Principal Database from the Mirroring Partner.

No restore or what so ever, everything happened behind the scenes. Very neat, very smart !

Principal and Mirrored database is in sync and they help each other’s too.

We have a view called [sys].[dm_db_mirroring_auto_page_repair] which keeps track of all repair attempts which done behind the scenes.

Let’s quickly query it and see what’s in there

SELECT * FROM [sys].[dm_db_mirroring_auto_page_repair]

Bingo!!! The result came as

 The result is very clear. It says that page 875 was corrupted and it was replaced/repaired.The action was successful and the page is reusable

BOL talks about the below page_status possible values  -

2 = Queued for request from partner.

3 = Request sent to partner.

4 = Queued for automatic page repair (response received from partner).

5 = Automatic page repair succeeded and the page should be usable.

6 = Irreparable. This indicates that an error occurred during page-repair attempt, for example, because the page is also corrupted on the partner, the partner is disconnected, or a network problem occurred. This state is not terminal; if corruption is encountered again on the page, the page will be requested again from the partner.

In SQL 2012 we have an additional view for Always ON AG’s and it is

[sys].[dm_hadr_auto_page_repair]

Conclusion

In case you have a mirroring setup,then its worth querying the view and see how much the Auto Page Repair feature have helped you.

Thanks for reading.

Distribution clean up: distribution job failing with error – Could not remove directory !


Today I came across an error for the Distribution clean up: distribution job in one of the environments

Message - 
Executed as user: Domain\Sqlagentaccount. Could not remove directory ‘\\MachineName\ReplData\unc\MachineName_Replica_Replica10_PUB\20120403180393\’. Check the security context of xp_cmdshell and close other processes that may be accessing the directory. [SQLSTATE 42000] (Error 20015). The step failed.

[Note - I have altered the error message contents for security reasons]

This job was running fine and it started giving issues suddenly.

Troubleshooting steps - 

1. As a first step I checked if xp_cmdshell is configured or not.It was found that xp_cmdshell was indeed enabled.

2.I started to dig into the job to see what it runs.The job runs a stored procedure

EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 72

3.When this is triggered from a job the Agent account is used,hence I decided to run this from SSMS query window.

I got the same error message as mentioned above,along with this

Replication-@rowcount_only parameter must be the value 0,1, or 2. 0=7.0 compatible checksum. 1=only check rowcou: agent distribution@rowcount_only parameter must be the value 0,1, or 2. 0=7.0 compatible checksum. 1=only  scheduled for retry. Could not clean up the distribution transaction tables.

4. I was pretty sure that these error messages are little misleading type and decided to go ahead with verifying the security permission of the UNC share.

\\MachineName\ReplData

5. Initially I was focusing on the security permissions of the unc share,and assigned both the Agent account and the Database Engine account full_control

6. Ran the job again and it failed yet again.

7. I decided to do some R&D via web and found this blog post from SQL Server Support team.This post was pointing that the SQL Account should also have full_control for the UNC share.

8. I went ahead and granted full_control to SQL Account for the UNC share and the issue was resolved.

This was indeed a strange behavior because the job was running fine before with the SQL Account being part of the UNC share full_control list.

The only change which had happened within the environment is SP4 upgrade and this should not have caused this trouble.

As a test case I removed the permission of the SQL Account  once again for the UNC share and tried to run the job.This was successful,which was yet again a strange behaviour.

Conclusion

This particular behavior is not documented anywhere nor this has been noticed by many people within the SQL Family,hence in case you face the same situation,then you might have to double check the permissions for the UNC share to isolate the issue and get a quick solution.

Thanks for reading.

Deleted the TUF file!!! Boy, that’s trouble


Just 2 days back I wrote a post of TUF files related to log shipping. You can read the post here 

Today we will see what is going to happen if someone deleted the TUF file accidentally or by any chance it got missed.

I tried to simulate this on my test machine which had log shipping configured. Below are the steps which I followed -

1. Deleted the TUF file which was available in the secondary server.

2. The delete operation was successful.

3. Checked log shipping status and found that the health is ‘Good’

4. Both primary and secondary databases are synced and both have got same set of data. Row by row,Col by Col.

Note - Ideally deleting the TUF file should also cause issues to log shipping secondary restores, however my simulation did not faced that behavior.

All looks good, and you might be wondering that deleting a TUF file is easy and it’s not going to hurt me much!!!

Now, let’s assume that we lost our primary database server due to Memory burn(Short circuit) and we are in need of the Secondary database.

The RTO and RPO matrix is quite okay and we are allowed to bring the secondary database up within 30 minutes. Walk in the park right? We just have to bring the database up, the users/jobs/other objects are already taken care and just the database needs to be up.

Let’s write this simple 6 word TSQL to bring our database up.

RESTORE DATABASE [XenDevDS] WITH RECOVERY

XenDevDS is my test database which is available in the secondary server and its primary copy was the one which was residing on the server which just went for a trip(Memory burn!)

As soon as we execute this command with a big smile assuming that the database will be up, we will get this message -

Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
Msg 3441, Level 17, State 1, Line 1
During startup of warm standby database ‘XenDevDS’ (database ID 7), its standby file (‘C:\Program Files\Microsoft SQL Server\MSSQL11.SERVER2012B\MSSQL\DATA\XenDevDS_20120112191505.tuf’) was inaccessible to the RESTORE statement. The operating system error was ’2(The system cannot find the file specified.)’. Diagnose the operating system error, correct the problem, and retry startup.

What does it mean – It simply means that you have done a good job by deleting the TUF file and now please bring it back.

TUF file is required for the Stand by database to recover and we will not be able to bring the database up without the same.

As the simulation was in a very controlled environment, I brought back the TUF file and ran the restore command once again.

RESTORE DATABASE [XenDevDS] WITH RECOVERY

RESTORE DATABASE successfully processed 0 pages in 0.908 seconds (0.000 MB/sec).

The database was recovered and was accepting new connections.

Conclusion – TUF file is a very important part of recovery of a stand by database and we have to educate server ops team or anyone who is responsible for cleaning up files and make sure that this is un-touched.

Do you have any ways to recover a stand by database in log shipping secondary without TUF file.If Yes,then please share your experience in the comments section of this post.

Thanks for reading.

TUF File – Not a very famous member,but does his job pretty well!


I have seen various questions related to TUF files,and one of the discussion was interesting and it was something like below-

<Start>

John  - I don’t understand why we need this TUF file in SQL Server, what does it do? I have been looking around for more information, but seems there is no great information around the same.

Kim – Are you talking about .TRN files?

John – No, I am talking about .TUF files. Trust me it’s there!

Kim – Oh, then I am missing something. Let me check that out.

</End of discussion>

So what is this TUF file is all about?

I was also not very sure of what TUF file deals with, however after some research I was able to understand the concept of TUF files and decided to write this post.

TUF file or a Transaction Undo File is created when performing log shipping to a server in Standby mode. This file contains information on all the modifications performed at the time backup is taken.

This file is important in Standby mode of log shipping were you can access the secondary database. Database recovery is done in standby mode when log is restored.

While restoring the log backup, un-committed transactions will be recorded to the undo file and only committed transactions will be written to disk there by making users to read the database. When we restore next Tlog backup SQL server will fetch the un-committed transactions from undo file and check with the new Tlog backup whether the same is committed or not. If its committed the transactions will be written to disk else it will be stored in undo file until it gets committed or rolled back.

A small graphical representation of the above statement is shown below -

I configured log shipping to test TUF file and created a scenario like below -

1. Created a primary database.

2. Configured log shipping to another Instance within the same box.

3. Backup, Copy and Restore to happen every 15 minutes.

4. Continuously inserted data to the primary database to simulate TUF creation.

5. I was able to find TUF file created under the same path were I had placed my system databases files.

There seems to be changes in this path were we can find the TUF files. It will be available in the root as mentioned above for SQL Server 2008 above and used to be in the LS_Copy folder for earlier versions.

 

 

Coming up next - What happens when I delete this file? So please stay tuned my friends.

Thanks for reading.