Distribution clean up: distribution job failing with error – Could not remove directory !


Today I came across an error for the Distribution clean up: distribution job in one of the environments

Message – 
Executed as user: Domain\Sqlagentaccount. Could not remove directory ‘\\MachineName\ReplData\unc\MachineName_Replica_Replica10_PUB\20120403180393\’. Check the security context of xp_cmdshell and close other processes that may be accessing the directory. [SQLSTATE 42000] (Error 20015). The step failed.

[Note – I have altered the error message contents for security reasons]

This job was running fine and it started giving issues suddenly.

Troubleshooting steps – 

1. As a first step I checked if xp_cmdshell is configured or not.It was found that xp_cmdshell was indeed enabled.

2.I started to dig into the job to see what it runs.The job runs a stored procedure

EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 72

3.When this is triggered from a job the Agent account is used,hence I decided to run this from SSMS query window.

I got the same error message as mentioned above,along with this

Replication-@rowcount_only parameter must be the value 0,1, or 2. 0=7.0 compatible checksum. 1=only check rowcou: agent distribution@rowcount_only parameter must be the value 0,1, or 2. 0=7.0 compatible checksum. 1=only  scheduled for retry. Could not clean up the distribution transaction tables.

4. I was pretty sure that these error messages are little misleading type and decided to go ahead with verifying the security permission of the UNC share.

\\MachineName\ReplData

5. Initially I was focusing on the security permissions of the unc share,and assigned both the Agent account and the Database Engine account full_control

6. Ran the job again and it failed yet again.

7. I decided to do some R&D via web and found this blog post from SQL Server Support team.This post was pointing that the SQL Account should also have full_control for the UNC share.

8. I went ahead and granted full_control to SQL Account for the UNC share and the issue was resolved.

This was indeed a strange behavior because the job was running fine before with the SQL Account being part of the UNC share full_control list.

The only change which had happened within the environment is SP4 upgrade and this should not have caused this trouble.

As a test case I removed the permission of the SQL Account  once again for the UNC share and tried to run the job.This was successful,which was yet again a strange behaviour.

Conclusion

This particular behavior is not documented anywhere nor this has been noticed by many people within the SQL Family,hence in case you face the same situation,then you might have to double check the permissions for the UNC share to isolate the issue and get a quick solution.

Thanks for reading.

Deleted the TUF file!!! Boy, that’s trouble


Just 2 days back I wrote a post of TUF files related to log shipping. You can read the post here 

Today we will see what is going to happen if someone deleted the TUF file accidentally or by any chance it got missed.

I tried to simulate this on my test machine which had log shipping configured. Below are the steps which I followed –

1. Deleted the TUF file which was available in the secondary server.

2. The delete operation was successful.

3. Checked log shipping status and found that the health is ‘Good’

4. Both primary and secondary databases are synced and both have got same set of data. Row by row,Col by Col.

Note – Ideally deleting the TUF file should also cause issues to log shipping secondary restores, however my simulation did not faced that behavior.

All looks good, and you might be wondering that deleting a TUF file is easy and it’s not going to hurt me much!!!

Now, let’s assume that we lost our primary database server due to Memory burn(Short circuit) and we are in need of the Secondary database.

The RTO and RPO matrix is quite okay and we are allowed to bring the secondary database up within 30 minutes. Walk in the park right? We just have to bring the database up, the users/jobs/other objects are already taken care and just the database needs to be up.

Let’s write this simple 6 word TSQL to bring our database up.

RESTORE DATABASE [XenDevDS] WITH RECOVERY

XenDevDS is my test database which is available in the secondary server and its primary copy was the one which was residing on the server which just went for a trip(Memory burn!)

As soon as we execute this command with a big smile assuming that the database will be up, we will get this message –

Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
Msg 3441, Level 17, State 1, Line 1
During startup of warm standby database ‘XenDevDS’ (database ID 7), its standby file (‘C:\Program Files\Microsoft SQL Server\MSSQL11.SERVER2012B\MSSQL\DATA\XenDevDS_20120112191505.tuf’) was inaccessible to the RESTORE statement. The operating system error was ‘2(The system cannot find the file specified.)’. Diagnose the operating system error, correct the problem, and retry startup.

What does it mean – It simply means that you have done a good job by deleting the TUF file and now please bring it back.

TUF file is required for the Stand by database to recover and we will not be able to bring the database up without the same.

As the simulation was in a very controlled environment, I brought back the TUF file and ran the restore command once again.

RESTORE DATABASE [XenDevDS] WITH RECOVERY

RESTORE DATABASE successfully processed 0 pages in 0.908 seconds (0.000 MB/sec).

The database was recovered and was accepting new connections.

Conclusion – TUF file is a very important part of recovery of a stand by database and we have to educate server ops team or anyone who is responsible for cleaning up files and make sure that this is un-touched.

Do you have any ways to recover a stand by database in log shipping secondary without TUF file.If Yes,then please share your experience in the comments section of this post.

Thanks for reading.

TUF File – Not a very famous member,but does his job pretty well!


I have seen various questions related to TUF files,and one of the discussion was interesting and it was something like below-

<Start>

John  – I don’t understand why we need this TUF file in SQL Server, what does it do? I have been looking around for more information, but seems there is no great information around the same.

Kim – Are you talking about .TRN files?

John – No, I am talking about .TUF files. Trust me it’s there!

Kim – Oh, then I am missing something. Let me check that out.

</End of discussion>

So what is this TUF file is all about?

I was also not very sure of what TUF file deals with, however after some research I was able to understand the concept of TUF files and decided to write this post.

TUF file or a Transaction Undo File is created when performing log shipping to a server in Standby mode. This file contains information on all the modifications performed at the time backup is taken.

This file is important in Standby mode of log shipping were you can access the secondary database. Database recovery is done in standby mode when log is restored.

While restoring the log backup, un-committed transactions will be recorded to the undo file and only committed transactions will be written to disk there by making users to read the database. When we restore next Tlog backup SQL server will fetch the un-committed transactions from undo file and check with the new Tlog backup whether the same is committed or not. If its committed the transactions will be written to disk else it will be stored in undo file until it gets committed or rolled back.

A small graphical representation of the above statement is shown below –

I configured log shipping to test TUF file and created a scenario like below –

1. Created a primary database.

2. Configured log shipping to another Instance within the same box.

3. Backup, Copy and Restore to happen every 15 minutes.

4. Continuously inserted data to the primary database to simulate TUF creation.

5. I was able to find TUF file created under the same path were I had placed my system databases files.

There seems to be changes in this path were we can find the TUF files. It will be available in the root as mentioned above for SQL Server 2008 above and used to be in the LS_Copy folder for earlier versions.

 

 

Coming up next – What happens when I delete this file? So please stay tuned my friends.

Thanks for reading.

Partial database availability – A walk through


Partial database availability is an exiting feature,and I decided to write this blog post after observing many doubts related to this feature in forums.

Lets assume a situation like mentioned below –

We have a database with multiple file groups and data files reside separately in respective file groups.Now assume a situation where we have a severe disk failure and one of the .ndf file residing drive is corrupted!

This will make the database inaccessible.We have multiple options to recover from this situation and one among them is to do a restore of the database using backup sets.Think of a situation where our database is super large and a restore will take around 30 – 45mins.

Do we really want our database users to wait until we complete the restore? What if we give them a portion of the database online,while we work on the recovery part and bring everything online slowly.

Wow!!! (Business will just love these ideas as soon as I tell them).However this one solution is not so simple and require lot of planning,testing and the application should be able to work without a portion of data.

Lets do a demo of this situation and understand how we can achieve partial database availability –

1. We will create a demo database

--Created a Database
CREATE DATABASE TEST_FILEGROUP

2. Create a new file group

--Create a new FileGroup
ALTER DATABASE TEST_FILEGROUP
ADD FILEGROUP ADDITIONAL

3. Add one additional data file to the database

--Add an additional data file to the database
ALTER DATABASE TEST_FILEGROUP
ADD FILE (NAME='NEW_DATA_FILE',
FILENAME='D:\Program Files\Microsoft SQL Server\MSSQL10_50.SQL2008R2RD\MSSQL\DATA\NEW_DATA_FILE.ndf')
TO FILEGROUP ADDITIONAL

4. Validate them

sp_helpdb TEST_FILEGROUP
Name
------------------
TEST_FILEGROUP
 TEST_FILEGROUP_log
 NEW_DATA_FILE

5. Create a table [Employee] on primary file group and insert some data

--Create a Table on primary file group and Insert some data rows
 USE [TEST_FILEGROUP]
 CREATE TABLE Employee(ID Int Identity(1000,1),Name Varchar(20))
USE [TEST_FILEGROUP]
 INSERT INTO Employee (Name)
 SELECT 'John'
 UNION ALL
 SELECT 'Tim'
 UNION ALL
 SELECT 'Tracy'
 UNION ALL
 SELECT 'Jim'
 UNION ALL
 SELECT 'Ancy'

6. Create another table [HRRECORDS] on the additional file group and Insert some data

--Create another table on the additional file group and Insert some data
 USE [TEST_FILEGROUP]
 CREATE TABLE HRRECORDS(ID Int Identity(1000,1),Description Varchar(20))
 ON ADDITIONAL

7. Now we will proceed to take a file group backup for the purpose of this demo

--Take a ADDITIONAL filegroup backup for purpose
 BACKUP DATABASE TEST_FILEGROUP
 FILEGROUP='ADDITIONAL'
 TO DISK='C:\TestBackup\Additional_FileGroup.bak'
Processed 16 pages for database 'TEST_FILEGROUP', file 'NEW_DATA_FILE' on file 1.
 Processed 5 pages for database 'TEST_FILEGROUP', file 'TEST_FILEGROUP_log' on file 1.
 BACKUP DATABASE...FILE=<name> successfully processed 21 pages in 0.318 seconds (0.495 MB/sec).

Now this is the real interesting part of this demo.We are going to simulate an error situation –

We are going to stop the SQL Engine service to delete the additional data file(.ndf file).Once the service is stopped we will be able to delete the ndf file.

Note – This is just for a demo purpose and should not be simulated in real time production environment.[Word of caution before the CTO/Manager gives you surprises! ]

Once the ndf file is deleted,start the engine and you will observe the below error straight away if you try to access our demo database.

This was expected and simply means that deleting ndf file caused failure for the database.What are we going to do now to bring this database up and running?

Definitely we can restore the backup to bring this up,however just think about this situation.Your database backups are huge as we might be dealing with a huge database and users have to wait until the whole backup set is restored.

Do we have a RTO of around 45 mins – 1 hr? Do we really need to wait for the whole restore to complete to fix and issue with another file group before users can connect to the database and access tables which are residing in Primary file group?

The short and sweet answer to this question is  – NO,starting SQL 2005,database can be made available to users as soon as the primary file group is up and running for a database.

Now lets go back to our situation were database is offline because of a corrupted/missing .ndf file.

Users compromised(This should be actually part of DR strategy and should not be decided at the last minute) that they can work with out table HRRECORDS,the one which was residing in the additional file group which just failed. The users just need Employee  table to continue their work.

Wow!!! Now we can feel some fresh air to breath.

8. We can acheive this by taking the additional data file offline

Note – If we make this file offline,we can bring this back only using a file/file grp backup or a regular database full backup.

--Taking additional file offline
ALTER
DATABASE TEST_FILEGROUP
MODIFY
FILE (NAME='NEW_DATA_FILE', offline);

We will need to recycle the service once again for changes to take effect and we can verify this change by checking the sys.database_files table

SELECT name,state_desc from
sys.database_files
Name                         state_desc
 TEST_FILEGROUP              ONLINE
 TEST_FILEGROUP_log          ONLINE
 NEW_DATA_FILE               OFFLINE

9. Now as the file is offline the database is accessable.

Our query to Employee Table will give details like

SELECT TOP 100 [ID]
 ,[Name]
 FROM [TEST_FILEGROUP].[dbo].[Employee]
ID    Name
1000 John
1001 Tim
1002 Tracy
1003 Jim
1004 Ancy

If we attempt to query the table on the additional data file will give an error like

SELECT TOP 100 [ID]
 ,[Description]
 FROM [TEST_FILEGROUP].[dbo].[HRRECORDS]

Msg 8653, Level 16, State 1, Line 1
The query processor is unable to produce a plan for the table or view ‘HRRECORDS’ because the table resides in a filegroup which is not online.

This is Partial Database Availability were one whole database is available without some tables and we have achieved this using File Groups/Files.

Now how can we bring this ndf file back from the backup? Here is the process to show that to you –

1.We will restore the file group from the backup which we had taken earlier

RESTORE DATABASE TEST_FILEGROUP
FILEGROUP = 'ADDITIONAL' FROM DISK = 'C:\TestBackup\Additional_FileGroup.bak' WITH RECOVERY

Oops we are missing something here –

/*Msg 3159, Level 16, State 1, Line 1
The tail of the log for the database “TEST_FILEGROUP” has not been backed up. Use BACKUP LOG WITH NORECOVERY to backup the log if it contains work you do not want to lose. Use the WITH REPLACE or WITH STOPAT clause of the RESTORE statement to just overwrite the contents of the log.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally. */

What does the message says – It says that the tail of the log is not backed up and it needs to be done before we do a restore of the file group.

2. So lets go ahead and backup the tail of the log

BACKUP LOG TEST_FILEGROUP TO DISK='C:\TestBackup\Tail.trn' WITH NORECOVERY
/*Processed 10 pages for database 'TEST_FILEGROUP', file 'TEST_FILEGROUP_log' on file 1.
BACKUP LOG successfully processed 10 pages in 0.223 seconds (0.345 MB/sec).*/

3. Lets try to restore the file Group now.

RESTORE DATABASE TEST_FILEGROUP
FILEGROUP = 'ADDITIONAL' FROM DISK = 'C:\TestBackup\Additional_FileGroup.bak' WITH RECOVERY

I specifically used RECOVERY here for the restore command to show the error message and show what is the need of tail of log backup.

As soon as we run the above command we will get another message

Processed 16 pages for database 'TEST_FILEGROUP', file 'NEW_DATA_FILE' on file 1.
Processed 5 pages for database 'TEST_FILEGROUP', file 'TEST_FILEGROUP_log' on file 1.
The roll forward start point is now at log sequence number (LSN) 21000000019100001. 
Additional roll forward past LSN 21000000024300001 is required to complete the restore sequence.
This RESTORE statement successfully performed some actions, 
but the database could not be brought online because one or more RESTORE steps are needed. 
Previous messages indicate reasons why recovery cannot occur at this point.
RESTORE DATABASE ... FILE=<name> successfully processed 21 pages in 0.214 seconds (0.736 MB/sec).

4.Finally we will bring the database up by restoring the tail of log backup

RESTORE DATABASE TEST_FILEGROUP
FROM DISK = 'C:\TestBackup\Tail.trn' WITH RECOVERY
Processed 0 pages for database 'TEST_FILEGROUP', file 'TEST_FILEGROUP' on file 1.
Processed 0 pages for database 'TEST_FILEGROUP', file 'NEW_DATA_FILE' on file 1.
Processed 7 pages for database 'TEST_FILEGROUP', file 'TEST_FILEGROUP_log' on file 1.
RESTORE LOG successfully processed 7 pages in 0.117 seconds (0.463 MB/sec).

Now our database is completely available with both tables and as a test case we can just query HRRECORDS table to validate data

SELECT TOP 100 [ID]
 ,[Description]
 FROM [TEST_FILEGROUP].[dbo].[HRRECORDS]
ID   Description
1000 IT Spec
1001 DBA
1002 Developer
1003 Java Guy
1004 .NetSpec

Conclusion  – Partial database availability is very much useful for huge databases and you have all your secondary file groups to store historical data and primary file group is critical for business.

Backup and restore of file/file groups is a very interesting topic and I will simulate this feature in SQL 2012 to see if there are any changes and will come up with more details.

I would love to hear your experience dealing with file groups and thanks for reading.