SharePoint PowerShell Script to Extract All Documents and Their Versions

Hey! Listen: This script doesn’t extract documents that suffer from Longurlitis (URL greater than the SharePoint maximum of 260 characters). So you may also want to also run the PowerShell Script To Find and Extract Files From SharePoint That Have A URL Longer Than 260 Characters.

Recently a client asked to extract all content from a SharePoint site for archival. A CMP file was out of the question, because this had to be a SharePoint independent solution.  Powershell to the rescue! The script below  will extract all documents and their versions, as well as all metadata and list data to CSV files.

The DownloadSite function will download all the documents and their versions into folders named after their respective document libraries. Versions will be named [filename]_v[version#].[extension].

The DownloadMetadata function will download all the document library’s metadata as well as list data from the site and export it as a CSV file. If you don’t need to download the metadata/ lists, just comment out the function below.

There’s also ample commenting in case someone wants to modify/ expand upon the script!

# This script will extract all of the documents and their versions from a site. It will also
# download all of the list data and document library metadata as a CSV file.
 
Add-PSSnapin Microsoft.SharePoint.PowerShell -erroraction SilentlyContinue
# 
# $destination: Where the files will be downloaded to
# $webUrl: The URL of the website containing the document library for download
# $listUrl: The URL of the document library to download
 
#Where to Download the files to. Sub-folders will be created for the documents and lists, respectively.
$destination = "C:\Export"
 
#The site to extract from. Make sure there is no trailing slash.
$site = "http://yoursitecollection/yoursite"
 
# Function: HTTPDownloadFile
# Description: Downloads a file using webclient
# Variables
# $ServerFileLocation: Where the source file is located on the web
# $DownloadPath: The destination to download to
 
function HTTPDownloadFile($ServerFileLocation, $DownloadPath)
{
	$webclient = New-Object System.Net.WebClient
	$webClient.UseDefaultCredentials = $true
	$webclient.DownloadFile($ServerFileLocation,$DownloadPath)
}
 
function DownloadMetadata($sourceweb, $metadatadestination)
{
	Write-Host "Creating Lists and Metadata"
	$sourceSPweb = Get-SPWeb -Identity $sourceweb
	$metadataFolder = $destination+"\"+$sourceSPweb.Title+" Lists and Metadata"
	$createMetaDataFolder = New-Item $metadataFolder -type directory 
	$metadatadestination = $metadataFolder
 
	foreach($list in $sourceSPweb.Lists)
	{
		Write-Host "Exporting List MetaData: " $list.Title
		$ListItems = $list.Items 
		$Listlocation = $metadatadestination+"\"+$list.Title+".csv"
		$ListItems | Select * | Export-Csv $Listlocation  -Force
	}
}
 
# Function: GetFileVersions
# Description: Downloads all versions of every file in a document library
# Variables
# $WebURL: The URL of the website that contains the document library
# $DocLibURL: The location of the document Library in the site
# $DownloadLocation: The path to download the files to
 
function GetFileVersions($file)
{
	foreach($version in $file.Versions)
	{
		#Add version label to file in format: [Filename]_v[version#].[extension]
		$filesplit = $file.Name.split(".") 
		$fullname = $filesplit[0] 
		$fileext = $filesplit[1] 
		$FullFileName = $fullname+"_v"+$version.VersionLabel+"."+$fileext			
 
		#Can't create an SPFile object from historical versions, but CAN download via HTTP
		#Create the full File URL using the Website URL and version's URL
		$fileURL = $webUrl+"/"+$version.Url
 
		#Full Download path including filename
		$DownloadPath = $destinationfolder+"\"+$FullFileName
 
		#Download the file from the version's URL, download to the $DownloadPath location
		HTTPDownloadFile "$fileURL" "$DownloadPath"
	}
}
 
# Function: DownloadDocLib
# Description: Downloads a document library's files; called GetGileVersions to download versions.
# Credit 
# Used Varun Malhotra's script to download a document library
# as a starting point: http://blogs.msdn.com/b/varun_malhotra/archive/2012/02/13/10265370.aspx
# Variables
# $folderUrl: The Document Library to Download
# $DownloadPath: The destination to download to
function DownloadDocLib($folderUrl)
{
    $folder = $web.GetFolder($folderUrl)
    foreach ($file in $folder.Files) 
	{
        #Ensure destination directory
		$destinationfolder = $destination + "\" + $folder.Url 
        if (!(Test-Path -path $destinationfolder))
        {
            $dest = New-Item $destinationfolder -type directory 
        }
 
        #Download file
        $binary = $file.OpenBinary()
        $stream = New-Object System.IO.FileStream($destinationfolder + "\" + $file.Name), Create
        $writer = New-Object System.IO.BinaryWriter($stream)
        $writer.write($binary)
        $writer.Close()
 
		#Download file versions. If you don't need versions, comment the line below.
		GetFileVersions $file
	}
}
 
# Function: DownloadSite
# Description: Calls DownloadDocLib recursiveley to download all document libraries in a site.
# Variables
# $webUrl: The URL of the site to download all document libraries
function DownloadSite($webUrl)
{
	$web = Get-SPWeb -Identity $webUrl
 
	#Create a folder using the site's name
	$siteFolder = $destination + "\" +$web.Title+" Documents"
	$createSiteFolder = New-Item $siteFolder -type directory 
	$destination = $siteFolder
 
	foreach($list in $web.Lists)
	{
		if($list.BaseType -eq "DocumentLibrary")
		{
			Write-Host "Downloading Document Library: " $list.Title
			$listUrl = $web.Url +"/"+ $list.RootFolder.Url
			#Download root files
			DownloadDocLib $list.RootFolder.Url
			#Download files in folders
			foreach ($folder in $list.Folders) 
			{
    			DownloadDocLib $folder.Url
			}
		}
	}
}
 
#Download Site Documents + Versions
DownloadSite "$site"
 
#Download Site Lists and Document Library Metadata
DownloadMetadata $site $destination

 

 

 

SharePoint Cleanup: Inventory of All Documents In Site Collection

The Problem

Most of our SharePoint sites are long-running projects or continuing programs that have no end date – meaning site decommissioning will never happen.  As such, document libraries can get pretty cluttered over the years with junk, making it difficult for users to find what they need (especially if Metadata isn’t used!), and bloating up SharePoint storage requirements. A lot of people will say that you can pick up a 2 terabyte drive at Futureshop for $100, but:

  1. Isn’t it better to keep things neat and well organized?
  2. Most SharePoint farms run on storage that costs way more than $100/ TB and
  3. Don’t throw hardware at a people problem! Just clean up already, it will save time, money and sanity!

Unfortunately cleaning up these sites can be a gargantuan job as thousands of files and hundreds of folders (shudder) pile up. The site owners are loath to tackle the big job of cleaning up, because it takes a fair bit of time, and most of the time they don’t even know where to start.

The Solution

Luckily, Gary LaPointe of STSADM (and Powershell) fame has a script to list all documents in the entire farm. This guy is a fountain of knowledge wrapped in a library inside a crunchy chipotle taco.

Gary’s script will spit out a CSV file of every document in the farm. From there it’s simple to pop it into Excel, do a bit of sorting, and conditional formatting to produce reports for your site collection administrators and site owners. With this report in hand, it’s easy to get them to clean things up, because all the problem documents are laid out for them.

We developed criteria for ‘the usual suspects’, ie: red flags that indicate something may need to be archived or deleted:

  1. Item Creation Date > 1 year ago
    Maybe these documents are still relevant, maybe not. In many cases site owners and users had forgotten they existed.
  2. Item Modified Date > 6 months ago
    Like with #1, this is kind of a ‘yellow flag’ – maybe it’s worth keeping, maybe it’s junk.
  3. File Size > 40MB
    This, to me, is an indication that a file needs to be looked at. It’s fairly rare in our case to have an Office document to get this large.
  4. Number of Versions > 5
    Our governance model limits the number of stored versions to 10. Anything more than 5 may need a look at. In some cases site owners had turned on versioning ‘just to see what it looks like’ and forgotten to turn it off – the actual functionality wasn’t used.
  5. Total file size (file size x # of versions) > 50MB
    This nabbed a lot of problem files. Some noteable examples were BMP (uncompressed image) files that were 40MB and had 10 versions – so 400MB just for one file. By compressing the BMP to a GIF or JPG, we took the file size down to 10kb, making the total potential file size 100kb.
  6. Item contains the word ‘Archive’, ‘Temp’/ ‘Temporary’, ‘Old’, etc.
    Big red flag right here. Site members and owners will often ‘cleanup’ their libraries by taking all stuff they aren’t sure about and clicking and dragging it all into a folder called “Archive” (thanks, Explorer View!). A lot of times things were dragged here and completely forgotten about.
  7. ZIP, RAR, TAR and other compressed file types
    This might be a contentious issue, but to me it’s rare that compressed files should be stored in SharePoint. A ZIP file containing 100 word documents has little value in SharePoint simply because they must first download the entire ZIP file, get the doc they need, re-zip it,  re-upload it and replace the old file. Sounds a lot like the pre-SharePoint document collaboration days.
    My other issue with compressed files is that they’re often used to circumvent the banned file types (ie: .bat, .cab, .exe, etc).
  8. Files with extensions > 4 characters
    Again, another tactic used to circumvent the banned file types. Some users would take a bat file and rename it from MyScript.bat to MyScript.bat_rename. Users, please don’t do this.

The above are just ideas on where to start, and totally depend on your Governance model, policies, and your farm. Maybe you’re ok with having .exe files in ZIP files on your farm – that’s cool! The point is to work out what works for you and your farm.

The Result

We’ve already shaved off about 20GB of junk from our farm. Our site collection administrators, site owners, and users were actually really happy to go through and clean stuff up. In a few cases, I was flat out told that they wanted to do it for months but had no idea where to start. We now have plans to automate and make this reporting a regular part of our farm maintenance.