SharePoint Cleanup: Inventory of All Documents In Site Collection

The Problem

Most of our SharePoint sites are long-running projects or continuing programs that have no end date – meaning site decommissioning will never happen.  As such, document libraries can get pretty cluttered over the years with junk, making it difficult for users to find what they need (especially if Metadata isn’t used!), and bloating up SharePoint storage requirements. A lot of people will say that you can pick up a 2 terabyte drive at Futureshop for $100, but:

  1. Isn’t it better to keep things neat and well organized?
  2. Most SharePoint farms run on storage that costs way more than $100/ TB and
  3. Don’t throw hardware at a people problem! Just clean up already, it will save time, money and sanity!

Unfortunately cleaning up these sites can be a gargantuan job as thousands of files and hundreds of folders (shudder) pile up. The site owners are loath to tackle the big job of cleaning up, because it takes a fair bit of time, and most of the time they don’t even know where to start.

The Solution

Luckily, Gary LaPointe of STSADM (and Powershell) fame has a script to list all documents in the entire farm. This guy is a fountain of knowledge wrapped in a library inside a crunchy chipotle taco.

Gary’s script will spit out a CSV file of every document in the farm. From there it’s simple to pop it into Excel, do a bit of sorting, and conditional formatting to produce reports for your site collection administrators and site owners. With this report in hand, it’s easy to get them to clean things up, because all the problem documents are laid out for them.

We developed criteria for ‘the usual suspects’, ie: red flags that indicate something may need to be archived or deleted:

  1. Item Creation Date > 1 year ago
    Maybe these documents are still relevant, maybe not. In many cases site owners and users had forgotten they existed.
  2. Item Modified Date > 6 months ago
    Like with #1, this is kind of a ‘yellow flag’ – maybe it’s worth keeping, maybe it’s junk.
  3. File Size > 40MB
    This, to me, is an indication that a file needs to be looked at. It’s fairly rare in our case to have an Office document to get this large.
  4. Number of Versions > 5
    Our governance model limits the number of stored versions to 10. Anything more than 5 may need a look at. In some cases site owners had turned on versioning ‘just to see what it looks like’ and forgotten to turn it off – the actual functionality wasn’t used.
  5. Total file size (file size x # of versions) > 50MB
    This nabbed a lot of problem files. Some noteable examples were BMP (uncompressed image) files that were 40MB and had 10 versions – so 400MB just for one file. By compressing the BMP to a GIF or JPG, we took the file size down to 10kb, making the total potential file size 100kb.
  6. Item contains the word ‘Archive’, ‘Temp’/ ‘Temporary’, ‘Old’, etc.
    Big red flag right here. Site members and owners will often ‘cleanup’ their libraries by taking all stuff they aren’t sure about and clicking and dragging it all into a folder called “Archive” (thanks, Explorer View!). A lot of times things were dragged here and completely forgotten about.
  7. ZIP, RAR, TAR and other compressed file types
    This might be a contentious issue, but to me it’s rare that compressed files should be stored in SharePoint. A ZIP file containing 100 word documents has little value in SharePoint simply because they must first download the entire ZIP file, get the doc they need, re-zip it,  re-upload it and replace the old file. Sounds a lot like the pre-SharePoint document collaboration days.
    My other issue with compressed files is that they’re often used to circumvent the banned file types (ie: .bat, .cab, .exe, etc).
  8. Files with extensions > 4 characters
    Again, another tactic used to circumvent the banned file types. Some users would take a bat file and rename it from MyScript.bat to MyScript.bat_rename. Users, please don’t do this.

The above are just ideas on where to start, and totally depend on your Governance model, policies, and your farm. Maybe you’re ok with having .exe files in ZIP files on your farm – that’s cool! The point is to work out what works for you and your farm.

The Result

We’ve already shaved off about 20GB of junk from our farm. Our site collection administrators, site owners, and users were actually really happy to go through and clean stuff up. In a few cases, I was flat out told that they wanted to do it for months but had no idea where to start. We now have plans to automate and make this reporting a regular part of our farm maintenance.