Introduction
Here's a post about indexing ZIP archives in the same style as the one I did on PDF indexing. The search engine makes use of IFilters to be able to read the specific structure of a certain file type and retrieve information from it that it puts in an index. When you perform a search query you will see the information from the index. If it weren't for IFilters you could only search on file name and metadata.
[Indexing Server]: the server(s) in the SharePoint Farm that has/have the "Indexing" Role assigned. In a small farm this can be a single server for all roles.
[Web Front End Server]: the server(s) in the SharePoint Farm that has/have the "Web Front End" Role assigned. In a small farm this can be a single server for all roles.
Windows SharePoint Services 3.0
[Indexing Server]
- Install the ZIP IFilter (see below for a list of available IFilters)
- Add the .zip file type to the index list:
- Open the Registry Editor (Start > Run > regedit)
- Go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\\Gather\Search\Extensions\ExtensionList
- Add a new String Value
- Value name:
- Value data: zip
- Perform an iisreset
- Perform a Full Update on the Search content indexes
- Open a Command Prompt on the Indexing Server
- net stop spsearch
- net start spsearch
- cd "C:\Program Files\Common Files\Microsoft Shared\Web server extensions\12\BIN"
- stsadm.exe –o spsearch -action fullcrawlstop
- stsadm.exe –o spsearch -action fullcrawlstart
[Web Front End Server]
The zip icon registration is available out of the box.
Microsoft Office SharePoint Server 2007
[Indexing Server]
- Install the ZIP IFilter (see below for a list of available IFilters)
- Add the .zip file type to the index list:
- Go to Central Administration, then to the Shared Services Administration Web of the current SSP, go to Search Settings and next to File Type
- Add a new file type zip
- Perform an iisreset
- Perform a Full Update on the Search content indexes
- Open a Command Prompt on the Indexing Server
- net stop osearch
- net start osearch
- Go to Central Administration, then to the Shared Services Administration Web of the current SSP, go to Search Settings and start a full crawl of all locations containing ZIP files
[Web Front End Server]
The zip icon registration is available out of the box.
Available IFilters
IFilterShop ZIP IFilter
- requires a license
- 32 bit and 64 bit (applies to the [Indexing Server])
- Note: I haven't gotten this one to work. After installation and configuration I'm receiving the following for all crawled ZIP items: Crawled (The filtering process could not load the item. This is possibly caused by an unrecognized item format or item corruption. )
Citeknet ZIP IFilter
- requires a license
- 32 bit and 64 bit (applies to the [Indexing Server])
- Currently version 2.1 Beta
- Works very nice in the test setup. Haven't seen it in production or stress tests.
What about PDF documents inside ZIP archives ?
The ZIP IFilter will index all files in the archive using a corresponding IFilter, but if yours is an appartment threaded IFilter (such as Adobe's PDF IFilter) you need to make the following adjustment:
[Indexing Server]
- Open the Registry Editor (Start > Run > regedit)
- Go to HKEY_CLASSES_ROOT\CLSID\{4C904448-74A9-11d0-AF6E-00C04FD8DC02}\InprocServer32
- Change the ThreadingModel key value
- Old value: Apartment
- New value: Both
- Go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex
- Change the DLLsToRegister key value
- Remove the entry corresponding to pdffilt.dll from the list to prevent the Adobe PDF IFilter from re-registering
- Restart the Search Service and perform a Full Update
An excellent tool to get an overview of installed IFilters is Citeknet IFilter Explorer which will also show you the threading model.
Conclusion
Using the above procedure for either WSS 3.0 or MOSS 2007 it is possible to have your ZIP archives indexed by the SharePoint Search. The IFilter will recursively index all containing ZIP archives. Any other files (.txt, .doc, .ppt, .pdf) are indexed and if an IFilter for that file type exists it will be used to extract information from it. This way it can index text inside PDF documents inside the ZIP archive.
Note that the search results will show confusing file names as shown below: