Comment

djo

Thinking out loud...

You can't just grep the files server-side. Some of the HTML might be generated at runtime, or stored in a database. There's also the false-positive problem.

You can't wget the site, and grep it post-spidering. As you said there's password-protection, and all the other problems associated with generated content (eg we have styles associated with the day of the week - mondayContent, tuesdayContent, etc).

I think you need some kind of reference counting scheme. But how to implement it without doing all the work by hand... not a clue. Possibly you could lash something up if your only IDE was Visual Studio. But I don't think any programmatic solution is going to "get":

$style = date('D') . 'content';

You could bind the CSS files tighter to the content - instead of a site-wide stylesheet, each directory has it's own stylesheet. That's more managable, but it's just working around the problem rather than solving it, removes a lot of the advantages of CSS and introduces redundancy.

How about using javascript to report back which styles have been used in rendering the page? If 3 months have gone by and you haven't had a report that mondayContent has been used, it's probably safe to remove it. The problems with this are obvious, and it's a total hack besides.

When you get into weird cases like "Site A depends on the stylesheet from site B", I think there's basically no chance.