One of the things I do at my job is clean up and beautify e-books that have been produced by a “meatgrinder”—the sort of automated conversion process that an outsourcer uses. My company has worked with a couple of conversion companies, and there are definite differences in the quality and markup philosophy of the files they produce, but one problem that appears to be chronic is that the EPUBs come back with CSS files containing tons of unused style declarations.
I’m talking thousands of lines, when two to three hundred will usually do.
This makes the files extremely tedious to troubleshoot and rework, so one of the first things I usually do if I know I’m going to be spending a considerable chunk of my day living in a particular EPUB is to cut down that stylesheet to what’s actually being used.
My method for doing this has until now been the most primitive possible: Search the book for all instances of
class=".*?" and then scan the results window using my tender eyeballs, writing out a list of class names. The resulting list might look something like this (though here I’ve already added a few styles of my own, and renamed some of the existing ones):
Then I go through the CSS file and delete all the classes except those that are on my list. Then I rewrite 99 percent of them, and I add another couple, and I scrub down the HTML, and so on, and so on. But making a list is the first step, and it’s a doozy.
If you’ve ever used, say, a computer, you might think this is kind of stupid. Surely there’s a better way!
Yeah, well, I thought so, too. And I figured our “conversion partner” would have a roomful of programmers on staff to automate this, just as they’ve automated seemingly everything else in their workflow. So I made what I thought was a reasonable (and strongly worded) request that they degunk their CSS files before sending the EPUBs to us.
So, many lovingly handcrafted class lists later, while reading in Glenn Fleishman’s Take Control of BBEdit about a bunch of BBEdit features I never, ever use, I thought, Hmmmmmm. Let me think about this some more. And within about ten minutes I had a mostly automated process for pulling what I needed to know out of an EPUB. I’m sure somebody who’s handy with AppleScript or those Automator thingies could make it a completely automated process in another ten minutes, and if you’re that person, I hope you’ll do so and send me the doodad. In the meantime, though, I’m happy enough with my quick and dirty method.
To wit, by popular demand:
How to extract a list of classes used in an EPUB, via BBEdit
- Copy all the HTML files from the EPUB into a new folder. I then drop this into a BBEdit project window, because I do everything in projects, but you could work on the folder directly.
- Run the Text Factory ExtractClassesAndTags (4 KB) on the whole project/folder.
- Use Edit > Insert > File Contents… to concatenate all the HTML files into a single document.
- Run the Text Factory SortAndDedupe (4 KB) on the concatenated HTML file.
Ta da! You should now have a sorted list of all the elements used in the EPUB, with separate instances for each class or locally formatted element. Typical output: tagdump.txt (4 KB)
It is not pretty. But it is useful.
What those text factories do
ExtractClassesAndTags contains six steps:
- Replace all
Our conversion partners never use IDs as hooks for CSS, and in some books, every single bleeding element has an ID assigned to it, so we definitely don’t want a list of those.
- Replace all
I don’t want a list of links.
- Replace all
src=".*?"with nothing. I don’t want a list of images; it’s in the OPF already.
- Replace all
- Format Markup: document skeleton.
This command, which I was aware existed but had never thought about before reading Glenn Fleishman’s book, strips all the content out of the HTML, leaving only the markup.
- Format markup: plain.
Another command I’d scorned before reading Glenn’s book, this puts every opening or closing tag on a line by itself. Who wants that?! Nobody, unless you’re about to sort them . . .
SortAndDedupe has only two—predictable—steps:
- Sort Lines
- Process Duplicate Lines
So, that’s it. Rocket science it ain’t, but it sure will save me a ton of time.
Theoretically, these Text Factories should also be usable in TextWrangler, which is a stripped-down, free version of BBEdit. I got error messages when I tried them in TW 3.5, and I couldn’t figure out how to run them at all in TW 4.0, which was released yesterday. If you have any luck with that, or with automating this further, please let me know.