Degristling the sausage

One of the things I do at my job is clean up and beautify e-books that have been produced by a “meatgrinder”—the sort of automated conversion process that an outsourcer uses. My company has worked with a couple of conversion companies, and there are definite differences in the quality and markup philosophy of the files they produce, but one problem that appears to be chronic is that the EPUBs come back with CSS files containing tons of unused style declarations.

I’m talking thousands of lines, when two to three hundred will usually do.

This makes the files extremely tedious to troubleshoot and rework, so one of the first things I usually do if I know I’m going to be spending a considerable chunk of my day living in a particular EPUB is to cut down that stylesheet to what’s actually being used.

My method for doing this has until now been the most primitive possible: Search the book for all instances of class=".*?" and then scan the results window using my tender eyeballs, writing out a list of class names. The resulting list might look something like this (though here I’ve already added a few styles of my own, and renamed some of the existing ones):

Then I go through the CSS file and delete all the classes except those that are on my list. Then I rewrite 99 percent of them, and I add another couple, and I scrub down the HTML, and so on, and so on. But making a list is the first step, and it’s a doozy.

If you’ve ever used, say, a computer, you might think this is kind of stupid. Surely there’s a better way!

Yeah, well, I thought so, too. And I figured our “conversion partner” would have a roomful of programmers on staff to automate this, just as they’ve automated seemingly everything else in their workflow. So I made what I thought was a reasonable (and strongly worded) request that they degunk their CSS files before sending the EPUBs to us.

Crickets.

So, many lovingly handcrafted class lists later, while reading in Glenn Fleishman’s Take Control of BBEdit about a bunch of BBEdit features I never, ever use, I thought, Hmmmmmm. Let me think about this some more. And within about ten minutes I had a mostly automated process for pulling what I needed to know out of an EPUB. I’m sure somebody who’s handy with AppleScript or those Automator thingies could make it a completely automated process in another ten minutes, and if you’re that person, I hope you’ll do so and send me the doodad. In the meantime, though, I’m happy enough with my quick and dirty method.

To wit, by popular demand:

How to extract a list of classes used in an EPUB, via BBEdit

Update, 12 February 2015: There’s a new version of all this! Please see Degristling the sausage: BBEdit 11 Edition. Or by all means read on, if you enjoy hearing the olds talk about walking through the snow to school, uphill both ways.

  1. Copy all the HTML files from the EPUB into a new folder. I then drop this into a BBEdit project window, because I do everything in projects, but you could work on the folder directly.
  2. Run the Text Factory ExtractClassesAndTags (4 KB) on the whole project/folder.
  3. Use Edit > Insert > File Contents… to concatenate all the HTML files into a single document.
  4. Run the Text Factory SortAndDedupe (4 KB) on the concatenated HTML file.

Ta da! You should now have a sorted list of all the elements used in the EPUB, with separate instances for each class or locally formatted element. Typical output: tagdump.txt (4 KB)

It is not pretty. But it is useful.

What those text factories do

ExtractClassesAndTags contains six steps:

  1. Replace all id=".*?" with nothing.
    Our conversion partners never use IDs as hooks for CSS, and in some books, every single bleeding element has an ID assigned to it, so we definitely don’t want a list of those.
  2. Replace all href=".*?" with nothing.
    I don’t want a list of links.
  3. Replace all src=".*?" with nothing. I don’t want a list of images; it’s in the OPF already.
  4. Replace all alt=".*?" with nothing.
    Ditto.
  5. Format Markup: document skeleton.
    This command, which I was aware existed but had never thought about before reading Glenn Fleishman’s book, strips all the content out of the HTML, leaving only the markup.
  6. Format markup: plain.
    Another command I’d scorned before reading Glenn’s book, this puts every opening or closing tag on a line by itself. Who wants that?! Nobody, unless you’re about to sort them . . .

SortAndDedupe has only two—predictable—steps:

  1. Sort Lines
  2. Process Duplicate Lines

So, that’s it. Rocket science it ain’t, but it sure will save me a ton of time.

Theoretically, these Text Factories should also be usable in TextWrangler, which is a stripped-down, free version of BBEdit. I got error messages when I tried them in TW 3.5, and I couldn’t figure out how to run them at all in TW 4.0, which was released yesterday. If you have any luck with that, or with automating this further, please let me know.

13 Responses

  1. Gabriele
    Gabriele April 14, 2012 at 12:32 pm |

    If your goal is returning a list of every CSS class used in a bunch of (x)html files, this little python script will help.

    http://pastebin.com/RThYyCss

    Usage: from the CLI, type ./cssstylelist.py > dump.txt

    Hope this helps.

  2. Gabriele
    Gabriele April 14, 2012 at 1:21 pm |

    Exactly. The path global variable sets the folder where the script takes the files from. If you happen to have all the contents in OEBPS, the line should read path = "OEBPS/".

    I personally put all my scripts in the root directory, so I can easily cd into the project folder and execute the script typing /cssstylelist.py. The script directory is irrelevant, is just a matter of prepending the full path before the script file itself.

  3. Gabriele
    Gabriele April 14, 2012 at 1:50 pm |

    That’s because you don’t have python 3.x installed. The default version provided in Snow Leopard (or Lion) is Python 2.7. You can install the latest Python 3 package installer from http://www.python.org/download/.

Leave a Reply

%d bloggers like this: