Degristling the sausage

One of the things I do at my job is clean up and beautify e-books that have been produced by a “meatgrinder”—the sort of automated conversion process that an outsourcer uses. My company has worked with a couple of conversion companies, and there are definite differences in the quality and markup philosophy of the files they produce, but one problem that appears to be chronic is that the EPUBs come back with CSS files containing tons of unused style declarations.

I’m talking thousands of lines, when two to three hundred will usually do.

This makes the files extremely tedious to troubleshoot and rework, so one of the first things I usually do if I know I’m going to be spending a considerable chunk of my day living in a particular EPUB is to cut down that stylesheet to what’s actually being used.

My method for doing this has until now been the most primitive possible: Search the book for all instances of class=".*?" and then scan the results window using my tender eyeballs, writing out a list of class names. The resulting list might look something like this (though here I’ve already added a few styles of my own, and renamed some of the existing ones):

Then I go through the CSS file and delete all the classes except those that are on my list. Then I rewrite 99 percent of them, and I add another couple, and I scrub down the HTML, and so on, and so on. But making a list is the first step, and it’s a doozy.

If you’ve ever used, say, a computer, you might think this is kind of stupid. Surely there’s a better way!

Yeah, well, I thought so, too. And I figured our “conversion partner” would have a roomful of programmers on staff to automate this, just as they’ve automated seemingly everything else in their workflow. So I made what I thought was a reasonable (and strongly worded) request that they degunk their CSS files before sending the EPUBs to us.

Crickets.

So, many lovingly handcrafted class lists later, while reading in Glenn Fleishman’s Take Control of BBEdit about a bunch of BBEdit features I never, ever use, I thought, Hmmmmmm. Let me think about this some more. And within about ten minutes I had a mostly automated process for pulling what I needed to know out of an EPUB. I’m sure somebody who’s handy with AppleScript or those Automator thingies could make it a completely automated process in another ten minutes, and if you’re that person, I hope you’ll do so and send me the doodad. In the meantime, though, I’m happy enough with my quick and dirty method.

To wit, by popular demand:

How to extract a list of classes used in an EPUB, via BBEdit

Update, 12 February 2015: There’s a new version of all this! Please see Degristling the sausage: BBEdit 11 Edition. Or by all means read on, if you enjoy hearing the olds talk about walking through the snow to school, uphill both ways.

Copy all the HTML files from the EPUB into a new folder. I then drop this into a BBEdit project window, because I do everything in projects, but you could work on the folder directly.
Run the Text Factory ExtractClassesAndTags (4 KB) on the whole project/folder.
Use Edit > Insert > File Contents… to concatenate all the HTML files into a single document.
Run the Text Factory SortAndDedupe (4 KB) on the concatenated HTML file.

Ta da! You should now have a sorted list of all the elements used in the EPUB, with separate instances for each class or locally formatted element. Typical output: tagdump.txt (4 KB)

It is not pretty. But it is useful.

What those text factories do

ExtractClassesAndTags contains six steps:

Replace all id=".*?" with nothing.
Our conversion partners never use IDs as hooks for CSS, and in some books, every single bleeding element has an ID assigned to it, so we definitely don’t want a list of those.
Replace all href=".*?" with nothing.
I don’t want a list of links.
Replace all src=".*?" with nothing. I don’t want a list of images; it’s in the OPF already.
Replace all alt=".*?" with nothing.
Ditto.
Format Markup: document skeleton.
This command, which I was aware existed but had never thought about before reading Glenn Fleishman’s book, strips all the content out of the HTML, leaving only the markup.
Format markup: plain.
Another command I’d scorned before reading Glenn’s book, this puts every opening or closing tag on a line by itself. Who wants that?! Nobody, unless you’re about to sort them . . .

SortAndDedupe has only two—predictable—steps:

Sort Lines
Process Duplicate Lines

So, that’s it. Rocket science it ain’t, but it sure will save me a ton of time.

Theoretically, these Text Factories should also be usable in TextWrangler, which is a stripped-down, free version of BBEdit. I got error messages when I tried them in TW 3.5, and I couldn’t figure out how to run them at all in TW 4.0, which was released yesterday. If you have any luck with that, or with automating this further, please let me know.

13 thoughts on “Degristling the sausage”

[…] and she was generous enough to share it with everyone in a post and her India, Ink. blog, “Degristling the Sausage.” These steps are based on using a BBEdit text editor workflow, but they can probably be adapted to […]

If your goal is returning a list of every CSS class used in a bunch of (x)html files, this little python script will help.

http://pastebin.com/RThYyCss

Usage: from the CLI, type ./cssstylelist.py > dump.txt

Hope this helps.

Grazie, Gabriele!

I’m guessing that the line
path = "OEBPS/Text" # your mileage may vary
means that if your OEPBS folder is flat (as all the ones our conversion company makes are), you should delete that /Text from the path?

And then in Terminal, you should cd into the directory of the unzipped EPUB?

Where does one place the script itself?

Exactly. The path global variable sets the folder where the script takes the files from. If you happen to have all the contents in OEBPS, the line should read path = "OEBPS/".

I personally put all my scripts in the root directory, so I can easily cd into the project folder and execute the script typing /cssstylelist.py. The script directory is irrelevant, is just a matter of prepending the full path before the script file itself.

Hmm. All I get is
env: python3: No such file or directory

That’s because you don’t have python 3.x installed. The default version provided in Snow Leopard (or Lion) is Python 2.7. You can install the latest Python 3 package installer from http://www.python.org/download/.

Doh! Yeah, that would do it, I suppose…

Whee! That’s lovely. Thank you so much!

But I like that my BBEDit method also pulls any local styling that’s on the HTML element itself, so I added
list.append(i.get("style"))
on line 30. Now it’s perfect. :-)

For those who don’t have BBEdit and who do use HTML entities in their code rather than XML ones (e.g.,   vs.  ), see also this other script by @gabalese.

[…] CSS in her ebooks (which therefore allows her to get rid of unused CSS) and has shared her workflow on her blog. I highly recommend checking out her […]

[…] einem professionellen Konvertierungsdienst machen lässt, entsteht oft sehr viel sinnloser Code. India, Ink. erklärt, wie du deine CSS-Datei in so einem Fall systematisch aufräumen […]

[…] straordinario il resoconto di India, Ink. su come hanno semiautomatizzato il lavoro di pulizia dei […]

[…] info: http://ink.indiamos.com/2012/04/11/degristling-the-sausage/ and […]

ePUBSecrets » Blog Archive » More Help for ePUB Work April 2012 April 13, 201212:24 pm

[…] and she was generous enough to share it with everyone in a post and her India, Ink. blog, “Degristling the Sausage.” These steps are based on using a BBEdit text editor workflow, but they can probably be adapted to […]
Gabriele April 14, 201212:32 pm

If your goal is returning a list of every CSS class used in a bunch of (x)html files, this little python script will help.

http://pastebin.com/RThYyCss

Usage: from the CLI, type ./cssstylelist.py > dump.txt

Hope this helps.
India April 14, 20121:14 pm

Grazie, Gabriele!

I’m guessing that the line
path = "OEBPS/Text" # your mileage may vary
means that if your OEPBS folder is flat (as all the ones our conversion company makes are), you should delete that /Text from the path?

And then in Terminal, you should cd into the directory of the unzipped EPUB?

Where does one place the script itself?
Gabriele April 14, 20121:21 pm

Exactly. The path global variable sets the folder where the script takes the files from. If you happen to have all the contents in OEBPS, the line should read path = "OEBPS/".

I personally put all my scripts in the root directory, so I can easily cd into the project folder and execute the script typing /cssstylelist.py. The script directory is irrelevant, is just a matter of prepending the full path before the script file itself.
India April 14, 20121:46 pm

Hmm. All I get is
env: python3: No such file or directory
Gabriele April 14, 20121:50 pm

That’s because you don’t have python 3.x installed. The default version provided in Snow Leopard (or Lion) is Python 2.7. You can install the latest Python 3 package installer from http://www.python.org/download/.
India April 14, 20121:52 pm

Doh! Yeah, that would do it, I suppose…
India April 14, 20122:04 pm

Whee! That’s lovely. Thank you so much!

But I like that my BBEDit method also pulls any local styling that’s on the HTML element itself, so I added
list.append(i.get("style"))
on line 30. Now it’s perfect. :-)
India April 19, 20121:45 pm

For those who don’t have BBEdit and who do use HTML entities in their code rather than XML ones (e.g.,   vs.  ), see also this other script by @gabalese.
unused CSS / FireFox / Firebug « The Book Studio November 9, 20121:28 pm

[…] CSS in her ebooks (which therefore allows her to get rid of unused CSS) and has shared her workflow on her blog. I highly recommend checking out her […]
Neue Wege #8 | Privatsprache – Projekt: Blackbox January 7, 201310:16 am

[…] einem professionellen Konvertierungsdienst machen lässt, entsteht oft sehr viel sinnloser Code. India, Ink. erklärt, wie du deine CSS-Datei in so einem Fall systematisch aufräumen […]
Di tutti i mestieri | Script | iCreate January 9, 20138:23 pm

[…] straordinario il resoconto di India, Ink. su come hanno semiautomatizzato il lavoro di pulizia dei […]
InDesign to Ebook: Resources | Digital Book World January 14, 20134:23 pm

[…] info: http://ink.indiamos.com/2012/04/11/degristling-the-sausage/ and […]

Degristling the sausage

How to extract a list of classes used in an EPUB, via BBEdit

What those text factories do

Like this:

13 thoughts on “Degristling the sausage”

Leave a Reply Cancel reply