One of the things I do at my job is clean up and beautify e-books that have been produced by a “meatgrinder”—the sort of automated conversion process that an outsourcer uses. My company has worked with a couple of conversion companies, and there are definite differences in the quality and markup philosophy of the files they produce, but one problem that appears to be chronic is that the EPUBs come back with CSS files containing tons of unused style declarations.
I’m talking thousands of lines, when two to three hundred will usually do.
This makes the files extremely tedious to troubleshoot and rework, so one of the first things I usually do if I know I’m going to be spending a considerable chunk of my day living in a particular EPUB is to cut down that stylesheet to what’s actually being used.
My method for doing this has until now been the most primitive possible: Search the book for all instances of class=".*?"
and then scan the results window using my tender eyeballs, writing out a list of class names. The resulting list might look something like this (though here I’ve already added a few styles of my own, and renamed some of the existing ones):
Then I go through the CSS file and delete all the classes except those that are on my list. Then I rewrite 99 percent of them, and I add another couple, and I scrub down the HTML, and so on, and so on. But making a list is the first step, and it’s a doozy.
If you’ve ever used, say, a computer, you might think this is kind of stupid. Surely there’s a better way!
Yeah, well, I thought so, too. And I figured our “conversion partner” would have a roomful of programmers on staff to automate this, just as they’ve automated seemingly everything else in their workflow. So I made what I thought was a reasonable (and strongly worded) request that they degunk their CSS files before sending the EPUBs to us.
Crickets.
So, many lovingly handcrafted class lists later, while reading in Glenn Fleishman’s Take Control of BBEdit about a bunch of BBEdit features I never, ever use, I thought, Hmmmmmm. Let me think about this some more. And within about ten minutes I had a mostly automated process for pulling what I needed to know out of an EPUB. I’m sure somebody who’s handy with AppleScript or those Automator thingies could make it a completely automated process in another ten minutes, and if you’re that person, I hope you’ll do so and send me the doodad. In the meantime, though, I’m happy enough with my quick and dirty method.
To wit, by popular demand:
How to extract a list of classes used in an EPUB, via BBEdit
Update, 12 February 2015: There’s a new version of all this! Please see Degristling the sausage: BBEdit 11 Edition. Or by all means read on, if you enjoy hearing the olds talk about walking through the snow to school, uphill both ways.
- Copy all the HTML files from the EPUB into a new folder. I then drop this into a BBEdit project window, because I do everything in projects, but you could work on the folder directly.
- Run the Text Factory ExtractClassesAndTags (4 KB) on the whole project/folder.
- Use Edit > Insert > File Contents… to concatenate all the HTML files into a single document.
- Run the Text Factory SortAndDedupe (4 KB) on the concatenated HTML file.
Ta da! You should now have a sorted list of all the elements used in the EPUB, with separate instances for each class or locally formatted element. Typical output: tagdump.txt (4 KB)
It is not pretty. But it is useful.
What those text factories do
ExtractClassesAndTags contains six steps:
- Replace all
id=".*?"
with nothing.
Our conversion partners never use IDs as hooks for CSS, and in some books, every single bleeding element has an ID assigned to it, so we definitely don’t want a list of those. - Replace all
href=".*?"
with nothing.
I don’t want a list of links. - Replace all
src=".*?"
with nothing. I don’t want a list of images; it’s in the OPF already. - Replace all
alt=".*?"
with nothing.
Ditto. - Format Markup: document skeleton.
This command, which I was aware existed but had never thought about before reading Glenn Fleishman’s book, strips all the content out of the HTML, leaving only the markup. - Format markup: plain.
Another command I’d scorned before reading Glenn’s book, this puts every opening or closing tag on a line by itself. Who wants that?! Nobody, unless you’re about to sort them . . .
SortAndDedupe has only two—predictable—steps:
- Sort Lines
- Process Duplicate Lines
So, that’s it. Rocket science it ain’t, but it sure will save me a ton of time.
Theoretically, these Text Factories should also be usable in TextWrangler, which is a stripped-down, free version of BBEdit. I got error messages when I tried them in TW 3.5, and I couldn’t figure out how to run them at all in TW 4.0, which was released yesterday. If you have any luck with that, or with automating this further, please let me know.
[…] and she was generous enough to share it with everyone in a post and her India, Ink. blog, “Degristling the Sausage.” These steps are based on using a BBEdit text editor workflow, but they can probably be adapted to […]
If your goal is returning a list of every CSS class used in a bunch of (x)html files, this little python script will help.
http://pastebin.com/RThYyCss
Usage: from the CLI, type ./cssstylelist.py > dump.txt
Hope this helps.
Grazie, Gabriele!
I’m guessing that the line
path = "OEBPS/Text" # your mileage may vary
means that if your OEPBS folder is flat (as all the ones our conversion company makes are), you should delete that /Text from the path?
And then in Terminal, you should cd into the directory of the unzipped EPUB?
Where does one place the script itself?
Exactly. The
path
global variable sets the folder where the script takes the files from. If you happen to have all the contents in OEBPS, the line should readpath = "OEBPS/"
.I personally put all my scripts in the root directory, so I can easily
cd
into the project folder and execute the script typing/cssstylelist.py
. The script directory is irrelevant, is just a matter of prepending the full path before the script file itself.Hmm. All I get is
env: python3: No such file or directory
That’s because you don’t have python 3.x installed. The default version provided in Snow Leopard (or Lion) is Python 2.7. You can install the latest Python 3 package installer from http://www.python.org/download/.
Doh! Yeah, that would do it, I suppose…
Whee! That’s lovely. Thank you so much!
But I like that my BBEDit method also pulls any local styling that’s on the HTML element itself, so I added
list.append(i.get("style"))
on line 30. Now it’s perfect. :-)
For those who don’t have BBEdit and who do use HTML entities in their code rather than XML ones (e.g., vs.  ), see also this other script by @gabalese.
[…] CSS in her ebooks (which therefore allows her to get rid of unused CSS) and has shared her workflow on her blog. I highly recommend checking out her […]
[…] einem professionellen Konvertierungsdienst machen lässt, entsteht oft sehr viel sinnloser Code. India, Ink. erklärt, wie du deine CSS-Datei in so einem Fall systematisch aufräumen […]
[…] straordinario il resoconto di India, Ink. su come hanno semiautomatizzato il lavoro di pulizia dei […]
[…] info: http://ink.indiamos.com/2012/04/11/degristling-the-sausage/ and […]