| View previous topic :: View next topic |
| Author |
Message |
bizzybody Post-Apocalyptic
Joined: 22 May 2008 Posts: 161
|
Posted: Mon Dec 29, 2008 2:55 am Post subject: Can someone write a utility to replace UTF-8 codes in HTML? |
|
|
I'm still searching for a UTF-8 / Unicode remover for HTML files.
Why? Because Palm OS doesn't grok Unicode and there's no converter from HTML to any sort of Palm document format that understands the UTF-8 HTML codes.
The result is the codes get 'translated' as nothing, not even a space, replaced with ? or an empty box or some very odd looking symbols.
Not even version six point two of Mobipocket Desktop Reader, which can import HTML and automatically convert it to the Mobipocket format, knows what to do with UTF-8 codes. They start with an ampersand & and an octothorpe ('pound sign' in the USA) # then a four digit number followed by a semicolon ;. Leading zeros may be left off and the codes still work.
I can't post a visual example of a code because unlike the HTML %nn character codes, phpBB does understand UTF-8 and will display them as the character, even between BBcode 'code' commands. The following bit is actually UTF-8 codes 0321 and 0231. If the code command worked properly, you wouldn't see the two single characters, you'd see the full UTF-8 code strings.
Anyway, for a person who writes program code on a regular basis, it shouldn't be too bleeping difficult to create a little program that sucks in an HTML text file, scans for instances of &# then reads the numbers (with code logic as needed to work with/without leading zeros) then replaces the whole &#nnnn; string with the appropriate single ASCII/ANSI character from a two column table.
This can be done with wordpad or MS Word versions prior to 2003* with tedious manual search and replace, but it's utterly silly that companies like Tealpoint and Mobipocket, after so many revisions to their software, have still failed to include such a feature in their converters!
What'd be a nifty/annoying extra feature for the converter would be to convert the entire document to UTF-8 codes, except for whatever header text must be there for a browser or other HTML/Unicode aware program to recognize the document. 'Course it'd bloat the file size 600% due to every single character requiring an additional six characters in UTF-8- of course it'd put the leading zeros on the two and three digit codes. ;)
Replacing 2000+ instances of the code for an en dash with a hyphen followed by a space will reduce the size of an HTML file quite a bit.
*Word 2003 refuses to open HTML files as plain text, not even changing the extension to .txt works. I tried to open an HTML e-book to manually replace the UTF-8 codes, Word said there was some 'error' in the HTML and refused to open it. I didn't care about what Microsoft thinks is an 'error' so I changed the extension and it gave me the same @#%^$ing 'error' message. |
|
| Back to top |
|
 |
Ratzmandious Taunt me not - I control your power bill

Joined: 01 Mar 2007 Posts: 647 Location: My orbital habitat
|
Posted: Mon Dec 29, 2008 4:35 am Post subject: |
|
|
I'll see what I can do Bizzy... as you say, not a major task, should be able to whip up a basic app today sometime.
Of course I can't guarantee that I won't be lazy... ;)
And unfortunately I'm back at work, so it'll have to be fit in around my *sigh* paying work.  _________________ "I'm not hostile! I'm just aggressively interactive!" |
|
| Back to top |
|
 |
Ratzmandious Taunt me not - I control your power bill

Joined: 01 Mar 2007 Posts: 647 Location: My orbital habitat
|
Posted: Mon Dec 29, 2008 4:48 am Post subject: |
|
|
Could you email me a sample file?
ratzmandious@yahoo.co.uk
You know... I just realised that I've gone and done it again... volunteered, that is.
When will I ever learn? (After I've two-thirds finished the application...) _________________ "I'm not hostile! I'm just aggressively interactive!" |
|
| Back to top |
|
 |
bizzybody Post-Apocalyptic
Joined: 22 May 2008 Posts: 161
|
Posted: Tue Dec 30, 2008 4:01 am Post subject: |
|
|
Thanks bunches. I'll see what I can dig up amongst my e-book collection. |
|
| Back to top |
|
 |
Ratzmandious Taunt me not - I control your power bill

Joined: 01 Mar 2007 Posts: 647 Location: My orbital habitat
|
Posted: Mon Jan 05, 2009 1:40 pm Post subject: |
|
|
Well I have an alpha version of the thing here...
http://ratzmandious.110mb.com/files/UTFStripper.zip
It scans a given file for the UTF tags lists them in a grid to let enter the appropriate replacent character and saves it as a new file.
Fairly basic system, but should work. Let me know what you think. _________________ "I'm not hostile! I'm just aggressively interactive!" |
|
| Back to top |
|
 |
bizzybody Post-Apocalyptic
Joined: 22 May 2008 Posts: 161
|
Posted: Mon Jan 05, 2009 6:07 pm Post subject: |
|
|
Will do. Thanks.
Been busy with the holidays and moving crystal dihydrogen monoxide. Too bad the stuff has no street value, I'd have a fortune sitting in my driveway...
For samples, check out some of the HTML files here, especially in the earlier CDs. Some of Baen's newer e-books don't use UTF-8 codes.
http://baencd.thefifthimperium.com/
The newer ones with a P0 number are revisions of the older ones where the Mobipocket versions have the Author Name and cover image correctly in them, and for some odd reason changed their filename extentions to .mobi from .prc I dunno *why* they've done that when Mobipocket's conversion software uses .prc |
|
| Back to top |
|
 |
Crossbow I know. It is a thing I do.
Joined: 19 Apr 2005 Posts: 156 Location: UK
|
Posted: Wed Jan 07, 2009 9:23 am Post subject: |
|
|
I knew there was a reason I stuck with .lit files  _________________ Optimism is the triumph of hope over experience |
|
| Back to top |
|
 |
vexxarr Site Admin

Joined: 10 Jan 2005 Posts: 723
|
Posted: Wed Jan 07, 2009 12:13 pm Post subject: H2O |
|
|
I was always told "Hydrogen Hydroxide" by my chemistry teacher! _________________ World Conquest is easy… It’s conquering the inhabitants that gets sticky. |
|
| Back to top |
|
 |
Ratzmandious Taunt me not - I control your power bill

Joined: 01 Mar 2007 Posts: 647 Location: My orbital habitat
|
Posted: Wed Jan 07, 2009 3:01 pm Post subject: |
|
|
*grumble, grumble* I want some more crystalline dihydrogen monoxide!
The stuff I ordered melted on me.
And I tend to use rtf for my ebooks, simply because just about everything can read it... _________________ "I'm not hostile! I'm just aggressively interactive!" |
|
| Back to top |
|
 |
bizzybody Post-Apocalyptic
Joined: 22 May 2008 Posts: 161
|
Posted: Wed Sep 30, 2009 11:04 pm Post subject: |
|
|
I found a simple and easy program to convert CHM, HTML, PDF, LIT and some variants of Palm PDB files directly to MobiPocket format.
It's called AutoKindle. Originally written to fix MobiPocket Reader eBooks to work on the Amazon Kindle (they use almost identical formats) the conversion from other formats has been added.
Now it just needs to be able to convert from TealDoc to mobi.  |
|
| Back to top |
|
 |
|