vexxarrcommunity Forum Index
  Register FAQ Search Memberlist Usergroups Profile Log in to check your private messages Log in 
Log in to check your private messages  ·  fChat
Can someone write a utility to replace UTF-8 codes in HTML?

 
Post new topic   Reply to topic    vexxarrcommunity Forum Index -> Digital Media
View previous topic :: View next topic  
Author Message
bizzybody
Post-Apocalyptic


Joined: 22 May 2008
Posts: 161

PostPosted: Mon Dec 29, 2008 2:55 am    Post subject: Can someone write a utility to replace UTF-8 codes in HTML? Reply with quote

I'm still searching for a UTF-8 / Unicode remover for HTML files.

Why? Because Palm OS doesn't grok Unicode and there's no converter from HTML to any sort of Palm document format that understands the UTF-8 HTML codes.

The result is the codes get 'translated' as nothing, not even a space, replaced with ? or an empty box or some very odd looking symbols.

Not even version six point two of Mobipocket Desktop Reader, which can import HTML and automatically convert it to the Mobipocket format, knows what to do with UTF-8 codes. They start with an ampersand & and an octothorpe ('pound sign' in the USA) # then a four digit number followed by a semicolon ;. Leading zeros may be left off and the codes still work.

I can't post a visual example of a code because unlike the HTML %nn character codes, phpBB does understand UTF-8 and will display them as the character, even between BBcode 'code' commands. The following bit is actually UTF-8 codes 0321 and 0231. If the code command worked properly, you wouldn't see the two single characters, you'd see the full UTF-8 code strings.

Code:
Ł ç


Anyway, for a person who writes program code on a regular basis, it shouldn't be too bleeping difficult to create a little program that sucks in an HTML text file, scans for instances of &# then reads the numbers (with code logic as needed to work with/without leading zeros) then replaces the whole &#nnnn; string with the appropriate single ASCII/ANSI character from a two column table.

This can be done with wordpad or MS Word versions prior to 2003* with tedious manual search and replace, but it's utterly silly that companies like Tealpoint and Mobipocket, after so many revisions to their software, have still failed to include such a feature in their converters!

What'd be a nifty/annoying extra feature for the converter would be to convert the entire document to UTF-8 codes, except for whatever header text must be there for a browser or other HTML/Unicode aware program to recognize the document. 'Course it'd bloat the file size 600% due to every single character requiring an additional six characters in UTF-8- of course it'd put the leading zeros on the two and three digit codes. ;)

Replacing 2000+ instances of the code for an en dash with a hyphen followed by a space will reduce the size of an HTML file quite a bit.

*Word 2003 refuses to open HTML files as plain text, not even changing the extension to .txt works. I tried to open an HTML e-book to manually replace the UTF-8 codes, Word said there was some 'error' in the HTML and refused to open it. I didn't care about what Microsoft thinks is an 'error' so I changed the extension and it gave me the same @#%^$ing 'error' message.
Back to top
View user's profile Send private message Send e-mail
Ratzmandious
Taunt me not - I control your power bill


Joined: 01 Mar 2007
Posts: 647
Location: My orbital habitat

PostPosted: Mon Dec 29, 2008 4:35 am    Post subject: Reply with quote

I'll see what I can do Bizzy... as you say, not a major task, should be able to whip up a basic app today sometime.

Of course I can't guarantee that I won't be lazy... ;)

And unfortunately I'm back at work, so it'll have to be fit in around my *sigh* paying work. Rolling Eyes
_________________
"I'm not hostile! I'm just aggressively interactive!"
Back to top
View user's profile Send private message Visit poster's website
Ratzmandious
Taunt me not - I control your power bill


Joined: 01 Mar 2007
Posts: 647
Location: My orbital habitat

PostPosted: Mon Dec 29, 2008 4:48 am    Post subject: Reply with quote

Could you email me a sample file?

ratzmandious@yahoo.co.uk

You know... I just realised that I've gone and done it again... volunteered, that is.

When will I ever learn? (After I've two-thirds finished the application...)
_________________
"I'm not hostile! I'm just aggressively interactive!"
Back to top
View user's profile Send private message Visit poster's website
bizzybody
Post-Apocalyptic


Joined: 22 May 2008
Posts: 161

PostPosted: Tue Dec 30, 2008 4:01 am    Post subject: Reply with quote

Thanks bunches. Smile I'll see what I can dig up amongst my e-book collection.
Back to top
View user's profile Send private message Send e-mail
Ratzmandious
Taunt me not - I control your power bill


Joined: 01 Mar 2007
Posts: 647
Location: My orbital habitat

PostPosted: Mon Jan 05, 2009 1:40 pm    Post subject: Reply with quote

Well I have an alpha version of the thing here...

http://ratzmandious.110mb.com/files/UTFStripper.zip

It scans a given file for the UTF tags lists them in a grid to let enter the appropriate replacent character and saves it as a new file.

Fairly basic system, but should work. Let me know what you think.
_________________
"I'm not hostile! I'm just aggressively interactive!"
Back to top
View user's profile Send private message Visit poster's website
bizzybody
Post-Apocalyptic


Joined: 22 May 2008
Posts: 161

PostPosted: Mon Jan 05, 2009 6:07 pm    Post subject: Reply with quote

Will do. Thanks.

Been busy with the holidays and moving crystal dihydrogen monoxide. Too bad the stuff has no street value, I'd have a fortune sitting in my driveway...

For samples, check out some of the HTML files here, especially in the earlier CDs. Some of Baen's newer e-books don't use UTF-8 codes.

http://baencd.thefifthimperium.com/

The newer ones with a P0 number are revisions of the older ones where the Mobipocket versions have the Author Name and cover image correctly in them, and for some odd reason changed their filename extentions to .mobi from .prc I dunno *why* they've done that when Mobipocket's conversion software uses .prc
Back to top
View user's profile Send private message Send e-mail
Crossbow
I know. It is a thing I do.


Joined: 19 Apr 2005
Posts: 156
Location: UK

PostPosted: Wed Jan 07, 2009 9:23 am    Post subject: Reply with quote

I knew there was a reason I stuck with .lit files Razz
_________________
Optimism is the triumph of hope over experience
Back to top
View user's profile Send private message AIM Address MSN Messenger
vexxarr
Site Admin


Joined: 10 Jan 2005
Posts: 723

PostPosted: Wed Jan 07, 2009 12:13 pm    Post subject: H2O Reply with quote

I was always told "Hydrogen Hydroxide" by my chemistry teacher!
_________________
World Conquest is easy… It’s conquering the inhabitants that gets sticky.
Back to top
View user's profile Send private message Send e-mail Visit poster's website AIM Address
Ratzmandious
Taunt me not - I control your power bill


Joined: 01 Mar 2007
Posts: 647
Location: My orbital habitat

PostPosted: Wed Jan 07, 2009 3:01 pm    Post subject: Reply with quote

*grumble, grumble* I want some more crystalline dihydrogen monoxide!

The stuff I ordered melted on me.

Confused

And I tend to use rtf for my ebooks, simply because just about everything can read it...
_________________
"I'm not hostile! I'm just aggressively interactive!"
Back to top
View user's profile Send private message Visit poster's website
bizzybody
Post-Apocalyptic


Joined: 22 May 2008
Posts: 161

PostPosted: Wed Sep 30, 2009 11:04 pm    Post subject: Reply with quote

I found a simple and easy program to convert CHM, HTML, PDF, LIT and some variants of Palm PDB files directly to MobiPocket format.

It's called AutoKindle. Originally written to fix MobiPocket Reader eBooks to work on the Amazon Kindle (they use almost identical formats) the conversion from other formats has been added.

Now it just needs to be able to convert from TealDoc to mobi. Smile
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    vexxarrcommunity Forum Index -> Digital Media All times are GMT - 5 Hours


Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum