Convert Word Special Characters to Plain Text
by Gary Randolph
Clients often send me documents in Microsoft Word, and I have to convert them
into HTML. The Word curly quotes and other special characters are
HTML-incompatible, causing me a lot of work in manually scanning and retyping.
So I wrote this utility to do the conversions for me.
If you want to read about the C# ASP.NET coding behind the program, scroll on
down. Otherwise, feel free to use it yourself.
How It Works
This C# ASP.NET program is extremely simple. First, the input text is
converted into a char array. Then the code loops through the array. All
Microsoft Word special characters are caught with the switch statement and
their plain text equivalents are added to the output string. Non-special
characters are added to the output string as is. Finally, the output string is
written to the output textbox.
private void btnConvert_Click(object sender, System.EventArgs e)
{
char [] charList = txtWord.Text.ToCharArray();
char quote=(char)34; //quote because you can't do """
string plainText="";
for (int counter=0; counter<charList.Length; counter
{
int thisChar = Convert.ToInt32(charList[counter]);
switch(thisChar)
{
case 8217: //curly apostrophe
plainText += "'";
break;
case 8230: //elipsis
plainText += "...";
break;
case 8220: //left curly quote
plainText += quote.ToString();
break;
case 8221: //right curly quote
plainText += quote.ToString();
break;
default:
plainText +=
charList[counter].ToString();
break;
}
}
txtConvert.Text = plainText;
}
About the only other thing work mentioning is how I discovered which characters
were Word special characters. During development I added an if test to the loop
that caught all characters with a code over 255 and captured them in a listbox
(see the code below). Then I could just read them off the screen.
if (thisChar>255)
{
ListItem newItem = new ListItem();
newItem.Text = thisChar.ToString() + " - " +
charList[counter].ToString();
newItem.Value = thisChar.ToString();
lbChars.Items.Add(newItem);
}
Copyright (c) 2007
|