Getting Only The Text Displayed On A Webpage Using C#

After looking around for months at various ways to get only the text displayed on a web browser using C#, it all boiled down to only a few simple lines of code.  I looked at several very robust solutions such as the HTML Agility Pack and Majestic 12 open source .NET solutions.  However, for applications which only require getting tag free / html free text from a web page, these solutions seem to be overkill, at least in my case.

Here are three very simplistic ways to get only the displayed text on a web page:

Method 1 – In Memory Cut and Paste

Use WebBrowser control object to process the web page, and then copy the text from the control…

Use the following code to download the web page:

 //Create the WebBrowser control

WebBrowser wb = new WebBrowser();

//Add a new event to process document when download is completed   

wb.DocumentCompleted +=

    new WebBrowserDocumentCompletedEventHandler(DisplayText);

//Download the webpage

wb.Url = urlPath;

Use the following event code to process the downloaded web page text:

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)

{

WebBrowser wb = (WebBrowser)sender;

wb.Document.ExecCommand(“SelectAll”, false, null);

wb.Document.ExecCommand(“Copy”, false, null);

textResultsBox.Text = CleanText(Clipboard.GetText());

}

Method 2 – In Memory Selection Object

This is a second method of processing the downloaded web page text.  It seems to take just a bit longer (very minimal difference).  However, it avoids using the clipboard and the limitations associated with that.

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)

{   //Create the WebBrowser control and IHTMLDocument2

WebBrowser wb = (WebBrowser)sender;

IHTMLDocument2 htmlDocument =

wb.Document.DomDocument as IHTMLDocument2;

//Select all the text on the page and create a selection object

wb.Document.ExecCommand(“SelectAll”, false, null);

IHTMLSelectionObject currentSelection = htmlDocument.selection;

//Create a text range and send the range’s text to your text box

IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange

textResultsBox.Text = range.text;

}

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

A good friend shared this example with me.  I am a huge fan of simple, and this example wins the simplicity contest hands down.  It was unfortunately very slow compared to the other two approaches.

The XmlDocument object will load / process html files with only 3 simple lines of code:

XmlDocument document = new XmlDocument();

document.Load(“www.yourwebsite.com”);

string allText = document.InnerText;

There you have it!  Three simple ways to scrape only displayed text from web pages with no external “packages” involved.

Packages

I have recently used the Waitin web application testing package to get website text using C#. Watin was not the easiest package to get set up for website text retrieval from C# as it required references to the Waitin core dll, Microsoft.mshtml, windows.forms, and then several additional classes classes included in my project. However, I still think it is worth mentioning, because I like the results it produces. The package is stable and very simple to use once you get it set up. In fact, the website text can be obtained using only 3 lines of code:

var browser = new MsHtmlBrowser();
browser.GoTo(“www.YourURLHere.com”);
commandLog.Text = browser.Text;

I have included a simple visual studio asp.net project for download here.

Links

About these ads

8 thoughts on “Getting Only The Text Displayed On A Webpage Using C#

  1. Could you please elaborate on your line of code in method one.

    textResultsBox.Text = CleanText(Clipboard.GetText());

    How do i get the code for cleantext? What does this method do? Could you provide the coding for that method?

  2. How would i go about using this in a console app? This is wad i have come up with so far but the problem i am having is that its displaying all text more than once. Here is my code:

    public class Program
    {
    private bool completed = false;
    private static WebBrowser wb;

    [STAThread]
    private static void Main(string[] args)
    {
    Program p = new Program();

    wb = new WebBrowser();
    wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(Displaytext);

    wb.Navigate(“www.pcwatch.cc);

    while (!p.completed)
    {
    Application.DoEvents();
    Thread.Sleep(1);
    }

    Console.WriteLine();
    Console.ReadLine();

    }

    private static void Displaytext(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
    WebBrowser wb = (WebBrowser)sender;
    wb.Document.ExecCommand(“SelectAll”, false, null);
    wb.Document.ExecCommand(“Copy”, false, null);
    Console.WriteLine(Clipboard.GetText().ToString());
    }

    }
    }

  3. First off I want to say superb blog! I had a
    quick question which I’d like to ask if you don’t mind.

    I was curious to know how you center yourself and clear your
    head before writing. I’ve had a tough time clearing my mind in getting my thoughts out.
    I do enjoy writing but it just seems like the first 10 to 15 minutes
    are usually lost just trying to figure out how to begin.
    Any ideas or hints? Appreciate it!

    • My biggest suggestion would be to iterate as many times as needed until you are happy with content. You must start anywhere to start making improvements from anywhere. Just get your scattered thoughts on paper, and then take a break, organize, and repeat until satisfied!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s