Professional Web Applications Themes

How to strip HTML markup from string? - ASP.NET General

Hello, I want to transform text with HTML markup to plain text. Is there some simple way how to do it? I can surely write my own function, which would simply strip everything with < and >. But if someonew has already written something similar for .NET, I would prefer more clever solution, which would try to retain original layout, at least paragraphs, hyperlinks etc - something like Outlook does when changing HTML to plain text. -- Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url] Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url] Keeping Freedom safe from Democracy...

  1. #1

    Default How to strip HTML markup from string?

    Hello,

    I want to transform text with HTML markup to plain text. Is there some
    simple way how to do it?

    I can surely write my own function, which would simply strip everything with
    < and >. But if someonew has already written something similar for .NET, I
    would prefer more clever solution, which would try to retain original
    layout, at least paragraphs, hyperlinks etc - something like Outlook does
    when changing HTML to plain text.


    --
    Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url]
    Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url]
    Keeping Freedom safe from Democracy


    Michal A. Valasek Guest

  2. #2

    Default Re: How to strip HTML markup from string?

    Function stripHTML(strHTML)
    'Strips the HTML tags from strHTML

    Dim objRegExp, strOutput
    Set objRegExp = New Regexp

    objRegExp.IgnoreCase = True
    objRegExp.Global = True
    objRegExp.Pattern = "<(.|\n)+?>"

    'Replace all HTML tag matches with the empty string
    strOutput = objRegExp.Replace(strHTML, "")

    'Replace all < and > with &lt; and &gt;
    strOutput = Replace(strOutput, "<", "&lt;")
    strOutput = Replace(strOutput, ">", "&gt;")

    stripHTML = strOutput 'Return the value of strOutput

    Set objRegExp = Nothing
    End Function
    "Michal A. Valasek" <newsaltaircom.net> wrote in message
    news:u5P%230CiXDHA.1940TK2MSFTNGP10.phx.gbl...
    > Hello,
    >
    > I want to transform text with HTML markup to plain text. Is there some
    > simple way how to do it?
    >
    > I can surely write my own function, which would simply strip everything
    with
    > < and >. But if someonew has already written something similar for .NET, I
    > would prefer more clever solution, which would try to retain original
    > layout, at least paragraphs, hyperlinks etc - something like Outlook does
    > when changing HTML to plain text.
    >
    >
    > --
    > Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url]
    > Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url]
    > Keeping Freedom safe from Democracy
    >
    >

    MS News \(MS ILM\) Guest

  3. #3

    Default RE: How to strip HTML markup from string?

    Hello Michal,

    The page in 4guysfromrolla.com (introduced by Ravikanth) and RegEx (introduced by another dev) could work for you.

    However, there are some other issues. Even after you entirely strip out all the <htmltags> you may be left with HTML-
    encoded strings such as which you will also want to p. These are easily handled with

    System.Web.HTTPUtility.HTMLDecode()

    And now, the long explanation of why this won't be good enough. There are still many unresolved issues: (It was posted by
    others before)

    1) Rendered line feeds versus actual line feeds. In any HTML source the line feeds that are in there are generally NOT the
    ones that are rendered. BR, P and others are the elements that determine the position on the rendered page.

    2) What you're going to do with any elements outside the BODY tag, and what you are going to do with text that is left over
    between elements such as OBJECT or SCRIPT?

    3) Complex pages that have multiple DIV, LAYER or SPAN elements - some of which are only displayed conditionally
    based on things such as browser version or client-side events.

    4) TABLEs. Even though the HTML source for a table is entered in a left-to-right fashion, rows and columns can be spanned
    so you may not find two words which are rendered together on the page to be next to each other in the source code.

    Basically, you need to decide, in advance, what you are looking for and what your end result is going to be. If you're just
    trying to p a simple HTML page with a reasonably predictable format then a simple regex will do the trick. If you are
    looking for specific elements with some important text then a regex and running a for...next loop through the matches would
    be in order.



    Best regards,
    Yanhong Huang
    Microsoft Online Partner Support

    Get Secure! - [url]www.microsoft.com/security[/url]
    This posting is provided "AS IS" with no warranties, and confers no rights.

    --------------------
    !From: "Michal A. Valasek" <newsaltaircom.net>
    !Subject: How to strip HTML markup from string?
    !Date: Sat, 9 Aug 2003 04:48:20 +0200
    !Lines: 18
    !X-Priority: 3
    !X-MSMail-Priority: Normal
    !X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
    !X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
    !Message-ID: <u5P#0CiXDHA.1940TK2MSFTNGP10.phx.gbl>
    !Newsgroups: microsoft.public.dotnet.framework.aspnet
    !NNTP-Posting-Host: gateway.haje.altaircom.net 62.24.73.162
    !Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTN GP10.phx.gbl
    !Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.framework.aspnet:166353
    !X-Tomcat-NG: microsoft.public.dotnet.framework.aspnet
    !
    !Hello,
    !
    !I want to transform text with HTML markup to plain text. Is there some
    !simple way how to do it?
    !
    !I can surely write my own function, which would simply strip everything with
    !< and >. But if someonew has already written something similar for .NET, I
    !would prefer more clever solution, which would try to retain original
    !layout, at least paragraphs, hyperlinks etc - something like Outlook does
    !when changing HTML to plain text.
    !
    !
    !--
    !Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url]
    !Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url]
    !Keeping Freedom safe from Democracy
    !
    !
    !


    Yan-Hong Huang[MSFT] Guest

Similar Threads

  1. strip out html
    By ceaseanddesist in forum Dreamweaver AppDev
    Replies: 2
    Last Post: May 24th, 02:25 PM
  2. Strip out from a position till end of string
    By Yankeet in forum Macromedia ColdFusion
    Replies: 7
    Last Post: May 23rd, 09:54 PM
  3. Should String#strip take a parameter?
    By Warren Brown in forum Ruby
    Replies: 12
    Last Post: July 26th, 03:42 PM
  4. Replies: 7
    Last Post: July 23rd, 04:53 PM
  5. Including HTML MarkUp.
    By Ray at in forum ASP
    Replies: 0
    Last Post: June 30th, 01:02 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139