How to strip HTML markup from string?

Ask a Question related to ASP.NET General, Design and Development.

  1. #1

    Default How to strip HTML markup from string?

    Hello,

    I want to transform text with HTML markup to plain text. Is there some
    simple way how to do it?

    I can surely write my own function, which would simply strip everything with
    < and >. But if someonew has already written something similar for .NET, I
    would prefer more clever solution, which would try to retain original
    layout, at least paragraphs, hyperlinks etc - something like Outlook does
    when changing HTML to plain text.


    --
    Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url]
    Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url]
    Keeping Freedom safe from Democracy


    Michal A. Valasek Guest

  2. Similar Questions and Discussions

    1. strip out html
      is there a function in asp to strip out html characters ? adam
    2. Strip out from a position till end of string
      I'm trying to link to google maps from my website, but some of the addresses in the database I have, include the apartment number (eg. #2b), so I...
    3. Should String#strip take a parameter?
      All, Several times I have run across the need to strip characters other than whitespaces from the beginning and/or end of a string. I have...
    4. String manipulation of a URL - strip preceding characters?
      How would I strip out everything from before the last "/" in the following string generated from request.servervaraibles method: ...
    5. Including HTML MarkUp.
      Sorry, I thought I typed it right. <!-- #include file="yourfile.txt" --> Ray at work "Ray at <%=sLocation%>" <ask@me.forit> wrote in message...
  3. #2

    Default Re: How to strip HTML markup from string?

    Function stripHTML(strHTML)
    'Strips the HTML tags from strHTML

    Dim objRegExp, strOutput
    Set objRegExp = New Regexp

    objRegExp.IgnoreCase = True
    objRegExp.Global = True
    objRegExp.Pattern = "<(.|\n)+?>"

    'Replace all HTML tag matches with the empty string
    strOutput = objRegExp.Replace(strHTML, "")

    'Replace all < and > with &lt; and &gt;
    strOutput = Replace(strOutput, "<", "&lt;")
    strOutput = Replace(strOutput, ">", "&gt;")

    stripHTML = strOutput 'Return the value of strOutput

    Set objRegExp = Nothing
    End Function
    "Michal A. Valasek" <news@altaircom.net> wrote in message
    news:u5P%230CiXDHA.1940@TK2MSFTNGP10.phx.gbl...
    > Hello,
    >
    > I want to transform text with HTML markup to plain text. Is there some
    > simple way how to do it?
    >
    > I can surely write my own function, which would simply strip everything
    with
    > < and >. But if someonew has already written something similar for .NET, I
    > would prefer more clever solution, which would try to retain original
    > layout, at least paragraphs, hyperlinks etc - something like Outlook does
    > when changing HTML to plain text.
    >
    >
    > --
    > Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url]
    > Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url]
    > Keeping Freedom safe from Democracy
    >
    >

    MS News \(MS ILM\) Guest

  4. #3

    Default RE: How to strip HTML markup from string?

    Hello Michal,

    The page in 4guysfromrolla.com (introduced by Ravikanth) and RegEx (introduced by another dev) could work for you.

    However, there are some other issues. Even after you entirely strip out all the <htmltags> you may be left with HTML-
    encoded strings such as which you will also want to parse. These are easily handled with

    System.Web.HTTPUtility.HTMLDecode()

    And now, the long explanation of why this won't be good enough. There are still many unresolved issues: (It was posted by
    others before)

    1) Rendered line feeds versus actual line feeds. In any HTML source the line feeds that are in there are generally NOT the
    ones that are rendered. BR, P and others are the elements that determine the position on the rendered page.

    2) What you're going to do with any elements outside the BODY tag, and what you are going to do with text that is left over
    between elements such as OBJECT or SCRIPT?

    3) Complex pages that have multiple DIV, LAYER or SPAN elements - some of which are only displayed conditionally
    based on things such as browser version or client-side events.

    4) TABLEs. Even though the HTML source for a table is entered in a left-to-right fashion, rows and columns can be spanned
    so you may not find two words which are rendered together on the page to be next to each other in the source code.

    Basically, you need to decide, in advance, what you are looking for and what your end result is going to be. If you're just
    trying to parse a simple HTML page with a reasonably predictable format then a simple regex will do the trick. If you are
    looking for specific elements with some important text then a regex and running a for...next loop through the matches would
    be in order.



    Best regards,
    Yanhong Huang
    Microsoft Online Partner Support

    Get Secure! - [url]www.microsoft.com/security[/url]
    This posting is provided "AS IS" with no warranties, and confers no rights.

    --------------------
    !From: "Michal A. Valasek" <news@altaircom.net>
    !Subject: How to strip HTML markup from string?
    !Date: Sat, 9 Aug 2003 04:48:20 +0200
    !Lines: 18
    !X-Priority: 3
    !X-MSMail-Priority: Normal
    !X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
    !X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
    !Message-ID: <u5P#0CiXDHA.1940@TK2MSFTNGP10.phx.gbl>
    !Newsgroups: microsoft.public.dotnet.framework.aspnet
    !NNTP-Posting-Host: gateway.haje.altaircom.net 62.24.73.162
    !Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTN GP10.phx.gbl
    !Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.framework.aspnet:166353
    !X-Tomcat-NG: microsoft.public.dotnet.framework.aspnet
    !
    !Hello,
    !
    !I want to transform text with HTML markup to plain text. Is there some
    !simple way how to do it?
    !
    !I can surely write my own function, which would simply strip everything with
    !< and >. But if someonew has already written something similar for .NET, I
    !would prefer more clever solution, which would try to retain original
    !layout, at least paragraphs, hyperlinks etc - something like Outlook does
    !when changing HTML to plain text.
    !
    !
    !--
    !Michal A. Valasek, Altair Communications, [url]http://www.altaircom.net[/url]
    !Please do not reply to this e-mail, for contact see [url]http://www.rider.cz[/url]
    !Keeping Freedom safe from Democracy
    !
    !
    !


    Yan-Hong Huang[MSFT] Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139