Professional Web Applications Themes

Is there any way to mark an object as "always in use" (specifically,in a C extension)? - Ruby

Some background ... I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content). I've written a tiny XML pr in C, because trying to load these doents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags). Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of ...

  1. #1

    Default Is there any way to mark an object as "always in use" (specifically,in a C extension)?

    Some background ...

    I have an application where there are many identical strings (the data consists of huge chunks of XML, with a lot of duplication in both the tag names and the CDATA content).

    I've written a tiny XML pr in C, because trying to load these doents using REXML ran all night and was still running the next day, presumably due to the size (hundreds of thousands of tags).

    Anyway, to reduce the memory used, given the repetitive nature of a lot of the data, I decided to store the strings as a (C coded) hash table of VALUE objects.

    Changes to the data are very, very few, so when this happens, I just create a new Ruby string, so the values in the hash table never change.

    Now, to my questions ...

    I found that when I played with particularly large doents, my code fell over with what looked like some kind of memory corruption. I eventually twigged to the fact that Ruby might be garbage collecting some of the strings I'd constructed, because my C code wasn't doing any rb_gc_mark() calls. That definitely seemed to be the story, because when I wrote one that just went through the entire hashtable and marked each value, the corruption disappeared.

    So, I guess my questions are: (1) is this likely to be what was really going wrong, or did adding the rb_gc_mark() calls fix the problem by pure luck and it's waiting to bite me again, further down the track; (2) is there some way I can mark all of those objects as always being in use, so that they'll never be considered for garbage collection; and more importantly (3) is there a better way to do achieve this?

    Thanks in advance,

    Harry O.



    Harry Guest

  2. #2

    Default Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    Hi,

    At Fri, 6 Feb 2004 10:29:43 +0900,
    Harry Ohlsen wrote in [ruby-talk:91665]: 

    Seems correct.
     

    You may want to use rb_gc_register_address()?
     

    Is it the single big hash in process, but not per instance,
    right?

    2 ways:
    1. rb_gc_register_address(),

    2. make the hash a hidden instance variable of any class, if
    exists.

    --
    Nobu Nakada


    nobu.nokada@softhome.net Guest

  3. #3

    Default Re: Is there any way to mark an object as "always in use" (specifically,in a C extension)?

    net wrote: 
    >
    >
    > You may want to use rb_gc_register_address()?[/ref]

    Thanks. I'll look that up!
     

    There's a single hash table, that's not registered as a Ruby object. However, the data stored in the table are all VALUE objects obtained from rb_str_new2().

    There is a single instance of an object implemented in C that Ruby *does* know about, and that object holds many references to the strings in the hash table. For one doent, for example, there were 2.7 million references to around 87,000 strings, totalling just under 267,000 bytes of text. It's the ..._mark() function of that class where I currently mark all of the strings.

    While it doesn't *seem* to be taking a huge amount of time to do that each time, I'd like to try to avoid it, just to see whether that really is the case ... ie, see whether the actual time being used is significant. In any case, if it's easy to fix, it just doesn't make sense to keep marking them over and over.
     

    That might work, but as things are currently, Ruby doesn't know anything about the hash table. It's simply an implementation detail of the extension. However, if I can't work out how to get it working with rb_gc_register_address(), I'll see if I can do something along these lines.

    Thanks for the suggestions!

    Harry O.


    Harry Guest

  4. #4

    Default Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    Hi,

    At Fri, 6 Feb 2004 11:49:59 +0900,
    Harry Ohlsen wrote in [ruby-talk:91669]: 
    >
    > There's a single hash table, that's not registered as a Ruby
    > object. However, the data stored in the table are all VALUE
    > objects obtained from rb_str_new2().[/ref]

    You mean struct st_table? If so, you may register a Hash
    instance to GC and use its tbl member directly.
     

    Current ruby's GC is weak for large amounts of live objects.
     

    Generational GC may help you, but it isn't still incorporated.

    --
    Nobu Nakada


    nobu.nokada@softhome.net Guest

  5. #5

    Default Re: Is there any way to mark an object as "always in use" (specifically,in a C extension)?

    net wrote: 
    >
    >
    > You mean struct st_table?[/ref]

    Sorry, I should have been more specific. The hash table is just some C code I wrote to implement one. It's only he data held in it that are Ruby objects (String). It's an interesting point, though. Maybe I could save myself some code by changing the C to use a Ruby hash.

    I've only learned enough about C extensions to get done what I needed. I plan to do some serious study when I get a chance. I must say, it was pretty easy to get started ... as I would expect from anything related to Ruby, of course!
     

    That definitely sounds simpler.
     

    I have a feeling this is why REXML had a problem loading the doent, because it probably needs to create quite a few other (sub-)objects for each XML tag, hence it would *really* be working hard!
     

    I've seen mention of GGC a number of times on the list. Is there a plan to add it to Ruby 1.X.Y, or will we have to wait until version 2?

    Cheers,

    Harry O.



    Harry Guest

  6. #6

    Default Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    At Fri, 06 Feb 2004 10:29:43 +0900 wrote Harry Ohlsen:
     

    Have you already tried xmlpr (wrapper around
    expat)? It's quite fast. I use it for huge XML doents where
    rexml and nqxml are way too slow.

    Ralf.

    Ralf Guest

  7. #7

    Default Re: Is there any way to mark an object as "always in use" (specifically,in a C extension)?

    Ralf Horstmann wrote:
     
    Back when I originally wrote it, I didn't have control of the box and
    hence couldn't get expat installed easily, so I didn't look any further
    at the time. However, I might give it a go by installing it in my own
    account. I also didn't have a lot of time to get this up and running
    back then.

    Since I already had some C code that did what I wanted (and nothing
    more), I figured it would be faster to wrap it ... plus, in the back of
    my mind, I'm sure I was thinking "what a great opportunity to learn how
    to do C extensions" :-).

    Nobu's suggestion worked fine, although I've not benchmarked yet to see
    whether the change has made a significant difference ... this thing
    takes quite a while to run, so it's hard to tell unless you think to
    look at the clock, or print some timestamps out, which is what I'll do
    when I get back to work on Monday.
     
    Just out of interest, how large was your "huge". Some of my doents
    are (literally) hundreds of megabytes.

    The other point I should make is that this application has to be able to
    make fairly arbitrary changes to the DOM, like moving whole subtrees
    around, and the changes are user-defined, hence I can't even use some
    kind of smart housekeeping, so event driven won't work for me.

    Cheers,

    Harry O.




    Harry Guest

  8. #8

    Default Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    At Sat, 07 Feb 2004 07:46:33 +0900 wrote Harry Ohlsen:
     
    > Just out of interest, how large was your "huge". Some of my doents
    > are (literally) hundreds of megabytes.[/ref]

    I just checked and found it to be about 10 megabytes. So actually not that
    much data. But it was already enough to let rexml run for hours :-)

    Regards,
    Ralf.

    Ralf Guest

  9. #9

    Default Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    --bp/iNruPH9dso1Pn
    Content-Type: text/plain; cht=us-ascii
    Content-Disposition: inline
    Content-Transfer-Encoding: quoted-printable
     
    > > Just out of interest, how large was your "huge". Some of my doents=[/ref][/ref]
    =20 
    >=20
    > I just checked and found it to be about 10 megabytes. So actually not that
    > much data. But it was already enough to let rexml run for hours :-)[/ref]

    There seems to be a problem/bug/whatever with the current version of
    REXML that makes large files take extra long to process. It reads the
    entire file in before it starts processing, which kills performance. Try
    adding this code to your program:


    module REXML
    class IOSource
    alias_method :_initialize, :initialize

    def initialize(arg, block_size=3D500)
    er_source =3D source =3D arg
    to_utf =3D false
    line_break =3D '>'
    super source.readline(line_break)
    line_break =3D encode( '>' )
    end
    end
    end

    That seems to fix the problem for other people.

    --
    Zachary P. Landau <net>
    GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

    --bp/iNruPH9dso1Pn
    Content-Type: application/pgp-signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFAJSXfCwWyMCTlrZkRAjNcAJsF8I511wJdohLMvWRR8x y68gD+NwCeK4pu
    OGdtmqcZmrSc/a26S5RcfuM=
    =ojMg
    -----END PGP SIGNATURE-----

    --bp/iNruPH9dso1Pn--


    Zachary Guest

  10. #10

    Default Re: Is there any way to mark an object as "always in use" (specifically,in a C extension)?

    Zachary P. Landau wrote:
     
    Is this a new problem introduced in a recent version? If so, it's
    probably not the cause of the slowness I was seeing, because I tried it
    about five or six months ago.

    However, it's definitely worth knowing about that patch for the next
    time I want to do some XML processing, because REXML is just so nice to
    use that it would normally be my first choice!

    Cheers,

    Harry O.




    Harry Guest

  11. #11

    Default Re: Is there any way to mark an object as "always in use" (specifically, in a C extension)?

    --wq9mPyueHGvFACwf
    Content-Type: text/plain; cht=us-ascii
    Content-Disposition: inline
    Content-Transfer-Encoding: quoted-printable

    On Sun, Feb 08, 2004 at 09:02:38AM +0900, Harry Ohlsen wrote: 
    > Is this a new problem introduced in a recent version? If so, it's=20
    > probably not the cause of the slowness I was seeing, because I tried it=[/ref]
    =20 
    =20 

    The problem came with 1.8.1, so that wouldn't have been the problem.
    Your probably was probably just a huge file :P

    --
    Zachary P. Landau <net>
    GPG: gpg --recv-key 0x24E5AD99 | http://kapheine.hypa.net/kapheine.asc

    --wq9mPyueHGvFACwf
    Content-Type: application/pgp-signature
    Content-Disposition: inline

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFAJZbtCwWyMCTlrZkRAsMyAJ0W4Qkr+5ZDiNWInbN4az c6gBS6awCfYhn6
    8Jes66FixD4BTOQQcyZ1h/w=
    =hhTh
    -----END PGP SIGNATURE-----

    --wq9mPyueHGvFACwf--


    Zachary Guest

Similar Threads

  1. Can't locate object method "newFromJpeg" via package "GD::Image"
    By francescomoi@usa.com in forum PERL Modules
    Replies: 3
    Last Post: December 20th, 11:39 AM
  2. Can't locate object method "blocking" via package "IO::Handle"
    By kemton@kemton.com in forum PERL Modules
    Replies: 1
    Last Post: June 20th, 02:54 PM
  3. Replies: 2
    Last Post: November 2nd, 08:39 PM
  4. Replies: 2
    Last Post: April 15th, 01:41 PM
  5. Can't not locate object method "isadmin" via package "Noc1"
    By Perldiscuss - Perl Newsgroups And Mailing Lists in forum PERL Beginners
    Replies: 1
    Last Post: November 13th, 03:34 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139