Professional Web Applications Themes

Net::http.get has a 50K limit? - Ruby

I'm trying to write a screenscraper and am getting a 50K limit on the data returned require "net/http" begin Net::HTTP.start("www.washingtonpost.com", 80){ |http| response , = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026") data=response.body puts data.length } rescue => err puts "Error: #{err}" exit end The last line returns 52166 . (the file is considerable bigger) What did I do wrong?...

  1. #1

    Default Net::http.get has a 50K limit?

    I'm trying to write a screenscraper and am getting a 50K limit on the data
    returned

    require "net/http"
    begin
    Net::HTTP.start("www.washingtonpost.com", 80){ |http|
    response , = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026")
    data=response.body
    puts data.length
    }
    rescue => err
    puts "Error: #{err}"
    exit
    end

    The last line returns 52166 . (the file is considerable bigger) What did I
    do wrong?


    Meihua Guest

  2. #2

    Default Re: Net::http.get has a 50K limit?

    Just a small change required.

    |require "net/http"
    |begin
    | Net::HTTP.start("www.washingtonpost.com", 80){ |http|
    | response , = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026")

    File.open("/some/file/","wb+"){|f|
    resp, = http.get(url, nil){|gotit|
    f.print(gotit)
    }
    }


    | data=response.body
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    The data is not part of the response any more. This behaviour has changed
    from ruby 1.6

    regs
    Vivek


    | puts data.length
    | }
    |rescue => err
    | puts "Error: #{err}"
    | exit
    |end
    |
    |The last line returns 52166 . (the file is considerable bigger) What did I
    |do wrong?
    |
    |
    |
    |

    --

    Accept that some days you are the pigeon and some days the statue



    Vivek Guest

  3. #3

    Default Re: Net::http.get has a 50K limit?

    I tried your suggestion (ie. using a block and putting it to a file: see
    copy below) but it still cuts the page short just as before. The actual web
    page is ~58K but I'm only getting ~51K. Any more suggestions?

    require "net/http"
    # using block
    Net::HTTP.start("www.washingtonpost.com", 80){ |http|
    File.open('result.txt', 'wb+') {|f|
    resp,=http.get('/wl/jobs/JS_JobSearch?TS=1012409733026',nil) {
    |str|
    f.print( str )
    }
    }
    }


    "Vivek Nallur" <ernet.in> wrote in message
    news:CDACMUMBAI.CDACINDIA.COM... 



    Meihua Guest

  4. #4

    Default Re: Net::http.get has a 50K limit?


    "Meihua Liang" <net> schrieb im Newsbeitrag
    news:YXuSb.11834$.. 
    see 
    web 

    Did you verify with wget that the server actually serves the complete
    doent? If not, that's what I'd do.

    robert
     
    { [/ref]
    changed [/ref]
    did 
    >
    >[/ref]

    Robert Guest

  5. #5

    Default Re: Net::http.get has a 50K limit?

    Yes the server completes the doent. Browser serves it nicely, and I also
    verified via wget, which give the complete copy. I still don't know why
    Net::http prematurely cuts off the doent.

    meihua

    "Robert Klemme" <net> wrote in message
    news:bve0ff$qsaln$news.uni-berlin.de... 
    > see 
    > web 
    >
    > Did you verify with wget that the server actually serves the complete
    > doent? If not, that's what I'd do.
    >
    > robert

    > { [/ref]
    > changed [/ref]
    > did 
    > >
    > >[/ref]
    >[/ref]


    Meihua Guest

  6. #6

    Default Re: Net::http.get has a 50K limit?

    Hi!

    Meihua Liang wrote:
     

    I played around with your script a bit, and noticed something strange: When
    trying to fetch the file via telnet, it is also cut off early. However, as
    you said, wget correctly retrieves the whole doent. Why? wget sends a
    user-agent header field, and only in this case the whole doent is
    served. So, by adding a user-agent header field to your request, it works
    for me (with Ruby 1.8):

    response = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026",
    {"user-agent" => "blub"})

    returns around 60000 bytes in response.body.
    When writing www spiders, you sometimes have to outsmart the webservers ;).

    Hth,
    Daniel
    Daniel Guest

Similar Threads

  1. Replies: 18
    Last Post: March 10th, 01:06 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139