Acessing files without directory structure information

Ask a Question related to UNIX Programming, Design and Development.

  1. #1

    Default Acessing files without directory structure information

    Hello,

    I am looking for a way to read a file in a directory (and its subdirectories)
    without knowing its contents.
    Practically, I need to choose a file randomly in a directory which may
    contains an awful lot, so I cannot build a representation of its contents,
    unless the time and memory needed to build it does not depend on the number
    or nature of the files.
    Are there any standard Unix routines that provide such functionality?

    Thank you,
    Sam
    --
    "So if you meet me, have some courtesy, have some sympathy, and some taste
    Use all your well-learned politesse, or I'll lay your soul to waste"

    - The Rolling Stones, "Sympathy for the Devil"

    Sam Zoghaib Guest

  2. Similar Questions and Discussions

    1. Directory Structure FOR CF (CFC's)
      My cfm pages cannot see my CFC's. CFC is new to the client. What does the directory structure look like? Do I need to create a separate folder for...
    2. Directory structure workaround
      Hi, For hit tracking purposes I've been asked if it is possible to turn our current URLs into something more friendly to analysis tools. The...
    3. Server directory structure
      I am setting up a ASP.NET intranet site. I will be using VS 2003 as development tool. I am not sure about what should be the best directory...
    4. [PHP] Linking to files outside the directory structure...?
      Hello, This is very much a possibility. first you 'fopen' the file and then you do an 'fpassthru' it will send the file to the client. You do...
    5. Backup directory structure
      Hi .. I copied the directory structure to CD .. The directory Structure looks like this .. ( This directory structure I used to create the DB...
  3. #2

    Default Re: Acessing files without directory structure information

    Sam Zoghaib <NOsam@zep.homedns.orgSPAM> writes:
    > I am looking for a way to read a file in a directory (and its subdirectories)
    > without knowing its contents.
    If you already knew its contents, why would you want to read it (again)?
    > Practically, I need to choose a file randomly in a directory which may
    > contains an awful lot, so I cannot build a representation of its contents,
    > unless the time and memory needed to build it does not depend on the number
    > or nature of the files.
    Huh?
    > Are there any standard Unix routines that provide such functionality?
    Probably. But your description of what you actually want to do is
    not quite clear.

    If what you want to do is:

    - given a (possibly large) directory, select a random file and read it
    - without any knowledge of how many files there are
    - with equal probability that any given file is chosen
    - without scanning the directory twice
    - assuming that the directory is not modified while you are in the
    process of selection.

    then:

    open directory;
    target = next file;
    do {
    nextfile = next file;
    taget = select from target or nextfile with probabilty 1/2;
    } while (more files exist);
    close directory;
    open target;

    Note: any design that requires directories with large number of
    files in them is probably not going to give you good performance.

    Cheers,
    --
    In order to understand recursion you must first understand recursion.
    Paul Pluzhnikov Guest

  4. #3

    Default Re: Acessing files without directory structure information

    Sam Zoghaib wrote:
    > Hello,
    >
    > I am looking for a way to read a file in a directory (and its subdirectories)
    > without knowing its contents.
    > Practically, I need to choose a file randomly in a directory which may
    > contains an awful lot, so I cannot build a representation of its contents,
    > unless the time and memory needed to build it does not depend on the number
    > or nature of the files.
    > Are there any standard Unix routines that provide such functionality?
    Yes, there are complete virus-creating libraries, with functions
    find_an_executable, infect and so on

    Lorinczy Zsigmond Guest

  5. #4

    Default Re: Acessing files without directory structure information

    Paul Pluzhnikov wrote in article <uvfu7qiwy.fsfYB8X@earthlink.net> on Sunday
    13 July 2003 05:02 in comp.unix.programmer:
    > If what you want to do is:
    >
    > - given a (possibly large) directory, select a random file and read it
    > - without any knowledge of how many files there are
    yes
    > - with equal probability that any given file is chosen
    yes
    > - without scanning the directory twice
    Actually, without scanning it even once. That's why I asked how I could make a
    routine which reads a directory whose contents you have no information about.
    > - assuming that the directory is not modified while you are in the
    > process of selection.
    Yes, though I forgot to mention it.
    > then:
    >
    > open directory;
    > target = next file;
    > do {
    > nextfile = next file;
    > taget = select from target or nextfile with probabilty 1/2;
    > } while (more files exist);
    > close directory;
    > open target;
    The problem here is that you are scanning the directory.
    > Note: any design that requires directories with large number of
    > files in them is probably not going to give you good performance.
    >
    Which is why I am looking for a routine that does not scan the directory.

    I was thinking of picking a small chunk of data in the directory (small enough
    to be sure it belongs to only one file), then retrieving what file it belongs
    to (without scanning the whole directory of course) to read it from start to
    end. I have no idea of how to do this though.

    Thanks,
    Sam
    --
    "One cannot be betrayed if one has no people"

    - Kobayashi in "The Usual Suspects", 1995

    Sam Zoghaib Guest

  6. #5

    Default Re: Acessing files without directory structure information

    Paul Pluzhnikov <ppluzhnikov@earthlink.net> writes:

    [...]
    > If what you want to do is:
    >
    > - given a (possibly large) directory, select a random file and read it
    > - without any knowledge of how many files there are
    > - with equal probability that any given file is chosen
    > - without scanning the directory twice
    > - assuming that the directory is not modified while you are in the
    > process of selection.
    > open directory;
    > target = next file;
    > do {
    > nextfile = next file;
    > taget = select from target or nextfile with probabilty 1/2;
    > } while (more files exist);
    > close directory;
    > open target;
    That would have a flat 50% chance of selecting the very last file. I
    think you meant: set target to the current file with 1/n probability,
    where n is the number of files seen so far, including this one.

    In that case, the last, n-th, file has a 1/n chance of being selected.
    The next-to-last file has a 1/(n-1) chance of being selected but 1/n
    of those 1/(n-1) times the last file will be selected later -- so the
    actual chance is:

    1/(n-1) - 1/n * 1/(n-1) = n/n(n-1) - 1/n(n-1) = n-1/n(n-1) = 1/n


    --
    ================================================== =============
    <erwin@andreasen.org> Herlev, Denmark
    <URL:http://www.andreasen.org/> <*>
    ================================================== =============

    Erwin S. Andreasen Guest

  7. #6

    Default Re: Acessing files without directory structure information

    Sam Zoghaib wrote:
    >
    > Paul Pluzhnikov wrote in article <uvfu7qiwy.fsfYB8X@earthlink.net> on Sunday
    > 13 July 2003 05:02 in comp.unix.programmer:
    >
    > > If what you want to do is:
    > >
    > > - given a (possibly large) directory, select a random file and read it
    > > - without any knowledge of how many files there are
    >
    > yes
    >
    > > - with equal probability that any given file is chosen
    >
    > yes
    >
    > > - without scanning the directory twice
    >
    > Actually, without scanning it even once. That's why I asked how I could make a
    > routine which reads a directory whose contents you have no information about.
    You will have to scan it at least once as users don't have low-level
    access to the directory "file" to access it randomly.

    > > - assuming that the directory is not modified while you are in the
    > > process of selection.
    >
    > Yes, though I forgot to mention it.
    >
    > > then:
    > >
    > > open directory;
    > > target = next file;
    > > do {
    > > nextfile = next file;
    > > taget = select from target or nextfile with probabilty 1/2;
    > > } while (more files exist);
    > > close directory;
    > > open target;
    >
    > The problem here is that you are scanning the directory.
    >
    > > Note: any design that requires directories with large number of
    > > files in them is probably not going to give you good performance.
    > >
    > Which is why I am looking for a routine that does not scan the directory.
    >
    > I was thinking of picking a small chunk of data in the directory (small enough
    > to be sure it belongs to only one file), then retrieving what file it belongs
    > to (without scanning the whole directory of course) to read it from start to
    > end. I have no idea of how to do this though.
    This will pick a random file from a directory, but you still have to
    scan the entire directory.

    #!/usr/bin/perl

    my $dir = shift or die "usage: $0 directory-name\n";

    opendir my $dh, $dir or die "Cannot open $dir: $!";

    my ( $n, $entry );
    rand( ++$n ) < 1 && ( $entry = $_ ) while $_ = readdir $dh;

    print "$entry\n";

    __END__


    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn Guest

  8. #7

    Default Re: Acessing files without directory structure information

    "Erwin S. Andreasen" <erwin@andreasen.com> writes:
    > > taget = select from target or nextfile with probabilty 1/2;
    > That would have a flat 50% chance of selecting the very last file. I
    > think you meant: set target to the current file with 1/n probability,
    > where n is the number of files seen so far, including this one.
    Oops. My mistake. You are correct.

    [Original algorithm from "The Practice of Programming" by Kernighan
    and Pike, p. 70]

    Cheers,
    --
    In order to understand recursion you must first understand recursion.
    Paul Pluzhnikov Guest

  9. #8

    Default Re: Acessing files without directory structure information

    Sam Zoghaib <NOsam@zep.homedns.orgSPAM> writes:
    > Actually, without scanning it even once. That's why I asked how I could make a
    > routine which reads a directory whose contents you have no information about.
    I do not believe it is possible to guarantee this:
    > > - with equal probability that any given file is chosen
    without scanning through the whole directory once.
    > I was thinking of picking a small chunk of data in the directory (small enough
    > to be sure it belongs to only one file), then retrieving what file it belongs
    > to (without scanning the whole directory of course) to read it from start to
    > end. I have no idea of how to do this though.
    Even if this were possible (which it isn't, since "directory data"
    does not contain any "file data" at all), this would violate the
    "equal probability of any given file", unless you also know that
    all files in the directory are the same size.

    Your requirements appear to be extremely strange. What is it that
    you are *actually* trying to achieve?

    Cheers,
    --
    In order to understand recursion you must first understand recursion.
    Paul Pluzhnikov Guest

  10. #9

    Default Re: Acessing files without directory structure information

    Paul Pluzhnikov wrote in article <uptkeqb8x.fsfYB8X@earthlink.net> on Monday
    14 July 2003 01:02 in comp.unix.programmer:

    > Even if this were possible (which it isn't, since "directory data"
    > does not contain any "file data" at all), this would violate the
    > "equal probability of any given file", unless you also know that
    > all files in the directory are the same size.
    You're right. Sorry for the mistake.
    > Your requirements appear to be extremely strange. What is it that
    > you are *actually* trying to achieve?
    I'm trying to make a program which can randomly read data from directories
    which can contains lots of files, in reasonable time and memory usage.
    Statistics may be an application.

    Actually, it may be possible to drop the "no information about the structure
    of the directory" condition, if I had a way to get the needed information
    (basically the list of files) very fast even for a large amount of files and
    can hold this information in little memory.

    I'm thinking, if I can represent each file with 2 integers (one for the
    directory it is in and one for the file itself) I could handle a very large
    number of files with little memory. However, it does not sove the time
    problem.

    Sam
    --
    "People sometimes ask me if it is a sin in the Church of Emacs to use vi.
    Using a free version of vi is not a sin; it's a penance".

    - Richard Stallman

    Sam Zoghaib Guest

  11. #10

    Default Re: Acessing files without directory structure information

    In article <bestf2$ht2$1@news-reader6.wanadoo.fr>,
    Sam Zoghaib <NOsam@zep.homedns.orgSPAM> wrote:
    >I'm trying to make a program which can randomly read data from directories
    >which can contains lots of files, in reasonable time and memory usage.
    >Statistics may be an application.
    Maybe if you explained why you need to do this, we might be able to suggest
    an alternative solution.

    --
    Barry Margolin, [email]barry.margolin@level3.com[/email]
    Level(3), Woburn, MA
    *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
    Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
    Barry Margolin Guest

  12. #11

    Default Re: Acessing files without directory structure information

    Sam Zoghaib wrote:
    >
    > Actually, it may be possible to drop the "no information about the structure
    > of the directory" condition, if I had a way to get the needed information
    > (basically the list of files) very fast even for a large amount of files and
    > can hold this information in little memory.
    A directory is just a file and you can open it for reading as raw data.
    The filenames are what you are looking for. They are strings and can
    be found heuristically. For example on a non-Linux systems, try doing
    'strings -t d /'. It will print an index containing the byte offsets
    of the filenames in the / directory.

    You could generate such an index and randomly select from it. Or you
    could randomly select an offset into the directory and write a function
    to find the closest string. Once found use the string to open whatever
    file it names.

    Note that how you find the strings would be filesystem specific.

    -- ced


    --
    Chuck Dillon
    Senior Software Engineer
    NimbleGen Systems Inc.

    Chuck Dillon Guest

  13. #12

    Default Re: Acessing files without directory structure information

    Paul Pluzhnikov wrote:
    >
    > If what you want to do is:
    >
    > - given a (possibly large) directory, select a random file and read it
    > - without any knowledge of how many files there are
    > - with equal probability that any given file is chosen
    > - without scanning the directory twice
    > - assuming that the directory is not modified while you are in the
    > process of selection.
    >
    > then:
    >
    > open directory;
    > target = next file;
    > do {
    > nextfile = next file;
    > taget = select from target or nextfile with probabilty 1/2;
    > } while (more files exist);
    > close directory;
    > open target;
    This will select a file "at random," but not with equal
    probability. It will choose the very last file half the time,
    the penultimate one-quarter of the time, the antepenultimate
    one-eighth of the time, and so on. The first and second are
    chosen with equal probability 1/(2**(N-1)) if there are N files
    altogether. If each execution of the algorithm takes one
    nanosecond and the directory contains 100 files, you should
    expect to wait about eight hundred times the current age of
    the Universe before either the first or second file will be
    chosen (don't hold your breath).

    One way to choose each file with probability 1/N and
    without (much) advance knowledge of N is to use a random
    number generator that produces N or more distinct values
    before repeating. It's not hard to implement a linear
    congruential generator with a period of four billion or
    so, so as long as the directory holds fewer than four
    billion files you'll be fine (this upper bound is the small
    amount of advance knowledge you need about the value of N).

    open directory
    target = next file;
    key = next random value;
    while (more files exist) {
    nextfile = next file;
    nextkey = next random value;
    if (nextkey < key) {
    target = nextfile;
    key = nextkey;
    }
    }
    close directory;
    open target;

    That is, you "attach" a different pseudo-random number to
    each file, and you choose the file whose attached number is
    smallest. If the generator is any good at all, the smallest
    of N consecutive values is as likely to be in one position
    as in another; the first, second, ..., Nth files are all
    equally likely to be associated with the minimum value.

    --
    [email]Eric.Sosman@sun.com[/email]
    Eric Sosman Guest

  14. #13

    Default Re: Acessing files without directory structure information

    Chuck Dillon wrote in article <3F12D037.2020601@nimblegen.com> on Monday 14
    July 2003 17:45 in comp.unix.programmer:

    > A directory is just a file and you can open it for reading as raw data.
    > The filenames are what you are looking for. They are strings and can
    > be found heuristically. For example on a non-Linux systems, try doing
    > 'strings -t d /'. It will print an index containing the byte offsets
    > of the filenames in the / directory.
    >
    > You could generate such an index and randomly select from it.
    That's what I'm gonna try to do.
    > Or you
    > could randomly select an offset into the directory and write a function
    > to find the closest string. Once found use the string to open whatever
    > file it names.
    >
    That's not possible, because this way, the bigger the file, the higher
    probability to be selected.

    Sam
    --
    "Giving the Linus Torvalds Award to the Free Software Foundation is a bit like
    giving the Han Solo Award to the Rebel Alliance"

    - Richard Stallman, August 1999

    Sam Zoghaib Guest

  15. #14

    Default Re: Acessing files without directory structure information

    In article <3F12D037.2020601@nimblegen.com>, Chuck Dillon wrote:
    > Sam Zoghaib wrote:
    >>
    >> Actually, it may be possible to drop the "no information about the structure
    >> of the directory" condition, if I had a way to get the needed information
    >> (basically the list of files) very fast even for a large amount of files and
    >> can hold this information in little memory.
    >
    > A directory is just a file and you can open it for reading as raw data.
    > The filenames are what you are looking for.
    Depending on your system and definition of "very fast", you could use
    C's readdir(), which is POSIX standardized. Peeking at the raw data
    may not be much of a speed improvement, at least for some systems,
    and is horribly non-portable.

    - Larry
    Larry Doolittle Guest

  16. #15

    Default Re: Acessing files without directory structure information

    In article <slrnbh5q86.teg.ldoolitt@recycle.lbl.gov>,
    Larry Doolittle <ldoolitt@recycle.lbl.gov> wrote:
    >Depending on your system and definition of "very fast", you could use
    >C's readdir(), which is POSIX standardized.
    The OP specifically said he doesn't want to scan the directory sequentially
    (because it's so big). How do you propose using readdir() to accomplish
    his goal?

    --
    Barry Margolin, [email]barry.margolin@level3.com[/email]
    Level(3), Woburn, MA
    *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
    Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
    Barry Margolin Guest

  17. #16

    Default Re: Acessing files without directory structure information

    Thanks to everybody for the useful replies.
    I have studied the different solutions proposed, and it seems that it's not
    really possible to address all requirements without scanning the directory
    once. Indeed, if I cannot scan the directory, I don't see how I'm gonna be
    able to pick among all subdirectories in equal probability. (determining
    after the files are picked which are directories and picking among them fails
    the equal probability requirement).

    So I guess I'll have to cope with a scan, and find a way to store references
    to files in the least memory I can.

    Thanks again for all your help.
    Sam
    --
    "Don't be afraid, I'm gonna give you the choice I never had..."

    - Lestat in "Interview with the Vampire" (Ann Rice, 1976)

    Sam Zoghaib Guest

  18. #17

    Default Re: Acessing files without directory structure information


    "Sam Zoghaib" <NOsam@zep.homedns.orgSPAM> wrote in message
    news:bf0mdp$8e4$1@news-reader3.wanadoo.fr...
    > So I guess I'll have to cope with a scan, and find a way to store
    references
    > to files in the least memory I can.
    Why do you need to store references in memory at all? Do you care about
    files other than the one you picked?

    DS


    David Schwartz Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139