Professional Web Applications Themes

Map out a directory heirarchy - PERL Beginners

I want to get a daily list of all the directories under a kind of large (by home standards) news heirarchy. I know a little about using File::Find but wonder if there is a better way. Here are the things one runs into with File::Find. if you run it looking for type d (-d) directories it still takes a really long time, and then returns all the stub names that don't actually end in files too. like comp/os or the like. I can think of a few ways to get down to the uniq directories that actually have files at ...

  1. #1

    Default Map out a directory heirarchy

    I want to get a daily list of all the directories under a kind of
    large (by home standards) news heirarchy.

    I know a little about using File::Find but wonder if there is a better
    way.

    Here are the things one runs into with File::Find.

    if you run it looking for type d (-d) directories it still takes a
    really long time, and then returns all the stub names that don't
    actually end in files too.

    like comp/os or the like.

    I can think of a few ways to get down to the uniq directories that
    actually have files at there end like:

    comp/os/linux/misc

    But not without actually finding the numbered files in there.

    For example, If I set File::Find looking for /^\d+$/ then in my case
    that will have to be a full path to postings. The trouble there is
    that there are literally millions of numbered files under those paths.

    I was trying to think of something crazy like putting File::Find in a
    while loop that lasts out soon as a numbered file is found.

    Then some way to force a chngdir but not to that same path.

    Can someone help me with this... but understand this is not really
    urgent since I do know how to get it done the long way. Though I'm
    sure this a pretty sorry way of doing it. I used Cwd because I
    couldn't quite figure out how to use File::Find's nochgdir operator.
    Also probably a pretty holey way of getting those duplicate paths down
    to one by cramming them into a hash as keys and letting them cancel.

    #! /usr/bin/perl -w

    use File::Find;
    use Cwd;
    if(!$ARGV[0] || $ARGV[0] eq "help"){
    usage();
    exit;
    }else{
    top_dir = ARGV;
    ARGV = ();
    }
    $file = "./uniq_dir_under_news";
    my ($our_dir, $absolute, $uniq_dirs, %uniq_dirs);
    find(\&wanted, top_dir);

    open(FILE,">$file") or die "Can't open $file: $!";

    sub wanted {
    $our_dir = getcwd;
    if($_ =~ /^\d+/){
    ## This print is just to let me know its running
    print "$our_dir/$_\n";
    $uniq_dirs{$our_dir} = $_;
    }
    }
    foreach $key (keys %uniq_dirs){
    push uniq_dirs,$key;
    }

    for(sort uniq_dirs){
    print FILE "$_\n";
    print "$_\n";
    }
    close(FILE);

    As you may guess this takes quite a while with 6.3 GIGs under
    /news.

    Harry Guest

  2. #2

    Default Re: Map out a directory heirarchy

    Harry Putnam wrote: 

    You are doing too much work as File::Find::find() already supplies the full
    path name.

    #!/usr/bin/perl
    use warnings;
    use strict;

    use File::Find;

    if ( !ARGV or $ARGV[0] eq 'help' ) {
    usage();
    exit 0;
    }

    my top_dir = splice ARGV;
    my $file = './uniq_dir_under_news';

    my %uniq_dirs;
    find( sub {
    if ( /^\d/ ) {
    ## This print is just to let me know its running
    print "$File::Find::name\n";
    }
    $uniq_dirs{ $File::Find::dir }++;
    }, top_dir );

    open FILE, '>', $file or die "Can't open $file: $!";

    for ( sort keys %uniq_dirs ) {
    print FILE "$_\n";
    print "$_\n";
    }

    close FILE;

    __END__



    John
    --
    use Perl;
    program
    fulfillment
    John Guest

  3. #3

    Default Re: Map out a directory heirarchy

    "John W. Krahn" <net> writes:
     

    First ... thanks for the coding tips.

    About the full path name: Well actually no, it doesn't give full path
    names. Or at least not absolute names, thats why I used Cwd. For
    example if given a directory thats in a relative address like.
    ../test_news. It returns:

    test_news
    test_news/tmp
    test_news/tmp/tmp2

    That isn't going to work for other applications to access those files.

    I learned something with the `splice' you added and your format for
    finding uniq dirs..
    $uniq_dirs{ $File::Find::dir }++;
    And the cool way you worked the sort right into the
    hash read.

    Not quite sure I understand how (in this case)
    `array2 = splice array1' is better than `array2 = array1'.

    Is it just because the splice removes the elements from ARGV
    and so needs no ARGV = (); ?

    This new coding although easier to look at and probably more
    efficient, isn't really any faster or at least not appreciably. It
    still goes to each and every numbered file.

    [...] snipped new code

    Also changing the file name regex from /^\d+$/ to /^\d/ will cause
    problems in some directories where there may be such things as 780~ or
    even 123.bak

    And finally I didn't really follow the different syntactical setup of
    the find() with sub. It appears to be the same thing as:


    find(\&wanted, directories);

    sub wanted{ ... }

    Like the examples in perldoc File::Find, only more confusing.

    Harry Guest

  4. #4

    Default Re: Map out a directory heirarchy

    Harry Putnam wrote: 
    >
    > First ... thanks for the coding tips.
    >
    > About the full path name: Well actually no, it doesn't give full path
    > names. Or at least not absolute names, thats why I used Cwd. For
    > example if given a directory thats in a relative address like.
    > ./test_news. It returns:[/ref]

    Ok, see below.
     

    Yes.
     

    In most file systems the file names are not stored in any particular order so
    in order to find every file of a certain type you have to look at every file
    in a directory to determine if it is the type you want.

     

    Your original example used /^\d+/ not /^\d+$/ and /^\d/ does the same thing as
    /^\d+/.

     

    sub something { ... }
    find( \&something, directories );

    And

    find( sub { ... }, directories );

    Do the same thing but the first uses a reference to a named subroutine and the
    second uses a reference to an anonymous subroutine.



    #!/usr/bin/perl
    use warnings;
    use strict;

    use File::Spec;
    use File::Find;

    if ( !ARGV or $ARGV[0] eq 'help' ) {
    usage();
    exit 0;
    }

    my top_dir = map File::Spec->rel2abs($_), splice ARGV;
    my $file = './uniq_dir_under_news';

    my %uniq_dirs;
    find( sub {
    if ( /^\d+$/ ) {
    ## This print is just to let me know its running
    print "$File::Find::name\n";
    }
    $uniq_dirs{ $File::Find::dir }++;
    }, top_dir );

    open FILE, '>', $file or die "Can't open $file: $!";

    for ( sort keys %uniq_dirs ) {
    print FILE "$_\n";
    print "$_\n";
    }

    close FILE;

    __END__



    John
    --
    use Perl;
    program
    fulfillment
    John Guest

  5. #5

    Default I'm confused: where are the commas for map and sort


    The map and the sort statements are strange. Why don't they require a comma
    between the first and second arguments?

    Thanks,
    Siegfried

    Siegfried Guest

  6. #6

    Default Re: Map out a directory heirarchy

    Harry wrote: [/ref]

    John replied: 

    So I guess there just isn't any tricky fast way to get just the
    directory names then eh? This is on ext3 fs but about the only real
    change I could make there would be to reiserfs or something and I'll
    assume that wouldn't really change the problem.
     [/ref]
     
    >
    > Your original example used /^\d+/ not /^\d+$/ and /^\d/ does the
    > same thing as /^\d+/.[/ref]

    Oh .. now I see where you got it. Because the actual program used
    /^\d+$/ for the reasons I listed. Must have been a foible or typo
    during conversion to mail message. I might not have cut and pasted
    all of it or changed the program after posting... or something.

    [...]
     

    OK, thanks for clearing that up...Is one better than the other in some
    way?
     

    [...]
     

    Is using File::Spec in this way, faster or more efficient than using
    Cwd like in the original? Or are you just showing another approach?

    Thanks for that too, I hadn't run into File::Spec as yet and now have
    a handy reference to its use when I need it.

    Those perldoc pages are terrrible about showing enough examples. But
    mainly because I lack the expertise to understand the ones they do
    give.

    Well thanks for your usual patience and explanations...

    Harry Guest

  7. #7

    Default Re: I'm confused: where are the commas for map and sort

    On Sun, 10 Oct 2004 14:30:32 -0600, Siegfried Heintze
    <com> wrote: 

    They are not special, they are just using a special semantic built into perl.

    Consider the following:
    ----------------------------------------------------------------------------------------
    sub iterate(&)
    {
    my $code = shift;
    &$code for _;
    }

    data = ( "jack", "jill", "jenny" );

    iterate { print "Hello $_\n" } data
    ----------------------------------------------------------------------------------------

    The "iterate" function takes a CODE reference and an ARRAY
    of stuff (same as MAP and GREP and SORT). It will then blindly
    iterate over the array of stuff and invoke the code for each element.

    The basic rule is, a BLOCK OF CODE does not require a comma
    after it, because it is self-enclosing (ie, it has braces).

    Consider then, that since 'map {xx} data' is actually parsing {xx}
    as a code reference, the same way 'my $prog = { some code}' does,
    you can also do

    sub sortfilter($$){
    stuff
    }

    sort \&sortfilter, data

    or even

    sub special_sort(&)
    {
    my $sort_func = shift;
    sort $sort_func, _
    }

    You should probably read more on anonymous subroutines and
    code blocks. (and even closures).

    Cheers.
    David
     
    David Guest

  8. #8

    Default Re: Map out a directory heirarchy

    Harry Putnam wrote: [/ref]
    >
    > John replied:

    >
    > So I guess there just isn't any tricky fast way to get just the
    > directory names then eh? This is on ext3 fs but about the only real
    > change I could make there would be to reiserfs or something and I'll
    > assume that wouldn't really change the problem.[/ref]

    A directory is just a special kind of file. Think of it as a text file with
    one file name per text line (in no particular order.) Now imagine that you
    had to find every line that contained the letter 'r'.

     
    >
    > OK, thanks for clearing that up...Is one better than the other in some
    > way?[/ref]

    The first one creates an entry in the symbol table because it has a name while
    the second one doesn't. As to which is "better" ... it depends. :-)

     
    >
    > [...]

    >
    > Is using File::Spec in this way, faster or more efficient than using
    > Cwd like in the original? Or are you just showing another approach?[/ref]

    In the original you called getcwd() inside the find() subroutine which would
    call it once for every directory entry in every directory and sub-directory.
    My example calls File::Spec->rel2abs() for every entry in ARGV which should
    be a lot more efficient.


    John
    --
    use Perl;
    program
    fulfillment
    John Guest

Similar Threads

  1. Replies: 6
    Last Post: October 16th, 04:26 PM
  2. flat heirarchy permissions
    By fluffy1 in forum Macromedia Contribute Connection Administrtion
    Replies: 1
    Last Post: July 30th, 07:58 AM
  3. permissions in flat heirarchy
    By fluffy1 in forum Macromedia Contribute General Discussion
    Replies: 0
    Last Post: July 28th, 05:08 PM
  4. Replies: 1
    Last Post: July 4th, 12:23 AM
  5. Replies: 1
    Last Post: May 21st, 03:47 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139