Jump to content
 English      
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
     Forums advanced search
HP.com Home
IT Resource Center Forums > HP-UX > general

Zip files 1.7 Million files

» 

IT Resource Center

» Login
» Register
» My profile
» Search knowledge base
» Forums
» Patch database
» Download drivers, software and firmware
» Warranty check
» Support Case Manager
» Software Update Manager
» Training and Education
» More maintenance and support options
» Online help
» Site map

Member icons
 
 HP moderator  HP moderator
 Expert in this area  Expert in this area
Member status
ITRC Pro ITRC Pro
250 points
ITRC Graduate ITRC Graduate
500 points
ITRC Wizard ITRC Wizard
1000 points
ITRC Royalty ITRC Royalty
2500 points
ITRC Pharaoh ITRC Pharaoh
7500 points
Olympian Olympian
20000 points
1-Star Olympian 1-Star Olympian
40000 points
2-Star Olympian 2-Star Olympian
80000 points
»  How to earn points
»  Support forums FAQs
Question status
Magical answer Magical answer
Message with a response that solved the author's question
Favorites status
Add to my favorites Add to my favorites
Delete from my favorites Delete from my favorites
This thread has been closed Thread closed
 

Content starts here
   Create a new message    Receive e-mail notification if a new reply is posted  Reply to this message
Author Subject: Zip files 1.7 Million files      Add to my favorites
ssheri
Nov 2, 2009 16:23:41 GMT   

Hi,

I have a filesystem which has got 1.7 million files. This filesystem contains files since Jan 2005.I need to gzip all files till Dec2008. I need to do this in an yearly basis. ie one zip file for 2005 , one zip file for 2006 . Same procedure for rest of the years.

Can anyone help me with the commands for this task?

Your help is much appreciated.
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click here


Sort Answers By: Date or Points
OldSchool This member has accumulated 7500 or more points
Nov 2, 2009 17:47:55 GMT  5 pts

do the filenames have the date in them somewhere? or are you relying on the datestamps? or something else entirely?
Suraj K Sankari This member has accumulated 2500 or more points
Nov 2, 2009 17:51:37 GMT  5 pts

HI,
First make a tar file then zip it with compress or gzip utility.

tar -cvf 2005.tar /directiory_name
gzip 2005.tar

Suraj
Michael Steele Expert in this area This member has accumulated 7500 or more points
Nov 2, 2009 17:55:30 GMT  5 pts

cd dir
find . -atime 360 -exec ll {} \;

Verify your selection by listing everything captured by find

When ready

find . -atime 360 -exec gzip {} \;

This is for one year. 720 for two years, etc.

I'd also suggest using 'tar' after you gzip else you'll run out of space fast. Real, real fast. In fact, having another dir to work with would be good.

find . -name *.gz | tar cvf backup.tar {} \;
OldSchool This member has accumulated 7500 or more points
Nov 2, 2009 18:57:49 GMT  5 pts

"tar -cvf 2005.tar /directiory_name
gzip 2005.tar"

that assumes the OP had the files already segregated into directories by year, which may or may not be the case.


"find . -atime 360 -exec gzip {} \;"

is probably closer to what the OP wants, but will result in one zip file for each original file found...which may be what their after.

or you could take the above "find" and "mv" the file to a separate directory, then gzip each, then tar the results....or mv the file, tar the directory and gzip *that*.

ssheri needs to remember the their is no "create date" stored in unix filesystems, M. Steele is going after the "access time" which would be may be a good bet. see the "man" page for "find", in particular the "-atime", "-mtime" and "-ctime" options to see which best fits.

Another option would be to create two reference files with appropriate dates, and use the "-newer" and "-older" options to sort out what you want.

All of the above is why I originally asked if the date was somehow "buried" in the filename.

some additional information about the original data layout, and the desired results might help in providing more appropriate responses.
Steven Schweda This member has accumulated 7500 or more points
Nov 2, 2009 19:43:54 GMT  5 pts

> Another option would be to create two
> reference files [...]

This seems like a better scheme than any of
the "-<X>time" options.  Especially if
you're not running the job at 00:00 on 1
January.  "-atime" would seem to be the
least likely to get the desired result
(unless no one ever looks at these files).

> or you could take the above "find" and
> "mv" the file [...]

I'd vote for moving them to year-specific
directories that way, and then doing
something like:

tar cf - year_2005_dir | \
gzip -c > year_2005_dir.tar.gz

Creating an actual "tar" archive file, and
_then_ hitting it with gzip tends to require
more disk space, at least temporarily.

> find . -atime 360 -exec gzip {} \;
>
> This is for one year. 720 for two years,
> etc.

Around here, years are longer than 360 days.
Which calendar do you use?  (And which does
"find" use?)
Viktor Balogh This member has accumulated 500 or more points
Nov 3, 2009 13:25:25 GMT  5 pts

If I would want to separate the files based explicitly on the year, I would go this way to create a file list:

# find . -exec ll {} + | awk '$8 == "2007"' | tee list_2007

This lists the files exactly from year 2007, (1st jan -> 31th dec) and also dives into subdirs. After that you could feed this file to gzip/tar or whatever you want...
OldSchool This member has accumulated 7500 or more points
Nov 3, 2009 16:16:58 GMT    Unassigned

lots of options presented.....still waiting for "ssheri" to shed some light on the original directory layout and the desired output.

from what was originally stated, it could well be that the OP wants a gzip file for a given year that contains all the files for that year (as opposed to zipping a tar of those files).

If so, I don't think that option has been covered yet, and it might be a pain to implement.
ssheri
Nov 3, 2009 16:18:47 GMT    N/A: Question Author

Hi All,
Thanks for your quick responses. I hope I would explain my requirement in detail.
=======================================

I have a filesystem which contains 1.7 million files. File are there since 2005 till today. My requirement is to tar and zip the files for each year separately. ie one tar/zip file for 2005, 2006,2007 and 2008. The files can be identified by their time stamp and there are no separate directories for each year. All files are residing on a single directory.
======================================
OldSchool This member has accumulated 7500 or more points
Nov 3, 2009 18:37:58 GMT  10 pts

"I have a filesystem which contains 1.7 million files. File are there since 2005 till today. My requirement is to tar and zip the files for each year separately. ie one tar/zip file for 2005, 2006,2007 and 2008. The files can be identified by their time stamp and there are no separate directories for each year. All files are residing on a single directory."

Ok, this could get ugly. Making the assumption that the files will be removed after archiving, then something like the following can be modified to work:

First, you need to realize that UNIX doesn't have / track a file timestamp related to the "creation time". It knows the following:

atime (File Access Time)
Access time shows the last time the data from a file was accessed - read by one of the Unix processes directly or through commands and scripts.

ctime (File Change Time)
ctime changes when you change file's ownership or access permissions. It will also naturally highlight the last time file had its contents updated.

mtime (File Modify Time)
Last modification time shows time of the last change to file's contents. It does not change with owner or permission changes, and is therefore used for tracking the actual changes to data of the file itself.

So...which one you look at depends on what you want. IF you can guarantee that the contents of the file, once written, were never modified, then the mtime option of find should be ok. Access time is useless for this if the file has ever been read after writing. Ctime *might* work.

If none of the above apply, then you're toast, as you've no way to locate files written in 2005.

Let us say that mtime is workable in your case, and you are going to find those files in year 2005. I'd create two reference files representing the upper and lower limits of the times you wish to locate:

touch -a -m -t 200501010000.00 $HOME/first.ref
touch -a -m -t 200512312359.59 $HOME/last.ref

should get be everything between 01/01/2005 at 00:00 and 12/31/2005 at 23:59 and 59 seconds.


then use find to locate the relevant file using find and move them to a directory by themselves

mkdir /yourname/2005
cd /where_files_are

find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec mv {} /yourname/2005/. \+

at that point, you should be able to tar the newly created directory and pipe that to zip as noted in one of the posts above.

Note that the above has not been tested, you might want to substitute something harmless, like ls for the move until you get it sorted out.

repeat the above, after adjusting timestamps on the ref files, and creating the required directories.
Michael Steele Expert in this area This member has accumulated 7500 or more points
Nov 3, 2009 22:44:06 GMT  4 pts

And I have provided the procedure that I would use if I had the task to perform.
Dennis Handly Expert in this area This member has accumulated 40000 or more points
Nov 4, 2009 01:14:06 GMT  5 pts

>I have a filesystem which has got 1.7 million files.  ... All files are residing on a single directory.

I assume people have told you this is not a good idea?

>The files can be identified by their time stamp

Encoded in their name, or in the ll(1) output?
I have a case where they are encoded in their name.

>there are no separate directories for each year.

If the names include the year, the first thing to do is to create a subdirectory and move all of a year into it.

If they don't include the year, you can make a simple script to do that:
last_year=""
ll -trog | while read F1 F2 F3 F4 F5 F6 F7; do
   case $F6 in
   200[5-8]) ;;
   *)        continue;;
   esac
   if [ "$F6" != "$last_year" ]; then
      mkdir -p $F6
      last_year=$F6
   fi
   echo mv "$F7" $last_year
done

Once they are in a separate directory, you use the tar-gzip suggestions as Steven suggested.

If you don't want to include the directory name in the tarball, you can use -C:
tar cf - -C 2005 . |
gzip > year_2005.tar.gz
OldSchool This member has accumulated 7500 or more points
Nov 4, 2009 14:36:13 GMT    Unassigned

perhaps the biggest problem here is that questions to the OP get a restatment of the original question, without additional information. and specific questions go unanswered.

there are a variety of answers posted, some of which may be more appropriate than others, depending on the exact goal, which isn't clear here.

I will add that if ssheri has any control of the creation of these files, going I'd encode the creation date in the filename somehow, as anything relying on the "timestamps" is not going to be a reliable method for determining when a file was created.
ssheri
Nov 5, 2009 12:12:30 GMT    N/A: Question Author

Thanks a lot..

Your suggestions match my requirement.
The files are not modified after their arrival to the filesystem. These files are getting saved to the filesystem as a result of a scheduled job. Later these are not getting modified by any user.
ssheri
Nov 16, 2009 18:28:04 GMT    N/A: Question Author

Hi,

I have checked up using the options which "oldschool" provided.
=======================================
1. created refernec files

2. ran find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec cp -p {} /yourname/2005/. \+

======================================

I have used cp instead of mv for resting. Test was only for 1 month data. But I am getting an error when I excecute it.

cp:./filename: not a directory, where filename is the last file whcih supposed to be copied as per the reference file.

For example if use
touch -a -m -t 200510010000.00 $HOME/first.ref
touch -a -m -t 200511302359.59 $HOME/last.ref

the above error is coming up with last file dated 20051130. I tried changing the date for touch and I am getting the same error for the last file created on the date as per last.ref.
OldSchool This member has accumulated 7500 or more points
Nov 16, 2009 21:25:37 GMT  5 pts

what happens with

find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec ls -l {} \+
ssheri
Nov 16, 2009 21:29:41 GMT    N/A: Question Author

It works fine. I am getting the error with last file in a purticular month as per the last.ref file when I do a cp instead ll.
Dennis Handly Expert in this area This member has accumulated 40000 or more points
Nov 17, 2009 10:37:03 GMT    Unassigned

>2. find . -xdev -type f -newer $HOME/first.ref -a !-newer $HOME/last.ref -exec cp -p {} /yourname/2005/. \+

(I would forget about cp instead of mv for a million files.)

>-exec cp -p {} /yourname/2005/. \+

You can't do this. You can only put the {} last. (Unless GNU find does it?)
(Also leave off the stinkin' \ before the "+".)

So you'll need to write a script:
-exec cp_stuff.sh /yourname/2005 +

Inside cp_stuff.sh you have:
#!/usr/bin/sh
# swap first to last to make "find -exec +" happy.
target=$1
shift
cp -p "$@" "$target"

(The quotes are to make Steven happy with his stinkin' spaces. :-)

>OldSchool: what happens with

Why fiddle with the find syntax when it is the -exec that is the problem?
Also this is what tusc is for.
OldSchool This member has accumulated 7500 or more points
Nov 17, 2009 15:11:00 GMT    Unassigned

"Why fiddle with the find syntax when it is the -exec that is the problem?
Also this is what tusc is for."

because it lost enough in translation that I couldn't tell that (from what the OP posted)...

Plus, based on the question, I doubt that the OP has the ability to interpret the output of tusc....
 
Create a new message    Receive e-mail notification if a new reply is posted   Reply to this message
 
 
Printable version
Privacy statement Using this site means you accept its terms
© 2009 Hewlett-Packard Development Company, L.P.