At4J Programmer's Guide

At4J is a set of Java libraries for data compression and file archiving. It has support for reading and building Zip and Tar archives. Through third-party libraries, it has support for bzip2, gzip and LZMA data compression and decompression.

This book is the Programmer's guide. It is written for programmers who want to use At4J in their applications. It gives an overview of At4J's design and capabilities with examples and pointers to the API documentation, which serves as the reference documentation.

Please help!

Feedback and contributions from users is essential for making At4J a better library. Please share your thoughts and opinions on At4J with the rest of the community through the mailing lists.

License and Copyright

Support

Support for At4J can be found on the users mailing list. See the At4J site.

Chapter 2. Getting started

Table of Contents

Requirements
Downloading At4J
Installing At4J
Running unit tests (optional)

Requirements

At4J requires Java 5 or a newer Java version to run.

Downloading At4J

Get At4J from http://www.at4j.org.

At4J is distributed in two Zip archives. The binary archive contains everything necessary for using At4J. The source archive contains everything that the binary archive does, as well as unit test classes, test data and the complete At4J source code in an Eclipse workspace.

Installing At4J

Unzip the distribution into a directory.

The At4J Jar file and its dependencies are in the lib sub directory of the distribution.

Running unit tests (optional)

The source distribution comes with a Schmant script for running all unit tests.

To run the unit tests, open a command window (terminal, cmd) and change directory to the At4J source distribution's build directory. Set the JAVA_HOME environment variable to point to a Java JDK 6 installation. (Just a JRE won't do.)

On Unix, run:

$ # For instance
$ export JAVA_HOME=/opt/java6
$ schmant.sh -p javaCmd=[Java command] run_unit_tests.js

where the Java command is the path to the java to use for running the tests.

On Windows, run:

> rem For instance. Note the absence of quotes in JAVA_HOME
> set JAVA_HOME=c:\Program Files\Java\jdk1.6.0_16
> schmant -p "javaCmd=[Java command]" run_unit_tests.js

where the Java command is the path to the java.exe to use for running the tests.

Chapter 3. Data compression

Table of Contents

Utilities
Which compression method is best?

At4J has an implementation of bzip2 and provides other data compression algorithms through third party libraries. All compression methods use Java's streams metaphor—data is compressed by writing to an OutputStream and decompressed by reading from an InputStream. The example below shows how data is compressed and then decompressed using bzip2 compression.

Example 3.1. Compressing and decompressing with bzip2

String toCompress = "Compress me!";

// This will contain the compressed byte array
ByteArrayOutputStream bout = new ByteArrayOutputStream();

// Settings for the bzip2 compressor
BZip2OutputStreamSettings settings = new BZip2OutputStreamSettings().
  // Use four encoder threads to speed up compression
  setNumberOfEncoderThreads(4);

OutputStream out = new BZip2OutputStream(bout, settings);
try
{
  // Compress the data
  out.write(toCompress.getBytes());
}
finally
{
  out.close();
}

byte[] compressed = bout.toByteArray();

// This will print a long range of numbers starting with "[66, 90, 104, ..."
System.out.println(Arrays.toString(compressed));

// Decompress the data again
StringBuilder decompressed = new StringBuilder();
InputStream in = new BZip2InputStream(
  new ByteArrayInputStream(compressed));
try
{
  byte[] barr = new byte[64];
  int noRead = in.read(barr);
  while(noRead > 0)
  {
    decompressed.append(new String(barr, 0, noRead));
    
    noRead = in.read(barr);
  }
}
finally
{
  in.close();
}

// This will print "Compress me!"
System.out.println(decompressed.toString());

Utilities

EntityFS has some utility classes that makes the I/O programming less verbose. The example below does the same as the example above, but uses the StreamUtil class for reading data from the decompressing stream.

Example 3.2. Compressing and decompressing with bzip2 using EntityFS utilities

String toCompress = "Compress me!";

// This will contain the compressed byte array
ByteArrayOutputStream bout = new ByteArrayOutputStream();

OutputStream out = new BZip2OutputStream(bout);
try
{
  // Compress the data
  out.write(toCompress.getBytes());
}
finally
{
  out.close();
}

byte[] compressed = bout.toByteArray();

// This will print a long range of numbers starting with "[66, 90, 104, ..."
System.out.println(Arrays.toString(compressed));

// Decompress the data again. Use StreamUtil to read data.
byte[] decompressed = StreamUtil.readStreamFully(
  new BZip2InputStream(
    new ByteArrayInputStream(compressed)), 64);

// This will print "Compress me!"
System.out.println(new String(decompressed));

The following EntityFS classes are useful when working with files and streams:

Table 3.1. Useful EntityFS classes

Class	Description
Files	Support for reading from and writing to files.
StreamUtil	Support for reading from and writing to streams.

Which compression method is best?

The answer is, of course: it depends. The performance characteristics of the different compression methods are investigated in At4J test report. The table below summarizes the characteristics of the different compression methods:

Table 3.2. Compression methods

Method	Compression	Speed	Software support
gzip	fair	fast	ubiquitous
bzip2	good	slow	widespread
LZMA	very good	slower	scarce

Chapter 4. bzip2 compression

Table of Contents

bzip2 utilities
Standalone bzip2 tools

The bzip2 compression method was developed by Julian Seward in the late nineties. See the Wikipedia article on bzip2 and the bzip2 home page.

At4J provides a Java implementation of bzip2. Data is compressed with the BZip2OutputStream and decompressed with the BZip2InputStream.

Since bzip2 compression is a CPU-intensive task, the BZip2OutputStream supports using several, parallel compression threads. An individual BZip2OutputStream may be told how many threads that it can use, or several output streams may share a set of threads through a BZip2EncoderExecutorService.

The following example shows how data is written to a bzip2 output stream that writes to a file, and then read again from the file and decompressed.

Example 4.1. Compressing and decompressing with bzip2

// Data will be compressed to the File f

String toCompress = "Compress me!";

OutputStream os = new FileOutputStream(f);
try
{
  // Use the default compression settings (maximum compression)
  OutputStream bzos = new BZip2OutputStream(os);
  try
  {
    bzos.write(toCompress.getBytes());
  }
  finally
  {
    bzos.close();
  }      
}
finally
{
  // Calling close here may mean that close will be called several times on the
  // same stream. That is safe.
  os.close();
}

// Read the compressed data
InputStream is = new FileInputStream(f);
try
{
  InputStream bzis = new BZip2InputStream(is);
  try
  {
    // Use the EntityFS StreamUtil utility to make our job easier.
    // This will print "Compress me!"
    System.out.println(
      new String(
        StreamUtil.readStreamFully(bzis, 32)));
  }
  finally
  {
    bzis.close();
  }
}
finally
{
  // Calling close here may mean that close will be called several times on the
  // same stream. That is safe.
  is.close();
}

Next example shows how a set of encoder threads is shared between two bzip2 streams.

Example 4.2. Compressing and decompressing with bzip2 using several encoder threads

// Data will be compressed to the File:s f1 and f2

String toCompress1 = "Compress me!";
String toCompress2 = "Compress me too!";

// Create a BZip2EncoderExecutorService with four threads.
BZip2EncoderExecutorService executor =
  BZip2OutputStream.createExecutorService(4);

// A settings object containing the executor service
BZip2OutputStreamSettings settings = new BZip2OutputStreamSettings().
  setExecutorService(executor);

try
{
  OutputStream bzos1 = new BZip2OutputStream(
	new FileOutputStream(f1), settings); 
  try
  {
    OutputStream bzos2 = new BZip2OutputStream(
      new FileOutputStream(f2), settings);
    try
    {
      bzos1.write(toCompress1.getBytes());
      bzos2.write(toCompress2.getBytes());
    }
    finally
    {
      bzos2.close();
    }
  }
  finally
  {
    bzos1.close();
  }
}
finally
{
  // Shut down the executor service
  executor.shutdown();
}

// Read the compressed data
InputStream bzis = new BZip2InputStream(new FileInputStream(f1));
try
{
  // Use the EntityFS StreamUtil utility to make our job easier.
  // This will print "Compress me!"
  System.out.println(
    new String(
      StreamUtil.readStreamFully(bzis, 32)));
}
finally
{
  bzis.close();
}

bzis = new BZip2InputStream(new FileInputStream(f2));
try
{
  // Use the EntityFS StreamUtil utility to make our job easier.
  // This will print "Compress me too!"
  System.out.println(
    new String(
      StreamUtil.readStreamFully(bzis, 32)));
}
finally
{
  bzis.close();
}

At4J also bundles the bzip2 library from the Apache Commons Compress site project. It provides the BZip2CompressorInputStream for reading compressed data and the BZip2CompressorOutputStream for writing compressed data.

Note that the available method of the BZip2CompressorInputStream always returns 0.

bzip2 utilities

The BZip2ReadableFile and BZip2WritableFile objects can transparently bzip2 decompress or compress data that is read from or written to a file. They implement the ReadableFile and the WritableFile interfaces respectively and can be passed to all methods that use those interfaces.

The next example does the same as the example above, except that it uses the BZip2ReadableFile and BZip2WritableFile classes.

Example 4.3. Compressing and decompressing with bzip2 using At4J readable and writable bzip2 files

// Data will be compressed to the File f

String toCompress = "Compress me!";

// Wrap the File in a ReadWritableFileAdapter to make it a
// ReadWritableFile
ReadWritableFile fa = new ReadWritableFileAdapter(f);

// Write the data using the EntityFS utility class Files and a
// BZip2WritableFile. Use maximum compression (9).
BZip2WritableFileSettings writeSettings = new BZip2WritableFileSettings().
  setBlockSize(9);

Files.writeText(new BZip2WritableFile(fa, writeSettings), toCompress);

// Read the data, again using Files. The data is read from a
// BZip2ReadableFile.
// This will print out "Compress me!"
System.out.println(
  Files.readTextFile(
    new BZip2ReadableFile(fa)));

Standalone bzip2 tools

The BZip2 and BUnzip2 classes have runnable main methods that emulate the behavior of the bzip2 and bunzip2 commands. See their API documentation for details on how to use them.

Chapter 5. gzip compression

Table of Contents

Standalone gzip tools

gzip is a compression format that was created by Jean-Loup Gailly and Mark Adler in the early nineties. See the Wikipedia article on gzip.

gzip is supported through Java's GZipOutputStream and GZipInputStream. They filter the data sent to or read from another stream to compress it and decompress it, respectively.

EntityFS has with the GZipReadableFile and GZipWritableFile classes. They can transparently decompress data read from a file and compress data written to a file.

Example 5.1. Working with gzip using readable and writable gzip files

// Data will be compressed to the File f

String toCompress = "Compress me!";

// Wrap the File in a ReadWritableFileAdapter to make it a
// ReadWritableFile
ReadWritableFile fa = new ReadWritableFileAdapter(f);

// Write the data using the EntityFS utility class Files and a
// GZipWritableFile with default compression settings.
Files.writeText(new GZipWritableFile(fa), toCompress);

// Read the data, again using Files. The data is read from a
// GZipReadableFile.
// This will print out "Compress me!"
System.out.println(
  Files.readTextFile(
    new GZipReadableFile(fa)));

Standalone gzip tools

The GZip and GUnzip classes have runnable main methods that emulate the behavior of the gzip and gunzip commands. See their API documentation for details on how to use them.

Chapter 6. LZMA compression

Table of Contents

Standalone LZMA tools

LZMA, the Lempel-Ziv-Markov chain-Algorithm, is a compression algorithm that has been under development since 1998. See the Wikipedia article on LZMA.

At4J uses Igor Pavlov's LZMA implementation from the LZMA SDK. It is built around a standalone encoder and a standalone decoder. The encoder reads data from an uncompressed stream and writes it to a compressed stream, and the decoder does the opposite. This pull method of processing data is quite unlike the push model employed by the Java streams API.

At4J provides stream implementations on top of the encoder and the decoder, in effect turning them inside out. To accomplish this, the encoder or the decoder is launched in a separate thread that is running as long as the stream writing to it or reading from it is open. The compressing stream implementation is LzmaOutputStream and the decompressing stream implementation is LzmaInputStream.

Clients are, of course, free to choose between using the LZMA SDK's Encoder and Decoder classes, or using At4J's LzmaOutputStream and LzmaInputStream.

Warning

LZMA does not seem to work well with the IBM JDK. See the At4J test results.

An LzmaOutputStream is configured using an LzmaOutputStreamSettings object. There are several configurable parameters. See the LzmaOutputStreamSettings documentation for details. By default, the output stream writes its configuration before the compressed data. By doing so, an LzmaInputStream reading from the file does not have to be configured manually; it just reads its configuration from the file header. If the compressed data does not contain the compression settings, the input stream can be configured using an LzmaInputStreamSettings object.

The example below shows how data is compressed by writing it to an LZMA output stream and then decompressed again by reading from an LZMA input stream.

Example 6.1. Compressing and decompressing with LZMA

// Data will be compressed to the File f

String toCompress = "Compress me!";

// Create a new LZMA output stream with the default settings. This will write
// the compression settings before the compressed data, so that the stream that
// will read the data later on does not have to be configured manually.
// This starts a new encoder thread.
OutputStream os = new LzmaOutputStream(new FileOutputStream(f));
try
{
    os.write(toCompress.getBytes());
}
finally
{
  // This closes the encoder thread.
  os.close();
}

// Read the compressed data
// This starts a new decoder thread.
InputStream is = new LzmaInputStream(new FileInputStream(f));
try
{
  // Use the EntityFS StreamUtil utility to make our job easier.
  // This will print "Compress me!"
  System.out.println(
    new String(
      StreamUtil.readStreamFully(is, 32)));
}
finally
{
  // This closes the decoder thread.
  is.close();
}

The example below writes LZMA compressed data to a file without writing the the compression settings, and then reads the data again using a manually configured input stream.

Example 6.2. Compressing and decompressing with LZMA using manual configuration

// Data will be compressed to the File f

String toCompress = "Compress me!";

// Create the configuration for the output stream. Set two properties and use
// the default values for the other properties. 
LzmaOutputStreamSettings outSettings = new LzmaOutputStreamSettings().
  // Do not write the configuration to the file
  setWriteStreamProperties(false).
  // Use a dictionary size of 2^8 = 256 bytes
  setDictionarySizeExponent(8);
  
// Create a new LZMA output stream with the custom settings.
OutputStream os = new LzmaOutputStream(new FileOutputStream(f), outSettings);
try
{
    os.write(toCompress.getBytes());
}
finally
{
  os.close();
}

// Create the configuration for the input stream. Configure it using properties
// from the output stream configuration above.
LzmaInputStreamSettings inSettings = new LzmaInputStreamSettings().
  setProperties(outSettings.getProperties());
  
// Read the compressed data with a manually configured input stream.
InputStream is = new LzmaInputStream(new FileInputStream(f), inSettings);
try
{
  // Use the EntityFS StreamUtil utility to make our job easier.
  // This will print "Compress me!"
  System.out.println(
    new String(
      StreamUtil.readStreamFully(is, 32)));
}
finally
{
  // This closes the decoder thread.
  is.close();
}

The LzmaWritableFile and LzmaReadableFile objects can transparently compress data written to and decompress data read from a file.

The next example does the same as Example 6.1, “Compressing and decompressing with LZMA”, except that it uses the LzmaReadableFile and LzmaWritableFile classes.

Example 6.3. Compressing and decompressing with LZMA using At4J readable and writable LZMA files

// Data will be compressed to the File f

String toCompress = "Compress me!";

// Wrap the File in a ReadWritableFileAdapter to make it a
// ReadWritableFile
ReadWritableFile fa = new ReadWritableFileAdapter(f);

// Write the data using the EntityFS utility class Files and a
// LzmaWritableFile using its default configuration.
Files.writeText(new LzmaWritableFile(fa), toCompress);

// Read the data, again using Files. The data is read from an unconfigured
// LzmaReadableFile.
// This will print out "Compress me!"
System.out.println(
  Files.readTextFile(
    new LzmaReadableFile(fa)));

Standalone LZMA tools

The Lzma and UnLzma classes have runnable main methods that emulate the behavior of the lzma and unlzma commands. See their API documentation for details on how to use them.

Chapter 7. Archives

Table of Contents

Reading archives

Extracting entries

Creating archives

Determining the metadata for an entry

An archive is a collection of files and directories, stored as archive entries in a single file. The archive often also stores some kind of metadata for each entry, such as its owner or its latest modification time.

At4J supports reading and creating Zip and Tar archives.

Reading archives

A program can access files and directories in an archive by creating an Archive object, such as ZipFile or TarFile on the archive file. The archive contains a set of ArchiveEntry:s containing data on each archive entry, each with its unique AbsoluteLocation in the archive. Entries may be ArchiveFileEntry:s, ArchiveDirectoryEntry:s or ArchiveSymbolicLinkEntry:s. An ArchiveFileEntry is a ReadableFile, and an ArchiveDirectoryEntry has a map of child entries with their names as keys. Archive objects implement Map<AbsoluteLocation, ArchiveEntry>, (read only) which makes it easy to access individual entries.

Entries in Zip and Tar archives differ in what kind of metadata they store, so each archive format has its own ArchiveEntry implementations. See Chapter 8, Tar and Chapter 9, Zip for examples.

Since an Archive object keeps the backing archive file open, it must be closed when the program is done using it.

Extracting entries

Entries can be extracted and copied to a FileSystem by manually traversing through the entries in an Archive object or by using the ArchiveExtractor.

Example 7.1. Extracting files from a Zip file using the archive extractor

// Extract all entries from the Zip file f to the directory d

ZipFile zf = new ZipFile(f);
try
{
  // Extract to d
  new ArchiveExtractor(zf).extract(d);
}
finally
{
  // Close the Zip file to release all its resources
  zf.close();
}

The ArchiveExtractor's extraction process can be fine-tuned by giving it a custom ExtractSpecification object.

For some archive formats (Tar), there are customized archive extractors (TarExtractor) that may be faster than the ArchiveExtractor.

Creating archives

An archive file is created by creating a new archive format-specific ArchiveBuilder object, for instance a ZipBuilder, and then adding entries to it. See the following chapters for examples.

An archive builder may or may not be a StreamAddCapableArchiveBuilder. A stream add capable builder has methods for adding data read from an input stream as a file entry in the archive.

Determining the metadata for an entry

The metadata added to each entry is determined by its effective ArchiveEntrySettings object (ZipEntrySettings, TarEntrySettings, etc). Entry settings can be defined in three different scopes:

Settings from rules specified for a single add operation (highest precedence). A rule is a settings object paired with a Filter<EntryToArchive> (EntityToArchiveFilter) filter to determine which entries it should be applied to.
Settings from the archive builder's list of global rules.
Default settings for each entry type (files or directories).

The archive builder arrives at the effective settings for each entry by:

1.	Combine the default file or directory settings with the settings from the first applicable global rule.
2.	Combine the settings created by the previous step with the settings from the second applicable global rule.
3 – n - 1.	…
n.	Combine the settings created by the previous step with the settings from the last applicable global rule.
n + 1.	Combine the settings created by the previous step with the settings from the first applicable rule for the add operation.
n + 2 – n + m - 1.	…
n + m.	Combine the settings created by the previous step with the settings from the last applicable rule for the add operation.

When combining settings object A with settings object B, a new settings object C is created that contains the values of properties from A, overridden by the values for the properties that are are set in B.

Chapter 8. Tar

Table of Contents

Character encoding in Tar files
Significant Tar features not supported by At4J
Reading Tar archives
Extracting entries from Tar archives
Creating Tar archives
Standalone Tar tools

Tar is an ancient file format, originally used for making tape backups (Tape ARchives). A Tar file consists of a list of tar entries. Each entry has a header containing its metadata, followed by the its data. The metadata contains, at least, the following data:

The entry's absolute location in the archive.
The Unix permission mode for the entry, such as 0755 or 0644.
The entry's owner user and group id:s.
The time of last modification of the entry.

There are four significant versions of the Tar file format:

Unix V7: The oldest Tar format. Path names and symbolic link targets are limited to 99 characters (plus the leading slash). Stores only the numerical user and group id:s for each entry. See V7TarEntryStrategy.
Posix.1-1988 (ustar): Path names are limited to a theoretical maximum of 255 characters (often shorter), and symbolic link targets are limited to 99 characters. Stores the owner user and group names for each entry, in addition to the numerical user and group id:s. See UstarEntryStrategy.
Gnu Tar: Path names and link targets can be of any length. Stores the owner user and group names for each entry, in addition to the numerical user and group id:s. See GnuTarEntryStrategy.
Posix.1-2001 (pax): Path names and link targets can be of any length. Supports an unlimited number of metadata variables for each entry. See PaxTarEntryStrategy

Each format is backwards compatible with earlier formats.

The Tar file format does not support any kind of compression of its entries. However, the Tar file itself is often compressed using gzip or bzip2 compression.

For more information on the Tar file format, see the Wikipedia article on Tar and the Gnu Tar manual.

Character encoding in Tar files

There is no standard dictating which character encoding to use for a Tar entry's text metadata, such as its path. Unix Tar programs use the platform's default charset (often UTF-8 or ISO8859-1), while Windows programs often use Codepage 437. Pax metadata variables are always encoded in UTF-8.

Significant Tar features not supported by At4J

The following significant Tar features are not supported:

Adding symbolic links when building a Tar archive.
Jörg Schilling's Star file format. At4J might be able to extract Star and Xstar archives fairly well (more testing needed!), but cannot create them.
Gnu Tar sparse files.

Reading Tar archives

A Tar archive is read by creating a TarFile object on the Tar file. The TarFile object contains a TarEntry object for each entry in the archive.

Example 8.1. Reading data from a Tar archive

// Read data from the Tar file f

// The UTF-8 charset
Charset utf8 = Charset.forName("utf8");

// Create the Tar file object. Text in the Tar file is encoded in UTF-8
TarFile tf = new TarFile(f, utf8);
try
{
  // Print out the names of the child entries of the directory entry /d
  TarDirectoryEntry d = (TarDirectoryEntry) tf.get(new AbsoluteLocation("/d"));
  System.out.println("Contents of /d: " + d.getChildEntries().keySet());

  // Print out the contents of the file /d/f
  TarFileEntry df = (TarFileEntry) d.getChildEntries().get("f");
  
  // Use the EntityFS utility class Files to read the text in the file.
  System.out.println(Files.readTextFile(df, utf8));
}
finally
{
  // Close the Tar archive to release all resources associated with it.
  tf.close();
}

To access file format version-specific data, the TarEntry objects can be cast to the types representing each Tar file format:

Table 8.1. Tar entry objects

Format	Base	File entries	Directory entries	Symbolic link entries
Unix V7	TarEntry	TarFileEntry	TarDirectoryEntry	TarSymbolicLinkEntry
Ustar	UstarEntry	UstarFileEntry	UstarDirectoryEntry	UstarSymbolicLinkEntry
Gnu Tar	UstarEntry	UstarFileEntry	UstarDirectoryEntry	UstarSymbolicLinkEntry
Pax	PaxEntry	PaxFileEntry	PaxDirectoryEntry	PaxSymbolicLinkEntry

More sophisticated entry types inherit from their less sophisticated brethren, for instance PaxFileEntry → UstarFileEntry → TarFileEntry.

The root directory entry in the TarFile, i.e. the directory entry with the absolute location / in the archive, is never present in the Tar archive itself. It is always of the type TarDirectoryEntry.

The next example shows how a pax variable for an entry in a Posix.1-2001- compatible Tar archive is read:

Example 8.2. Reading a pax variable for an entry

// Parse the Tar archive in the file f
// The contents of this Tar archive is encoded with UTF-8. Most of its metadata
// are stored in PAX variables which always are encoded in UTF-8, so if we did
// not know the archive's encoding beforehand, that would probably not matter.
TarFile tf = new TarFile(f, Charset.forName("utf8"));
try
{
  // The Tar entry for the file räksmörgås.txt (räksmörgås = shrimp sandwich)
  PaxFileEntry fe = (PaxFileEntry) tf.get(
    new AbsoluteLocation("/räksmörgås.txt"));
  
  // Print out all Pax variable names
  System.out.println("Pax variables: " + fe.getPaxVariables().keySet());
  
  // Print out the value of the ctime variable (file creation time)
  System.out.println("ctime: " + fe.getPaxVariables().get("ctime"));
}
finally
{
  // Close the Tar archive to release its associated resources
  tf.close();
}

Extracting entries from Tar archives

To extract entries from a Tar archive, use the TarExtractor. It extracts entries while parsing the archive, which makes it faster than the more generic ArchiveExtractor. The extraction process can be configured with a TarExtractSpecification object.

Example 8.3. Extracting Java source files from a Tar archive

// Extract XML and Java source files from the Tar archive in the file f to the
// target directory d. The archive is compressed using gzip.

// Use a GZipReadableFile to transparently decompress the file contents.
ReadableFile decompressedView = new GZipReadableFile(f);

TarExtractor te = new TarExtractor(decompressedView);

// Create a custom specification object
TarExtractSpecification spec = new TarExtractSpecification().
  //
  // Don't overwrite files
  setOverwriteStrategy(DontOverwriteAndLogWarning.INSTANCE).
  //
  // Only extract XML and Java source files.
  // Filter on data found in the Tar entry header. The filters used are from
  // the org.at4j.tar package and they implement EntityFS' ConvenientFilter
  // interface and the marker interface TarEntryHeaderDataFilter. Custom filters are
  // easy to implement.
  //
  // We choose to only extract files. Necessary parent directories will be
  // created automatically.
  //
  // Be sure to get the parentheses right when combining filters! 
  setFilter(
    TarFileEntryFilter.FILTER.and(
      new TarEntryNameGlobFilter("*.java").or(
      new TarEntryNameGlobFilter("*.xml")))).
  //
  // The archive is encoded using UTF-8.
  setFileNameCharset(Charset.forName("utf8"));

// Extract!
te.extract(d, spec);

Note

The example above uses the GZipReadableFile to transparently decompress the contents of the archive file before it is fed to the TarExtractor. There are corresponding implementations for bzip2 and LZMA compression in the BZip2ReadableFile and LzmaReadableFile classes, respectively, as well as WritableFile implementations for transparently compressing data written to a file using gzip, bzip2 or LZMA compression.

Creating Tar archives

There are two different classes for creating Tar archives: TarBuilder and TarStreamBuilder. TarBuilder is a StreamAddCapableArchiveBuilder, but it requires a RandomlyAccessibleFile to write to. TarStreamBuilder is not stream add capable, but it makes do with only a WritableFile to write data to [1].

Both Tar archive builders use a TarEntryStrategy object that determines which Tar file format version that the created archive will be compatible with. The available strategies are V7TarEntryStrategy, UstarEntryStrategy, GnuTarEntryStrategy and PaxTarEntryStrategy. The default strategy is the GnuTarEntryStrategy.

The configurable metadata for each added Tar entry is represented by a TarEntrySettings object. The effective metadata for the entry is arrived at using the process described in the section called “Determining the metadata for an entry”.

Below is an example that shows how a Tar archive is built using the TarBuilder.

Example 8.4. Build a Tar archive using the Tar builder

// Build the Tar file "myArchive.tar" in the directory targetDir.
RandomlyAccessibleFile tarFile = Directories.newFile(targetDir, "myArchive.tar");

// Configure global Tar builder settings.
TarBuilderSettings settings = new TarBuilderSettings().
  //
  // Make files and directories owned by the user rmoore (1234), group bonds 
  // (4321).
  //
  // The settings object we create here will be combined with the default
  // default settings, which means that we only have to set the properties that
  // we want to change from the default values. 
  setDefaultFileEntrySettings(
    new TarEntrySettings().
      setOwnerUid(1234).
      setOwnerUserName("rmoore").
      setOwnerGid(4321).
      setOwnerGroupName("bonds")).
  setDefaultDirectoryEntrySettings(
    new TarEntrySettings().
      setOwnerUid(1234).
      setOwnerUserName("rmoore").
      setOwnerGid(4321).
      setOwnerGroupName("bonds")).
  //
  // Use a Tar entry strategy that will create a Posix.1-2001-compatible
  // archive
  setEntryStrategy(
    // Encode file names using UTF-8
    new PaxTarEntryStrategy(Charset.forName("utf8")));

// Create the Tar builder
TarBuilder builder = new TarBuilder(tarFile, settings);

// Add a global rule that says that script files should be executable.
builder.addRule(
  new ArchiveEntrySettingsRule<TarEntrySettings>(
    //
    // The global rule's settings
    new TarEntrySettings().
      // 
      // The code is an octal value, the same as is used with the chmod command.
      setEntityMode(UnixEntityMode.forCode(0755)),
    //
    // The global rule's filter
    new NameGlobETAF("*.sh")));

// Add all files and directories from the src directory to the /source directory
// in the archive
builder.addRecursively(src, new AbsoluteLocation("/source"));

// Add the headlines from The Times online to indicate the build date...
// Open a stream
InputStream is = new URL("http://www.timesonline.co.uk/tol/feeds/rss/topstories.xml").
  openStream();
try
{
  builder.add(is, new AbsoluteLocation("/todays_news.xml"));
}
finally
{
  is.close();
}

// Close the builder to finish writing the archive.
builder.close();

The following example shows how a Tar archive is built and compressed using the TarStreamBuilder

Example 8.5. Build a Tar archive using the Tar stream builder

// Build the Tar file "myArchive.tar.bz2" in the directory targetDir.
// Use a BZip2WritableFile to compress the archive while it is created.
WritableFile tarFile = new BZip2WritableFile(
  Directories.newFile(targetDir, "myArchive.tar.bz2"));

// Configure global Tar builder settings.
// Use the default Tar entry strategy (GnuTarEntryStrategy).
TarBuilderSettings settings = new TarBuilderSettings().
  //
  // Files are not world readable
  setDefaultFileEntrySettings(
    new TarEntrySettings().
      setEntityMode(UnixEntityMode.forCode(0640)));

// Create the Tar builder
TarStreamBuilder builder = new TarStreamBuilder(tarFile, settings);

// Add two files
builder.add(
  new NamedReadableFileAdapter(
    new CharSequenceReadableFile("The contents of this file are secret!"),
    "secret.txt"),
  AbsoluteLocation.ROOT_DIR);

builder.add(
  new NamedReadableFileAdapter(
    new CharSequenceReadableFile("The contents of this file are public!"),
    "public.txt"),
  AbsoluteLocation.ROOT_DIR,
  //
  // Use custom settings for this file
  new TarEntrySettings().
    setEntityMode(UnixEntityMode.forCode(0644)));

// Close the builder to finish the file.
builder.close();

Standalone Tar tools

The Tar class has a runnable main method that emulates the behavior of the tar command. See its API documentation for details on how to use it.

^[1]This means that a program can give the Tar stream builder a transparently compressing writable file implementation such as GZipWritableFile, BZip2WritableFile or LzmaWritableFile to have the archive compressed while it is created.

Chapter 9. Zip

Table of Contents

Character encoding in Zip files

Significant Zip features not supported by At4J

Reading Zip archives

Extracting from Zip archives

Creating Zip archives

Adding support for unsupported features

Adding a new compression method
Adding a new external attribute type
Adding a new extra field type

Standalone Zip tools

The Zip file format was originally developed in 1989 by Phil Katz for the company PKZIP. A Zip archive contains file and directory entries, where each file's data is compressed individually. The archive contains a number of Zip entries containing metadata on the entry and file data for file entries, followed by a central directory where some of the metadata for each entry is repeated.

The Zip specification allows for several different compression methods, even within the same Zip archive. The At4J implementation supports the following:

Stored (uncompressed)
Deflated (gzip compression)
bzip2 compression
LZMA compression

The Deflated and Stored methods are most common and are widely supported by Zip software.

Each Zip entry has a metadata record associated with it. It contains data such as the entry's absolute location in the archive, its last modification time, its external file attributes and a comment. (See ZipEntry.) The format of the external file attributes is configurable in order to be able to capture significant attributes from the file system containing the files that were added to the archive. Unix external file attributes, for instance, contains information on the entry's permission mode (same mode as the chmod command), such as 0644 or 0755.

The entry metadata can be, and often is, extended using extra fields that contain metadata that does not fit into the standard metadata record. This can for instance be timestamps with a higher precision than the timestamps in the standard record.

The Zip archive itself can also have a comment. It is often printed by the Zip program when the archive is being unzipped.

The Zip file format is specified in PKWARE Zip application note and in Info-Zip's Zip application note. See also the Wikipedia article on Zip.

Character encoding in Zip files

Neither PKWARE's nor Info-Zip's application notes specify which character encoding to use for encoding text metadata. Windows (and DOS) programs use Codepage 437 to encode file paths, and the platform's default charset (Codepage 1252 in Sweden, for instance) for other text metadata such as comments. Unix programs use the platform's default charset (often UTF-8 or ISO-8859-1) for all text data. The Unicode path extra field (UnicodePathExtraField) can be, but seldom is, used to store an UTF-8-encoded version of an entry's path.

Significant Zip features not supported by At4J

The following significant Zip features are not supported:

Adding symbolic links when building a Zip archive.
Zip archives split over several archive files.
Zip file signing.
Zip file encryption.
Some compression methods.
Some entry external file attribute formats.
Some entry extra fields.

Reading Zip archives

A Zip archive is read by creating a ZipFile object on the Zip file. The ZipFile object contains a ZipEntry object for each entry in the archive.

Example 9.1. Reading data from a Zip archive

// Read data from the Zip file f

// The UTF-8 charset
Charset utf8 = Charset.forName("utf8");

// Create the Zip file object. Text and file paths in the Zip file are encoded
// in UTF-8
ZipFile zf = new ZipFile(f, utf8, utf8);
try
{
  // Print out the names of the child entries of the directory entry /d
  ZipDirectoryEntry d = (ZipDirectoryEntry) zf.get(new AbsoluteLocation("/d"));
  System.out.println("Contents of /d: " + d.getChildEntries().keySet());

  // Print out the contents of the file /d/f
  ZipFileEntry df = (ZipFileEntry) d.getChildEntries().get("f");
  
  // Use the EntityFS utility class Files to read the text in the file.
  System.out.println(Files.readTextFile(df, utf8));
}
finally
{
  // Close the Zip archive to release all resources associated with it.
  zf.close();
}

External file attributes, compression method metadata and extra fields can be accessed through ZipEntry objects. External file attributes are represented by a ZipExternalFileAttributes-implementing object, compression method metadata by a ZipEntryCompressionMethod object and extra fields with a list of ZipEntryExtraField objects. Each extra field is represented by two objects since it occurs both in the Zip entry's metadata (the local header) and in the central directory at the end of the Zip file. The isInLocalHeader method of an ZipEntryExtraField object can be used to query it about where it got its data from – the local header or the central directory.

Example 9.2. Reading metadata from a Zip entry

// Create a Zip archive object for the archive in the file f
// The Zip file metadata is encoded using the UTF-8 charset.
ZipFile zf = new ZipFile(f, Charsets.UTF8, Charsets.UTF8);

// Print out the archive comment
System.out.println(zf.getComment());

// Get the file entry /f1
ZipFileEntry zfe = (ZipFileEntry) zf.get(new AbsoluteLocation("/f1"));

// Print out its comment
System.out.println(zfe.getComment());

// Print out its compression method
System.out.println(zfe.getCompressionMethod().getName());

// Print out its Unix permissions mode
System.out.println(
  ((UnixExternalFileAttributes) zfe.getExternalFileAttributes()).
    getEntityMode());

// Print out the value of the last modification time from the extended timestamp
// extra field from the local file header. Format the data using a
// SimpleDateFormat object.
System.out.println(
  new SimpleDateFormat("yyyyMMdd").format(
    zfe.getExtraField(ExtendedTimestampExtraField.class, true).
      getLastModified()));

The ZipFile object uses a ZipFileParser object to parse the contents of the Zip file. It has a few extension points where additional functionality can be plugged in. See the section called “Adding support for unsupported features” below.

Extracting from Zip archives

Zip entries can be extracted using the ArchiveExtractor. There is no custom extractor for Zip archives.

Creating Zip archives

A Zip archive is created using a ZipBuilder object. It is configured with a ZipBuilderSettings object.

Each added entry is configured with a ZipEntrySettings object. It contains properties for the compression method to use, for the extra fields to add, for the entry comment and for how the external file attributes should be represented. The builder uses the strategy described in the section called “Determining the metadata for an entry” to arrive at the effective settings for each entry.

Below is an example that shows how a Zip archive is built using a ZipBuilder.

Example 9.3. Building a Zip archive

// Build the Zip file "myArchive.zip" in the directory targetDir.
RandomlyAccessibleFile zipFile = Directories.newFile(targetDir, "myArchive.zip");

// Configure the global Zip builder settings.

// Create a factory object for the external attributes metadata
ZipExternalFileAttributesFactory extAttrsFactory =
  new UnixExternalFileAttributesFactory(
    //
    // Set files to be world readable
    UnixEntityMode.forCode(0644),
    //
    // Set directories to be world executable
    UnixEntityMode.forCode(0755));

ZipBuilderSettings settings = new ZipBuilderSettings().
  //
  // Set the default file entry settings.
  setDefaultFileEntrySettings(
    new ZipEntrySettings().
      //
      // Use bzip2 compression for files entries.
      // NOTE: bzip2 is not supported by all Zip implementations!
      setCompressionMethod(BZip2CompressionMethod.INSTANCE).
      //
      // Use the external attributes factory created above
      setExternalFileAttributesFactory(extAttrsFactory).
      //
      // Add an extra field factory for creating the Unicode path extra field
      // that stores the entry's path name encoded in UTF-8.
      addExtraFieldFactory(UnicodePathExtraFieldFactory.INSTANCE)).
  //
  // Set the default directory entry settings.
  setDefaultDirectoryEntrySettings(
    new ZipEntrySettings().
      //
      // Use the external attributes factory created above.
      setExternalFileAttributesFactory(extAttrsFactory).
      //
      // An extra field factory for creating the Unicode path extra field.
      addExtraFieldFactory(UnicodePathExtraFieldFactory.INSTANCE)).
  //
  // Set a Zip file comment.
  setFileComment("This is myArchive.zip's comment.");

// Create the Zip builder
ZipBuilder zb = new ZipBuilder(zipFile, settings);

// Add a global rule that says that all script files (files ending with .sh)
// should be world executable.
zb.addRule(
  new ArchiveEntrySettingsRule<ZipEntrySettings>(
    new ZipEntrySettings().
      //
      // This object only has to contain the difference between the default file
      // settings and the settings for this rule due to the way in which
      // settings are combined.
      setExternalFileAttributesFactory(
        new UnixExternalFileAttributesFactory(
          //
          // Files are world executable.
          UnixEntityMode.forCode(0755),
          //
          // Directories are world executable. (No directories will be matched
          // by the rule's filter, though.)
          UnixEntityMode.forCode(0755))),
    //
    // The filter that determines which entries the rule will be applied to.
    FileETAF.FILTER.and(
      new NameGlobETAF("*.sh"))));

// Add the directory hierarchy under the directory src to the location /source
// in the archive.
zb.addRecursively(src, new AbsoluteLocation("/source"));

// Close the builder to finish writing the archive.
zb.close();

The shortcut method setCompressionLevel on the ZipBuilder object can be used for setting the default compression level for files without having to create a new ZipEntryCompressionMethod object.

Example 9.4. Build a Zip archive and set the compression level

// Build the Zip file "myArchive.zip" in the directory targetDir. Use the best
// possible (deflate) compression.
RandomlyAccessibleFile zipFile = Directories.newFile(targetDir, "myArchive.zip");

// Configure the global Zip builder settings.

ZipBuilderSettings settings = new ZipBuilderSettings().
  //
  // Set maximum compression level for the default file compression method
  // (deflate)
  setCompressionLevel(CompressionLevel.BEST);

// Create the Zip builder
ZipBuilder zb = new ZipBuilder(zipFile, settings);

// Add the directory hierarchy under the directory src to the location /source
// in the archive.
zb.addRecursively(src, new AbsoluteLocation("/source"));

// Close the builder to finish writing the archive.
zb.close();

Adding support for unsupported features

It is possible to plug in support for new extra field types, new compression methods and new external attribute types in the ZipFile and ZipBuilder objects.

Feature implementations will have to work with raw, binary data read from and written to Zip files. They will probably find the number types in the org.at4j.support.lang package and perhaps the utilities in the org.at4j.support.util package useful.

Adding a new compression method

This is how to make ZipFile understand a new compression method:

Implement a new ZipEntryCompressionMethod class.
Implement a new ZipEntryCompressionMethodFactory class.
Create a new ZipFileParser instance.
Register the new compression method factory in the Zip file parser's compression method factory registry.

To use the new compression method with the ZipBuilder, use it with the ZipEntrySettings objects for the files that should be compressed using the new method, or with the default file settings objects if all files should be compressed using it.

Adding a new external attribute type

This is how to make ZipFile understand a new external attribute type:

Implement a new ZipExternalFileAttributes class.
Implement a new ZipExternalFileAttributesFactory class.
Create a new ZipFileParser instance.
Register the new external attributes factory in the Zip file parser's external attributes factory registry.

To use the new external attributes object with the ZipBuilder, use the factory with the ZipEntrySettings objects for the entries that should use the new attributes, or with the default file and directory settings objects if all entries should use them.

Adding a new extra field type

This is how to make ZipFile understand a new extra field type:

Implement a new ZipEntryExtraField class.
Implement a new ZipEntryExtraFieldParser class.
Create a new ZipFileParser instance.
Register the new extra field parser in the Zip file parser's extra field parser registry.

This is how to add entries using the new extra fields to a ZipBuilder:

Implement a new ZipEntryExtraFieldFactory class.
Use the new extra field factory with the ZipEntrySettings for the entries that should have the new extra fields, or with the default file and directory settings objects if all file and directory entries should have them.

Standalone Zip tools

The Zip and Unzip emulates the behavior of the zip and unzip commands. See their API documentation for details on how to use them.

Bibliography

[Wikipedia article on bzip2] Wikipedia article on bzip2.

[bzip2 home page] bzip2 home page.

[Apache Commons Compress site] Apache Commons Compress.

[Wikipedia article on gzip] Wikipedia article on gzip.

[Wikipedia article on LZMA] Wikipedia article on LZMA.

[LZMA SDK] LZMA SDK home page.

[Wikipedia article on Tar] Wikipedia article on Tar.

[Gnu Tar manual] Gnu Tar manual.

[PKWARE Zip application note] PKWARE Zip application note.

[Info-Zip's Zip application note] Info-Zip's Zip application note.

[Wikipedia article on Zip] Wikipedia article on Zip.

[At4J test report] At4J test report.