Chapter 8. Tar

Table of Contents

Character encoding in Tar files
Significant Tar features not supported by At4J
Reading Tar archives
Extracting entries from Tar archives
Creating Tar archives
Standalone Tar tools

Tar is an ancient file format, originally used for making tape backups (Tape ARchives). A Tar file consists of a list of tar entries. Each entry has a header containing its metadata, followed by the its data. The metadata contains, at least, the following data:

There are four significant versions of the Tar file format:

Unix V7
The oldest Tar format. Path names and symbolic link targets are limited to 99 characters (plus the leading slash). Stores only the numerical user and group id:s for each entry. See V7TarEntryStrategy.
Posix.1-1988 (ustar)
Path names are limited to a theoretical maximum of 255 characters (often shorter), and symbolic link targets are limited to 99 characters. Stores the owner user and group names for each entry, in addition to the numerical user and group id:s. See UstarEntryStrategy.
Gnu Tar
Path names and link targets can be of any length. Stores the owner user and group names for each entry, in addition to the numerical user and group id:s. See GnuTarEntryStrategy.
Posix.1-2001 (pax)
Path names and link targets can be of any length. Supports an unlimited number of metadata variables for each entry. See PaxTarEntryStrategy

Each format is backwards compatible with earlier formats.

The Tar file format does not support any kind of compression of its entries. However, the Tar file itself is often compressed using gzip or bzip2 compression.

For more information on the Tar file format, see the Wikipedia article on Tar and the Gnu Tar manual.

There is no standard dictating which character encoding to use for a Tar entry's text metadata, such as its path. Unix Tar programs use the platform's default charset (often UTF-8 or ISO8859-1), while Windows programs often use Codepage 437. Pax metadata variables are always encoded in UTF-8.

The following significant Tar features are not supported:

A Tar archive is read by creating a TarFile object on the Tar file. The TarFile object contains a TarEntry object for each entry in the archive.

Example 8.1. Reading data from a Tar archive

// Read data from the Tar file f

// The UTF-8 charset
Charset utf8 = Charset.forName("utf8");

// Create the Tar file object. Text in the Tar file is encoded in UTF-8
TarFile tf = new TarFile(f, utf8);
try
{
  // Print out the names of the child entries of the directory entry /d
  TarDirectoryEntry d = (TarDirectoryEntry) tf.get(new AbsoluteLocation("/d"));
  System.out.println("Contents of /d: " + d.getChildEntries().keySet());

  // Print out the contents of the file /d/f
  TarFileEntry df = (TarFileEntry) d.getChildEntries().get("f");
  
  // Use the EntityFS utility class Files to read the text in the file.
  System.out.println(Files.readTextFile(df, utf8));
}
finally
{
  // Close the Tar archive to release all resources associated with it.
  tf.close();
}


To access file format version-specific data, the TarEntry objects can be cast to the types representing each Tar file format:


More sophisticated entry types inherit from their less sophisticated brethren, for instance
PaxFileEntryUstarFileEntryTarFileEntry.

The root directory entry in the TarFile, i.e. the directory entry with the absolute location / in the archive, is never present in the Tar archive itself. It is always of the type TarDirectoryEntry.

The next example shows how a pax variable for an entry in a Posix.1-2001- compatible Tar archive is read:

Example 8.2. Reading a pax variable for an entry

// Parse the Tar archive in the file f
// The contents of this Tar archive is encoded with UTF-8. Most of its metadata
// are stored in PAX variables which always are encoded in UTF-8, so if we did
// not know the archive's encoding beforehand, that would probably not matter.
TarFile tf = new TarFile(f, Charset.forName("utf8"));
try
{
  // The Tar entry for the file räksmörgås.txt (räksmörgås = shrimp sandwich)
  PaxFileEntry fe = (PaxFileEntry) tf.get(
    new AbsoluteLocation("/räksmörgås.txt"));
  
  // Print out all Pax variable names
  System.out.println("Pax variables: " + fe.getPaxVariables().keySet());
  
  // Print out the value of the ctime variable (file creation time)
  System.out.println("ctime: " + fe.getPaxVariables().get("ctime"));
}
finally
{
  // Close the Tar archive to release its associated resources
  tf.close();
}


To extract entries from a Tar archive, use the TarExtractor. It extracts entries while parsing the archive, which makes it faster than the more generic ArchiveExtractor. The extraction process can be configured with a TarExtractSpecification object.

Example 8.3. Extracting Java source files from a Tar archive

// Extract XML and Java source files from the Tar archive in the file f to the
// target directory d. The archive is compressed using gzip.

// Use a GZipReadableFile to transparently decompress the file contents.
ReadableFile decompressedView = new GZipReadableFile(f);

TarExtractor te = new TarExtractor(decompressedView);

// Create a custom specification object
TarExtractSpecification spec = new TarExtractSpecification().
  //
  // Don't overwrite files
  setOverwriteStrategy(DontOverwriteAndLogWarning.INSTANCE).
  //
  // Only extract XML and Java source files.
  // Filter on data found in the Tar entry header. The filters used are from
  // the org.at4j.tar package and they implement EntityFS' ConvenientFilter
  // interface and the marker interface TarEntryHeaderDataFilter. Custom filters are
  // easy to implement.
  //
  // We choose to only extract files. Necessary parent directories will be
  // created automatically.
  //
  // Be sure to get the parentheses right when combining filters! 
  setFilter(
    TarFileEntryFilter.FILTER.and(
      new TarEntryNameGlobFilter("*.java").or(
      new TarEntryNameGlobFilter("*.xml")))).
  //
  // The archive is encoded using UTF-8.
  setFileNameCharset(Charset.forName("utf8"));

// Extract!
te.extract(d, spec);


Note

The example above uses the GZipReadableFile to transparently decompress the contents of the archive file before it is fed to the TarExtractor. There are corresponding implementations for bzip2 and LZMA compression in the BZip2ReadableFile and LzmaReadableFile classes, respectively, as well as WritableFile implementations for transparently compressing data written to a file using gzip, bzip2 or LZMA compression.

There are two different classes for creating Tar archives: TarBuilder and TarStreamBuilder. TarBuilder is a StreamAddCapableArchiveBuilder, but it requires a RandomlyAccessibleFile to write to. TarStreamBuilder is not stream add capable, but it makes do with only a WritableFile to write data to[1].

Both Tar archive builders use a TarEntryStrategy object that determines which Tar file format version that the created archive will be compatible with. The available strategies are V7TarEntryStrategy, UstarEntryStrategy, GnuTarEntryStrategy and PaxTarEntryStrategy. The default strategy is the GnuTarEntryStrategy.

The configurable metadata for each added Tar entry is represented by a TarEntrySettings object. The effective metadata for the entry is arrived at using the process described in the section called “Determining the metadata for an entry”.

Below is an example that shows how a Tar archive is built using the TarBuilder.

Example 8.4. Build a Tar archive using the Tar builder

// Build the Tar file "myArchive.tar" in the directory targetDir.
RandomlyAccessibleFile tarFile = Directories.newFile(targetDir, "myArchive.tar");

// Configure global Tar builder settings.
TarBuilderSettings settings = new TarBuilderSettings().
  //
  // Make files and directories owned by the user rmoore (1234), group bonds 
  // (4321).
  //
  // The settings object we create here will be combined with the default
  // default settings, which means that we only have to set the properties that
  // we want to change from the default values. 
  setDefaultFileEntrySettings(
    new TarEntrySettings().
      setOwnerUid(1234).
      setOwnerUserName("rmoore").
      setOwnerGid(4321).
      setOwnerGroupName("bonds")).
  setDefaultDirectoryEntrySettings(
    new TarEntrySettings().
      setOwnerUid(1234).
      setOwnerUserName("rmoore").
      setOwnerGid(4321).
      setOwnerGroupName("bonds")).
  //
  // Use a Tar entry strategy that will create a Posix.1-2001-compatible
  // archive
  setEntryStrategy(
    // Encode file names using UTF-8
    new PaxTarEntryStrategy(Charset.forName("utf8")));

// Create the Tar builder
TarBuilder builder = new TarBuilder(tarFile, settings);

// Add a global rule that says that script files should be executable.
builder.addRule(
  new ArchiveEntrySettingsRule<TarEntrySettings>(
    //
    // The global rule's settings
    new TarEntrySettings().
      // 
      // The code is an octal value, the same as is used with the chmod command.
      setEntityMode(UnixEntityMode.forCode(0755)),
    //
    // The global rule's filter
    new NameGlobETAF("*.sh")));

// Add all files and directories from the src directory to the /source directory
// in the archive
builder.addRecursively(src, new AbsoluteLocation("/source"));

// Add the headlines from The Times online to indicate the build date...
// Open a stream
InputStream is = new URL("http://www.timesonline.co.uk/tol/feeds/rss/topstories.xml").
  openStream();
try
{
  builder.add(is, new AbsoluteLocation("/todays_news.xml"));
}
finally
{
  is.close();
}

// Close the builder to finish writing the archive.
builder.close();


The following example shows how a Tar archive is built and compressed using the TarStreamBuilder

Example 8.5. Build a Tar archive using the Tar stream builder

// Build the Tar file "myArchive.tar.bz2" in the directory targetDir.
// Use a BZip2WritableFile to compress the archive while it is created.
WritableFile tarFile = new BZip2WritableFile(
  Directories.newFile(targetDir, "myArchive.tar.bz2"));

// Configure global Tar builder settings.
// Use the default Tar entry strategy (GnuTarEntryStrategy).
TarBuilderSettings settings = new TarBuilderSettings().
  //
  // Files are not world readable
  setDefaultFileEntrySettings(
    new TarEntrySettings().
      setEntityMode(UnixEntityMode.forCode(0640)));

// Create the Tar builder
TarStreamBuilder builder = new TarStreamBuilder(tarFile, settings);

// Add two files
builder.add(
  new NamedReadableFileAdapter(
    new CharSequenceReadableFile("The contents of this file are secret!"),
    "secret.txt"),
  AbsoluteLocation.ROOT_DIR);

builder.add(
  new NamedReadableFileAdapter(
    new CharSequenceReadableFile("The contents of this file are public!"),
    "public.txt"),
  AbsoluteLocation.ROOT_DIR,
  //
  // Use custom settings for this file
  new TarEntrySettings().
    setEntityMode(UnixEntityMode.forCode(0644)));

// Close the builder to finish the file.
builder.close();


The Tar class has a runnable main method that emulates the behavior of the tar command. See its API documentation for details on how to use it.



[1] This means that a program can give the Tar stream builder a transparently compressing writable file implementation such as GZipWritableFile, BZip2WritableFile or LzmaWritableFile to have the archive compressed while it is created.