Wednesday, March 26, 2008

Killer Wail

Recently I received an email complaining that Orca lost information when doing a Save Transformed.

The sender was working on repackaging an msi that had to be finished with extreme time pressure and Orca was losing files when doing the Save Transformed As. He subsequently downloaded and installed InstEd, and in short order had his transformed msi saved with all data intact.

Orca's "Save As" and "Save Transformed As" issues are well documented, but no less dangerous for that.

From the "Special Considerations when editing Databases" page in the Orca Help:

Embedded Streams and Storages
When a database is saved using the Save As… or Save Transformed As… command, embedded binary streams (such as embedded cabinet files) are not saved to the new database unless they are part of a data row. Embedded sub-storages (nested install files) are never saved to the new database.


But the question remains, why doesn't it save the sub-storages? From my previous post, you might remember that the _Streams table is an abstraction of the subset of the underlying OLE structured storage that represents all the binary fields in the database.

The critical thing to understand is that while all binary fields in regular tables (such as the Icon table) are backed in the _Streams table, not all fields in the _Streams table are represented in regular tables.

When Orca does a Save As, or Save Transformed As, operation it creates a new database, and copies, table by table, row by row, the data into the target database. But, it doesn't copy all the _Streams rows that aren't represented in regular tables. Nor does it copy OLE structured storage entities that aren't represented by regular tables. Therefore, this data never makes it into the target database.

So, while the newly created database contains all the persistent tables, it can be missing data that is critical to the msi.

Note that critical data can be stored in the underlying OLE storage entities, that aren't in the _Streams table. For example, language transforms that are applied when an msi is installed, are stored as OLE structured storage entities, but are not represented in the _Streams table.

So why would you ever use the Save As feature in Orca? Well the only advantage is that it writes a fresh database, which means that the wasted space from many additions/deletions/edits gets trimmed out. But while the msi may be smaller, you have to be sure that all important data in the msi is represented in the persistent (regular) tables, otherwise it will get lost.

It seems to me that given the risks associated with this command, it would have been better named "Compact Database As", and not "Save As". And similarly, "Save Transformed As", should warn the user that it may lose information.

There is no way with Orca to perform a "Save Transformed As" without losing the information, unlike "Save As", where you can copy the source msi before editing and saving.

Could a tool compact the database and not lose the critical information? Well, yes, but only if it completely understood how the Windows Installer API uses the OLE structured storage. It could create a new database, copy all the tables, copy all the _Streams rows unique to the _Streams table, and then copy the remaining OLE structured storage entities, but the problem is that there is no documentation on how to tell which entities have already been copied when the tables were copied.

How does InstEd implement Save As and Save Transformed As? InstEd copies the underlying database file to the target, and then applies all changes that have been made since the underlying database file was last saved. This is equivalent to copying the underlying database to the target and then editing it. In this way, all the non-persistent table data is maintained, but you don't get a fresh database, with optimal space saving.

InstEd uses a similar mechanism for "Save Transformed As" to ensure that no data is lost.

If you wanted to achieve equivalent beahviour to Orca's Save As, you could export all the tables, and import them to a new and empty database.

Monday, March 17, 2008

_Streams of Consciousness

Unlike the confused thoughts falling from my mind, the _Streams table is quite critical to a Windows Installer database.

In fact, it is the "location" of all the binary fields in the database.

The _Streams table, consisting of two fields 'Name' and 'Data', is an abstraction of the underlying OLE structured storage data streams. It provides access to the binary streams for the Windows Installer API.

All binary fields are stored in their own OLE stream in the database file, and the _Streams table is generated when an sql request is made for a binary field. While the _Streams table is temporary and generated on request, changes to the table are persistent.

Every field in the msi database that contains binary data is represented in the _Streams table using the format <table_name>.<row_key>.

So, for a row in the Binary table with a Name field of 'Icon', the _Streams table would contain a row with a 'Name' field of 'Binary.Icon'.

This is a one way relationship. While all binary fields in tables are accessible via a row in the _Streams table, not all rows in the _Streams table represent another table row.

This is important to understand, since critical information is often stored in the _Streams table that is not accessible via regular tables. The most common example is the cab for a merge module (msm).

The installed files that a merge module contains are stored in an _Streams row with a name of 'MergeModule.CABinet' (case-sensitive). Note that there is no 'MergeModule' table with a row called 'CABinet'.

Other types of binary data can be stored in the _Streams table without having a corresponding table. Any internal cab file can be stored in the _Streams table without requiring it to be represented in a normal table. So the Media table entry might be '#cab1', or '#cab1.cab', with no attending 'cab1' table.

Given the importance of the _Streams table, it is curious that other tools have not provided direct access to it. InstEd provides access to it, allowing quick access to merge module cabinet files, and a central place to access all binary fields.

However editing an _Streams table row that represents a row in another table (the <table_name>.<row_key> format) will jump to that row in the other table. This is to ensure that the user is well aware that the _Streams row is represented by another table row.

Have you considered what happens if there are two tables with a binary field, where one is called (for example) 'Binary', and the other 'Binary.Table'? Can you have a row in 'Binary' called 'Table.Value' and a row in 'Binary.Table' called 'Value'?

It turns out you can, but changing one field, changes the other, since the binary field for both rows is backed in the _Streams table by a row called 'Binary.Table.Value'.

Monday, March 10, 2008

The Danger of 'Special' Values

When defining the meaning of a data field, it can be tempting to specify certain values that the field may hold as special. Those values would signify a different meaning than all other values.

A great example of this is in the Windows Installer API, where the docs for MsiRecordSetInteger state "To set a record integer field to NULL_INTEGER, set iValue to MSI_NULL_INTEGER".

So the MSI_NULL_INTEGER constant:
#define MSI_NULL_INTEGER 0x80000000
has a different meaning than any other integer.

There can be good reasons for doing this. The Windows Installer team probably made a special value for NULL to obviate the need for every field in the msi database to have a boolean attached to it specifying whether the field is null or not. Obviously there is a large space saving advantage here.

But the dangers of doing this are obvious. What if someone wants to store 0x80000000 and not have it considered as a null value? This is exactly the reason why the docs for the LockPermissions table declares:
"You cannot specify GENERIC_READ in the Permission column. Attempting to do so will fail. Instead, you must specify a value such as KEY_READ or FILE_GENERIC_READ."

It just so happens that the value of the GENERIC_READ constant is, you guessed it, 0x80000000. Ouch.

So, while the docs are technically correct, it is worth noting that you can use GENERIC_READ, as long as it is combined with another bit flag. This is the reason that InstEd has included GenericRead in it's Permissions column bit flag editor.

InstEd Permission bit flag editor

Friday, March 7, 2008

InstEd is released!

InstEd 1.5 has been released.

Why version 1.5? Well, it had a former incarnation in Camwood's appEditor.

It is now more stable, faster, and with more features, including:

  • Comparing msi files for visual differencing. It also allows comparing an msi against itself so that changes are tracked visually as they are made.
  • Providing an option to automatically update the File and MsiFileHash table details (Version, Language, hash values) for files in the File table from the source files.
  • Compiled with profile guided optimization to provide greater performance.

So go on, InstEd it!