Wednesday, November 26, 2008

Lower your colours!

I will persist with this spelling of colours. Please forgive me if it offends, since I can't win either way.

appEditor's colour scheme was panned by some of the people kindly testing the product. Therefore InstEd was released with their preferred colour scheme*, which I agree is an improvement.

However Pär Leeman kindly sent me his preference:

If you want to make your own colour scheme, jump into this registry key:
HKEY_CURRENT_USER\Software\instedit.com\InstEd\Options
and edit the colour values.

The format of the values is

    Alpha, Red, Green, Blue
0x FF FF FF FF


I suggest that you don't change the Alpha or you may get weird drawing effects.

You must change the values in the registry while InstEd is not running, otherwise the values will be written over when InstEd closes.

Here is the registry file for Pär's colour scheme:


-------------------------------------------------------------------------
Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\instedit.com\InstEd\Options]
"COLOR_SELECTED_ITEM_BORDER"=dword:cf03a5fc
"COLOR_FOCUSED_END"=dword:ffffff03
"COLOR_FOCUSED_START"=dword:3cf7f74d
"COLOR_SELECTED_END"=dword:cf03a5fc
"COLOR_SELECTED_START"=dword:cff7fcfd
"COLOR_INVALID_END"=dword:ffff3d54
"COLOR_INVALID_START"=dword:ffff3d54
"COLOR_BASE_ONLY_END"=dword:ff999696
"COLOR_BASE_ONLY_START"=dword:ff999696
-------------------------------------------------------------------------


Feel free to post your own schemes.

Note: You can either export the Options key from regedit as a backup, or simply delete all the colour values, to have them revert to defaults.

* Many thanks to Carita Comty

Friday, November 21, 2008

Working with Cabinets in InstEd

The new version of InstEd provides a mechanism for building/rebuilding cabs in the msi.

Help documentation on this feature is now available here.

Wednesday, November 12, 2008

InstEd 1.5.3.7 Released!

Finally, the new version is available for download.

Release notes here.

Documentation on new features will follow shortly.

Tuesday, September 30, 2008

Tips and Tricks Part 4 - Moving component elements

While it would be great to have drag and drop of things like files and registry entries between components, in the meantime there is a nifty trick to move multiple elements between components.

The basics of the trick is to rename the source component to a temporary name, updating the references of only those elements that you wish to move. Then rename the temp component to its original name, updating no references. This will leave the elements that you wish to move "orphaned" with an invalid component.

Then, rename the target component to the temporary name, updating no relationships. This will leave the target component's elements "orphaned", but the elements that are being moved will now be attached the "temp" component.

The final step is to rename the "temp" component back to its original name (the target component), updating all relationships. This will result in the target component containing its original elements and the "moved" elements.

For example:


SourceCmp TargetCmp
File1 File3
File2 File4
Reg1 Reg3
Reg2 Reg4

Step1:
---> TempCmp TargetCmp
File1 File3
File2 File4
Reg1 Reg3
Reg2 Reg4

Step 2:
SourceCmp <--- TargetCmp
File1 File3
File2 File4
Reg1 Reg3
Reg2 Reg4

Step 3:
SourceCmp TargetCmp <---
File1 File3
File2 File4
Reg1 Reg3
Reg2 Reg4

Step 4:
SourceCmp ---> TargetCmp
File1 File3
File4
File2
Reg1 Reg3
Reg4
Reg2



When only moving a few items in one table, it is probably easier to simply copy and paste the target component's name into the relevant fields. But when moving many entries, especially from disparate tables, this method can be quicker. It is probably easiest done from the Component (F10) or Feature (F11) tree views.

This method applies equally well to any parent child entity where the children reference the parent, such as Dialogs and Controls.

Wednesday, September 17, 2008

Is a Managed InstEd Manageable?

For the sake of curiosity, and also to explore the possibility of utilising some of the WPF functionality, I thought I would compile InstEd as a managed (C++/CLI) .NET executable.

The first issue that I encountered was that the default boost regular expression object library (.lib) could not be linked in successfully, due to incompatibilities with compile time options with the lib, and compile time options for building a C++/CLI exe.

So, after creating a new Visual C++ project for the regular expression library, and compiling that under C++/CLI as a dll, I managed to compile my C++/CLI exe utilising that regex dll. This problem may have been resolved in later releases of boost, and it may go away by utilising the tr1/regex library that Microsoft ships with Visual Studio 9.

Bearing in mind that I changed no code, and that InstEd utilises some cutting edge C++ techniques, I was very impressed that the C++/CLI compiler could make a working executable. It's a credit to the compiler team, I think.

However, even with a working executable, there were two significant issues with the resulting executable. Firstly, the executable size went from 1.5MB to 4.5MB. That's a huge increase.

The second issue was that a quick performance check showed a doubling in the time taken to build the relationships for a large file.

While it is not entirely sensible to build a C++/CLI version of a native app for no good reason (i.e. not utilise the .NET framework), it does highlight that significant reworking of the code would be necessary to make it worth building as a managed executable. Most likely, any use of the .NET framework will have to happen in dll's, leaving the main executable completely native.

However, rest assured that for the moment, InstEd will continue to be built with minimum dependencies. Even the C++ runtime is built into the exe to save that "dll hell".

Monday, August 18, 2008

Tips and Tricks Part 3 - Generating a new GUID

Fields which are designated to contain GUIDs (e.g. Component::ComponentId) will present a "New GUID" button when edited.

Additionally, all string fields can be replaced with a newly generated guid by pressing CTRL+G when the field is selected but the editor is not open. Most field editors also support CTRL+G.

Tips and Tricks Part 2 - Displaying the current file's path

Clicking on the filename tab will display the path of the loaded file in the status bar.

In the case of transforms, the status bar will display the transformation chain (base msi, and the sequence of transforms that have been applied). In this case, it will show the filenames only.

Alternatively, as with most applications, clicking File->Save As will ensure that the Open File dialog is in the folder of the current file.

Friday, August 1, 2008

Tips and Tricks Part 1 - Edit tracking and undo

"Tricks and Tips" would just give the wrong impression.

InstEd supports viewing transforms (mst files) and highlights all the changes that the transform makes to the base msi.

Under the hood, this is implemented by loading all the tables of the base msi and all the tables of the transformed msi. Some magic within the underlying InstEd table structures then does diffing of each row/cell to determine whether and how they should be highlighted.

Eventually I got around to using the same code to diff between msi's. Instead of diffing the tables between a transform and a base msi (or chain of transforms applied to a base msi), InstEd can happily diff between two msi files. When an msi is open, select the menu option: Transform->Compare To...

This is a useful thing to do in many instances. For example, every new release of the InstEd msi is generated by copying the previous release, and then diffing it against the release before that so that I can see what fields I need to change (ProductCode, Upgrade table, etc).

(Note that InstEd won't cope very well (read "crash") with identically named tables that have different schemas... just yet.)

The next obvious step was to extend the diffing code so that an open msi that is being edited can be compared to itself.



This means that as edits are made to the file, they will be highlighted, based on the same logic as the transform diffing code. So, changed cells will be highlighted in green, deleted rows will be highlighted in yellow, tables that contain changes will be highlighted in green, and deleted tables will be highlighted in yellow.



This highlighting is great, but ultimately, the big advantage comes from the undo mechanism provided by this diffing code. The diffing code allows for "Revert to base...", which makes the selected item (cell or table) identical to that of the base msi.

Right clicking on a cell or table provides the "Revert to base.." option.


To revert one or more rows that have been deleted, copy the deleted rows (CTRL+SHIFT+C) and paste them back again (CTRL+SHIFT+V).

In this fashion a reasonably capable undo mechanism is available.

It's not the same as most undo implementations which use a chain of undo/redo commands, however in some ways it is more flexible. For example, it can immediately "revert to base" any given change without having to undo all the subsequent changes in the chain. However, it will only "revert to base". It can't revert to intermediate values. i.e. change it once, then change it again, and the first change is unrecoverable via this mechanism).

Note that another excellent undo mechanism is to close the file without saving.

If you think that this feature should be enabled by default (or selectable) so that when opening an msi, InstEd automatically diffs it against itself, post a comment.

Wednesday, July 23, 2008

Help - External Tools

The Online Help for InstEd has been extended with a section on External Tools.

Tuesday, June 24, 2008

TypeLib or not TypeLib

That is the question.
Whether tis nobler in the mind to suffer
the slings and arrows of outrageous fortune,
or just use the bloody Registry table?


It seems that Microsoft have effectively deprecated the TypeLib table.

From the msdn docs:

Installation package authors are strongly advised against using the TypeLib table. Instead, they should register type libraries by using the Registry table. Reasons for avoiding self registration include:

  • If an installation using the TypeLib table fails and must be rolled back, the rollback may not restore the computer to the same state that existed prior to the rollback. Type libraries registered prior to rollback may not be registered after rollback.
Additionally, if you compare (using InstEd of course) the _BadRegData tables from early darice.cub files with that in the latest Windows Server 2008 SDK, all the references to TypeLib subkeys have been removed. This means they no longer show up as ICE33 errors. Microsoft really want you to put TypeLib registration into the Registry table.

But why is this? Their justification is that registry keys created/overwritten by the RegisterTypeLibraries action cannot be restored to their previous state upon rollback. With the Registry table entries, the Windows Installer engine guarantees that registry keys modified during an installation are returned to their previous state upon rollback.

However the RegisterTypeLibraries action most likely calls the RegisterTypeLib api. The Windows Installer engine has no control over what that api does, and cannot therefore manage the registry keys so that they are restored upon rollback.

Hence Microsoft's strong suggestion to use the Registry table.

But a closer examination shows that the suggestion solves a problem only when things go wrong, and leave a gaping problem when things go right.

Consider the automobile airbag. It's a fantastic device for saving the average driver's face of average attractiveness in the case of an accident. We are all grateful they exist. The Registry table is the Windows Installer airbag. It saves the registry from deformity in the case of an accident by ensuring that, upon rollback, the registry is restored to its state prior to the installation.

However, imagine that whenever the driver turned the ignition off, the steering wheel whacked them in the face, airbag safely tucked inside it's protective casing.

This is the problem with declaring the typelib table unsafe. Sure, the airbag will save you when you have an accident, something most drivers don't set out to do. But the Windows Installer engine can't save the registry when everything goes well, the package is installed perfectly, and then is uninstalled for whatever reason, regardless of whether registry keys are set via the Registry or the TypeLib tables.

The Windows Installer engine only holds the registry state for the duration of the installation, allowing restoration only upon rollback. Upon successful installation and uninstallation, both the Registry table and the TypeLib table will result in keys/values being deleted, regardless of their state prior to the installation. It's the whack in the face for turning the ignition off.

Addressing the airbag issue, it seems that given they have the code for the RegisterTypeLib api (and code for rollback of the registry), they could supplant it with one that successfully restores the registry table upon rollback, just like the Registry table does. This would provide all the same protection as the Registry table, and not require modification of all the existing msi's out there.

So, should you use the TypeLib table? Well, given that MS haven't done the work to allow successful rollback on the TypeLib table, if you have the tools to automatically generate the Registry table entries, then I can't see why not. But the extra work may not be worth it, considering that the majority of cases where things go as intended (install/unsintall) will suffer the same problems as the minority of cases when things have gone wrong (rollback).

The only caveat to this is that most conflict checking utilities will not look inside a typelib for conflicts between typelibs (e.g. this interface's typelib info is provided by this file, no this file, no this file). In this case, using the Registry table probably provides a more significant benefit than using the TypeLib table (note to software vendors, your packages will likely be used in enterprises where they are conflict managed, so help them out).

But this of course isn't Microsoft's justification, and their justification seems a little weak for such a strong recommendation.

Tuesday, June 10, 2008

What! Are you blind?

I didn't say it out loud, but I was thinking something similar.
Personification of inanimate objects like compilers is daft,
but it feels good. That's my excuse anyway.

Here's the situation.

Given this C++/CLI code:

template< typename T >
ref struct A
{
~A()
{
}
};

ref struct B
{
A<int>^ a;
}

The linker generates this error:
error LNK2020: unresolved token (06000414) A<int>::Dispose

Knowing that in C++/CLI, the destructor syntax generates a Dispose function for ref classes/structs, the usual resolution is to ensure that A class has a destructor defined.

Which is where you say "What! Are you blind?". Because there it is. And no matter how many times you check it to make sure it exists and hit the recompile button, the linker still spits out the same error. I know. I tried a lot of times.

The solution? Change the member declaration in B to:
A<int> a;


The reason? There are conflicting behaviours between the C++ template compilation and the C++/CLI MSIL generator. By declaring a destructor, the MSIL generation insists that the class must have a Dispose method. But the C++ template compilation will not emit definitions of functions in template classes unless the function is actually used.

By using the handle syntax (^), but never manually calling a->Dispose(), the compiler will never emit the Dispose function. Removing the handle syntax, such that the "a" variable exists on the stack (at least notionally), means that behind the scenes the compiler generates the call to a->Dispose() when the variable goes out of scope.

In native C++, the template optimisation is exactly what most programmers want. Why have a function defined if it is never called?

However an MSIL class in an assembly must be fully defined, regardless of what functions are called.

Hence, unless you manually call Dispose, or use the stack based syntax, the compiler never generates the Dispose function for the linker to find, which is invalid for an MSIL class in an assembly.

Of course, the resulting error is useful because I realised that I hadn't managed the object properly. Unfortunately this can only be considered serendipitous, because should many objects of type A exist, but only one of them get Disposed properly (or even improperly), the unresolved external error won't show.

The unanswered question. It seems that there is something special about the Dispose function (or how it is generated with the destructor syntax), because the same is not true of an ordinary function. E.g.

template< typename T >
ref struct A
{
~A()
{
}

void f()
{
}
};

ref struct B
{
A<int> a;
}

This code does not generate an unresolved external error for the f function. Further investigation with ILDasm is required.

Note: This post is relevant to Visual Studio 2005. I am not sure of how VS 2008 behaves.

Friday, June 6, 2008

InstEd 1.5.2.3 Released!

1.5.2.3 has been released. With huge performance improvements, the addition of forward and back navigation, and quite a few bug fixes, it is now better than ever, making packaging even more productive.

Release notes here.
Download here.

Thursday, May 29, 2008

Is Null Null or Not Null

One of the people on the AppDeploy forums recently asked for the option of having null fields be represented in the tables by an empty string instead of the string "<null>".

Another person responded by saying that that would be bad, because a null string is not the same as an empty string.

Well, usually that is correct, and can be quite an important distinction. In most databases, a null field is certainly different from an empty string. However in the case of Windows Installer databases, after storing an empty string in the database, calling MsiRecordIsNull on that field will return true. So a non-null string (empty) can be a null string, at least for Windows Installer.

Of course, I have already examined the issue of null integers being represented by a special number. Again, a non-null integer can be a null integer.

But, by the time we get to binary fields (the last remaining fundamental type), we find suddenly that null binary fields really are null, and are distinct from every other binary value.

That is, the only way to store a null binary field is to pass NULL to the MsiRecordSetStream function. Passing a path to an empty file results in a non-null binary field of 0 bytes.

Actually, it's not strictly true that passing NULL to the MsiRecordSetStream function is the only way to set a binary field to null. You can actually call MsiRecordSetInteger with the MSI_NULL_INTEGER value. This will happily set a binary field to null.

So null is not null but it is null.

As for the original request to use an empty string to display null values, I added the option to the next release (accessible only in the registry settings at the moment). In fact you can use any string you like to represent null now.

Unfortunately, having an empty string to represent null values makes scanning the rows a bit difficult (there is no grid to help out), and also is indistinguishable from a string of white space. Which is rare but probably not quite as rare as "<null>". Back to the problem of special values.

Friday, May 23, 2008

Orca east bugs, again

Just to provide an update to the Vista SDK Orca bug, here is the Windows Installer Team Blog's description of the bug:

Updates that can be found in the Windows SDK for Windows Server 2008:
  • Orca crashes when a transform is generated and a row is deleted from the current table Orca crashes if the user attempts to generate a new transform and deletes a row from an existing table. (Using Orca, select New Transform, then delete a row from a table).

Of course it is more serious than that, because it can't even view a transform that deletes a row.

Just to be clear, the bug was not introduced as a special feature in the 2008 SDK version. It was fixed in that version.

Wednesday, May 14, 2008

Orca Eats Bugs

It's good to know that Orca is at least getting bug fixes.

Orca 4.0.5299.0 which comes with the Vista Platform SDK suffered a problem where deleting certain rows in a transform would cause it to crash.

The repeatable example (and it may not be confined to this) is a transform that deleted the only row in a table. You can reproduce it by starting Orca and opening "c:\program files\orca\orca.dat" (or equivalent). This is Orca's template msi. Perform a File->Save As to save the msi somewhere (not over orca.dat). Add a dummy row to the LockPermissions table (for example) and then File->Save.

Then select Transforms->New Transform, and delete the LockPermissions row. Bang, Orca beaches itself.

This came to my attention because Orca would crash when clicking the LockPermissions table when viewing a transform (built by another tool) that deleted the only row in the table. So it's not just that Orca can't generate these transforms, it can't even view them.

But, the good news is that the Orca version released in the Windows Server 2008 SDK seems to have fixed this problem. So if you must use Orca, definitely get the latest.

Of course, you will be far more productive with InstEd, at no extra cost (and a much smaller download).

Thursday, May 1, 2008

OpenMP utility code

The relationship building code in InstEd was running a bit slow on my very large test msi. It was taking 22 minutes to build the table relationships.

This was a bit long, so having noticed that Visual C++ 2005's compiler came with OpenMP support, I thought I would have a go at utilising the dual cores on my dev machine to speed this up.

The first task at hand was to determine whether the loops involved were suitable for breaking down into parallel partitions. Some were, and some weren't. My starting point was this loop :

//pseudocode
foreach( row in refing_table.rows() )
{

// get rows that row references
// store the relationships
}


This loop was suitable because except for storage of the relationships, the majority of the work didn't write any data anywhere, and was fairly computationally expensive.

Ideally, the OpenMP version would look like this:
#pragma omp parallel
{
#pragma omp for
//pseudocode
for( Rows::iterator row = refing_table.rows().begin();
row != refing_table.rows().end();
++
row )
{

// get rows that row references

#pragma critical
{
// store the relationships
}
}
}


However the #pragma omp for directive has some major limitations, primarily that the iteration variable must be an integer type. If Rows were a type that had random access iterators then the loop could be changed to use integers for iteration. However, Rows is actually a std::list. So advancing the iterator from begin on each iteration could be expensive.

What I needed was a function that would split the rows() list into partitions suitable for each thread working on the parallel block. And it would be great if the calculation of the partition suitable for a given thread could happen inside the parallel block, thereby further utilising the OpenMP benefits.

(Naturally, splitting the rows() list would happen as an abstraction through iterator ranges, not by actually copying the list items.)

This would be a common problem for any C++ programmers wanting to utilise OpenMP in loops that iterated over STL containers.

So, in an effort to make the solution generic and easy to utilise, I wrote some code for just such a purpose. It in turn utilises the excellent Boost.Range library and concepts, which allow the code to work on any container that models the Range concept, including native C style arrays, and STL containers.

The first task was to write a function to evenly split a given range into partitions. After my first ugly, clumsy, and inefficient effort, Nathanael Rensen came up with an excellent algorithm.

///////////////////////////////////////////////
//
// The returned sub range is such that if this function is called
// for each partition [0,partition_count), the entire "range"
// will be covered by all returned sub ranges, and distributed
// amongst the partitions in the most even size distribution possible.
//
// The size parameter must specify the size of the range.
// This overload, accepting a size, is preferable where
// range.size() may be expensive.
//
template<typename Range>
inline
boost::iterator_range< typename Range::iterator > split_range(
const
Range& range,
int
partition_count,
int
partition,
int
size )
{

Range::iterator begin = boost::begin( range );
Range::iterator end = boost::end( range );

if
( partition_count > 1 )
{

int
remainder = size % partition_count;
int
quotient = size / partition_count;

if
( partition < remainder )
{

std::advance( begin, partition * ( 1 + quotient ) );
end = begin;
std::advance( end, quotient + 1);
}

else

{

std::advance( begin, remainder + partition * quotient );
end = begin;
std::advance( end, quotient );
}
}


return
boost::make_iterator_range( begin, end );
}


///////////////////////////////////////////////
//
// The returned sub range is such that if this function is called
// for each partition [0,partition_count), the entire "range"
// will be covered by all returned sub ranges, and distributed
// amongst the partitions in the most even size distribution possible.
//
// Use this overload where range.size() is not expensive
// (i.e. Range::iterator models random_access_iterator )
//
template<typename Range>
inline
boost::iterator_range< typename Range::iterator > split_range(
const
Range& range,
int
partition_count,
int
partition )
{


return
split_range( range, partition_count, partition, range.size() );
}


Having got the partitioning out of the way, the next part was allowing it to be easily used from within an omp parallel block. This turned out to be surprisingly easy.



///////////////////////////////////////////////
//
// This function should be called within a #pragma omp parallel
// block, and returns a sub_range of the input range.
//
// The returned sub range is such that if this function is called
// by each thread in the parallel thread group, the entirety of "range"
// will be covered by all threads, and distributed amongst the threads
// in the most even size distribution possible.
//
// The size parameter must specify the size of the range.
// This overload, accepting a size, is preferable where
// range.size() may be expensive.
//
template<typename Range >
inline
boost::iterator_range< typename Range::iterator >
split_range_openmp(
const Range& range,
int size )
{

int
thread_count = omp_get_num_threads();
int
thread = omp_get_thread_num();


return
split_range( range, thread_count, thread, size );
}


///////////////////////////////////////////////
//
// This function should be called within a #pragma omp parallel
// block, and returns a sub_range of the input range.
//
// The returned sub range is such that if this function is called
// by each thread in the parallel thread group, the entirety of "range"
// will be covered by all threads, and distributed amongst the threads
// in the most even size() distribution possible.
//
// Use this overload where range.size() is not expensive
// (i.e. Range::iterator models random_access_iterator )
//
template<typename Range >
inline
boost::iterator_range< typename Range::iterator >
split_range_openmp( const Range& range )
{

return
split_range_openmp( range, range.size() );
}




Each thread operating on an OpenMP parallel block gets a thread number from 0 to thread_count - 1, which is perfect for the partitioning.

So, the usage is now as simple as:

#pragma omp parallel
{
boost::iterator_range< Rows::iterator > range
=
split_range_openmp(
refing_table.rows(),
refing_table.rows().size() );


for
( Rows::iterator row = boost::begin( range )
row != boost::end( range );
++
row )
{

// get rows that row references

#pragma critical
{
// store the relationships
}
}
}




And voila, the rows() list is kindly split into equal sections for each OpenMP thread to work on.

Well almost. Because the utility functions don't modify the input range, they accept it as a const reference. However, in the case of containers, this results in const_iterators being returned, which is incompatible with the stated return type using Range::iterator.

In order to work around this, you could either add non const versions of each of the utility functions, add a template parameter to each for the desired iterator type, or utilise make_iterator_range with non const iterators.

boost::iterator_range< Rows::iterator > range
=
split_range_openmp( boost::make_iterator_range(
refing_table->rows().begin(),
refing_table->rows().end() ),
refing_table->rows().size() );


But there it is. Some utility functions to make it easy to utilise multiple threads on loops with non-integer iterators.

The full header file can be found here: split_range.hpp.

Performance
And what was the result of utilising OpenMP for this loop? Well, I actually made the biggest improvement by comparing the binary data of each field for relationships, instead of comparing their display strings which were built each time. This dropped the time from 22 minutes to 6:45. Adding the OpenMP loop dropped it again to 5:05, but raised the total processor time by about 2 minutes. This was probably due in part to the overhead of splitting the range, and in part to the overhead of the OpenMP code.

But, it's worth noting that there can often be big performance increases found before resorting to multithreading.

Update
I subsequently found this paper which discusses OpenMP and non-conformant (non integer iterator)loops, and provides some alternative solutions.

Try this new InstEd instead

The next version of InstEd has been released.

Roll over to the InstEd web site to see the new features and download it.

I am sure you will find it the fastest Windows Installer Editor.

Monday, April 21, 2008

Can You Hitch in a Cab?

Sometimes when repackaging an msi for an enterprise it's necessary to add files to an installation using a transform. For example, you may wish to drop out a common file that has been captured by the original vendor with a custom component code, and install it using its proper merge module.

It would be great to be able to embed the cab into the transform, so that the transform is self-contained. Well there's good news, and bad news.

The good news is that you can embed the cab in the transform. The bad news is that the transform can't be used during installation. Which is pretty bad news.

So effectively, you can hitch in a cab, but its illegal. At least in the msi world.

Here's the low down. You can embed a cab in the transform as long as the cab's binary field is listed in a regular table entry. For example, it must be listed in the Binary table. If the cab is listed only in the _Streams table, it won't get saved into the transform.

If you generate a transform file with an embedded cab, and apply it to an msi in InstEd, you will successfully be able to extract the cab again, proving that the transform contains the cab file. In fact, if you apply the transform to an msi, and perform a Save Transformed command, the resultant msi will install fine, including any files from the added cab.

However, if you apply the transform with the embedded cab to the msi during an installation:
msiexec /i msi_file TRANSFORMS=mst_file
then you get an error related to the installation not being able to find the cab.

My guess is that this is related to this little snippet in the msdn docs for MsiDatabaseApplyTransform:
The MsiDatabaseApplyTransform function delays transforming tables until it is necessary. Any tables to be added or dropped are processed immediately. However, changes to the existing table are delayed until the table is loaded or the database is committed.

When installing an msi, it seems that the OLE structured storage streams are extracted "before" the table that references the cab is transformed. This is surmised from fact that the transform contains the stream for the cab (you can pull it out of the transform in InstEd), but it is not accessible during the installation process. So, a likely scenario is that the table that references the cab's binary data (and hence the underlying stream) is transformed after the streams are extracted.

Can you hitch in a cab and get away with it?
Could you force the table that references the cab to be transformed before the installation code extracts the streams? Well, I haven't tested it, and it would be unsupported, but you might be able to do so by adding a custom action to the transform that reads from the table that references the cab. This would force the table to be transformed.

If this custom action could be run early enough in the InstallExecuteSequence (or even the InstallUISequence) then perhaps the table would be transformed before the streams were extracted. But if it did work, it would be unsupported and could possibly break in future releases of msi.

Having said that, it would be nice if Microsoft did officially allow cabs to be embedded in transforms.

Tuesday, April 8, 2008

Care for a Date?

While file system time stamps are not the ultimate arbiter of whether or not files have been edited, they can be useful in determining at a glance whether files have been edited.

Unfortunately, the Windows Installer API forces the Last Write Time timestamp to update just by opening a file in anything other than read only mode. Orca, and InstEd, don't (by default) open files in read only mode. Therefore just by opening a database file (not transforms) in Orca, even if no changes are made and a File->Save is never executed, the timestamp of the file will be changed when the file is closed.

This can be unfortunate when customising an msi installation, and trying to determine at a glance whether the source msi has been changed or whether all the changes have been kept in mst files (as they ideally should).

Why is this?
As previously discussed, the Windows Installer file format is based on OLE Structured Storage. The Structured Storage API provides a "transaction" mode, whereby changes to a file can be discarded. Coincidentally, the Windows Installer API also provides a transaction mode. It is almost certain that the Windows Installer transaction mode utilises the OLE Structured Storage transaction mode, rather than implementing transactions itself.

It is this transaction mode with which Orca, and InstEd, open files by default. Using this mode, the tool can make many changes to the file, and they get discarded unless specifically committed (File->Save). This allows fast saving (no need to save all the tables when saving, just call Commit), and easy discarding (simply don't call Commit).

But the underlying OLE Structure Storage transaction mode can, if required, save changes in "scratch" areas in the file, until such time as they are committed. Therefore, in transaction mode, the file must be opened with write permissions, even if commit is never called.

And at the NTFS file system level, as soon as a file handle that has been opened with write access is closed, the Last Write Time timestamp is updated.

The good news is that InstEd preserves the Last Write Time timestamp if no changes are made to the file. It does this by storing the timestamp when the file is opened, updating the stored timestamp whenever Save is called, and resetting the Last Write Time to the stored version whenever the file is closed.

Transform files are never opened with write permissions until they are saved, and therefore don't suffer the same problem.

Wednesday, April 2, 2008

Killer Whales can be dangerous

Don't get me wrong, Orca has been the mainstay of anyone wanting to rapidly edit Windows Installer files for a long time. And does an excellent job. Mostly.

The problem is that there are a few nasty things that Orca does silently. So you won't even know the msi file being worked on has been corrupted. See my previous entry about the _Streams table.

One other danger is the Copy and Paste Rows functionality.

When a row is copied, it's fields are placed as tab delimited strings onto the clipboard.
When multiple rows are copied, each row's string is separated by appropriate end of line characters.

However, if a string field in a row contains a tab, or an end of line character, then that row cannot be pasted back into the database.

Unfortunately, the user is not made aware when pasting rows that Orca has stopped pasting them because it has found an invalid number of tabs (fields) in the row for the table.

This becomes dangerous because an expected behaviour, copy and pasting rows, silently doesn't work.

InstEd resolves this problem by quoting fields that have tab and end of line characters, and escaping quotes within such an escaped field. This is compatible with Excel, so that pasting rows back into InstEd or Excel will result in correct behaviour.

Furthermore, it only quotes fields that contain tab or end of line characters, or that have a quote at the start or end of the field. This provides as much compatibility for copying from InstEd and pasting into Orca as is possible.

The upshot is that InstEd will always copy and paste rows correctly within itself, and with Excel, whereas Orca has the potential to (silently) lose information when pasting rows.

Wednesday, March 26, 2008

Killer Wail

Recently I received an email complaining that Orca lost information when doing a Save Transformed.

The sender was working on repackaging an msi that had to be finished with extreme time pressure and Orca was losing files when doing the Save Transformed As. He subsequently downloaded and installed InstEd, and in short order had his transformed msi saved with all data intact.

Orca's "Save As" and "Save Transformed As" issues are well documented, but no less dangerous for that.

From the "Special Considerations when editing Databases" page in the Orca Help:

Embedded Streams and Storages
When a database is saved using the Save As… or Save Transformed As… command, embedded binary streams (such as embedded cabinet files) are not saved to the new database unless they are part of a data row. Embedded sub-storages (nested install files) are never saved to the new database.


But the question remains, why doesn't it save the sub-storages? From my previous post, you might remember that the _Streams table is an abstraction of the subset of the underlying OLE structured storage that represents all the binary fields in the database.

The critical thing to understand is that while all binary fields in regular tables (such as the Icon table) are backed in the _Streams table, not all fields in the _Streams table are represented in regular tables.

When Orca does a Save As, or Save Transformed As, operation it creates a new database, and copies, table by table, row by row, the data into the target database. But, it doesn't copy all the _Streams rows that aren't represented in regular tables. Nor does it copy OLE structured storage entities that aren't represented by regular tables. Therefore, this data never makes it into the target database.

So, while the newly created database contains all the persistent tables, it can be missing data that is critical to the msi.

Note that critical data can be stored in the underlying OLE storage entities, that aren't in the _Streams table. For example, language transforms that are applied when an msi is installed, are stored as OLE structured storage entities, but are not represented in the _Streams table.

So why would you ever use the Save As feature in Orca? Well the only advantage is that it writes a fresh database, which means that the wasted space from many additions/deletions/edits gets trimmed out. But while the msi may be smaller, you have to be sure that all important data in the msi is represented in the persistent (regular) tables, otherwise it will get lost.

It seems to me that given the risks associated with this command, it would have been better named "Compact Database As", and not "Save As". And similarly, "Save Transformed As", should warn the user that it may lose information.

There is no way with Orca to perform a "Save Transformed As" without losing the information, unlike "Save As", where you can copy the source msi before editing and saving.

Could a tool compact the database and not lose the critical information? Well, yes, but only if it completely understood how the Windows Installer API uses the OLE structured storage. It could create a new database, copy all the tables, copy all the _Streams rows unique to the _Streams table, and then copy the remaining OLE structured storage entities, but the problem is that there is no documentation on how to tell which entities have already been copied when the tables were copied.

How does InstEd implement Save As and Save Transformed As? InstEd copies the underlying database file to the target, and then applies all changes that have been made since the underlying database file was last saved. This is equivalent to copying the underlying database to the target and then editing it. In this way, all the non-persistent table data is maintained, but you don't get a fresh database, with optimal space saving.

InstEd uses a similar mechanism for "Save Transformed As" to ensure that no data is lost.

If you wanted to achieve equivalent beahviour to Orca's Save As, you could export all the tables, and import them to a new and empty database.

Monday, March 17, 2008

_Streams of Consciousness

Unlike the confused thoughts falling from my mind, the _Streams table is quite critical to a Windows Installer database.

In fact, it is the "location" of all the binary fields in the database.

The _Streams table, consisting of two fields 'Name' and 'Data', is an abstraction of the underlying OLE structured storage data streams. It provides access to the binary streams for the Windows Installer API.

All binary fields are stored in their own OLE stream in the database file, and the _Streams table is generated when an sql request is made for a binary field. While the _Streams table is temporary and generated on request, changes to the table are persistent.

Every field in the msi database that contains binary data is represented in the _Streams table using the format <table_name>.<row_key>.

So, for a row in the Binary table with a Name field of 'Icon', the _Streams table would contain a row with a 'Name' field of 'Binary.Icon'.

This is a one way relationship. While all binary fields in tables are accessible via a row in the _Streams table, not all rows in the _Streams table represent another table row.

This is important to understand, since critical information is often stored in the _Streams table that is not accessible via regular tables. The most common example is the cab for a merge module (msm).

The installed files that a merge module contains are stored in an _Streams row with a name of 'MergeModule.CABinet' (case-sensitive). Note that there is no 'MergeModule' table with a row called 'CABinet'.

Other types of binary data can be stored in the _Streams table without having a corresponding table. Any internal cab file can be stored in the _Streams table without requiring it to be represented in a normal table. So the Media table entry might be '#cab1', or '#cab1.cab', with no attending 'cab1' table.

Given the importance of the _Streams table, it is curious that other tools have not provided direct access to it. InstEd provides access to it, allowing quick access to merge module cabinet files, and a central place to access all binary fields.

However editing an _Streams table row that represents a row in another table (the <table_name>.<row_key> format) will jump to that row in the other table. This is to ensure that the user is well aware that the _Streams row is represented by another table row.

Have you considered what happens if there are two tables with a binary field, where one is called (for example) 'Binary', and the other 'Binary.Table'? Can you have a row in 'Binary' called 'Table.Value' and a row in 'Binary.Table' called 'Value'?

It turns out you can, but changing one field, changes the other, since the binary field for both rows is backed in the _Streams table by a row called 'Binary.Table.Value'.

Monday, March 10, 2008

The Danger of 'Special' Values

When defining the meaning of a data field, it can be tempting to specify certain values that the field may hold as special. Those values would signify a different meaning than all other values.

A great example of this is in the Windows Installer API, where the docs for MsiRecordSetInteger state "To set a record integer field to NULL_INTEGER, set iValue to MSI_NULL_INTEGER".

So the MSI_NULL_INTEGER constant:
#define MSI_NULL_INTEGER 0x80000000
has a different meaning than any other integer.

There can be good reasons for doing this. The Windows Installer team probably made a special value for NULL to obviate the need for every field in the msi database to have a boolean attached to it specifying whether the field is null or not. Obviously there is a large space saving advantage here.

But the dangers of doing this are obvious. What if someone wants to store 0x80000000 and not have it considered as a null value? This is exactly the reason why the docs for the LockPermissions table declares:
"You cannot specify GENERIC_READ in the Permission column. Attempting to do so will fail. Instead, you must specify a value such as KEY_READ or FILE_GENERIC_READ."

It just so happens that the value of the GENERIC_READ constant is, you guessed it, 0x80000000. Ouch.

So, while the docs are technically correct, it is worth noting that you can use GENERIC_READ, as long as it is combined with another bit flag. This is the reason that InstEd has included GenericRead in it's Permissions column bit flag editor.

InstEd Permission bit flag editor

Friday, March 7, 2008

InstEd is released!

InstEd 1.5 has been released.

Why version 1.5? Well, it had a former incarnation in Camwood's appEditor.

It is now more stable, faster, and with more features, including:

  • Comparing msi files for visual differencing. It also allows comparing an msi against itself so that changes are tracked visually as they are made.
  • Providing an option to automatically update the File and MsiFileHash table details (Version, Language, hash values) for files in the File table from the source files.
  • Compiled with profile guided optimization to provide greater performance.

So go on, InstEd it!