Thursday, May 1, 2008

OpenMP utility code

The relationship building code in InstEd was running a bit slow on my very large test msi. It was taking 22 minutes to build the table relationships.

This was a bit long, so having noticed that Visual C++ 2005's compiler came with OpenMP support, I thought I would have a go at utilising the dual cores on my dev machine to speed this up.

The first task at hand was to determine whether the loops involved were suitable for breaking down into parallel partitions. Some were, and some weren't. My starting point was this loop :

//pseudocode
foreach( row in refing_table.rows() )
{

// get rows that row references
// store the relationships
}


This loop was suitable because except for storage of the relationships, the majority of the work didn't write any data anywhere, and was fairly computationally expensive.

Ideally, the OpenMP version would look like this:
#pragma omp parallel
{
#pragma omp for
//pseudocode
for( Rows::iterator row = refing_table.rows().begin();
row != refing_table.rows().end();
++
row )
{

// get rows that row references

#pragma critical
{
// store the relationships
}
}
}


However the #pragma omp for directive has some major limitations, primarily that the iteration variable must be an integer type. If Rows were a type that had random access iterators then the loop could be changed to use integers for iteration. However, Rows is actually a std::list. So advancing the iterator from begin on each iteration could be expensive.

What I needed was a function that would split the rows() list into partitions suitable for each thread working on the parallel block. And it would be great if the calculation of the partition suitable for a given thread could happen inside the parallel block, thereby further utilising the OpenMP benefits.

(Naturally, splitting the rows() list would happen as an abstraction through iterator ranges, not by actually copying the list items.)

This would be a common problem for any C++ programmers wanting to utilise OpenMP in loops that iterated over STL containers.

So, in an effort to make the solution generic and easy to utilise, I wrote some code for just such a purpose. It in turn utilises the excellent Boost.Range library and concepts, which allow the code to work on any container that models the Range concept, including native C style arrays, and STL containers.

The first task was to write a function to evenly split a given range into partitions. After my first ugly, clumsy, and inefficient effort, Nathanael Rensen came up with an excellent algorithm.

///////////////////////////////////////////////
//
// The returned sub range is such that if this function is called
// for each partition [0,partition_count), the entire "range"
// will be covered by all returned sub ranges, and distributed
// amongst the partitions in the most even size distribution possible.
//
// The size parameter must specify the size of the range.
// This overload, accepting a size, is preferable where
// range.size() may be expensive.
//
template<typename Range>
inline
boost::iterator_range< typename Range::iterator > split_range(
const
Range& range,
int
partition_count,
int
partition,
int
size )
{

Range::iterator begin = boost::begin( range );
Range::iterator end = boost::end( range );

if
( partition_count > 1 )
{

int
remainder = size % partition_count;
int
quotient = size / partition_count;

if
( partition < remainder )
{

std::advance( begin, partition * ( 1 + quotient ) );
end = begin;
std::advance( end, quotient + 1);
}

else

{

std::advance( begin, remainder + partition * quotient );
end = begin;
std::advance( end, quotient );
}
}


return
boost::make_iterator_range( begin, end );
}


///////////////////////////////////////////////
//
// The returned sub range is such that if this function is called
// for each partition [0,partition_count), the entire "range"
// will be covered by all returned sub ranges, and distributed
// amongst the partitions in the most even size distribution possible.
//
// Use this overload where range.size() is not expensive
// (i.e. Range::iterator models random_access_iterator )
//
template<typename Range>
inline
boost::iterator_range< typename Range::iterator > split_range(
const
Range& range,
int
partition_count,
int
partition )
{


return
split_range( range, partition_count, partition, range.size() );
}


Having got the partitioning out of the way, the next part was allowing it to be easily used from within an omp parallel block. This turned out to be surprisingly easy.



///////////////////////////////////////////////
//
// This function should be called within a #pragma omp parallel
// block, and returns a sub_range of the input range.
//
// The returned sub range is such that if this function is called
// by each thread in the parallel thread group, the entirety of "range"
// will be covered by all threads, and distributed amongst the threads
// in the most even size distribution possible.
//
// The size parameter must specify the size of the range.
// This overload, accepting a size, is preferable where
// range.size() may be expensive.
//
template<typename Range >
inline
boost::iterator_range< typename Range::iterator >
split_range_openmp(
const Range& range,
int size )
{

int
thread_count = omp_get_num_threads();
int
thread = omp_get_thread_num();


return
split_range( range, thread_count, thread, size );
}


///////////////////////////////////////////////
//
// This function should be called within a #pragma omp parallel
// block, and returns a sub_range of the input range.
//
// The returned sub range is such that if this function is called
// by each thread in the parallel thread group, the entirety of "range"
// will be covered by all threads, and distributed amongst the threads
// in the most even size() distribution possible.
//
// Use this overload where range.size() is not expensive
// (i.e. Range::iterator models random_access_iterator )
//
template<typename Range >
inline
boost::iterator_range< typename Range::iterator >
split_range_openmp( const Range& range )
{

return
split_range_openmp( range, range.size() );
}




Each thread operating on an OpenMP parallel block gets a thread number from 0 to thread_count - 1, which is perfect for the partitioning.

So, the usage is now as simple as:

#pragma omp parallel
{
boost::iterator_range< Rows::iterator > range
=
split_range_openmp(
refing_table.rows(),
refing_table.rows().size() );


for
( Rows::iterator row = boost::begin( range )
row != boost::end( range );
++
row )
{

// get rows that row references

#pragma critical
{
// store the relationships
}
}
}




And voila, the rows() list is kindly split into equal sections for each OpenMP thread to work on.

Well almost. Because the utility functions don't modify the input range, they accept it as a const reference. However, in the case of containers, this results in const_iterators being returned, which is incompatible with the stated return type using Range::iterator.

In order to work around this, you could either add non const versions of each of the utility functions, add a template parameter to each for the desired iterator type, or utilise make_iterator_range with non const iterators.

boost::iterator_range< Rows::iterator > range
=
split_range_openmp( boost::make_iterator_range(
refing_table->rows().begin(),
refing_table->rows().end() ),
refing_table->rows().size() );


But there it is. Some utility functions to make it easy to utilise multiple threads on loops with non-integer iterators.

The full header file can be found here: split_range.hpp.

Performance
And what was the result of utilising OpenMP for this loop? Well, I actually made the biggest improvement by comparing the binary data of each field for relationships, instead of comparing their display strings which were built each time. This dropped the time from 22 minutes to 6:45. Adding the OpenMP loop dropped it again to 5:05, but raised the total processor time by about 2 minutes. This was probably due in part to the overhead of splitting the range, and in part to the overhead of the OpenMP code.

But, it's worth noting that there can often be big performance increases found before resorting to multithreading.

Update
I subsequently found this paper which discusses OpenMP and non-conformant (non integer iterator)loops, and provides some alternative solutions.

2 comments:

Anonymous said...

Thanks for your efforts writing this all down, due to OpenMP 3.0 not being fully implemented yet I needed a way to parallelise an iterator-based list, and this howto perfectly fits my needs.

In order to get it working, I needed to change some things to the source though. First, there is a bogus "#ifndef" at the top of the file. Second, the "typedef" keyword is missing to define the "end" and "begin" fields (lines 37 and 38). Apart from those two issues, the code works as it should :)

maleadt

Neil said...

Thanks maleadt.

Glad you found the code useful.

You are right about the dodgy #ifndef. And also about the missing "typedef". But I suspect you mean "typename" rather than "typedef".