ned Productions - Niall's Programming Tips

During my recent work on Tornado, I've come across a number of things which I haven't found anywhere else on the internet - so I thought I'd post them here for others to use. It's a hotchpotch of stuff not really in much order or organisation, but I think it is stuff you won't find anywhere else!

Updated 21st November 2003 with improvements - you may find my page on "What is Computer Software?" interesting

Link to Niall's MySQL-driven web counter

1. Bugs in Microsoft software:

Tornado pushes everything to the max - how it goes about using the system is extremely taxing for it and as a result I have uncovered a large number of bugs in Qt (for most of which I have supplied source fixes which Trolltech then applied to the main distribution - thank you Trolltech!) and more interestingly quite a few bugs in Microsoft's products.

Since in order to report bugs to Microsoft you need to pay them several thousand of whatever currency to get your direct line to real developers, I am publishing my list of discovered bugs and how I worked around them in the hope it will help others. The list is in no particular order:

Microsoft Visual Studio C++ v6.0 & .NET 2002 (v7.0)
This isn't a bad little compiler - though Visual Studio .NET IDE 2002 is bug-ridden, it should be better by .NET 2003 (where the hell is the service pack???) However, both compilers corrupt memory (or the stack) when you make a copy of an object when:
- The object being copied has a default copy constructor
- The object being copied has no data members and is subclassed from an object with a virtual table (ie; it has virtual methods) and subclassed from another object with no data members (eg; a class used purely for namespace purposes)
Basically the default copy constructor code correctly copies the first inherited class' data and adjusts the virtual table correctly. However, unfortunately, it then goes on to alter the table for the class with no data members - which, because it has no data, means it overwrites something else lower down in memory. Result: crashes.

Solution: Add a dummy data member to either the class itself or the class it derives from which doesn't have any, it doesn't matter which. I would suggest adding it to the base class makes the problem go away permanently.
Duplex pipes are broken (ie; ones created by CreateNamedPipe())
I was completely stumped by the weird behaviour of my pipes - they'd work fine and then hang infrequently. It turns out that if one thread fills a duplex pipe's write buffer and then another thread tries writing more when the reading thread has not yet read any data, the second write hangs and so do any other subsequent writes. Best of all, the read doesn't return either! I found this bug reported for Windows NT 3.1 so it's clearly an oldie!

Solution: Break your duplex pipes into two separate pipes each going one direction. You get improved latency too I noticed because it seems you can only write once to a pipe on NT before it blocks.
OutputDebugString() isn't thread-safe
Yes, should your threads try outputting debug text at the same time, this call will hang! Perfect!

Solution: Stick a critical section around OutputDebugString().
Overlapped i/o reads continues even after closing the file handle
Arguably this isn't a bug, but it results in memory corruption as the overlapped i/o writes its data into what is now something else. It isn't mentioned very clearly, but you can use CancelIo() to cancel it before closing the file.

Unfortunately, CancelIo() only cancels the overlapped i/o started by the calling thread. So, for example, if your reader thread dies and you're trying to cleanup after it, you can close all the handles and free the memory but you can't stop it still corrupting memory. Thanks Microsoft!

Update: Yes, this frigging misfunctionality got me again. There was literally one exit route (caused by a thrown exception under rare circumstances) where CancelIo() wasn't getting called and yes, I was getting my heap corruption again.

Solution: Either find some way of ensuring each thread cancels its own i/o, or stick a Sleep(500) in before releasing the memory the overlapped i/o would write into, or indeed just don't free the memory ever. Your choice

2. Multithreaded programming tips

Most of the programs I write are multithreaded. The usual sagely advice regarding multithreading is to not use it except when absolutely necessary ie; there is a substantial performance advantage to be gained.

My problem with this advice is that it seems to me like the proverbial ostrich sticking their head into the ground. If you never practice something, you'll never get good at it and with future computers being certainly SMP multiprocessing and then NUMA it seems daft not to write your code today for what is obviously going to be the future, especially if it costs you little right now except a tiny bit more effort. Furthermore, a lot of techniques are used in computer software to fudge the "not everything happens at once" problem most computer programmers have. Things like sockets, events, messaging, data arriving - all these are asynchronous events which happen in parallel with one another and for each of these there is a programmatic convention for serialising them into a set of consecutive operations easier for the programmer to get their head around.

Now most of these conventions I'm fine with - anything making my life easier is good. But what I don't agree with is when use of these conventions makes a program considerably more complex and harder to maintain - a good example here is notification network sockets. These have an interrupt routine post a window message to an application so it can deal with a socket event synchronously - and invariably, I can't help seeing it as making things more complex, not less so.

There are many others. Something doing intensive i/o with a serial port for example. In this situation using overlapped i/o on Windows, you can have i/o happen in the background while you get on with running the GUI for example. But it's a much better design to have a separate thread do that and let the GUI alone to do its own thing.

People have this notion, and it's not helped by past-generation programmers, that writing multithreaded code is hard. I have written 30,000 line 20 thread programs and they work perfectly. Sure, the effort expended was a little bit harder but not hugely so - and the customer got a program capable of very low latency (for windows), high speed control program.

2.1 The key to success

The key to success for any program is discipline. There are many excellent but sloppy programmers just as dangerous as an awful programmer, and there are many technically poor programmers with the discipline to make them a very valuable member of a team.

Discipline is following rules. The rules are extremely simple, and if you follow them you will write code:

With less than one memory leak per 5,000 lines
With a negligible chance of causing memory corruption
With perhaps only one to two bugs per 1000 lines
Which is modular, maintainable and easy to debug
Which is fast, scales excellently with hardware, and lasts long into the future

And here they are:

Error checking is your friend
It's sad to see how little code has the good old assert() function in it. It costs almost nothing and yet permits sanity checks which will often give clue to an underlying problem if they are sprinkled around liberally. It's also distressing to see how little code checks the return code of an API for error condition - this is a particular problem on Windows, but Unix suffers from it plenty too.

I suggest you wrap every API call in a macro. Tornado uses TERRHWIN() for Win32 API calls and TERRHOS() for Unix ones. These macros test for the error condition and if true use FormatMessage() and strerror() respectively to decode a global error code into a string which they construct into an exception which also contains source file name and line number (on debug Win32 builds it also adds a stack backtrace), which is then thrown. Unlike older C++ compilers, there is no common popularly used compiler today which doesn't do a good job in generating efficient and correct exception handling code - most problems in this area nowadays stem from sloppy programming. If you want a library that already has all this done for you, choose my own: TnFOX.

Good starting points are checking parameters for exceeding maximums or minimums, if accessing a list index is exceeding the number of members in that list, if when finding an item that must be in a list fails etc. Nowadays with C++, assert() should be used for stuff likely to fail in debug sessions as obviously it won't check in release builds.

However, you should still sprinkle your code with runtime checks which throw an exception also in release builds shipped to the user. I use a special kind of exception which indicates if the failure of the check is fatal and also reports it to the user in a special way - with an accompanying stack backtrace for example.

You should fully test your exception handling. I've had a worker thread slowly corrupt memory randomly to test how well the application responds to an extreme environment - does it correctly save out as much data as it can, clean up after itself as best as it can etc. It's as an old boss once said how he tested his filing system - he stuck in a new hard drive, ran the FS and then struck the HD with a hammer to simulate a platter crash.

(See "Writing Exception Safe Code" in the TnFOX documentation)
Zero pointers throw exceptions
If I had a dime for every time someone says pointers are evil. Well sure, they are powerful, but they are not evil. And in the hands of a disciplined programmer, they can make your code (a) considerably more efficient and (b) catch more errors.

Yes, you read that last bit right. C++ references (using &) are all well and good, but they must point to an object and hence they are ripe for referencing an object which has been deleted. I only ever use references for an object whose existence I can guarantee, and pointers for everything else.

Pointers are good because they can be zeroed and any attempt to indirect through a zero pointer will throw an obvious exception on every platform I've ever used. The key to using pointers as a powerful and useful tool instead of corrupting data and leading to unstable code is to always zero a pointer when it doesn't point to anything.

That single, simple rule can remove perhaps 40% of all bugs seen in C or C++ code. It even works for assembler! In practice, it means always initialising a pointer with zero eg; MyClass *ptr=0; especially in class constructors eg;
```
class MyClass {
   MyOtherClass *myptr;
   MyClass() : myptr(0) {
   ...
```
You should also always zero a pointer after deletion. In fact, I use a macro called TDELETE(x) which is simply { delete x; x=0; } which is as simple as it gets.

Of all the rules here, if you follow this one you will see the greatest improvement in the quality of your code. You will find attempts to pass a freed block of data to some code or calling methods in a deleted instantiation will no longer result in memory corruption or changes "not happening" but a guaranteed fatal exception you can track down and fix.
Memory leakage
95% of memory problems with C-like languages are due to sloppy pointer discipline. C++ gives you a wonderful thing to look after your memory allocations - it's called a destructor, and you should use it wherever possible. The best form is to encapsulate and abstract your memory and pointer use in a class - preferably one someone else has already debugged for you eg; the STL's list, vector, array or string objects. If you must write it yourself, make it small and abstract as much of the realities underneath from the outside world - this makes fixing it when it's broken much easier. Needless to say, plenty of sanity checks here is a must.

For the 10% of memory allocation code you can't use a class for, use this class:
```
/*! \class TPtrHold
\brief A guaranteed deleted pointer

In code using exception handling, a major problem is allocating pointers
to objects and then correctly freeing them on exception. The solution
is this class which guarantees deletion of new-ed pointers stored inside
it.
\code
TPtrHold<qstring> temp;
TERRHM(temp=new QString);
\endcode
We recommend usage for any new-ed pointer which doesn't inherit QObject
(as obviously QObject manages auto-deletion for its children) or is
inside a Qt list with auto-deletion enabled.
*/
template<class type> class TPtrHold
{
   type *ptr;
public:
   TPtrHold(type *p=0) : ptr(p) { }
   ~TPtrHold() { free(); }
   operator type *() { return ptr; }
   operator const type *() const { return ptr; }
   type *&operator->() { return ptr; }
   type *&operator=(type *p) { ptr=p; return ptr; }
   void free() { delete ptr; ptr=0; }
};
```
Using this class is a must if you use exception handling with code which allocates pointers on the stack. At any time your code may be exited and thus the code deleting a pointer will never get called - using a destructor, this is fixed.

(Added 21st Nov 2003: Nowadays I rarely use that class except for static data, I use rollbacks instead. See the docs in TnFOX)

A lot of people use expensive tools like BoundsChecker or Purify to detect memory problems. This is all well and good, but I feel they are missing the fact that a great deal of these facilities are already provided free with your language and toolkit - irrespective of platform.

The new and delete operators are overloadable on newer compilers:
```
#define _TMEMDBG_NEW_ new(1, _tmemdbg_current_file_, __LINE__)
#define _TMEMDBG_MALLOC_(x) _malloc_dbg(x, 1, _tmemdbg_current_file_, __LINE__)

#define new _TMEMDBG_NEW_
#define malloc(x) _TMEMDBG_MALLOC_(x)
```
(You need to add a static const char *_tmemdbg_current_file_=__FILE__; to each .cpp file individually - this is because of how __FILE__ works). Indeed, MSVC6 and later come with a debug heap which already has this new new written for you but it's a trivial implementation on other platforms like GCC. Simply place the allocated pointer with filename and line number into a list (preferably accessed by hash table) and at the end of the program do a dump of the unfreed memory blocks showing where they were allocated. Voila, you have an industrial strength memory leak detection mechanism.

MSVC7 or later (or Visual Studio .NET <year> as it's called by MS) comes with extensive run time check ability. It can check for exceeding boundaries known to it (your array or list class should already be doing this), writing off the ends of buffers (you should already be using strncpy() - never strcpy() - or in C++, always a string class) and usefully for the transition to 64 bit code, container truncations not using a cast (eg; an int into a short, or worst of all especially on windows the common int = void *). What I find most useful about this new MSVC7 feature is testing my code to see how well it works in a machine three times slower than mine - because it should always run adequately at the third of the speed of whatever you're seeing as the developer with a high-end machine.

Lastly, if you're on Linux you should seriously check out valgrind which can debug more than just memory problems (eg; how well does your program use the processors cache?). I usually belt through any code just before release.
Lock or prevent all shared data access between threads
It's the first rule of multithreaded programming - synchronise access to all shared data. It would be better to say however that if you structure your code differently, you can reduce the access to shared data in the first place which results in (a) much faster paralleled execution and (b) less errors because you forgot to hold the lock and (c) less thread deadlocks.

It's hard to put into words how to structure your code differently. Certainly, you need to do many more deep copies of objects which comes with an obvious memory usage and speed penalty. But I think the essence is to keep objects within themselves as much as possible and try to make everything reentrant (ie; able to have multiple things running the same code but with different inputs eg; two instantiations of a class could run without locking because they are working on two separate sets of data). Obviously, static buffers must be avoided at all costs - the stack is your friend here!

Don't be afraid to split stuff off into another thread when it seems appropriate. The more practice you get writing multithreaded code, the better you will get at it. It's best to hold locks more than you need to at first, but if you hold a lock while waiting for another you will run into thread deadlock problems which can be solved two ways: (a) release the first while waiting on the second or (b) use a more global lock covering access to more data. If the task you want to perform in parallel is very small, consider using a threadpool instead of starting a whole dedicated thread. And yep, you guessed it - TnFOX has a FXThreadPool class.

Because of this deadlock problem, sticking a mutex on every little piece of data will usually lead to non-functional code. Furthermore, it results in very slow performance because all synchronisation code is expensive - it negates the effect of the processor's cache and stalls the entire memory subsystem on many architectures. No, the solution is to strike the right balance between how specific and global your locks are, and it's something you get much much better at with practice.

One point here is again the use of destructors to help you much like with freeing memory pointers above. If you hold any kind of synchronisation lock, hold it indirectly by instantiating a class - then when an exception or other form of strange exit happens, the lock is correctly freed. The Java and Win32 approach of try...finally is fine and useful, but by thinking a little more laterally you can get something more efficient in C++. (Added 21st Nov 2003: See TnFOX's FXMtxHold)

Lastly, at least once per large multithreaded project you will usually have to write a really difficult piece of multithreaded code - memorable for me have been the intelligent mutex (which permitted a configurable pattern of multiple users eg; define set A, B & C of possible user combinations, so only locks of say type A can execute in parallel - bit like a read/write mutex but more complex), a hardware write access lock which the user switched between threads and mostly recently, an optimised scalable fully recursive read/write mutex. My advice for these kinds of code is to keep the difficult bit as absolutely as small as possible. It's better to have 100 lines of extremely difficult code than 1000 lines of moderately difficult code especially as you can rip out and completely replace 100 lines much easier than 1000. And of course, reuse an existing implementation off the web long before writing your own - TnFOX comes with LGPLed source code you can study.
Multithreading efficiency
A well written multithreaded program will come within 10% and usually 5% the speed of a non-multithreaded program on a uniprocessor machine ie; not noticeable to the user. If it's done right, the unique benefits include a much more responsive program capable of working simultaneously on dozens of things at the same time and furthermore a much more stable program because fatal errors can often be localised to within a single thread, letting your program relaunch that thread or otherwise save data which might otherwise be lost. A multithreaded program is furthermore much more scalable for the future - your code will continue running faster with increased hardware into the future much more linearly than an older-style program.

Thread support varies greatly between platform. Windows uses the OS/2 model which is very flexible and powerful but also slower than POSIX can be which is considerably more basic. Really useful stuff Windows has which POSIX should have is thread message queues, Asynchronous Procedure Callbacks (APCs), inter-process synchronisation (POSIX defines optional extremely basic inter-process locks, but they are rarely implemented) and lastly the thing I miss the most is the ability to wait on more than one wait object at once. Simple things in Windows like having arbitrary code executed in another thread's context is very cumbersome in POSIX (usually involves signals), and I recommend from bitter experience you try and find a different way of structuring it so you don't have to.

I have three tips to dramatically improve multithreading efficiency apart from that design and structure have the biggest effect (but I've already covered that above):
1. Use spin counts on anything which can put a thread to sleep. Basically, putting a thread to sleep and waking it again is extremely expensive - a call to the kernel, lots of modifying inter-thread shared lists, updating the scheduler etc. A simple count of 1...4000 tries to get a lock before actually waiting on it makes a huge difference on a multiprocessor system and can make some difference on a uniprocessor one if the lock is being contended a lot (depends on that system's scheduler). Of course, if trying a lock is also expensive (some POSIX implementations), you'll want not to do this. If using Windows, take a good look at InitializeCriticalSectionAndSpinCount().
2. Mutexs are dirt simple to implement - however amazingly most POSIX implementations are still horrendously inefficient. Since they're the most used synchronisation primitive, you may consider rolling your own mutexs - on Windows, critical sections are 57 times quicker than Win32 mutexs so it should be obvious which to use! To roll your own mutexes, you'll need access to the hardware's synchronisation assembler op which can at the very minimum swap a memory location atomically though many architectures can also increment atomically or even exchange and compare at once. I suggest using the one I wrote for TnFOX (check the FXMutex class).
3. Use more thread local storage. Even experienced multithreaded programmers often forget how thread local storage can greatly improve an algorithm working with shared data. I used TLS in my RWMutex below to save having to maintain a list of threads currently owning a read lock and it not only made things more efficient, it made the design simpler too. You can encapsulate TLS for C++ using something like the code in the RWMutex source below.
4. Consider more intelligent synchronisation primitives. The most common complex primitive is the read-write mutex, though I listed some stranger ones above. RWMutexes are best suited to code which mostly reads from shared data but rarely writes to it - which, in my experience, is 85-90% of shared data use. Or, to put it another way, if you are writing just as much as reading your shared data structures consider separating the read-only parts out into a RWMutex protected index and keep the writeable parts individually locked by a normal mutex - thus you can use the index for quick look up and not hold up any other threads except when two want to write to the same data.
  
  POSIX has optional support for RWMutexes and Win32 & OS/2 just doesn't have them at all. I provide a working optimal algorithm with source below which can aid you in this.

3. A Read-Write Mutex

I provide here C++ source code for Tornado's implementation of a generic read write mutex and thread local storage (TLS) encapsulation for C++ using POSIX threads (Win32 users please consult TlsAlloc()). Tornado's kernel runs hundreds of threads all of which work simultaneously on the kernel namespace which is made up of tens of thousands of linked nodes, the entire tree of which is protected by one RWMutex. Most of the writeable stuff is linked off the node tree into separately protected items, so the nodes are effectively a giant index.

This implementation has gone through three incarnations to try and balance out the priority of readers and writers and to remove a few deadlocking bugs. As far as I know, the following implementation is bug free but I offer no warranties or liability if it is not. This source is completely free for use by anyone for any reason, so you get no ability to whinge if it doesn't work for you!

Basically, the below is as perfect a RWMutex I know how to make. It permits infinite recursion on any combination of read and write locks and also gives writers the priority - as soon as a write is requested, all new reader requests wait in a queue. It is extremely efficient, normally having only a 4x mutex overhead for reader lock claim and release. It keeps the amount of time holding the mutex protecting its private data as low as possible to reduce contention and avoids waiting on the kernel as much as possible - and when it wakes threads waiting for something else to finish, it does so as selectively as is portably possible for POSIX threads. It also deals with the classical problem of two or more threads already holding a read lock both requesting a write lock at the same time.

As it's copied directly from the Tornado source tree, it uses some classes you may not be familiar with: TMutex (an optimised implementation), TMtxHold (a class holding a mutex lock and thus guaranteeing its unlocking irrespective of how the function exits), TWaitCondition (an object upon which a thread can wait for a notification ie; it sleeps till another thread indicates it should wake - roughly corresponds to an event object in Win32 & OS/2 or a wait condition under POSIX threads). See Tornado's documentation if you want to know more.

<17th June 2003: source deleted - please see FXRWMutex class in FXThread.cpp in TnFOX)

ned Productions – Niall’s Programming Tips

Link to Niall's MySQL-driven web counter

1. Bugs in Microsoft software:

2. Multithreaded programming tips

2.1 The key to success

3. A Read-Write Mutex