Cache coherency explained
The Intel SA-110 StrongARM processor is a Harvard cache architecture processor – hence it uses separate instruction and data caches, and in addition its data cache is a write-back cache. This raises cache coherency problems in a multiprocessing configuration.
Caching explained
Most modern microprocessors contain some form of instruction and data caching where a small but fast bit of memory is used to speed up access to main memory. At its simplest level, a cache can pre-load blocks of memory from main memory so that the processor need not stall when performing a load. This is possible because (a) a processor usually accesses sequential memory locations and (b) loading more than one sequential memory location at the same time is faster than loading each sequential memory location one by one. Hence when the processor accesses an uncached bit of memory, the cache reads a full cache line in the hope it will be used (which it usually is).
Again at its simplest level, a cache must be write-through cache whenever the processor performs a write to main memory. Quite simply, when the write is performed, the relevant cache entry is updated, and a write to main memory is issued. One could say that cache coherency has been maintained ie; the cache accurately reflects the contents of main memory.
The problem is, writes to main memory are usually even slower than reads. Things can be made better through the use of write buffering where writes are collated together for burst writing (much as reads are), but especially given that main memory cannot be loaded from whilst being stored to, one can see how the processor spends a lot of time stalled waiting on main memory. The solution to this is to minimise the number of writes actually made to main memory by only doing the write once the processor has definitely finished with that location. This is what a write-back cache does – it writes out the location to main memory sometime after the actual write took place.
Write-back Caching explained
A write-back cache is for obvious reasons somewhat more complicated than that for a write-through cache. One must now maintain a cleanliness flag for each cache entry where a clean entry accurately reflects the contents of main memory and where a dirty entry does not. Now when the cache loads a cache line, its contents are marked as clean. When the processor changes a memory location, the change is made within the cache and that cache entry marked as dirty. At some stage later, depending on when the caching algorithm decides, clean cache entries are thrown away and dirty cache entries are written to main memory. However, here it is important to note that for the period while there are dirty cache entries for a memory location, the cache is not coherent with that memory location.
Cache incoherency
Now imagine if there were two processors of this type accessing a particular area of memory at the same time. The first processor would be holding most of that area within its data cache, including any changes it might have made to the data. The second processor then accesses that area of memory where it will find the original data not yet written back to, and loads that old data into its cache. Now, depending on the caching algorithm, the first processor may write back the dirty cache entries, and of course the second processor doesn’t know any better and so it may write back its changes to the data later – obviously, this causes memory corruption.
Traditionally, special support hardware is used to maintain cache coherency – an update to a memory location by one processor invalidates that location in all other processor’s caches. However, no such hardware support exists in our target hardware – hence, a cache coherency solution must be implemented in software.
Software based Cache Coherency
Having researched the topic of cache coherency (including having talked with Dr. Rabhi), it would seem that options are limited. Almost all software based cache coherency protocols require some form of special hardware or compiler support. The only entirely software based cache coherency protocol I could find which didn’t require compiler support is the fast selective invalidation scheme (Cheong and Veidenbaum, 1988) which unfortunately requires a write-through cache. The write-back cache of the StrongARM can be manually forced to immediately clean dirty entries like a write-through cache, but this is an expensive solution as the StrongARM was not designed to work like this.
Ultimately, this means that there is only one method of implementing cache coherency without some hardware support – and that is to disable the data cache on each processor. However, the StrongARM processor is extremely slow when working with non-cached data as the designers had two modes of data access in mind during design – that of i/o register access and that of memory access. Disabling the data cache causes all memory accesses to be made like accessing i/o space – no use is made of burst reads or writes and hence a huge number of cycles is wasted every time the processor accesses data.
Obviously, the solution is to enable every performance feature of the processor for as long as possible whilst retaining cache coherency. This is done as follows:
1. Mutually exclude use of an area with that area being accessed remotely
So long as an area of memory is guaranteed to be exclusively available to a processor, it can enable the use of its data cache when accessing that area of memory. Using semaphores, areas of memory can be marked "in use" by a particular processor and hence exclusively allocated to that processor.
2. Prevent areas being accessed by multiple processors from being cached
The StrongARM’s Memory Management Unit (MMU) allows the caching used to access memory to be specified to a 1Mb, 64k & 4k unit granularity. Hence, one can mark all shared regions of memory as uncacheable but still write-bufferable. This prevents the processor using its data cache on shared regions of memory but at least writes to shared areas permit burst writes.
3. Implement software cache coherency
One severe problem with doing large volumes of work in the type of uncached shared memory above is that burst reads are not permitted. If you are copying memory from one location to another (a common enough activity), the writes would be fast enough, but the reads would be extremely slow. A solution here is to implement a form of software triggered cache coherency.
For example, if processor A wanted to rewrite 64k of processor B’s memory, it could send a message to processor B to say it was about to do so. Processor B would then invalidate that area of memory within its cache and Processor A would write the 64k. Should Processor A or B then access that 64k, both would have the correct contents.
The reason why as large an amount as 64k is necessary is because doing this requires a certain amount of overhead which is proportionately less as the amount increases. Also, of course, one would have to ensure that Processor B is happy to lose everything within that 64k. However, if these stringent conditions are met, the data caches for both processors can be left enabled with the resultant speed increase.
Conclusions
Through the use of the recommendations within this document, it is hoped that the highest possible performance can be achieved in a multiprocessing StrongARM solution.