Modern compilers generally implement boolean values using types with much larger ranges than just 0 and 1 (such as bytes, or 32-bit integers). Frugal programmers have historically used "packing" techniques to take several independent booleans and address them individually as bits inside larger types. (For some well-known examples of this, STL has
std::bitset
and the boost library has boost::dynamic_bitset
.)
Nstate is a relatively simple library for C++ which provides analogously dense packing for arrays of values known to range from 0 to radix - 1. It was implemented to support the Nocycle project, but can be applied in general contexts like any other array:
#include <cstdlib> // for rand()
#include <iostream>
#include "Nstate.hpp" // Nstate library from Nocycle project
int main(int argc, char * const argv[]) {
nocycle::NstateArray<3> narr (20); // array of length 20
// fill vector with 20 random values 0..2
for (int index = 0; index < 20; index++) {
nocycle::Nstate<3> n = rand() % 3;
narr[index] = n;
}
// now print the values
for (int index = 0; index < 20; index++) {
std::cout <<
"NstateArray<3>[" << index << "] = "
<< narr[index] << std::endl;
}
return 0;
}
Yet under the hood, all 20 tristate values occupy only a single 32-bit value. Conventional approaches with a vector of
unsigned char
would take five times as much storage. More common packing strategies which put these tristates into 2-bit values would only be able to fit 16 tristates into 32 bits. Though reading and writing from an NstateArray
involves a little overhead, the space savings can be significant in certain applications.
The code for Nstate is currently hosted in a project on GitHub, and I welcome input or improvements to the library. It is released under the Boost Software License, which permits liberal inclusion in both open source or commercial software.
Background
I was trying to figure out how to efficiently store an array of tristates in a memory block. I knew I could use two-bits-per tristate, but a tristate needs slightly less than two bits to be represented. To understand the theoretical limit of how much memory this could save, one has to use logarithms:
- (bits for N tristates) / (bits for 2 × N bits)
- = ceil(log2(3N)) / (2 × N)Note by definition
- ≈N log2(3) / (2 × N)Note power rule
- = log2(3) / 2Note cancellation
- ≈79%Note calculator
Though introducing code complexity to save 20% of memory might not be of much importance usually, the problem I was looking at was using 216 * 216 tristates... which at 2-bits per would be 512MB. Saving a fifth of that is nice, and could even be essential on hardware based on a single 512MB chip that needed some room for code as well as data!
Theory aside, it's a bit tricky to address 216 * 216 tristates sequentially in a 405MB buffer. Current computers aren't geared toward interpreting multi-megabyte series of bits as a giant holistic number that it can easily divide by powers of 3 and give back the remainder. The only serious discussion I could find about this was in the Joel on Software Forum, where such compactness was discouraged due to the excuse that "memory is cheap".
However, Mark Bessey provided a solution in the thread that packed 20 tristates into a common 4-byte unsigned number. This still wasted some available states, but how much worse would it be than the theoretical limit? It breaks down to:
(16/20) * 512MB = 409.6MB
So with the theoretical limit at 405MB or so, I was wasting 4-5MB beyond what's "possible". Yet this compromise made the performance tractable, so I decided to go with it. It seemed like a case ripe for generalization, so rather than solve the problem just for tristates I went ahead and did it for "nstates".
Implementation Notes
Having not had the occasion to write my own C++ array before, I used the
boost::dynamic_bitset
as my model for getting ordinary-looking array accesses working.
Initially I had hoped that each
NstateArray
template instantiation would have a static power table member shared among all instances with that radix, however a number of wrenches got thrown into that plan. So I ended up building a global power table map, which instances cache--that isn't a particularly great solution. I considered using dynamic allocation of an array and mapping to that so that each array instance would be storing a pointer--faster to access, and cheaper to copy than a std::vector
. But due to reference counting issues I thought I'd put the code out for peer review to see if an even better compiler-independent solution was out there...
There was a lot of code that I might have liked to have put into the implementation file instead of the header file. Various things stood in the way of this--essentially, when you write templated classes a lot of things need to be left in the header file.
It would be nice to see this beefed up and improved by anyone who thought it was worth doing so. Iterators could be useful (or, maybe not?!!) Yet because I'm the only one using this right now (that I know of) I've not spent a whole lot of time on trying to make it suit anyone else's applications for it. Let me know if you have an idea... I'm happy to look at incorporating your comments!