What with the Mandelbrot Engine prototype sitting on the
corner of my desk, I guess I should describe the innards so
you guys know what you're getting into...
First, the cold water spray:
This thing is designed to do Mandelbrot calculations. It's
not a general-purpose array processor, and it can't be
contorted into being one. There are three main issues:
inter-processor communications, dynamic range of the numeric
operations, and data I/O.
The Mandelbrot calculations don't need any communication
between the processors, so that's what we've got. A "real"
array processor should have some communication, but there's
no good way to pull it off. Homework: sketch an
inter-processor communications method that will work for any
number of processors between 1 and 256, satisfy the needs of
algorithms you don't know about yet, and can be implemented
on a single chip micro without affecting performance of the
main routine that doesn't need it...
The internal math is fixed point, with 60 fractional bits.
That gives a dynamic range of -8.0 to +7.999, which is
ideally suited to Mandelbrot calculations, but not much
else. The routines support addition, complementing,
multiplying, squaring, but not division. The multiply
routine is a 300 line macro that expands into about 3K of
code -- it's written without run-time loops! There are some
interesting side effects, because some of the routines check
for overflow and clamp the numbers to 3.9999...
With 60 fractional bits you can zoom in pretty much to your
heart's content... we lose precision at about 1.0E-12, so
you'll lose interest first. The iteration counts can range
up to 64K-1, which ought to be enough for any picture you
can get in your sights. With 8-byte reals there isn't
enough RAM to hold all the values for Julia calculations...
which is why we need an 8052-style processor.
The results stream out of the array in daisy-chain order,
which means the AT has to wait for the slowest processor.
The big advantage of this method is that the communications
channel doesn't have to carry addressing information, so the
data rate isn't a limiting factor for any practical image.
We could have the AT poll the processors, but then the
effective serial data rate would be under 1/3 the actual
rate, which isn't a good idea either. We could have any
processor that's not ready send a zero count, but then the
AT has to maintain a map of which points are filled in,
which are pending, and so forth... this puts the AT smack
into the critical path, which was exactly the reason we
wanted an array processor in the first place!
Providing a hook so you can download an 8052 program is no
big to-do, so it'll be in the masked version. You'll need
to add a RAM and address latch per processor, which will
double the PC board footprint and expense. Basically, you
get to download code to any one or all processors, then
specify that that code be executed instead of the standard
Mandelbrot iteration or on a one-off basis to replace the
whole smash. Fair enough?
My feeling is that you're kidding yourself. If you don't
have the full suite of cross-assemblers and simulators you
don't have a chance of getting the code running. If you've
got all that, why do you want to bother hacking around with
our hardware? Just get some 8031s, EPROMs, RAMs, and roll
your own... and write it up for INK, of course!
Now, the good news:
This thing is a killer. Steve said that it takes about 3
minutes to compute the overall image with 64 processors.
What he didn't say is that much of that is spent drawing the
dots on the screen -- the Mandelbrot engine is waiting on
the AT!
The initial image is the whole Mandebrot set, centered on
(-0.405,0) with a real axis size of 3.59 and aspect ratio of
about 1.33. The image has 19.7% black points (44154 pels)
with an average 6.6 iterations/point and computes in 2.8
minutes. If overhead in transmission and dot drawing
weren't a factor, it would take 1.9 minutes.
A better measure is to run the same scene with an iteration
count of 128. It takes 6.2 minutes to display 37667 Mset
points (16.8%), with an average of 15.0 iterations/point.
The communications and display overhead accounts for about
1.8 minutes of that time.
A LONG computation with this thing set to 1024 iterations is
maybe 45 minutes...
I don't have comparative times, but it greatly outruns my 8
MHz AT/10 MHz 80287 running the official IBM MSET program.
Benchmarking this thing is a problem because we don't have
an apples-to-apples comparison with a 286 running the same
code. Sometime Real Soon Now...
Ballpark performance: each iteration takes 5 to 7 ms per
processor (including ALL overhead). Divide by the number of
processors to get the average time per iteration for the
whole array. For 64 processors it's about 94 us, including
the data transmission and dot drawing times. Your mileage
may vary, but that's a good starting point.
Multiply by the number of points on the screen (224,000 for
an EGA) and multiply THAT by the average number of
iterations per point (which depends on the scene). Divide
by 60,000 to get seconds and send us a check!
Ed Nisley