The Atari ST(E) BLiTTER in brief by The Paranoid of Paradox 2012

     ____    ___    ___________      ___________
    /    \  /  /\  /__/__   __/_____/  ____/    \
   /     /\/  / / /  /\//  /__   __/  __/ /     /\
  /      \/  /_/_/  / //  /\_/  / /  /___/  /\  \/
 /_______/______/__/ //__/ //  / /______/__/ /\__\
 \_______\______\__\/ \__\//__/ /\______\__\/  \__\
                           \__\/
  The Atari ST(E) BLiTTER in brief
  
  A comprehensive overview
  
  by The Paranoid of Paradox 2012
  
  
  
  
  Table of Content
  a.) Introduction
  b.) Register Overview
  c.) What to do with it
  d.) Common mistakes
  e.) Compatibility
  f.) Summary
  Appendices  
  
  
  a.) Introduction
  
    This little documentation covers the Atari STE BLiTTER, its
    registers and how to operate them. It somehow relies on the
    STE-FAQ by the same author from a couple of years ago for the
    basics and please refer to this document for a more detailed
    look on the registers and their internals.
    It should also be noted that the Atari STE BLiTTER is identical
    to the Atari ST BLiTTER regarding its registers and interface,
    hardware-wise, it is different as it has been integrated into
    the COMBEL, a chip that hosts various functions.
    What actually is the BLiTTER for ? Having a look at Atari's own
    official documentation, the BLiTTER is just the Bit-Blt (BLiT)
    algorithm as specified by Newman and Sproul ("Principles of
    Interactive Computer Graphics", McGraw-Hill, 1979, Chapter 18).
    It is meant to copy graphical data, organized in bit maps, and
    manipulate those by applying masks, halftones and logically
    combine source and destination. The Atari STE BLiTTER is a
    slightly more advanced version of this set of algorithms since
    it has direct memory access and does not rely on an external
    bus master to feed it data.
    This little documentation is not a hardware reference manual but
    instead describes some algorithms of what can be done by the
    BLiTTER if programmed acurately.
    Thanks to various people like Ray of .tSCc., ultra of Cream^Orb,
    Kalms of TLB, No of Escape and Tobe of MJJ Prod for their help
    and support. Maximum thanks of course to my crewmates 505, Dan,
    RA and Zweckform of Paradox.
    Also, this documentation relies on the Atari ST/STE/TT Profi-
    Book by Jankowski, Rabich, Reschke, Sybex, 12. Auflage 1992,
    The TOS 1.04 Update Book by Pauly, Data Becker.
    

    
  b.) Register overview
  
    The Atari STE BLiTTER is memory mapped and capable of direct
    memory access. The registers of the Atari STE BLiTTER are
    
    $FFFF8A00: Halftone Pattern 0  (16 Bits)
    $FFFF8A02: Halftone Pattern 1  (16 Bits)
    $FFFF8A04: Halftone Pattern 2  (16 Bits)
    ...        ...
    $FFFF8A1E: Halftone Pattern 15 (16 Bits)
    
    These registers make up a 16 x 16 pixels dither pattern to be
    used for masking source data on special request (see below).
    
    $FFFF8A20: Source X Increment (15 Bit - Bit 0 is unused)
    $FFFF8A22: Source Y Increment (15 Bit - Bit 0 is unused)
    $FFFF8A24: Source Address     (23 Bit - Bit 31..24, Bit 0 unused)
    
    This sets up everything related to source addressing. It sets the
    24-Bit base address where to read data from, how many bytes to
    add after each word copied (X Increment) and how many bytes to
    add after each line copied (Y Increment) to the source address.
    
    $FFFF8A28: ENDMASK 1 (16 Bits)
    $FFFF8A2A: ENDMASK 2 (16 Bits)
    $FFFF8A2C: ENDMASK 3 (16 Bits)
    
    These masks are overlaid with the destination words and only bits
    that feature a "1" are actually manipulated by the BLiTTER. All
    bits containing a "0" will be left untouched. ENDMASK 1 refers
    to the first word per line being copied, ENDMASK 3 to the last
    word in each line being copied while ENDMASK 2 affects all copies
    in between.
    
    $FFFF8A2E: Destination X Increment (15 Bit - Bit 0 unused)
    $FFFF8A30: Destination Y Increment (15 Bit - Bit 0 unused)
    $FFFF8A32: Destination address     (23 Bit - Bit 31..24/0 unused)
    
    Much like the source address generation, this covers the target
    or destination addressing, containing increment per word being
    copied (X), per line (Y) and the target address to start with.
    
    $FFFF8A36: X Count (16 Bits)
    $FFFF8A38: Y Count (16 Bits)
    
    These two registers configure how many words (!) to copy per line
    (X) and how many lines to copy in total (Y). Please note that
    these values are 16 Bits wide and unsigned.
    
    $FFFF8A3A: HOP (8 Bits)
    $FFFF8A3B: OP  (8 Bits)
    
    HOP stands for "Halftone OPeration" mode and configures in what way
    halftone and data read from memory using source addressing are being
    overlaid. It contains values 0..3:
      0 = All bits are generated as "1"
      1 = All bits taken from halftone patterns
      2 = All bits taken from source
      3 = Source and halftone are AND combined
    The OP Register stands for "OPeration" mode and configures how 
    destination and source data are being overlaid. It contains values
    0..15:
      0 = Target Bits are all "0"
      1 = Target Bits are Source AND Target
      2 = Target Bits are Source AND NOT Target
      3 = Target Bits are Source
      4 = Target Bits are NOT Source AND Target
      5 = Target Bits are Target
      6 = Target Bits are Source XOR Target
      7 = Target Bits are Source OR Target
      8 = Target Bits are NOT Source AND NOT Target
      9 = Target Bits are NOT Source XOR NOT Target
     10 = Target Bits are NOT Target
     11 = Target Bits are Source OR NOT Target
     12 = Target Bits are NOT Source
     13 = Target Bits are NOT Source OR Target
     14 = Target Bits are NOT Source OR NOT Target
     15 = Target Bits are all "1"
     
    $FFFF8A3C: Misc. Register (8 Bits)
    
    This register is a bit structure of the following content:
      Bit 7    = BUSY Bit (Write: Start/Stop, Read: Status Busy/Idle)
      Bit 6    = HOG Mode (Write: HOG/BLiT mode, Read: Status)
      Bit 5    = Smudge Mode (Write: Smudge/Clean mode: Read Status)
      Bit 3..0 = Line number of Halftone Pattern to start with
      
    $FFFF8A3D: Misc. Register (8 Bits)
    
    This register is a bit structure of the following content:
      Bit 7    = Force eXtra Source Read (FXSR)
      Bit 6    = No Final Source Read (NFSR)
      Bit 3..0 = Skew (Number of right shifts per copy)
      
    
    A lot of registers of which some are totally confusing - What the
    heck is a smudge mode ? And where's the difference between HOG and
    BLiT-mode ? For more details, please see the STE-FAQ in which most
    of the registers are being explained in much more detail.
    However, to give a comprehensive summary, all registers of the
    BLiTTER should be initialized by software before usage. There is 
    no way of telling in what state the BLiTTER is when your software
    takes over so that every register should be cleaned even if it is
    not being used by your software.
    Once the BLiTTER has been initialized, the minimum setup to get it
    going implies writing source increments and address, destination
    increments and addressing, the OP-mode, count registers and set it
    to "busy" - for simple copies in OP-mode "3" aligned on word
    boundaries, that's sufficient.
    The BLiTTER modifies some of the registers internally, namely the
    source and destination address registers and the Y count. The X
    count is latched internally and restored after each line. For
    consecutive copies with incremental addresses, it's therefore
    sufficient to simply set a new number of lines in Y count and to
    set it to "busy" again.
    
    The BLiTTER is (unfortunately) not restricted to fixed execution
    times. Its effective execution times "per (16-Bit) copy" heavily
    depend on the chosen HOP- and OP-modes, in clock cycles:

          OP
    HOP    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
       0   4  8  8  4  8  8  8  8  8  8  8  8  4  8  8  4
       1   4  8  8  4  8  8  8  8  8  8  8  8  4  8  8  4
       2   4 12 12  8 12  8 12 12 12 12  8 12  8 12 12  4
       3   4 12 12  8 12  8 12 12 12 12  8 12  8 12 12  4
    
    (Actually, ENDMASKs differing from $FFFF are rumoured to also
     delay the BLiTTER, i have not yet verified this.)

    To optimize code that writes to the BLiTTER often and regularly,
    it might be interesting to know which registers must be set by
    software, which are updated by the BLiTTER and which are sort of
    constant until software requires different settings:
    
      $FFFF8A00: Halftone pattern 0..15:   Constant
      $FFFF8A20: Source increment X and Y: Constant
      $FFFF8A24: Source address:           Updated by BLiTTER
      $FFFF8A28: ENDMASKs 0..2:            Constant
      $FFFF8A2E: Dest. increment X and Y:  Constant
      $FFFF8A32: Dest. address:            Updated by BLiTTER
      $FFFF8A36: X-Count:                  Latched by BLiTTER
      $FFFF8A38: Y-Count:                  Must be updated by CPU
      $FFFF8A3A: Setting-Registers:        Constant (except for BUSY)
      
    This implies that the minimum number of registers to set before
    a BLiT-operation is to write the Y-Count number of lines and to
    set the BUSY-Bit. The address registers are automatically updated
    by the BLiTTER internally (using the configured increments) and
    the X-Count is internally latched so that the BLiTTER automatically
    restores the value having been in the register after completing one
    line.
    
    While this settles the descriptive part about the interfaces and
    some internals of the BLiTTER, what can we actually do with it ?
    

  c.) What to do with it ?
  
    So what can the BLiTTER do for you ? What does it do quickly and
    how can you gain maximum benefit in your software from it ?
    Here are a few techniques i have found useful and i try to arrange
    them in order of difficulty ...
    
   1.) Saturated increments and decrements on word-based chunky buffers
   
    One of the simplest tricks that the BLiTTER can do for you is that
    it can do saturated increments and decrements quite quickly on
    chunky graphics buffer in which each pixel consists of a word
    instead of a byte. Commonly, chunky graphics on the Atari STE are
    pre-multiplied by 4 to have quicker conversions to the planar format
    that the Shifter requires so that pixel colour "1" is represented by
    the value "4", colour "2" by "8" and so on, up to colour "15", which
    is represented by the value "60".
    All this effect actually requires is a decently organized halftone
    pattern, the smudge mode and some logic ...

    The halftone pattern is just a 16 word table. When doing saturated
    decrements, each element holds its line number minus "1", for
    saturated increments, each element holds its line number plus "1"
    and, of course, both need to be overflow/underflow protected:
    
                    Decrement                Increment
    Hafltone 0:   0 - 1: $0000              0 + 1: $0004
    Halftone 1:   1 - 1: $0000              1 + 1: $0008
    Halftone 2:   2 - 1: $0004              2 + 1: $000C
    Halftone 3:   3 - 1: $0008              3 + 1: $0010
       ...            ...                       ...
    Halftone 14: 14 - 1: $0034             14 + 1: $003C
    Halftone 15: 15 - 1: $0038             15 + 1: $003C
    
    Now the BLiTTER can be made to use the lowest 4 bits of the word
    it just read to look up a line of the halftone pattern and to use
    this one for the configured HOP-mode. While this has been intended
    to allow some sort of random-ish halftone pattern usage, it works
    fine for what is required: If the BLiTTER reads, for example, the
    colour value "7", it will look up line "7" in the Halftone pattern
    in smudge-Mode, which contains the value "6" in the decrement and
    the value "8" in the increment setup.
    While this sounds pretty easy at first, there's one obstacle to
    overcome still: The words of the chunky graphic the BLiTTER reads
    are pre-multiplied by 4 (and so are the entries in the halftone
    pattern). Of the lowest 4 bits that the BLiTTER reads, only bits
    2 and 3 are used in any way and bits 0 and 1 are always 0.
    But that's not a problem at all since the BLiTTER can do right
    shifts _BEFORE_ looking up the line of the halftone pattern by
    simply setting the "skew"-value to "2".
    Now the BLiTTER reads a word which contains the numbers 0 to 60
    in multiples of 4, shifts these down by 2 bits, making it range
    from 0 to 15 effectively, then looks up the line in the halftone
    pattern, reads the entry and copies it back.
    All that is required otherwise is to set the OP-register to "3"
    (Target = Source) and the HOP-register to "1" (Halftone only).
    
    Unlike stated in various kinds of documentation, the BLiTTER does
    not only apply smudge once per line but on every read it performs,
    making this a quite quick way of doing simple saturated increments
    and decrements.
    
    Can this also be used for byte-based chunky buffers ? Yes, it
    actually can, it only needs 2 passes then, and it works the
    following way:
    Set up halftone pattern registers like described above and set
    skew to "2" like above. However, set all ENDMASKs to "0x00FF"
    which implies that the low byte of each word will be affected
    only. Now activate smudge-mode and let the BLiTTER run over your
    chunky buffer once.
    Re-configure the halftone pattern by shifting each entry by 8
    bits to the left, set skew to "10" and set all ENDMASKs to 
    "0xFF00", meaning that only high bytes will be affected. Now
    make the BLiTTER run over your chunky buffer again. What did it
    do this time ?
    It reads a word and shifts it down by 10 bits, effectively 
    shifting out the low byte completely and scaling the high byte
    down to range from 0 to 15. Then it uses the result to look up
    the line of the halftone pattern, which contains a value
    preshifted by 10 bits. Now the BLiTTER writes back the whole
    word but the ENDMASKs prevent it from overwriting the low byte
    having been generated in the previous pass.
    
    Sneaky, isn't it ?
    
    (Actually, you don't even need to reconfigure the halftone
     pattern when switching from odd to even bytes or back. All
     you need to do is set the halftone pattern to bytes, too,
     like $0000, $0404, $0808, $0C0C, $1010, $1414, $1818... and
     the ENDMASKs will take care of protecting the byte not to
     be overwritten)


   2.) Making the BLiTTER "add" colour values in c2p based effects
  
    It was Kalms of The Black Lotus who wondered why no one actually
    thought of using the BLiTTER for simple maths on the STE - which
    seems to be quite common on the Amiga.
    The problem is that the BLiTTER in the STE does not have some sort
    of overflow treatment and therefore cannot be used for any kind of
    mathematics ... Or can it ?
    
    The trick is to separate the addition from the potential overflow
    that is involved and c2p is a very thankful concept for this. How
    does it work then ?
    Let's consider a word per pixel again with only 8 colours being
    used effectively and let's consider that the BLiTTER is meant to
    handle 4 graphic objects that should be overlaid cleanly, for
    example 4 blobs. Now reorganize the graphic of your blob to use
    bits 13 to 11 instead of bits 4 to 2. Then make the BLiTTER copy
    the first blob using "Target = Source" and ENDMASKs set to all
    "0xFFFF". This will both clear the chunky buffer under the blob
    as well as produce the first blob. Copy the second blob with a
    skew value of "3" using the same graphic and the OP-register set
    to "Target = Source OR Target". Copy the third blob using a skew
    value of "6" and the fourth one using a skew value of "9", all
    using "Target = Source OR Target".
    The resulting bits of each word your code has generated looks like
    this:
    
      15  14  13  12  11  10   9   8   7   6   5   4   3   2   1   0
      unused   |_______|   |_______|   |_______|   |_______|   unused
                Blob 0       Blob 1      Blob 2     Blob 3
                
    That didn't really "add" the colour values of each blob, did it ?
    No, not yet. It requires a customized c2p table to turn the 4
    pixel colours stored in this word to turn into 1 final colour. It
    will cost some memory since each pixel consists of 12 bits used
    instead of 4, but it's still small enough: For double-pixel mode,
    it requires 4 x 4096 x 4 bytes = 65536 Bytes and is fairly easy
    to generate:
    - Loop over all potential colours of all 4 pixels
    - Perform saturated adds of all 4 pixel colours
    - Generate the c2p-table entry
    
    Naturally, this routine is somewhat limited but 4 pixel colour
    can already be quite a lot, if, for example, displaying zooming
    blobs.
    

   3.) BLiTTER sprites
   
    3.a.) Where they make sense
      BLiTTER based sprites are slightly more complex than they first
      seem, depending on the required amount of flexibility. Hence,
      the more flexibility you need, the more complicated your BLiTTER
      routine will be and at the same time, the more speed gain you
      will observe.
      If you want to push around bulky sprites with many animation
      phases so that preshifting doesn't really work out, they make
      sense. If you want to generate sprites in realtime and therefore
      don't have the option to preshift, they make sense. If you need
      additional funky operations such as applying halftone patterns
      or logical operations, they make sense.
      
    3.b.) Where they do not make sense
      On small-scaled, non-masking 2 bitplane sprites commonly used for
      sprite record screens - The CPU can do that a lot faster. They
      also make no sense when the sprites your routine handles can be
      fully preshifted and do not really require to be fully masked.
      Also, on the Falcon030, the BLiTTER can barely keep up with the
      CPU in some operations; in many operations the CPU is faster
      and only in a small amount of operations the BLiTTER turns out
      to be faster. For true-colour sprites however, the BLiTTER hardly
      makes sense.
      
    3.c.) Positioning and sizes
      You know that the Atari STE shifter uses interleaved bitplanes of
      words. So to position your sprite horizontally, you need to take
      the x-coordinate, divide it by 16 to find out which word - i.e.
      which 16 bits in screen memory in this line - it is and then
      multiply by 8 in low-res to go to the correct set of words since
      4 words = 8 Bytes make up 16 pixels in 4 bitplanes (in mid-res,
      you multiply by 4 Bytes = 2 words and in high-res, you do not
      multiply at all). To get the correct line, you simply multiply
      the y coordinate of your sprite with the number of bytes per
      line and add that to the target address.
      Now you calculated the target address for the correct X and Y
      coordinate, but your sprite will always start from the leftmost
      pixel, so you need an additional offset to find out which bit
      your sprite needs to start - Naturally, this is the X coordinate
      modulo 16 (or AND 15) and to this coordinate, you need to shift
      your sprite. To sum it all up, you need to generate 3 values from
      your X and Y coordinate:
      - Y coord * (Number of bytes per line)     -> Add to target address
      - (X coord / 16) * (width of bitplane set) -> Add to target address
      - (X coord & 15) -> Amount of right shifts for your sprite
      And since the BLiTTER is terribly good at shifting - unlike the
      plain 68000, this should speed things up.
      Or shouldn't it ?
      
    3.d.) Some nonsense about the BLiTTER's internals
      These bits in one of the configuration registers are terribly
      confusing when confronted with the BLiTTER first and are still
      hard to explain. So let's figure out some of the BLiTTER's
      internals.
      The BLiTTER actually has a 32-Bit internal buffer of which
      usually only the lower 16 bits are used when using SKEW = 0:
      
      Bits  31 29 27 25 23 21 19 17  15 13 11  9  7  5  3  1
            |______________________| |______________________|
              Overlap/Skew buffer          Read/Write
              
      So what are the upper 16-Bits actually good for ? They
      actually make up the skew. Instead of shifting bit by bit,
      the BLiTTER simply reads out 16 Bits from position (15 - SKEW),
      using (16 - (16 - SKEW)) bits from the upper Overlap/Skew
      buffer. After each write, the BLiTTER automatically copies the
      lower 16 bits into the upper and procedes.
      This way, the BLiTTER appears to be shifting to the right (which
      it is not) and at the same time allows to "carry" over all bits
      being "shifted" out of the initial word being read into the next
      copy. For example, if we consider a SKEW value of 3, the BLiTTER
      reads the source word into the lower word using bits 15..0, then
      reads out bits 18..3 (filling the 3 upper bits with the garbage
      that the upper word of the BLiTTER internal buffer still features
      so make sure your ENDMASKs are set up decently!). Afterwards, it
      copies Bit 15..0 to Bit 31..16, then reads new data into the Bits
      15..0, but again reads out Bit 18..3, now containing the right-
      most 3 bits of your previously copied word in the 3 left-most
      bits! This way, the BLiTTER can "carry" up to 15 Bits without
      consuming any additional time at all.
      
      Back to the sprites. Why is it so important to know about the
      BLiTTER's internal for sprites when it's so obvious that the
      number of right-shifts calculated in chapter 3.c.) are to be
      written into the SKEW register ?
      Because it might not be sufficient.
      And here's where one of the two most complicated registers of
      the BLiTTER come in handy.
      
    3.e.) The No-Final-Source-Read Bit
      Now let's consider a sprite you want to copy to any possible
      x-coordinate so you write the x-coordinate modulo 16 to the SKEW
      register and sometimes, this doesn't really work out. Why ?
      Because the BLiTTER shifts out bits - as soon as SKEW is unequal
      to zero - and needs one additional write access to put out the
      final bits having been "carried" from the last word having been
      read. It does not need to _COPY_ another word, it only needs to
      _WRITE_ the final bits.
      And that's what NFSR is for. Here's what it does:
      On the last copy of each line, the BLiTTER does not read the
      source but instead only reads out the 16 bits from the current
      Overlap/Skew and Read/Write buffer as given by the SKEW value
      and writes them to memory, according to the target address,
      implying that the final set of "carry" bits are now being
      written. Since the lower 16 Bits still contain some "outdated"
      data - the BLiTTER has performed a read access to refresh this
      buffer when NFSR is set - you will observe junk on the screen
      unless your ENDMASK3 is set up properly (hint!).
      Also, this final copy counts as one with regard to the X-Count
      register, even when NFSR is set, meaning that you need to set
      the X-Count register to one more word than you would if SKEW
      was zero. However, the source address is incremented one time
      less than the destination address is because the last word
      being written has not been read so you also need to adopt the
      Y increment for source addressing accordingly.
      
    3.f.) Not really needed for BLiTTER sprites but while we're at it
      So now that we have explained the NFSR-register what actually
      does the Force-eXtra-Source-Read (FXSR) register do ?
      Easy. Wouldn't it be all a bit limited if the BLiTTER couldn't
      do left-shifts ? Exactly. And that's what FXSR is for.
      When using FXSR, the BLiTTER will read one word into the top
      16 bits prior to anything else it does. And by doing so, it
      effectively shifted to the left by 16 pixels so that the
      number of effective left shifts is given by (16 - SKEW).
      However, FXSR also implies that the source address will be
      increment once more than the destination address will be, so
      set up your Y increments accordingly. Unlike NFSR, it does not
      decrease the X Count register as it is treated as an extra
      read and not as a copy.
      Does it make any sense to have NFSR and FXSR set at the same
      time ? While it first sounds like it would not, it turns out
      to be useful quite often: Whenever a left-shift is needed,
      even if only one bit, the FXSR needs to be set, but that also
      makes source address being incremented once so that, when the
      first real copy is being performed, it's considered the first
      "write" but the second "read". To make up for this, the NFSR
      does correct this on the final copy when it prevents one read
      but performs one write - balancing reads and writes. However,
      this depends strongly on the size of the bitfield you want to
      copy.
      To make it more visible, let's consider the example from the
      official Atari BLiTTER documentation:
      
      NFSR + FXSR: Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
      SKEW = 15:   Dest    |..aaaaaaaaaaaaab|bbbbbbbbbbbbbb..|
      
      Let's consider each copy and its result. Before the first
      action is taken by the BLiTTER, source and target look like
      this:
      
      No action:   Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
      yet          Dest    |................|................|
      Blitbuffer:          |................|................|
      
      Now, starting the BLiTTER makes it "force extra source read":
      
      Extra src.   Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
      read         Dest    |................|................|
      Blitbuffer:          |...aaaaaaaaaaaaa|................|
                                  
      Having done that, the BLiTTER will now perform the first copy
      by reading the next word from the source:
      
      First read   Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
                   Dest    |................|................|
      Blitbuffer:          |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
                                                
      Now, using a SKEW value of 15, it will write back data from
      the upper 16 bits mainly
      
      First write  Source  |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
                   Dest    |..aaaaaaaaaaaaab|................|
      Blitbuffer:          |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
      
      Internally, the BLiTTER updates the upper 16 bits
      
      Refreshing  Source   |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
                  Dest     |..aaaaaaaaaaaaab|................|
      Blitbuffer:          |bbbbbbbbbbbbbbb.|bbbbbbbbbbbbbbb.|

      but will not READ any data due to the NFSR set. Instead, it
      will only read out the upper 16-Bits mainly (mind the SKEW)
      and write back the following value ...
      
      Writing    Source    |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
      bad ENDMASK Dest     |..aaaaaaaaaaaaab|bbbbbbbbbbbbbb.b|
      Blitbuffer:          |bbbbbbbbbbbbbbb.|bbbbbbbbbbbbbbb.|
      
      but using a properly configured ENDMASK3, the lowest bit(s)
      should be masked out, therefore correctly generating
      
      Writing   Source     |...aaaaaaaaaaaaa|bbbbbbbbbbbbbbb.|
      good ENDM. Dest      |..aaaaaaaaaaaaab|bbbbbbbbbbbbbb..|
      Blitbuffer:          |bbbbbbbbbbbbbbb.|bbbbbbbbbbbbbbb.|

      
    3.g.) Back to BLiTTER sprites - How to mask them ?
      "Real" Software sprites require to be masked, meaning that
      all bits "used" by the sprite (unequal to a dedicated transparent
      colour, commonly zero) are being cleared before the sprite is
      being copied onto the target buffer while all bits not used by
      the sprite (equal to a dedicated transparent colour, commonly zero)
      have no influence at all on the background the sprite is being
      copied over.
      This is usually done by using a so-called mask which is AND-
      combined with the background, which is an inverse of the sprites
      silhouette, meaning that all bits of the mask are "0" in which
      the sprite features a colour differing from a dedicated trans-
      parent colour and that all bits of the mask are "1" in which the
      sprite features the dedicated transparent colour.
      The mask can be generated prior to the sprite generation or
      stored along with the sprite data.
      Actually, the BLiTTER can generate the mask by itself but it
      requires as many passes as bitplanes are involved and it requires
      the transparent colour to be "0": You copy the sprite data of
      each bitplane into one buffer using "NOT Source OR Target" OP-mode.
      The resulting buffer can be copied to the screen in AND-OP-mode
      for all bitplanes involved.
      (Actually, it can be used for transparent colours unequal to zero,
      but then additional operations might be required).
      There is a slightly restricted but much much faster way of getting
      a mask to be AND-combined with the background:
      If you can afford to restrict your sprite to 8 colours and can
      arrange the colours in a way that it only uses colours 8 to 15,
      you get a mask for free. In this case, bitplane 3 is set for
      every pixel differing from the transparent colour "0" and therefore
      represents a perfect mask. Simply copy it over the lower 3 bitplanes
      in the "NOT Source AND Target" OP-Mode to correctly mask your sprite.

      Small sprites can also be masked using the ENDMASK registers. To do
      so, the appropriate mask for a single line needs to be copied to
      the ENDMASK registers, then, using an X-Increment of 2, all words
      for this line for all bitplanes are being masked. Using a suitable
      negative Y-increment, the BLiTTER can automatically jump back after
      having manipulated this line. The procedure needs to be repeated for
      every line (Thanks to RA for this trick!).
    
    3.h.) Sprites of fixed horizontal size (multiples of 16)

      This is the easiest case as long as the sprite data is left-aligned
      and that is usually the case. The first task is to check whether a
      SKEW-value of 0 can be used or not by AND-combining the x-coordinate
      of the sprite with 15. If the result is 0, no shifting is needed and
      the sprite can be copied onto the screen bitplane by bitplane, using
      either a pregenerated mask or bitplane 3 as mask as described above
      (Actually, if the SKEW-value is 0, all bitplanes can be copied in
      one linear go, but the routine will have to be separated completely
      from the case when the SKEW-value differs from 0, which is kind of
      odd).
      In case the SKEW-value turns out to differ from 0, there's basically
      two more things that need to be done: Generate decent ENDMASKs and
      correct the number of words to be copied per line, including setting
      the NFSR Bit and correcting the Y increments. And ENDMASK 1 is easy
      to generate: Simply shift it to the right by the SKEW-value (do not
      use ASR since it might shift in ONEs), invert all bits and use it as
      ENDMASK 3. ENDMASK 2 should be all ones anyhow.
      Since your sprite's size in X is a multiple of 16, setting the NSFR
      bit is required since bits will be shifted out of the last "copy" as
      soon as the SKEW value exceeds zero. This also means that there is
      one more word to be copied per line, so add one to the X-Count
      register, but since the source is not being incremented after the
      last copy while the destination address is, correct source and
      destination Y increments accordingly.
      Then, copy the sprite data bitplane by bitplane (which is necessary
      now since the BLiTTER would mix up bitplane data by shifting if you
      tried to copy all bitplanes at once for a SKEW value other than 
      zero), using either a pregenerated mask or bitplane 3 if you were
      using the trick explained in the previous chapter.
          
    3.i.) Sprites of random horizontal size
    
      Now this is tricky because generating ENDMASKs is quite confusing
      for this case. While ENDMASK2 can always be set to "All Ones" -
      it's only used if there are more than 32 bits to be BLiT and then
      it covers the middle which is copied fully always - ENDMASK1 and
      3 can be separated into 2 major cases:
      - Copy only affect leftmost 16 bits
      - Copy affects more than the leftmost 16 bits
      However, it's not sufficient to simply check the width of the
      buffer to copy. If the data is being shifted, even a BLiT of 2
      bits width can require ENDMASK1 and ENDMASK3 to be set properly
      if being shifted by 15 bits! Similarly, setting or unsetting
      NFSR is equally difficult. If, for example, a block of 18 bits
      width is being copied with a shift count of up to 14, no NFSR
      is needed - 18 bits always require 2 words to be read which are
      being covered by ENDMASK1 and 3. But, if the shift count reaches
      15, one overlap bit shows up, 3 words need to be written while
      only 2 are being read, hence, NFSR is required.
      Actually, to generate all this in realtime is far too complicated
      so that packing all information in 2 tables is the best ways to
      deal with it:
        a.) A table for 0..15 shifts for less than 17 bits to be copied,
        b.) a table for 0..15 shifts for more than 16 bits to be copied.
      The table is attached to this document for everybody to use.
      

    3.j.) Speed issues: Hog- vs. Blit-Mode
      
      As you remember, the blit-register $FFFF8A3C incorporates the 
      bit 6 that decides whether the BLiTTER runs in "hog"-mode (Bit 6
      set) or in the so-called "blit"-mode (Bit 6 unset) also known as
      cooperative mode. What's the difference and when to use what ?
      In "hog"-mode, the BLiTTER will not release the bus after having
      been started until the copy is completely finished. While the
      BLiTTER will, of course, share the bus with the Shifter (incl.
      DMA-sound) in "hog"-mode, the CPU has no chance of accessing the
      bus and will be stalled for as long as the BLiTTER takes to
      complete the copy.
      In "blit"-mode, the BLiTTER will release the bus after 64 cycles
      and will reserve the bus again after 64 more cycles - which is why
      this mode is also nicknamed "cooperative" mode. This allows the 
      CPU to get some work done while the BLiTTER is busy but will,
      naturally, slow down CPU _AND_ BLiTTER visibly.
      Which one is better ?
      None, but each has its own advantages and disadvantages.
      In "hog"-mode, the BLiTTER will not waste cycles reserving the bus
      (up to 7 cycles) and therefore run as quickly as possible. Using the
      timing table quoted in the first chapter it is possible to calculate
      the time after which the BLiTTER will release the bus again. However,
      during this time, the CPU will not have any access to the bus, not
      even for highly critical interrupts, potentially making it miss
      some interrupts that occured.
      In "blit"-mode the CPU will gain access after 64 cycles for 64
      cycles, but the BLiTTER will have to reserve the bus afterwards and
      therefore delay its copying by up to 7 cycles per turn. Nevertheless,
      the CPU will have a chance of reacting to interrupts with a delay of
      64 cycles but the interrupt service routine should finish in less
      than 64 cycles, otherwise it is potentially stalled by the BLiTTER
      again.
      
      There is an easy way to speed up the BLiTTER in "Blit mode".
      To understand how it is necessary to know that the BLiTTER latches
      the register $FFFF8A3C internally and that any write access to this
      register will update the latch.
      To speed it up in "blit"-mode, do the following
          or.b  #$80,$FFFF8A3C.w     ; start the BLiTTER
      loop:
          bset.b  #7,$FFFF8A3C.w     ; (re)start the BLiTTER
          nop                        ; BLiTTER will need a few cycles
          bne.s   loop               ; Loop if registers shows "busy"
          
      The example above is officially advertised by Atari and it works
      the following way: The first "or" instruction starts the BLiTTER
      more or less immediatelly. The "bset" instruction will test (i.e.
      read) bit 7 of the register and set the Z-bit in the CCR of the
      680x0 processor accordingly, then set the bit again. The BLiTTER
      internally updates the register which has internally been idle,
      meaning that bit 7 was not set of the internal register, making
      the BLiTTER request bus access again. This update of the internal
      latch as well as the bus request does take some time so that the
      "nop" will be executed in any case. If the bit 7 was set, the
      "bne.s" will branch to the loop label. If the bit 7 was unset,
      the BLiTTER has finished the copy and resetting the "busy"-bit
      had no effect. The instruction in between the "bset.b" and the
      "bne.s" does not have to be a NOP of course. It can be any
      useful instruction that may be engaged before the BLiTTER is
      assigned bus access. The implementation above will speed up the
      BLiTTER to roughly 90% of "hog"-mode performance and still allow
      to react to interrupts within 64 cycles of delay.
      
      In the 1990's revision of Atari's BLiTTER manual, they revised the
      code quoted above the following:
      
      loop:
          tas     $FFFF8A3C.w        ; (re)start the BLiTTER
          nop                        ; BLITTER will need a few cycles
          bmi.s   loop               ; Loop if register shows "busy"
      
      If an interrupt service however cannot be completed in 64 cycles
      and should not be stalled by the BLiTTER again, Atari states that
      it is possible to "stall" the BLiTTER by writing Bit 7 to zero -
      But Atari points out in the official documentation that it needs
      to be set to the original state before end-of-interrupt is reached:
      
        move.b  $FFFF8A3C.w,-(sp)     ; Write register to stack
        bclr.b  #7,$FFFF8A3C.w        ; unset busy bit
        nop                           ; BLiTTER might interfere here
        
        <Interrupt Service Routine body>
        
        move.b (sp)+,$FFFF8A3C.w      ; restore original register
        
      Unfortunately, this does not work at all and is not mentioned in
      other documentation regarding the BLiTTER i have read so far.
      This seems to origin from the internal state handling. The CPU
      reads a "1" in the BLiT-busy bit when reading out the register,
      even though internally it is a "0" because the BLiTTER is pausing
      for 64 cycles. Writing through a "0" will not have any effect and
      not stop the internal state handler to set it back to "1" internally
      after 64 cycles. The BLiTTER will only write a "0" if it is done
      copying data (i.e. the line counter turns "0"). There does not seem
      to be any other way of stalling the BLiTTER on request.

      So, if parts of the code need to be protected from potential BLiTTER
      interference, there is only one decent way of doing so: Use the
      BLiTTER in hog mode but only engage small segments at a time. Since
      the BLiTTER does not reset any of its registers except for the X Count
      register, the BLiTTER can easily be used for subsequent copies by
      just setting Y Count as desired (and Line Number if needed) and start
      the BLiTTER manually to engage the next copy.

  3.k.) More speed issues: Wasting time?
  
      It might often not be possible to organize CPU code around BLiTTER
      operations, but if you do, you might gain a lot of speed by placing
      the right instructions directly in front of a BLiTTER operation.
      As the BLiTTER might require a couple of cycles before actually doing
      anything, the CPU usually has a chance of loading the next instruction
      before the BLiTTER hogs the bus. If this instruction requires no
      additional bus access, it will be carried out _in parallel_ to the
      BLiTTER operation, making the Atari ST/STE a dual-processor system
      for this short period.
      Naturally, the speed gain is higher if the instruction being carried
      out by the CPU is lengthy, ideally a multiplication or division. If
      using registers only (which means the instruction does not require
      additional bus cycles to fetch data), the CPU can carry out this
      instruction partially or completely in the "background" while the
      BLiTTER is busy. The code will have to look something like this:
      
        move.w    dividend,D5       ;load value to be divided
        move.w    divisor,D6        ;load value to divide by
        or.b      #$80,$FFFF8A3C.w  ;start the BLiTTER
        divs      D6,D5             ;will run in parallel!

      Again, thanks to RA for figuring this one out.

      Naturally, on an 68020 or any subsequent 680X0, you can also try to
      "lock" the CPU into accessing cache only and not require bus accesses
      while the BLiTTER is busy. To do so, you need to make sure the cache
      "locks" the lines containing the code you want to run while the 
      BLiTTER is busy and then branch to exactly this memory region before
      the BLiTTER is granted bus access.
      On an 68020, you will run into the problem that the code being
      executed must not write any result data to memory because the 68020
      has no data cache and will therefore attempt to access the bus.
      On the 68030, you can buffer some data in its data cache, but then
      you must make sure the cache has enough "empty" space to store the
      data, otherwise, it will try to write back lines to memory which
      would stall the cache because the bus is granted to the BLiTTER.
      The easiest way to do so is to flush the data cache immediatelly 
      before starting the BLiTTER.
      This, on the other hand, is not a suitable way on an 68040 or 68060
      as these CPUs have incredibly large caches (probably not by today's
      standards, but for the time being) and simply flushing the whole
      data cache might not only stall the CPU for many cycles while the
      cache is updating itself but might also remove data from the cache
      you wanted your code to operate on out of the BLiTTER context.

  3.l.) Official Atari code
  
      As pointed out by ggn, the official Atari BLiTTER programming
      manual, June 17th 1987, The Atari Corporation, Sunnyvale,
      California, has sample BLiTTER code from page 17 on.
      The original document as well as its revision from Jan 25th 1990
      can be found under
      
        http://dev-docs.atariforge.org
        
      hosted by Atari Users Network!

      The sample code is also attached to this document at the very end.
      Using a text editor, it should be easy to extract it into an ASCII
      .S-file to load into any assembly editor.

  4.) BLiTTER modified and generated code

      Self-modifying and self-generating code are most certainly the queens
      of demo-effect fundamentals. With one exception though: If the code to
      be generated or modified becomes too large, the concept fails: The CPU
      takes too much time modifying/generating bulky code if it, afterwards, 
      needs to run the modified/generated code as well.
      Usually, the code is therefore only generated/modified for a subset of 
      necessary code, for example for a single line for which the code can 
      quickly be modified or generated, and this code is run multiple times. 
      If this is, however, no longer possible because subsets of the code 
      to be run multiple times can no longer be defined, the BLiTTER can 
      come in handy and generate/modify code.
      
      4.a.) BLiTTER modified code
      
      Let's say you have a routine to generate a rectangular graphics block 
      and it is completely unrolled and therefore bulky. Now data needs to 
      be filled in for a sequence of instructions that can be read from a 
      predefined block. Naturally, the code contains a bit of "overhead", 
      for example line break generation and similar, that would need to be 
      skipped when filling in data.
      No problem using X and Y increments wisely.
      All you need to do is set up a table with the data to be read by the
      BLiTTER and set X and Y source increments accordingly. Please note
      that the BLiTTER can't read bytes so that if you want to copy data in
      less that 16 Bit packages, you will have to arrange your source table
      that it is stuffed to 16-Bit boundaries per copy.
      The target for the BLiTTER is the unrolled loop and the X and Y
      destination increments now need to be selected carefully. Because the
      680x0 instructions are minimally 16 bits wide, the BLiTTER can modify
      each instruction easily and by using appropriate ENDMASKs, it is also
      easy to ensure that only the desired bits of each instruction are
      being modified.
      An example: Let's say, you have groups of 8 instructions which are
      24 Bytes in total, every 2nd instruction is supposed to be modified,
      then an instruction block of 8 Bytes needs to be skipped. For such
      a routine, the destination X increment would be 6 Bytes and the
      destination Y increment would be 6 + 8 Bytes due to the fact that the
      BLiTTER does not increment X on the last copy of a line. The X counter
      would be set to 4 to go through the inner block, the Y counter needs
      to be set to cover the whole unrolled code.
      And there you go. BLiTTER modified code. Simple as that.
      
      4.b.) BLiTTER generated code
      
      Even though it's rather unlikely to generate code by the BLiTTER,
      there's still a fairly special case that might come in handy. Let's
      say you have a huge block of code in which some instructions need
      to be generated on short notice. A compact table defines which
      instruction out of 16 in total needs to be put into certain places
      and the target addresses are in some way regular, ideally in a
      rectangular block of the code to be modified. 
      The trick is now to load all potential 16 instructions into the
      halftone pattern, switch the BLiTTER to "halftone only" HOP mode
      and activate SMUDGE mode. Now set up the source and destination
      increments as required for reading out the table and writing into
      the code block and, in case the source table is premultiplied by
      a constant that is a power of 2, also set the SKEW value accordingly
      so that the data read in covers the range 0..15 after SKEWing.
      Using the SMUDGE mode will now make the BLiTTER read in a word from
      the source table, shift it down to the range 0..15 and use the result
      as an index for the halftone pattern. The line being indexed is then
      written to the target buffer.
      Even though this method is quite restricted, it makes it fairly easy
      to generate for example line breaks plus minor overhead code such as
      correcting address registers into large blocks of screen related code.
      
      
  5.) Special Use in Chunky Graphics Mode
      
      5.a.) Speedup in 16 (or less) colours
      
      So far, the BLiTTER has mainly been used to manipulate bitplane
      graphics for which it had been designed in the first place. So how to 
      use the BLiTTER sensibly in chunky graphics mode when not doing 
      saturated adds, increments or decrements ?
      Often, it's not worth the effort: When copying data, meaning that
      HOP is set to "2" while OP is set to "3", the BLiTTER takes 8 cycles 
      per word (See table in the first section). When using the 68000 to 
      copy words using "movem.w", it takes 12 cycles + 4 per register being 
      addressed by the instruction, meaning that it takes 8 cycles per word 
      when using "movem.w" with 3 registers and a mere 5.5 cycles per word
      when using "movem.w" with 8 registers.
      There is, however, a simply trick to speed up the BLiTTER when a 
      maximum of 16 colours per pixels is applied: By using the SMUDGE mode, 
      a well configured halftone pattern, HOP mode "1" and OP mode "3".
      The halftone pattern needs to contain all 16 possible target values 
      (most likely values 0..60 in multiples of 4) and the SKEW value needs 
      to be set to the exact number of shifts necessary to scale every pixel 
      value read by the BLiTTER down to cover bits 0..3 (most likely 2).
      As a result, the BLiTTER will read a word, shift it down by the given 
      amount of bits, use the resulting bits 0..3 as a lookup index for the 
      halftone pattern and write the value from the halftone pattern to the 
      destination.
      Looking at the clock cycle table in the first section, the BLiTTER 
      requires a mere 4 cycles per copy then!
      Naturally, the CPU can still outspeed the BLiTTER if the chunky buffer 
      is byte based because the CPU will copy 2 bytes at once while the 
      BLiTTER, using the smudge mode, will only copy 1 byte effectively. 
      But, if manipulation of the data is needed while copying (such as 
      saturated incrementing/decrementing, shifting or application of 
      ENDMASKs) this concept is impossible to beat by the CPU!
      Again, it's not necessary to have a word-based chunky buffer for this 
      concept, if separated over 2 individual runs, one refering to the high 
      byte (affecting SKEW, Halftone content and ENDMASKs), one refering to 
      the low byte (affecting SKEW, Halftone content and ENDMASKs), it can 
      be applied on byte buffers just the same, introducing minor overhead 
      though.

      5.b.) The holy grail (?) - BLiTTER C2P
      
      So, can the BLiTTER perform chunky-to-planar conversion ? It can. Can 
      it convert faster than the CPU ? Not necessarily, but it does have its
      advantages as will be laid out end of this chapter.
      Actually, BLiTTER c2p is anything but complicated. Much like the 
      CPU-based c2p uses tables to convert chunky pixels into a fragment of 
      a planar word, the BLiTTER can use the halftone pattern. Again, the 
      colour of the chunky pixel just having been read is being used as an 
      index by employing the SMUDGE mode and an intelligent SKEW. 
      Let's go through it step by step.
      First of all, we need to prepare the halftone pattern. Since the 
      BLiTTER can only write 16 bits at a time, which is exactly the size of 
      one bitplane word, we need to cover all 4 bitplanes of the Atari ST(E) 
      in 4 separate runs. Naturally, for these separate runs, the halftone 
      pattern needs to be constructed individually:
      
      Index    Plane 0   Plane 1   Plane 2   Plane 3
        0       $0000     $0000     $0000     $0000
        1       $FFFF     $0000     $0000     $0000
        2       $0000     $FFFF     $0000     $0000
        3       $FFFF     $FFFF     $0000     $0000
        4       $0000     $0000     $FFFF     $0000
        5       $FFFF     $0000     $FFFF     $0000
        6       $0000     $FFFF     $FFFF     $0000
        7       $FFFF     $FFFF     $FFFF     $0000      
        8       $0000     $0000     $0000     $FFFF
        9       $FFFF     $0000     $0000     $FFFF
       10       $0000     $FFFF     $0000     $FFFF
       11       $FFFF     $FFFF     $0000     $FFFF
       12       $0000     $0000     $FFFF     $FFFF
       13       $FFFF     $0000     $FFFF     $FFFF
       14       $0000     $FFFF     $FFFF     $FFFF
       15       $FFFF     $FFFF     $FFFF     $FFFF      

      Now to make sure the right pixel is being used by the BLiTTER to be 
      converted, the SKEW mode needs to be set accordingly. For example, for 
      a compact nibble buffer with 4 chunky pixels per word:
      
      Bit    15 14 13 12 11 10  9  8    7  6  5  4  3  2  1  0
      Pixel   D  D  D  D  C  C  C  C    B  B  B  B  A  A  A  A
             |_________| |_________|   |_________| |_________|
               leftmost     left          right     rightmost
                 pixel      pixel         pixel       pixel
      
      So to convert the leftmost pixel in the above example, we need to set 
      SKEW to 12, for the left pixel to 8, for the right pixel to 4 and for 
      the rightmost pixel to 0.
      To make sure the BLiTTER will not overwrite previously converted 
      pixels when BLiTTING subsequent runs, the ENDMASKs also need to be set 
      accordingly, using the popular double pixel mode:
      
      Leftmost pixel:   All 3 ENDMASKs to %11000000 00000000
      Left pixel:       All 3 ENDMASKs to %00110000 00000000
      Right pixel:      All 3 ENDMASKs to %00001100 00000000
      Rightmost pixel:  All 3 ENDMASKs to %00000011 00000000
      
      Obviously, this is not sufficient to convert all pixels in the target 
      word, so we need to schedule 4 additional runs using the SKEW values 
      from above but these ENDMASKs:

      Leftmost pixel:   All 3 ENDMASKs to %00000000 11000000
      Left pixel:       All 3 ENDMASKs to %00000000 00110000
      Right pixel:      All 3 ENDMASKs to %00000000 00001100
      Rightmost pixel:  All 3 ENDMASKs to %00000000 00000011
      
      And finally, we need to set HOP and OP accordingly. Since we want to 
      write back Halftone content only, HOP needs to be set to "1". Since we 
      want the 2 bits we set per run to replace existing content in these 2 
      bits, we need to set OP to "3".
      
      Now, all the CPU needs to do is to control how the BLiTTER works. 
      Ideally, the BLiTTER steps through each of the 8 target pixel 
      positions (changing SKEW and ENDMASKs) before switching to the next 
      bitplane (loading ENDMASKs, resetting SKEW and ENDMASKs).
      
      Does it pay off ? Using a HOP-mode "1" and an OP-mode "3", the BLiTTER 
      needs 4 cycles for each BLiT. However, since the BLiTTER needs to write
      each bitplane individually, it requires 4 x 4 cycles for a single 
      chunky pixel, which makes 16 cycles per pixel.
      This is not necessarily faster than the CPU can do, even on a compact 
      nibble buffer where an additional shift of 2 bits and clever usage of 
      the c2p-table (which is then 256KB in size) are needed. Clever usage 
      of movem.w and unrolled loops can bring even this kind of c2p 
      conversion in the range of 15 cycles per pixel.
      But the BLiTTER c2p does have advantages:
      - No bulky table needed
      - Compact nibble buffers can easily be used
      - Scalable as 3/2/1 bitplanes can be handled using this algorithm, 
        which then also require 3/4, 1/2 or 1/4 of the initial runtime.
      - Dither patterns can easily be used through the halftone pattern
      - By loading ENDMASKs properly, the target buffer does not need to be
        aligned on a word boundary but can be shifted to any even position
      
      Of course, the BLiTTER c2p also comes with a set of disadvantages:
      - Limited to 16 colours per chunky pixel
      - Not necessarily faster than a CPU-based concept
      - No customisation possible which could increase performance
      - Max2p concept not applicable
      
      Well, to be honest, the BLiTTER "could" theoretically also manage 8 
      bits per chunky pixel, but it would need to run twice per chunky pixel 
      then, severely slowing it down while the CPU-based c2p concept only 
      needs a larger table and 2 movep instructions per 4 pixels, which will 
      make it slower, but not necessarily down to half the speed.
      

  6.) Maximum 2 Planar (max2p) and the BLiTTER
  
    What is the Maximum 2 Planar (max2p) or the recently updated Maximum 2
    Planar Extreme (m2px) algorithm and how does it benefit from using the
    BLiTTER ?
    The idea behind max2p (and m2px) is to use otherwise unused bits to
    store additional pixel information being interpreted while doing the
    chunky to planar conversion. A fairly conservative approach would be
    to add 2 "normal resolution" background pixels with a single chunky
    pixel, for example by locating the bitgroups like this:
    
      Bit   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
               |_________| |_________| |____________|
                left bgnd   right bgnd  chunky pixel
    
    In the example above, there is also 1 overflow bit reserved so that
    an algorithm generating the chunky pixels could even add 2 pixel
    values of 4 bits each without risking to potentially overwrite
    background information.
    And guess which chip is ideal to fill in either bits 11 to 14 or
    bits 10 to 7 or bits 6 to 2 without touching any other ? The
    BLiTTER by using SMUDGE, decently set up halftone patterns and
    decently set up ENDMASKs.
    A similar approach is to combine 2 chunky pixels with a different
    colour information, for example a 2 bit alpha layer per pixel:
    
      Bit   15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
                  |_________| |___| |___| |_________|
                 left chunky  left  right right chunky
                             alpha  alpha
                             
    This way, 2 chunky pixels can easily be converted at once while
    at the same time, additional colour information can be handled
    completely for free. And which is the best way to modify the 2
    middle bits without touching either the left or right chunky pixel
    information ? Again, the BLiTTER by decently setting up SKEW, 
    SMUDGE, halftone pattern and endmasks.
    
    Maximum 2 Planar Extreme (m2px) goes one step beyond that. It exploits
    the fact that in almost all CPU-based c2p algorithms, the 1 byte
    static offset in indirect indexed addressing for the move/or when
    converting a pixel is unused:
    
      movem.w      (A0)+,d0-d3    ;fetch 4 chunky word sized pixels
      move.l  0(A1,d0.w),d0       ;convert leftmost pixel using table A1
      or.l    0(A2,d1.w),d0       ;convert left pixel using table A2
      or.l    0(A3,d2.w),d0       ;convert right pixel using table A3
      or.l    0(A4,d3.w),d0       ;convert rightmost pixel using table A4
      movep.l         d0,X(A5)    ;put to screen buffer
      
    Why waste precious addressing capabilities if you can actually put
    2 pixels of 3 bits colour depth in the static offsets ? Usually, c2p
    code is unrolled and even more often, it's being generated so that
    the BLiTTER can easily fill in the static offsets:
    
      movem.w         (A0)+,d0-d3 ;fetch some sort of 4 chunky pixels
      move.l  l1r1(A1,d0.w),d0    ;combine with background and convert
      or.l    l2r2(A2,d1.w),d0    ;combine with background and convert
      or.l    l3r3(A3,d2.w),d0    ;combine with background and convert
      or.l    l4r4(A4,d3.w),d0    ;combine with background and convert
      movep.l            d0,X(A5) ;put to screen buffer
      
    With l1, l2, l3 and l4 denoting a "left" background pixel while
    r1, r2, r3 and r4 refer to a "right" background pixel. If these are
    static, they can easily be placed in the generation of the code,
    otherwise, the BLiTTER will gladly update them easily, as the static
    offset is simply the last byte of the instruction and employs the
    following format:
    
      Bit   7  6  5  4  3  2  1  0
           |______| |______|
         left bgnd. right bgnd.
         
    This approach also leaves the chunky pixel(s) being read completely
    free to be used in any manner that suits the underlying effect, for
    example carrying 2 chunky pixels plus 2 bits of additional alpha
    layer information per pixel as explained above. Naturally, setting
    up the m2px-tables does involve some thinking then, but it's
    possible.
    
    How will BLiTTER code look when updating, for example, the background
    pixels in the generated m2px-code that looks a lot like the example
    code listed above:
    First of all, to speed things up, the background should exist in the
    compressed 2 x 3 bit format explained above. Then the ENDMASKs need
    to be set up to not touch any other part of the instructions, e.g.
    all 3 to $00FC. If the compressed format is not available, the
    BLiTTER can create it for you easily - it will merely require 2
    runs then, one to generate the left half using ENDMASKs $00E0, and
    one using ENDMASKs $001C. By cleverly setting up halftone patterns,
    SKEW and SMUDGE can easily be used to speed up the BLiTTER, too.
    Then, carefully read the amount of words between the words the
    BLiTTER needs to modify, these are the destination X increment.
    Also carefully count the words between the last and the first
    instruction the BLiTTER can update using the destination X increment
    calculated above (for example, where the movep.l is located). This
    is going to be the destination Y increment. Source X and Y increments
    depend on the source, naturally. NFSR and FXSR are not required but
    after each line, the BLiTTER will require updates on destination and
    maybe also source address. But that's it.
    BLiTTER modified code.
    Easy as that.
    
  
  7.) Extremely restricted fixed point increments for the BLiTTER
  
    So as far the current explanation went, the BLiTTER only handles
    integer increments and offsets and often enough, only even values
    as it is operating on 16 bit words all the time.
    What do people refer to if they actually talk about zooming or
    maybe even rotozooming a graphics block using the BLiTTER ?
    Actually, the BLiTTER can sort of handle fixed point increments,
    simply by shifting the dot. Meaning, the BLiTTER can only increase
    a source or destination increment that are multiples of 2, but what
    if the next source pixel wasn't 2 bytes away but 8 ?
    Then, an increment of 2 would not increase the source pointer by
    1 (or 2 pixels for a byte based buffer) but by 1/4 pixel. Making
    the BLiTTER effectively operate on fixed point numbers.
    Naturally, this actually requires the source graphics block to
    contain every pixel in multiple instances, making it consume way
    more memory, still, the precision is very limited and not near,
    for example, 8.8 Bit fixed point which is fairly popular on the
    68000. Yet, it works well for small buffers.

    
  8.) BLiTTER in true colour effects (thanks to Ray of .tSCc.)
  
    So far, we only used the BLiTTER for effects typically done on the
    STE, either using bitplanes directly or cleverly arranging c2p-
    effects to take advantage of the BLiTTER.
    Can it do anything useful on, for example, the Falcon in its true
    colour mode, too ?
    Yes, it actually can.
    Let's take a closer look at the Falcon's true colour pixel format:
    
      Bit    15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
             |____________| |_______________| |____________|
                  Red             Green           Blue
                  
    Let's say, using just 4 bits for red, green and blue each would
    still be sufficient for the desired effect, the pixel format
    could be reduced to
    
      Bit    15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
             |_________|    |_________|       |_________|
                Red            Green              Blue
                
    This would actually limit the number of colours available to 4096
    colours, but still, in a compact manner and all of them on one
    screen.
    Naturally, these 4 bits can now easily by modified in any desired
    way using halftone patterns and the SMUDGE mode again, for example
    for fading down by 1 colour value. The halftone pattern could then
    be set up like this:
    
      Line  0 - %0000 0 0000 00 0000 0 =  0 - 1 =  0 (underflow protect)
      Line  1 - %0000 0 0000 00 0000 0 =  1 - 1 =  0
      Line  2 - %0001 0 0001 00 0001 0 =  2 - 1 =  1
      Line  3 - %0010 0 0010 00 0010 0 =  3 - 1 =  2
      ...
      Line 15 - %1110 0 1110 00 1110 0 = 15 - 1 = 14
      
    In each line, the original value in red, green and blue is decreased
    by one at the same time. But by using ENDMASKs in an intelligent way,
    the colour value not due for updating can easily be masked out:
    
       First BLiTTER run to decrease red   - Set ENDMASKs to $F000
      Second BLiTTER run to decrease green - Set ENDMASKs to $0780
       Third BLiTTER run to decrease blue  - Set ENDMASKs to $001E
    
    Also, SKEW values need to be set accordingly:
    
       First BLiTTER run to decrease red   - Set SKEW to 12
      Second BLiTTER run to decrease green - Set SKEW to 7
       Third BLiTTER run to decrease blue  - Set SKEW to 1
       
    And there you go. As each word costs merely 4 cycles to modify, it is
    still fairly quick (12 cycles for decreasing red, green and blue) and
    costs basically no memory for any table. It also allows to modify
    the individual channels individually, by for example fading down red,
    fading up green and keeping blue as it is.

    The above concept does not only allow to fade down/up quickly using
    the BLiTTER but also leaves 4 bits free to potentially implement an
    4-bit alpha layer, right ? This can also be filled easily by the
    BLiTTER, using SMUDGE, a decently set up halfone pattern and 
    appropriate SKEWing.
    Even more obscure, you can even use the segmented pixel format
    shown above. By using SMUDGE mode and appropriately set up halftone
    data, the BLiTTER can produce the segmented data easily and by
    using the right ENDMASKs, the BLiTTER will even fill the segmented
    bits of the alpha layer into the correct position.

    Other pixel modifications can be implemented similarly but as a rule of
    thumb, the BLiTTER will only outspeed the CPU if SMUDGE mode can be
    used. Just copying around true colour pixel data using the BLiTTER is
    usually not faster than doing it using the CPU.

  9.) Other applications than graphics
  
    Already in 1999, TAM of T.O.Y.S proposed writing a BLiTTER based MOD-
    player. It never came into existance, and thinking about it, the BLiTTER 
    by itself would also be unable to re-scale the samples at a sufficient 
    precision. However, spreading sampled data to simulate 4 or 8 
    DMA-channels would be easy to do by simply using other source than 
    destination increments.
    It would be possible to have the DMA sound run over a given memory 
    segment over and over again and update the sampled sound data using the 
    BLiTTER directly after the DMA sound has played a certain part of it. 
    This would ease up streaming sampled sound data from a slowish source, 
    such as a disk drive.
    

  d.) Common mistakes
  
    The BLiTTER is usually not that hard to get used to, but a few common 
    mistakes are usually being done regularly and as the BLiTTER works 
    independantly of the CPU, it can become tricky to actually debug the 
    code that turns out to be not working as planned. Here are a few:
    
    ? Starting the BLiTTER does something but it appears to be incomplete
      when the CPU tries to procede operating on the data.
    ! Please note that in cooperative mode, the BLiTTER is not done yet when
      the CPU gets control over the bus again. Either use the trick proposed 
      by Atari to accelerate the BLiTTER or use it in HOG mode.
      
    ? Tried but it still seems corrupted, especially the interrupt services 
      are getting out of sync very often.
    ! Please note that in BLiT mode, the BLiTTER will always arbitrate for 
      the bus after 64 cycles of idle time, no matter what the CPU is doing. 
      It will interfere with the interrupt service routines. If the timing 
      of the interrupt service routines is critical, the BLiTTER should only 
      be used with great care and in HOG mode where appropriate.
      
    ? It seems to work, but after a while, the system seems to crash?
    ! Sounds like you're not writing the number of lines correctly. Please 
      note that this is one of those registers you need to update _ALWAYS_ 
      before starting the BLiTTER.

    ? Wah! My chunky graphics are corrupted by the BLiTTER!
    ! Please remember that the BLiTTER _ONLY_ operates on even addresses. 
      Even if your chunky graphics are using byte sized pixels, they must be 
      located on an even address if you want the BLiTTER to correctly 
      operate on these as it does not allow to set any odd address.
    
    ? Sprites work fine at certain positions but are screwed up at others
    ! Commonly, shifting sprites requires a lot of attention regarding the 
      NFSR and FXSR bits, especially if the SKEW-value is greater than 0.

    ? Tried. Still looks screwed up.
    ! Please also note that as soon as NFSR or FXSR are set, the Y 
      increments need to be revised. See explanation above.
      
    ? It seems to work, but there's defective data to the left or right of 
      the BLiTTER sprite every now and then.
    ! The BLiTTER never clears its internal data buffers so that if you're 
      using SKEW, NFSR and FXSR, outdated data remains in the buffer and 
      will be displayed if not masked out by the ENDMASKs.
     
    ? My SMUDGE-mode effect only seems to use a few lines of the halftone 
      pattern, but not all 16. Why ?
    ! Please note that the SMUDGE-mode selects the halftone-line by 
      evaluating bits 3..0. You need to make sure the value read by the 
      BLiTTER uses these 4 bits, if necessary, by shifting it down using the 
      SKEW-register. The BLiTTER, by the way, doesn't really care about bits 
      15 to 4 when using SMUDGE mode.
      
    ? I tried some of the Paradox demos on my PC with an emulator. They look
      badly screwed up and therefore, they suck. Learn how to code!
    ! It's hardly our fault if emulators are not complete regarding the
      BLiTTER emulation. Ask your favourite emulator programmer to improve 
      the BLiTTER emulation or run the demo on the original machine but 
      don't blame us.

    ? I read on de.wikipedia.org that you're irrelevant.
    ! Yes, we were surprised about that, too. If the so-called "expert on 
      the matter" - we'd rather call him a "has-been" - considers one of the 
      few groups still doing things irrelevant, it makes you wonder what 
      groups are left to be considered relevant.
      


  e.) Compatibility
  
    Programming the BLiTTER can massively accelerate your code, but what can
    you do to make sure the machine you program for has one ? Here's a list:
    
    - Atari 520/1040ST, STf, STm, STfm, Mega and Mega ST
      Commonly, these machines had no BLiTTER. Not even all Mega and Mega ST
      had one initially, Atari merely mentioned in the owner's manual that 
      if the machine does not allow to activate a "Blitter" in the "Extras" 
      menu of the desktop, the machine can be fitted with one at the local 
      Atari dealer. Otherwise, later Atari Mega and Mega ST models were 
      shipped with one.
    
    - Atari STE and MegaSTE
      These machines had a BLiTTER. All of them did. As it's included in the 
      so-called COMBEL chip, it's present in all STE and MegaSTE.
      Please note that in the MegaSTE, BLiTTER timing is slightly different 
      as the BLiTTER requires more idle cycles before arbitrating the bus.
      This may become critical for extremely hard synchronized effects.
      As a result, starting the BLiTTER on a MegaSTE is delayed by one
      "nop" instruction of the CPU which needs to be taken into account.
      Some senseless techno babble:
        The BLiTTER requires 4 to 7 CPU cycles to arbitrate the bus on a
        regular ST or STE, leaving the CPU enough time to finish a bus
        access. On the MegaSTE, the CPU accesses the bus through a cache,
        which, unfortunately, Atari has never disclosed any internals of.
        However, the cache controller is likely to organize the cache in
        "lines" of 16 bytes each. To either read or write a complete line
        in one go, the cache needs to do 8 subsequent accesses, reading or
        writing 16 bits at once. This implies 8 bus cycles (minimally), so
        that it's likely that Atari simple changed the minimum "slot" which
        is granted to the CPU/cache from 4 cycles on the ST/STE to 8 cycles
        on the MegaSTE, making the BLiTTER require 8 to 11 cycles to
        arbitrate the bus.
      
    - Atari TT
      This model had no BLiTTER. Never.
      
    - Atari Falcon030
      The Falcon030 had one in all configurations and as a special bonus, 
      the BLiTTER in the Falcon can be switched to either 8MHz or 16MHz
      by writing a "1" to bit 2 of $FFFF8007 for 16MHz or clearing bit 2 of
      $FFFF8007 for 8MHz.
      
    - Atari compatibles such as MedusaT40, Hades040/060 or Milan
      None of these had a BLiTTER as they all were aimed at running more 
      advanced graphics cards which might have internal BLiTTER-like 
      accelerators but were not accessible through a memory mapped 
      interface.
      
    - Emulators:
      Most of the emulators focussed on GEM-compatibility such as GEMulator,
      WinSTon, TOS2WIN, STemulator etc. feature no BLiTTER-emulation, at 
      least not on a hardware level.
      SainT does not support FXSR and NFSR and halftone operations also seem 
      to be completely malfunctioning, otherwise, simple blits seem to work.
      The most popular emulator STEem features most parts of the BLiTTER, 
      however, Halftone-operations are restricted and SMUDGE-mode is not 
      supported at all.
      Hatari supports all of the BLiTTER features and the timing is 
      accurate, at least if you configure it to 8MHz CPU clock.
      
    How do you actually find out whether the machine is equipped with a 
    BLiTTER at runtime to make sure it has one ?  By calling XBIOS #64.
    This function accepts one parameter, the mode:
    
      mode = -1: Do not alter XBIOS blit mode, just return current mode,
      mode =  0: Deactivate BLiTTER, return previous mode and
      mode =  1: Activate BLiTTER, return previous mode
      
    The return value merely uses bit 0 to report the XBIOS blit mode and
    bit 1 to report whether a BLiTTER is available in hardware:
    
      Blitmode = 0: No BLiTTER available and XBIOS is not using one,
      Blitmode = 1: No BLiTTER available but enabled by XBIOS (impossible),
      Blitmode = 2: BLiTTER available but disabled by XBIOS and
      Blitmode = 3: BLiTTER available and enabled by XBIOS.
    
    Please note that this function does not exist at TOS revisions before 
    TOS 1.02 so that calling XBIOS #64 will fail on TOS 1.00. Therefore,
    to really make sure,
    
      a.) check word at address $00000002,
      b.) If that is $0100, the machine is equipped with TOS 1.00 does
          not have XBIOS #64, so don't even try. It is also unlikely that
          the machine has a BLiTTER -> Set "No BLiTTER"
      c.) If that exceeds $0100, call XBIOS #64 and check bit 1 of the
          return value:
          If that is "1" -> Set "BLiTTER",
          if that is "0" -> Set "No BLiTTER"
      

  f.) Summary
  
    So this is the BLiTTER. As you have read above, it can do a few funky
    tricks, it has a couple of restrictions, it can assist a few other 
    things and it's most certainly not restricted to merely copying bit-
    blocks around.
    In the golden era of Atari ST(E) demos, the BLiTTER was usually being
    ignored. If at all, STE demos mainly focussed on hardware scrolling and 
    DMA sound, but, if at all, the BLiTTER was usually only being used for a 
    couple of sprites and that's about it. Now that there are more complex 
    algorithms such as c2p, it was about time to set the record straight and 
    show how the BLiTTER can support these new-school effects.
    Hopefully, this tutorial has motivated a few people to toy around with 
    the BLiTTER a bit and probably add one or two new effects to the list 
    below. If so, this tutorial has served its purpose.
    
    
    The Paranoid of Paradox 2012. 
     
  
    
    
  Appendix: Demo Resource List
  
    BLiTTER sprites:       Sonic in "Pacemaker"
    BLiTTER fade down:     Smearing 3D bobs in "Pacemaker"
    BLiTTER melt-o-vision: Nothern Lights effect in "Pacemaker"
    
    Random Sized Sprites:  Jigsaw Puzzle Zoomer in "again"
    ENDMASK management:    Twister in "again"
    
    BLiTTER zooming:       Radial Blur in "Blue Period"
    Max2p:                 "Alternative Party Invitro" (incl. BLiTTER blur)
    M2px:                  Overlaid Tunnel in "Blue Period"
    
    BLiTTER modified code: Circle interference effect in "Blue Period"
    BLiTTER modified code: Zooming double-alpha layer scroll in "SV2011"
    

  Appendix: Original Atari BLiTTER sample code
  
    * (c) 1987 Atari Corporation
    *     All rights reserved
    *
    * BLiTTER BASE ADDRESS
    
    BLiTTER         equ     $FF8A00
    
    * BLiTTER REGISTER OFFSETS
    
    Halftone        equ     0
    Src_Xinc        equ     32
    Src_Yinc        equ     34
    Src_Addr        equ     36
    Endmask1        equ     40
    Endmask2        equ     42
    Endmask3        equ     44
    Dst_Xinc        equ     46
    Dst_Yinc        equ     48
    Dst_Addr        equ     50
    X_Count         equ     54
    Y_Count         equ     56
    HOP             equ     58
    OP              equ     59
    Line_Num        equ     60
    Skew            equ     61
    
    * BLiTTER REGISTER FLAGS
    
    fHOP_Source     equ     1
    fHOP_Halftone   equ     0
    
    fSkewFXSR       equ     7
    fSkewNFSR       equ     6
    
    fLineBusy       equ     7
    fLineHog        equ     6
    fLineSmudge     equ     5
    
    * BLiTTER REGISTER MASKS
    
    mHOP_Source     equ     $02
    mHOP_Halftone   equ     $01
    
    mSkewFXSR       equ     $80
    mSkewNFSR       equ     $40
    
    mLineBusy       equ     $80
    mLineHog        equ     $40
    mLineSmudge     equ     $20
    
    *                     E n D m A s K   d A t A
    *
    * These tables are referenced by PC relative instructions.  Thus,
    * the labels on these tables must remain within 128 bytes of the
    * referencing instructions forever.  Amen.
    *
    * 0: Destination  1: Source   <<< Invert right end mask data >>>
    
    lf_endmask:
            dc.w    $FFFF
    rt_endmask:
            dc.w    $7FFF
            dc.w    $3FFF
            dc.w    $1FFF
            dc.w    $0FFF
            dc.w    $07FF
            dc.w    $03FF
            dc.w    $01FF
            dc.w    $00FF
            dc.w    $007F
            dc.w    $003F
            dc.w    $001F
            dc.w    $000F
            dc.w    $0007
            dc.w    $0003
            dc.w    $0001
            dc.w    $0000
    
    * TiTLE:    BLiT_iT
    *
    * PuRPoSE:  Transfer a rectangular block of pixels located at an
    *           arbitrary X,Y position in the source memory form to
    *           another arbitrary X,Y position in the destination memory
    *           form using replae mode (boolean operator 3).
    *           The source and destination rectangles should not overlap.
    *
    * iN:
    *       a4     pointer to 34 byte input parameter block
    *
    * NoTe:  This routine must be executed in supervisor mode as access
    *        is made to hardware registers in the protected region of the
    *        memory map.
    *
    *
    *       I n P u T   p A r A m E t E r   B l O c K   o F f S e T s
    
    SRC_FORM        equ      0 ; Base address of source memory form        .l
    SRC_NXWD        equ      4 ; Offset between words in source plane      .w
    SRC_NXLN        equ      6 ; Source form width                         .w
    SRC_NXPL        equ      8 ; Offset between source planes              .w
    SRC_XMIN        equ     10 ; Source blit rectangle minimum X           .w
    SRC_YMIN        equ     12 ; Source blit rectangle minimum Y           .w
    
    DST_FORM        equ     14 ; Base address of destination memory form   .l
    DST_NXWD        equ     18 ; Offset between words in destination plane .w
    DST_NXLN        equ     20 ; Destination form width                    .w
    DST_NXPL        equ     22 ; Offset between destination planes         .w
    DST_XMIN        equ     24 ; Destination blit rectangle minimum X      .w
    DST_YMIN        equ     26 ; Destination blit rectangle minimum Y      .w
    
    WIDTH           equ     28 ; Width of the blit rectangle               .w
    HEIGHT          equ     30 ; Height of the blit rectangle              .w
    PLANES          equ     32 ; Number of planes to blit                  .w
    
    BLiT_iT:
    
            lea     BLiTTER,a5         ; a5-> BLiTTER register block
    
    *
    * Calculate Xmax coordinates from Xmin coordinates and width
    *
            move.w  WIDTH(a4),d6
            subq.w  #1,d6              ; d6<- width-1
            
            move.w  SRC_XMIN(a4),d0    ; d0<- src Xmin
            move.w  d0,d1
            add.w   d6,d1              ; d1<- src Xmax = src Xmin + width-1
            
            move.w  DST_XMIN(a4),d2    ; d2<- dst Xmin
            move.w  d2,d3
            add.w   d6,d3              ; d3<- dst Xmax = dst Xmin + width-1
            
    *
    * Endmasks are derived from source Xmin mod 16 and source Xmax mod 16
    *
            moveq.l #$0F,d6            ; d6<- mod 16 mask
            
            move.w  d2,d4              ; d4<- DST_XMIN
            and.w   d6,d4              ; d4<- DST_XMIN mod 16
            add.w   d4,d4              ; d4<- offset into left end mask tbl
            move.w  lf_endmask(pc,d4.w),d4 ; d4<- left endmask
            
            move.w  d3,d5              ; d5<- DST_XMAX
            and.w   d6,d5              ; d5<- DST_XMAX mod 16
            add.w   d5,d5              ; d5<- offset into right end mask tbl
            move.w  rt_endmask(pc,d5.w),d5 ; d5<- inverted right end mask
            not.w   d5                 ; d5<- right end mask
    
    *
    * Skew value is (destination Xmin mod 16 - source Xmin mod 16)
    * && 0x000F.  Three discriminators are used to determine the
    * states of FXSR and NFSR flags:
    *
    *    bit 0           0: Source Xmin mod 16 =< Destination Xmin mod 16
    *                    1: Source Xmin mod 16 >  Destination Xmin mod 16
    *
    *    bit 1           0: SrcXmax/16-SrcXmin/16 <> DstXmax/16-DstXmin/16
    *                          Source span              Destination span
    *                    1: SrcXmax/16-SrcXmin/16 == DstXmax/16-DstXmin/16
    *
    *    bit 2           0: multiple word Destination span
    *                    1: single word Destination span
    *
    *    These flags form an offset into a skew flag table yielding
    *    correct FXSR and NFSR flag states for the given source and
    *    destination alignments
    *
    
            move.w  d2,d7              ; d7<- Dst Xmin
            and.w   d6,d7              ; d7<- Dst Xmin mod 16
            and.w   d0,d6              ; d6<- Src Xmin mod 16
            sub.w   d6,d7              ; d7<- Dst Xmin mod16-Src Xmin mod16
    *                                  ; if Sx&F > Dx&F then cy:1 else cy:0
            clr.w   d6                 ; d6<- initial skew flag table index
            addx.w  d6,d6              ; d6[bit0]<- introword alignment flag
            
            lsr.w   #4,d0              ; d0<- word offset to src Xmin
            lsr.w   #4,d1              ; d1<- word offset to src Xmax
            sub.w   d0,d1              ; d1<- Src span - 1
            
            lsr.w   #4,d2              ; d2<- word offset to dst Xmin
            lsr.w   #4,d3              ; d3<- word offset to dst Xmax
            sub.w   d2,d3              ; d3<- Dst span - 1
            bne     set_endmasks       ; 2nd discriminator is one word dst
    * When destination spans a single word, both end masks are merged
    * into Endmask1.  The other end masks will be ignored by the BLiTTER
    
            and.w   d5,d4              ; d4<- single word end mask
            addq.w  #4,d6              ; d6[bit2]:1 => single word dst
            
    set_endmasks:
    
            move.w  d4,Endmask1(a5)    ; left end mask
            move.w  #$FFFF,Endmask2(a5) ; center end mask
            move.w  d5,Endmask3(a5)    ; right end mask
            
            cmp.w   d1,d3              ; the last discriminator is the
            bne     set_count          ; equality of src and dst spans
            
            addq.w  #2,d6              ; d6[bit1]:1 => equal spans
            
    set_count:
    
            move.w  d3,d4
            addq.w  #1,d4              ; d4<- number of words in dst line
            move.w  d4,X_Count(a5)     ; set value in BLiTTER
    
    * Calculate Source Starting address:
    *
    *   Source Form address              +
    *  (Source Ymin * Source Form Width) +
    * ((Source Xmin/16) * Source Xinc)
    
            move.l  SRC_FORM(a4),a0    ; a0-> start of Src form
            move.w  SRC_YMIN(a4),d4    ; d4<- offset in lines to Src Ymin
            move.w  SRC_NXLN(a4),d5    ; d5<- length of Src form line
            mulu    d5,d4              ; d4<- byte offset to (0, Ymin)
            add.l   d4,a0              ; a0-> (0, Ymin)
            
            move.w  SRC_NXWD(a4),d4    ; d4<- offset between consecutive
            move.w  d4,Src_Xinc(a5)    ;      words in Src plane
            
            mulu    d4,d0              ; d0<- offset to word containing Xmin
            add.l   d0,a0              ; a0-> 1st src word (Xmin, Ymin)
            
    * Src_Yinc is the offset in bytes from the last word of one Source
    * line to the first word of the next Source line
    
            mulu    d4,d1              ; d1<- width of src line in bytes
            sub.w   d1,d5              ; d5<- value added to pointer at end
            move.w  d5,Src_Yinc(a5)    ;      of line to reach start of next
    
    * Calculate Destination starting address
    
            move.l  DST_FORM(a4),a1    ; a1-> start of dst form
            move.w  DST_YMIN(a4),d4    ; d4<- offset in lines to dst Ymin
            move.w  DST_NXLN(a4),d5    ; d5<- width of dst form
            
            mulu    d5,d4              ; d4<- byte offset to (0, Ymin)
            add.l   d4,a1              ; a1-> dst (0, Ymin)
            
            move.w  DST_MXWD(a4),d4    ; d4<- offset between consecutive
            move.w  d4,Dst_Xinc(a5)    ;      word in dst plane
            
            mulu    d4,d2              ; d2<- DST_NXWD * (DST_XMIN/16)
            add.l   d2,a1              ; a1-> 1st dst word (Xmin, Ymin)
            
    * Calculate Destination Yinc
    
            mulu    d4,d3              ; d3<- width of dst line - DST_NXWD
            sub.w   d3,d5              ; d5<- value added to dst pointer at
            move.w  d5,Dst_Yinc(a5)    ;      end of line to reach next line
            
    * The low nibble of the difference in Source and Destination alignment
    * is the skew value.  Use the skew flag index to reference FXSR and NFSR
    * states in skew flag table.
    
            and.b   #$0F,d7                ; d7<- isolated skew count
            or.b    skew_flags(pc,d6.w),d7 ; d7<- necessary flags and skew
            move.b  d7,Skew(a5)            ; load Skew register
            
            move.b  #mHOP_Source,HOP(a5)   ; set HOP to source only
            move.b  #3,OP(a5)              ; set OP to "replace" mode
            
            lea     Line_Num(a5),a2        ; fast ref to Line_Num register
            move.b  #fLineBusy,d2          ; fast ref to LineBusy flag
            move.w  PLANES(a4),d7          ; d7<- plane counter
            bra     begin
    
    *
    *          T h E   s E t T i N g   O f   S k E w    F l A g S
    *
    *
    * QUALIFIERS   ACTIONS           BITBLT DIRECTION: LEFT -> RIGHT
    *
    * equal Sx&F>
    * spans Dx&F FXSR NFSR
    * 0     0     0    1 |..ssssssssssssss|ssssssssssssss..|
    *                    |......dddddddddd|dddddddddddddddd|dd..............|
    *
    * 0     1     1    0 |......ssssssssss|ssssssssssssssss|ss..............|
    *                    |..dddddddddddddd|dddddddddddddd..|
    *
    * 1     0     0    0 |..ssssssssssssss|ssssssssssssss..|
    *                    |...ddddddddddddd|ddddddddddddddd.|
    *
    * 1     1     1    1 |...sssssssssssss|sssssssssssssss.|
    *                    |..dddddddddddddd|dddddddddddddd..|
    
    
    skew_flags:
    
            dc.b    mSkewNFSR          ; Source span < Destination span
            dc.b    mSkewFXSR          ; Source span > Destination span
            dc.b    0                  ; Spans equal Shift Source right
            dc.b    mSkewNFSR+mSkewFXSR ; Spans equal Shift Source left
            
    * When Destination span is but a single word ...
    
            dc.b    0                  ; Implies a Source span of no words
            dc.b    mSkewFXSR          ; Source span of two words
            dc.b    0                  ; Skew flags aren't set if Source and
            dc.b    0                  ; Destination spans are both one word

    next_plane:

            move.l  a0,Src_Addr(a5)    ; load Source pointer to this plane
            move.l  a1,Dst_Addr(a5)    ; load Destination ptr to this plane
            move.w  HEIGHT(a4),Y_Count(a5) ; load the line count
            
            move.b  #mLineBusy,(a2)    ; <<< start the BLiTTER >>>
            
            add.w   SRC_NXPL(a4),a0    ; a0-> start of next src plane
            add.w   DST_NXPL(a4),a1    ; a1-> start of next dst plane
            
            
            
    * The BLiTTER is usually operated with the HOG flag cleared.
    * In this mode the BLiTTER and the ST's CPU share the bus equally,
    * each taking 64 bus cycles while the other is halted.  This mode
    * allows interrupts to be fielded by the CPU while an extensive
    * BitBlt is being processed by the BLiTTER.  There is a drawback in
    * that BitBlts in this shared mode may take twice as long as BitBlts
    * executed in hog mode.  Ninety percent of hog mode performance is
    * achieved while retaining robust interrupt handling via a method
    * of prematurely restarting the BLiTTER.  When control is returned
    * to the cpu by the BLiTTER, the cpu immediatelly resets the BUSY
    * flag, restarting the BLiTTER after just 7 bus cycles rather than
    * after the usual 64 cycles.  Interrupts pending will be serviced
    * before the restart code regains control.  If the BUSY flag is
    * reset when the Y_Count is zero, the flag will remain clear
    * indicating BLiTTER completion and the BLiTTER won't be restarted.
    *
    * (Interrupt service routines may explicitely halt the BLiTTER
    * during execution time critical sections by clearing the BUSY flag.
    * The original BUSY flag state must be restored however, before
    * termination of the interrupt service routine.)
    
    
    restart:
    
            bset.b  d2,(a2)            ; Restart BLiTTER and test the BUSY
            nop                        ; flag state.  The "nop" is executed
            bne     restart            ; prior to the BLiTTER restarting.
    *                                  ; Quit if the BUSY flag was clear.
    
    begin:  dbra    d7,next_plane
    
            rts
    




  Appendix: Sophisticated ENDMASK, NFSR/FXSR/Skew tables

    These two tables,
      bl_shifttab for copies exceeding 16 pixels per line and
      bl_shifttabs for copies up to 16 pixels per line
    can make getting started with the BLiTTER a lot easier.
  
bl_shifttab:
;
; This fairly complex table is meant to be used with an unsigned word offset:
;   High nibble 0..15 = Number of shifts   AND 15
;   Low nibble  0..15 = Size of the sprite AND 15
;     premultiplied by 8 because each line consists of 8 Bytes
;
; Word 0: ENDMASK 1, only depends on the number of shifts
; Word 1: ENDMASK 3, depends on both number of shifts and size
; Word 2: SKEW,      depends primarily on number of shifts and combination of both
; Word 3: Overflow,  depends on size+shifts, multiple of 8 plus rounding upwards
;                    If desired, it can help calculating the increment from line
;                    to line before dividing by 16-pixels-per-BLiT depending on
;                    NFSR and FXSR as defined in the word at offset 4
;
                DC.W $FFFF,$FFFF,$00,15 ;    size 16/16 pixels ; 0 Shifts
                DC.W $FFFF,$8000,$00,15 ;    size  1/16 pixels
                DC.W $FFFF,$C000,$00,15 ;    size  2/16 pixels
                DC.W $FFFF,$E000,$00,15 ;    size  3/16 pixels
                DC.W $FFFF,$F000,$00,15 ;    size  4/16 pixels
                DC.W $FFFF,$F800,$00,15 ;    size  5/16 pixels
                DC.W $FFFF,$FC00,$00,15 ;    size  6/16 pixels
                DC.W $FFFF,$FE00,$00,15 ;    size  7/16 pixels
                DC.W $FFFF,$FF00,$00,15 ;    size  8/16 pixels
                DC.W $FFFF,$FF80,$00,15 ;    size  9/16 pixels
                DC.W $FFFF,$FFC0,$00,15 ;    size 10/16 pixels
                DC.W $FFFF,$FFE0,$00,15 ;    size 11/16 pixels
                DC.W $FFFF,$FFF0,$00,15 ;    size 12/16 pixels
                DC.W $FFFF,$FFF8,$00,15 ;    size 13/16 pixels
                DC.W $FFFF,$FFFC,$00,15 ;    size 14/16 pixels
                DC.W $FFFF,$FFFE,$00,15 ;    size 15/16 pixels

                DC.W $7FFF,$8000,$41,31 ;    size 16/16 pixels ; 1 Shifts
                DC.W $7FFF,$C000,$01,15 ;    size  1/16 pixels
                DC.W $7FFF,$E000,$01,15 ;    size  2/16 pixels
                DC.W $7FFF,$F000,$01,15 ;    size  3/16 pixels
                DC.W $7FFF,$F800,$01,15 ;    size  4/16 pixels
                DC.W $7FFF,$FC00,$01,15 ;    size  5/16 pixels
                DC.W $7FFF,$FE00,$01,15 ;    size  6/16 pixels
                DC.W $7FFF,$FF00,$01,15 ;    size  7/16 pixels
                DC.W $7FFF,$FF80,$01,15 ;    size  8/16 pixels
                DC.W $7FFF,$FFC0,$01,15 ;    size  9/16 pixels
                DC.W $7FFF,$FFE0,$01,15 ;    size 10/16 pixels
                DC.W $7FFF,$FFF0,$01,15 ;    size 11/16 pixels
                DC.W $7FFF,$FFF8,$01,15 ;    size 12/16 pixels
                DC.W $7FFF,$FFFC,$01,15 ;    size 13/16 pixels
                DC.W $7FFF,$FFFE,$01,15 ;    size 14/16 pixels
                DC.W $7FFF,$FFFF,$01,15 ;    size 15/16 pixels

                DC.W $3FFF,$C000,$42,31 ;    size 16/16 pixels ; 2 Shifts
                DC.W $3FFF,$E000,$02,15 ;    size  1/16 pixels
                DC.W $3FFF,$F000,$02,15 ;    size  2/16 pixels
                DC.W $3FFF,$F800,$02,15 ;    size  3/16 pixels
                DC.W $3FFF,$FC00,$02,15 ;    size  4/16 pixels
                DC.W $3FFF,$FE00,$02,15 ;    size  5/16 pixels
                DC.W $3FFF,$FF00,$02,15 ;    size  6/16 pixels
                DC.W $3FFF,$FF80,$02,15 ;    size  7/16 pixels
                DC.W $3FFF,$FFC0,$02,15 ;    size  8/16 pixels
                DC.W $3FFF,$FFE0,$02,15 ;    size  9/16 pixels
                DC.W $3FFF,$FFF0,$02,15 ;    size 10/16 pixels
                DC.W $3FFF,$FFF8,$02,15 ;    size 11/16 pixels
                DC.W $3FFF,$FFFC,$02,15 ;    size 12/16 pixels
                DC.W $3FFF,$FFFE,$02,15 ;    size 13/16 pixels
                DC.W $3FFF,$FFFF,$02,15 ;    size 14/16 pixels
                DC.W $3FFF,$8000,$42,31 ;    size 15/16 pixels

                DC.W $1FFF,$E000,$43,31 ;    size 16/16 pixels ; 3 Shifts
                DC.W $1FFF,$F000,$03,15 ;    size  1/16 pixels
                DC.W $1FFF,$F800,$03,15 ;    size  2/16 pixels
                DC.W $1FFF,$FC00,$03,15 ;    size  3/16 pixels
                DC.W $1FFF,$FE00,$03,15 ;    size  4/16 pixels
                DC.W $1FFF,$FF00,$03,15 ;    size  5/16 pixels
                DC.W $1FFF,$FF80,$03,15 ;    size  6/16 pixels
                DC.W $1FFF,$FFC0,$03,15 ;    size  7/16 pixels
                DC.W $1FFF,$FFE0,$03,15 ;    size  8/16 pixels
                DC.W $1FFF,$FFF0,$03,15 ;    size  9/16 pixels
                DC.W $1FFF,$FFF8,$03,15 ;    size 10/16 pixels
                DC.W $1FFF,$FFFC,$03,15 ;    size 11/16 pixels
                DC.W $1FFF,$FFFE,$03,15 ;    size 12/16 pixels
                DC.W $1FFF,$FFFF,$03,15 ;    size 13/16 pixels
                DC.W $1FFF,$8000,$43,31 ;    size 14/16 pixels
                DC.W $1FFF,$C000,$43,31 ;    size 15/16 pixels

                DC.W $0FFF,$F000,$44,31 ;    size 16/16 pixels ; 4 Shifts
                DC.W $0FFF,$F800,$04,15 ;    size  1/16 pixels
                DC.W $0FFF,$FC00,$04,15 ;    size  2/16 pixels
                DC.W $0FFF,$FE00,$04,15 ;    size  3/16 pixels
                DC.W $0FFF,$FF00,$04,15 ;    size  4/16 pixels
                DC.W $0FFF,$FF80,$04,15 ;    size  5/16 pixels
                DC.W $0FFF,$FFC0,$04,15 ;    size  6/16 pixels
                DC.W $0FFF,$FFE0,$04,15 ;    size  7/16 pixels
                DC.W $0FFF,$FFF0,$04,15 ;    size  8/16 pixels
                DC.W $0FFF,$FFF8,$04,15 ;    size  9/16 pixels
                DC.W $0FFF,$FFFC,$04,15 ;    size 10/16 pixels
                DC.W $0FFF,$FFFE,$04,15 ;    size 11/16 pixels
                DC.W $0FFF,$FFFF,$04,15 ;    size 12/16 pixels
                DC.W $0FFF,$8000,$44,31 ;    size 13/16 pixels
                DC.W $0FFF,$C000,$44,31 ;    size 14/16 pixels
                DC.W $0FFF,$E000,$44,31 ;    size 15/16 pixels

                DC.W $07FF,$F800,$45,31 ;    size 16/16 pixels ; 5 Shifts
                DC.W $07FF,$FC00,$05,15 ;    size  1/16 pixels
                DC.W $07FF,$FE00,$05,15 ;    size  2/16 pixels
                DC.W $07FF,$FF00,$05,15 ;    size  3/16 pixels
                DC.W $07FF,$FF80,$05,15 ;    size  4/16 pixels
                DC.W $07FF,$FFC0,$05,15 ;    size  5/16 pixels
                DC.W $07FF,$FFE0,$05,15 ;    size  6/16 pixels
                DC.W $07FF,$FFF0,$05,15 ;    size  7/16 pixels
                DC.W $07FF,$FFF8,$05,15 ;    size  8/16 pixels
                DC.W $07FF,$FFFC,$05,15 ;    size  9/16 pixels
                DC.W $07FF,$FFFE,$05,15 ;    size 10/16 pixels
                DC.W $07FF,$FFFF,$05,15 ;    size 11/16 pixels
                DC.W $07FF,$8000,$45,31 ;    size 12/16 pixels
                DC.W $07FF,$C000,$45,31 ;    size 13/16 pixels
                DC.W $07FF,$E000,$45,31 ;    size 14/16 pixels
                DC.W $07FF,$F000,$45,31 ;    size 15/16 pixels

                DC.W $03FF,$FC00,$46,31 ;    size 16/16 pixels ; 6 Shifts
                DC.W $03FF,$FE00,$06,15 ;    size  1/16 pixels
                DC.W $03FF,$FF00,$06,15 ;    size  2/16 pixels
                DC.W $03FF,$FF80,$06,15 ;    size  3/16 pixels
                DC.W $03FF,$FFC0,$06,15 ;    size  4/16 pixels
                DC.W $03FF,$FFE0,$06,15 ;    size  5/16 pixels
                DC.W $03FF,$FFF0,$06,15 ;    size  6/16 pixels
                DC.W $03FF,$FFF8,$06,15 ;    size  7/16 pixels
                DC.W $03FF,$FFFC,$06,15 ;    size  8/16 pixels
                DC.W $03FF,$FFFE,$06,15 ;    size  9/16 pixels
                DC.W $03FF,$FFFF,$06,15 ;    size 10/16 pixels
                DC.W $03FF,$8000,$46,31 ;    size 11/16 pixels
                DC.W $03FF,$C000,$46,31 ;    size 12/16 pixels
                DC.W $03FF,$E000,$46,31 ;    size 13/16 pixels
                DC.W $03FF,$F000,$46,31 ;    size 14/16 pixels
                DC.W $03FF,$F800,$46,31 ;    size 15/16 pixels

                DC.W $01FF,$FE00,$47,31 ;    size 16/16 pixels ; 7 Shifts
                DC.W $01FF,$FF00,$07,15 ;    size  1/16 pixels
                DC.W $01FF,$FF80,$07,15 ;    size  2/16 pixels
                DC.W $01FF,$FFC0,$07,15 ;    size  3/16 pixels
                DC.W $01FF,$FFE0,$07,15 ;    size  4/16 pixels
                DC.W $01FF,$FFF0,$07,15 ;    size  5/16 pixels
                DC.W $01FF,$FFF8,$07,15 ;    size  6/16 pixels
                DC.W $01FF,$FFFC,$07,15 ;    size  7/16 pixels
                DC.W $01FF,$FFFE,$07,15 ;    size  8/16 pixels
                DC.W $01FF,$FFFF,$07,15 ;    size  9/16 pixels
                DC.W $01FF,$8000,$47,31 ;    size 10/16 pixels
                DC.W $01FF,$C000,$47,31 ;    size 11/16 pixels
                DC.W $01FF,$E000,$47,31 ;    size 12/16 pixels
                DC.W $01FF,$F000,$47,31 ;    size 13/16 pixels
                DC.W $01FF,$F800,$47,31 ;    size 14/16 pixels
                DC.W $01FF,$FC00,$47,31 ;    size 15/16 pixels

                DC.W $FF,$FF00,$48,31 ;    size 16/16 pixels ; 8 Shifts
                DC.W $FF,$FF80,$08,15 ;    size  1/16 pixels
                DC.W $FF,$FFC0,$08,15 ;    size  2/16 pixels
                DC.W $FF,$FFE0,$08,15 ;    size  3/16 pixels
                DC.W $FF,$FFF0,$08,15 ;    size  4/16 pixels
                DC.W $FF,$FFF8,$08,15 ;    size  5/16 pixels
                DC.W $FF,$FFFC,$08,15 ;    size  6/16 pixels
                DC.W $FF,$FFFE,$08,15 ;    size  7/16 pixels
                DC.W $FF,$FFFF,$08,15 ;    size  8/16 pixels
                DC.W $FF,$8000,$48,31 ;    size  9/16 pixels
                DC.W $FF,$C000,$48,31 ;    size 10/16 pixels
                DC.W $FF,$E000,$48,31 ;    size 11/16 pixels
                DC.W $FF,$F000,$48,31 ;    size 12/16 pixels
                DC.W $FF,$F800,$48,31 ;    size 13/16 pixels
                DC.W $FF,$FC00,$48,31 ;    size 14/16 pixels
                DC.W $FF,$FE00,$48,31 ;    size 15/16 pixels

                DC.W $7F,$FF80,$49,31 ;    size 16/16 pixels ; 9 Shifts
                DC.W $7F,$FFC0,$09,15 ;    size  1/16 pixels
                DC.W $7F,$FFE0,$09,15 ;    size  2/16 pixels
                DC.W $7F,$FFF0,$09,15 ;    size  3/16 pixels
                DC.W $7F,$FFF8,$09,15 ;    size  4/16 pixels
                DC.W $7F,$FFFC,$09,15 ;    size  5/16 pixels
                DC.W $7F,$FFFE,$09,15 ;    size  6/16 pixels
                DC.W $7F,$FFFF,$09,15 ;    size  7/16 pixels
                DC.W $7F,$8000,$49,31 ;    size  8/16 pixels
                DC.W $7F,$C000,$49,31 ;    size  9/16 pixels
                DC.W $7F,$E000,$49,31 ;    size 10/16 pixels
                DC.W $7F,$F000,$49,31 ;    size 11/16 pixels
                DC.W $7F,$F800,$49,31 ;    size 12/16 pixels
                DC.W $7F,$FC00,$49,31 ;    size 13/16 pixels
                DC.W $7F,$FE00,$49,31 ;    size 14/16 pixels
                DC.W $7F,$FF00,$49,31 ;    size 15/16 pixels

                DC.W $3F,$FFC0,$4A,31 ;    size 16/16 pixels ; 10 Shifts
                DC.W $3F,$FFE0,$0A,15 ;    size  1/16 pixels
                DC.W $3F,$FFF0,$0A,15 ;    size  2/16 pixels
                DC.W $3F,$FFF8,$0A,15 ;    size  3/16 pixels
                DC.W $3F,$FFFC,$0A,15 ;    size  4/16 pixels
                DC.W $3F,$FFFE,$0A,15 ;    size  5/16 pixels
                DC.W $3F,$FFFF,$0A,15 ;    size  6/16 pixels
                DC.W $3F,$8000,$4A,31 ;    size  7/16 pixels
                DC.W $3F,$C000,$4A,31 ;    size  8/16 pixels
                DC.W $3F,$E000,$4A,31 ;    size  9/16 pixels
                DC.W $3F,$F000,$4A,31 ;    size 10/16 pixels
                DC.W $3F,$F800,$4A,31 ;    size 11/16 pixels
                DC.W $3F,$FC00,$4A,31 ;    size 12/16 pixels
                DC.W $3F,$FE00,$4A,31 ;    size 13/16 pixels
                DC.W $3F,$FF00,$4A,31 ;    size 14/16 pixels
                DC.W $3F,$FF80,$4A,31 ;    size 15/16 pixels

                DC.W $1F,$FFE0,$4B,31 ;    size 16/16 pixels ; 11 Shifts
                DC.W $1F,$FFF0,$0B,15 ;    size  1/16 pixels
                DC.W $1F,$FFF8,$0B,15 ;    size  2/16 pixels
                DC.W $1F,$FFFC,$0B,15 ;    size  3/16 pixels
                DC.W $1F,$FFFE,$0B,15 ;    size  4/16 pixels
                DC.W $1F,$FFFF,$0B,15 ;    size  5/16 pixels
                DC.W $1F,$8000,$4B,31 ;    size  6/16 pixels
                DC.W $1F,$C000,$4B,31 ;    size  7/16 pixels
                DC.W $1F,$E000,$4B,31 ;    size  8/16 pixels
                DC.W $1F,$F000,$4B,31 ;    size  9/16 pixels
                DC.W $1F,$F800,$4B,31 ;    size 10/16 pixels
                DC.W $1F,$FC00,$4B,31 ;    size 11/16 pixels
                DC.W $1F,$FE00,$4B,31 ;    size 12/16 pixels
                DC.W $1F,$FF00,$4B,31 ;    size 13/16 pixels
                DC.W $1F,$FF80,$4B,31 ;    size 14/16 pixels
                DC.W $1F,$FFC0,$4B,31 ;    size 15/16 pixels

                DC.W $0F,$FFF0,$4C,31 ;    size 16/16 pixels ; 12 Shifts
                DC.W $0F,$FFF8,$0C,15 ;    size  1/16 pixels
                DC.W $0F,$FFFC,$0C,15 ;    size  2/16 pixels
                DC.W $0F,$FFFE,$0C,15 ;    size  3/16 pixels
                DC.W $0F,$FFFF,$0C,15 ;    size  4/16 pixels
                DC.W $0F,$8000,$4C,31 ;    size  5/16 pixels
                DC.W $0F,$C000,$4C,31 ;    size  6/16 pixels
                DC.W $0F,$E000,$4C,31 ;    size  7/16 pixels
                DC.W $0F,$F000,$4C,31 ;    size  8/16 pixels
                DC.W $0F,$F800,$4C,31 ;    size  9/16 pixels
                DC.W $0F,$FC00,$4C,31 ;    size 10/16 pixels
                DC.W $0F,$FE00,$4C,31 ;    size 11/16 pixels
                DC.W $0F,$FF00,$4C,31 ;    size 12/16 pixels
                DC.W $0F,$FF80,$4C,31 ;    size 13/16 pixels
                DC.W $0F,$FFC0,$4C,31 ;    size 14/16 pixels
                DC.W $0F,$FFE0,$4C,31 ;    size 15/16 pixels

                DC.W $07,$FFF8,$4D,31 ;    size 16/16 pixels ; 13 Shifts
                DC.W $07,$FFFC,$0D,15 ;    size  1/16 pixels
                DC.W $07,$FFFE,$0D,15 ;    size  2/16 pixels
                DC.W $07,$FFFF,$0D,15 ;    size  3/16 pixels
                DC.W $07,$8000,$4D,31 ;    size  4/16 pixels
                DC.W $07,$C000,$4D,31 ;    size  5/16 pixels
                DC.W $07,$E000,$4D,31 ;    size  6/16 pixels
                DC.W $07,$F000,$4D,31 ;    size  7/16 pixels
                DC.W $07,$F800,$4D,31 ;    size  8/16 pixels
                DC.W $07,$FC00,$4D,31 ;    size  9/16 pixels
                DC.W $07,$FE00,$4D,31 ;    size 10/16 pixels
                DC.W $07,$FF00,$4D,31 ;    size 11/16 pixels
                DC.W $07,$FF80,$4D,31 ;    size 12/16 pixels
                DC.W $07,$FFC0,$4D,31 ;    size 13/16 pixels
                DC.W $07,$FFE0,$4D,31 ;    size 14/16 pixels
                DC.W $07,$FFF0,$4D,31 ;    size 15/16 pixels

                DC.W $03,$FFFC,$4E,31 ;    size 16/16 pixels ; 14 Shifts
                DC.W $03,$FFFE,$0E,15 ;    size  1/16 pixels
                DC.W $03,$FFFF,$0E,15 ;    size  2/16 pixels
                DC.W $03,$8000,$4E,31 ;    size  3/16 pixels
                DC.W $03,$C000,$4E,31 ;    size  4/16 pixels
                DC.W $03,$E000,$4E,31 ;    size  5/16 pixels
                DC.W $03,$F000,$4E,31 ;    size  6/16 pixels
                DC.W $03,$F800,$4E,31 ;    size  7/16 pixels
                DC.W $03,$FC00,$4E,31 ;    size  8/16 pixels
                DC.W $03,$FE00,$4E,31 ;    size  9/16 pixels
                DC.W $03,$FF00,$4E,31 ;    size 10/16 pixels
                DC.W $03,$FF80,$4E,31 ;    size 11/16 pixels
                DC.W $03,$FFC0,$4E,31 ;    size 12/16 pixels
                DC.W $03,$FFE0,$4E,31 ;    size 13/16 pixels
                DC.W $03,$FFF0,$4E,31 ;    size 14/16 pixels
                DC.W $03,$FFF8,$4E,31 ;    size 15/16 pixels

                DC.W $01,$FFFE,$4F,31 ;    size 16/16 pixels ; 15 Shifts
                DC.W $01,$FFFF,$0F,15 ;    size  1/16 pixels
                DC.W $01,$8000,$4F,31 ;    size  2/16 pixels
                DC.W $01,$C000,$4F,31 ;    size  3/16 pixels
                DC.W $01,$E000,$4F,31 ;    size  4/16 pixels
                DC.W $01,$F000,$4F,31 ;    size  5/16 pixels
                DC.W $01,$F800,$4F,31 ;    size  6/16 pixels
                DC.W $01,$FC00,$4F,31 ;    size  7/16 pixels
                DC.W $01,$FE00,$4F,31 ;    size  8/16 pixels
                DC.W $01,$FF00,$4F,31 ;    size  9/16 pixels
                DC.W $01,$FF80,$4F,31 ;    size 10/16 pixels
                DC.W $01,$FFC0,$4F,31 ;    size 11/16 pixels
                DC.W $01,$FFE0,$4F,31 ;    size 12/16 pixels
                DC.W $01,$FFF0,$4F,31 ;    size 13/16 pixels
                DC.W $01,$FFF8,$4F,31 ;    size 14/16 pixels
                DC.W $01,$FFFC,$4F,31 ;    size 15/16 pixels

bl_shifttabs:
;
; This fairly complex table is meant to be used with an unsigned word offset:
;   High nibble 0..15 = Number of shifts   AND 15
;   Low nibble  0..15 = Size of the sprite AND 15
;     premultiplied by 8 because each line consists of 8 Bytes
;
; Word 0: ENDMASK 1, only depends on the number of shifts
; Word 1: ENDMASK 3, depends on both number of shifts and size
; Word 2: SKEW,      depends primarily on number of shifts and combination of both
; Word 3: Overflow,  depends on size+shifts, multiple of 16 plus rounding upwards
;                    If desired, it can help calculating the increment from line
;                    to line before dividing by 16-pixels-per-BLiT depending on
;                    NFSR and FXSR as defined in the word at offset 4
;
                DC.W $FFFF,$00,$00,15 ;    size 16/16 pixels ; 0 Shifts
                DC.W $8000,$00,$00,15 ;    size  1/16 pixels
                DC.W $C000,$00,$00,15 ;    size  2/16 pixels
                DC.W $E000,$00,$00,15 ;    size  3/16 pixels
                DC.W $F000,$00,$00,15 ;    size  4/16 pixels
                DC.W $F800,$00,$00,15 ;    size  5/16 pixels
                DC.W $FC00,$00,$00,15 ;    size  6/16 pixels
                DC.W $FE00,$00,$00,15 ;    size  7/16 pixels
                DC.W $FF00,$00,$00,15 ;    size  8/16 pixels
                DC.W $FF80,$00,$00,15 ;    size  9/16 pixels
                DC.W $FFC0,$00,$00,15 ;    size 10/16 pixels
                DC.W $FFE0,$00,$00,15 ;    size 11/16 pixels
                DC.W $FFF0,$00,$00,15 ;    size 12/16 pixels
                DC.W $FFF8,$00,$00,15 ;    size 13/16 pixels
                DC.W $FFFC,$00,$00,15 ;    size 14/16 pixels
                DC.W $FFFE,$00,$00,15 ;    size 15/16 pixels

                DC.W $7FFF,$8000,$41,31 ;    size 16/16 pixels ; 1 Shifts
                DC.W $4000,$00,$01,15 ;    size  1/16 pixels
                DC.W $6000,$00,$01,15 ;    size  2/16 pixels
                DC.W $7000,$00,$01,15 ;    size  3/16 pixels
                DC.W $7800,$00,$01,15 ;    size  4/16 pixels
                DC.W $7C00,$00,$01,15 ;    size  5/16 pixels
                DC.W $7E00,$00,$01,15 ;    size  6/16 pixels
                DC.W $7F00,$00,$01,15 ;    size  7/16 pixels
                DC.W $7F80,$00,$01,15 ;    size  8/16 pixels
                DC.W $7FC0,$00,$01,15 ;    size  9/16 pixels
                DC.W $7FE0,$00,$01,15 ;    size 10/16 pixels
                DC.W $7FF0,$00,$01,15 ;    size 11/16 pixels
                DC.W $7FF8,$00,$01,15 ;    size 12/16 pixels
                DC.W $7FFC,$00,$01,15 ;    size 13/16 pixels
                DC.W $7FFE,$00,$01,15 ;    size 14/16 pixels
                DC.W $7FFF,$00,$01,15 ;    size 15/16 pixels

                DC.W $3FFF,$C000,$42,31 ; size 16/16 pixels ; 2 Shifts
                DC.W $2000,$00,$02,15 ;    size  1/16 pixels
                DC.W $3000,$00,$02,15 ;    size  2/16 pixels
                DC.W $3800,$00,$02,15 ;    size  3/16 pixels
                DC.W $3C00,$00,$02,15 ;    size  4/16 pixels
                DC.W $3E00,$00,$02,15 ;    size  5/16 pixels
                DC.W $3F00,$00,$02,15 ;    size  6/16 pixels
                DC.W $3F80,$00,$02,15 ;    size  7/16 pixels
                DC.W $3FC0,$00,$02,15 ;    size  8/16 pixels
                DC.W $3FE0,$00,$02,15 ;    size  9/16 pixels
                DC.W $3FF0,$00,$02,15 ;    size 10/16 pixels
                DC.W $3FF8,$00,$02,15 ;    size 11/16 pixels
                DC.W $3FFC,$00,$02,15 ;    size 12/16 pixels
                DC.W $3FFE,$00,$02,15 ;    size 13/16 pixels
                DC.W $3FFF,$00,$02,15 ;    size 14/16 pixels
                DC.W $3FFF,$8000,$42,31 ; size 15/16 pixels

                DC.W $1FFF,$E000,$43,31 ; size 16/16 pixels ; 3 Shifts
                DC.W $1000,$00,$03,15 ;    size  1/16 pixels
                DC.W $1800,$00,$03,15 ;    size  2/16 pixels
                DC.W $1C00,$00,$03,15 ;    size  3/16 pixels
                DC.W $1E00,$00,$03,15 ;    size  4/16 pixels
                DC.W $1F00,$00,$03,15 ;    size  5/16 pixels
                DC.W $1F80,$00,$03,15 ;    size  6/16 pixels
                DC.W $1FC0,$00,$03,15 ;    size  7/16 pixels
                DC.W $1FE0,$00,$03,15 ;    size  8/16 pixels
                DC.W $1FF0,$00,$03,15 ;    size  9/16 pixels
                DC.W $1FF8,$00,$03,15 ;    size 10/16 pixels
                DC.W $1FFC,$00,$03,15 ;    size 11/16 pixels
                DC.W $1FFE,$00,$03,15 ;    size 12/16 pixels
                DC.W $1FFF,$00,$03,15 ;    size 13/16 pixels
                DC.W $1FFF,$8000,$43,31 ; size 14/16 pixels
                DC.W $1FFF,$C000,$43,31 ; size 15/16 pixels

                DC.W $0FFF,$F000,$44,31 ; size 16/16 pixels ; 4 Shifts
                DC.W $0800,$00,$04,15 ;    size  1/16 pixels
                DC.W $0C00,$00,$04,15 ;    size  2/16 pixels
                DC.W $0E00,$00,$04,15 ;    size  3/16 pixels
                DC.W $0F00,$00,$04,15 ;    size  4/16 pixels
                DC.W $0F80,$00,$04,15 ;    size  5/16 pixels
                DC.W $0FC0,$00,$04,15 ;    size  6/16 pixels
                DC.W $0FE0,$00,$04,15 ;    size  7/16 pixels
                DC.W $0FF0,$00,$04,15 ;    size  8/16 pixels
                DC.W $0FF8,$00,$04,15 ;    size  9/16 pixels
                DC.W $0FFC,$00,$04,15 ;    size 10/16 pixels
                DC.W $0FFE,$00,$04,15 ;    size 11/16 pixels
                DC.W $0FFF,$00,$04,15 ;    size 12/16 pixels
                DC.W $0FFF,$8000,$44,31 ; size 13/16 pixels
                DC.W $0FFF,$C000,$44,31 ; size 14/16 pixels
                DC.W $0FFF,$E000,$44,31 ; size 15/16 pixels

                DC.W $07FF,$F800,$45,31 ; size 16/16 pixels ; 5 Shifts
                DC.W $0400,$00,$05,15 ;    size  1/16 pixels
                DC.W $0600,$00,$05,15 ;    size  2/16 pixels
                DC.W $0700,$00,$05,15 ;    size  3/16 pixels
                DC.W $0780,$00,$05,15 ;    size  4/16 pixels
                DC.W $07C0,$00,$05,15 ;    size  5/16 pixels
                DC.W $07E0,$00,$05,15 ;    size  6/16 pixels
                DC.W $07F0,$00,$05,15 ;    size  7/16 pixels
                DC.W $07F8,$00,$05,15 ;    size  8/16 pixels
                DC.W $07FC,$00,$05,15 ;    size  9/16 pixels
                DC.W $07FE,$00,$05,15 ;    size 10/16 pixels
                DC.W $07FF,$00,$05,15 ;    size 11/16 pixels
                DC.W $07FF,$8000,$45,31 ; size 12/16 pixels
                DC.W $07FF,$C000,$45,31 ; size 13/16 pixels
                DC.W $07FF,$E000,$45,31 ; size 14/16 pixels
                DC.W $07FF,$F000,$45,31 ; size 15/16 pixels

                DC.W $03FF,$FC00,$46,31 ; size 16/16 pixels ; 6 Shifts
                DC.W $0200,$00,$06,15 ;    size  1/16 pixels
                DC.W $0300,$00,$06,15 ;    size  2/16 pixels
                DC.W $0380,$00,$06,15 ;    size  3/16 pixels
                DC.W $03C0,$00,$06,15 ;    size  4/16 pixels
                DC.W $03E0,$00,$06,15 ;    size  5/16 pixels
                DC.W $03F0,$00,$06,15 ;    size  6/16 pixels
                DC.W $03F8,$00,$06,15 ;    size  7/16 pixels
                DC.W $03FC,$00,$06,15 ;    size  8/16 pixels
                DC.W $03FE,$00,$06,15 ;    size  9/16 pixels
                DC.W $03FF,$00,$06,15 ;    size 10/16 pixels
                DC.W $03FF,$8000,$46,31 ; size 11/16 pixels
                DC.W $03FF,$C000,$46,31 ; size 12/16 pixels
                DC.W $03FF,$E000,$46,31 ; size 13/16 pixels
                DC.W $03FF,$F000,$46,31 ; size 14/16 pixels
                DC.W $03FF,$F800,$46,31 ; size 15/16 pixels

                DC.W $01FF,$FE00,$47,31 ; size 16/16 pixels ; 7 Shifts
                DC.W $0100,$00,$07,15 ;    size  1/16 pixels
                DC.W $0180,$00,$07,15 ;    size  2/16 pixels
                DC.W $01C0,$00,$07,15 ;    size  3/16 pixels
                DC.W $01E0,$00,$07,15 ;    size  4/16 pixels
                DC.W $01F0,$00,$07,15 ;    size  5/16 pixels
                DC.W $01F8,$00,$07,15 ;    size  6/16 pixels
                DC.W $01FC,$00,$07,15 ;    size  7/16 pixels
                DC.W $01FE,$00,$07,15 ;    size  8/16 pixels
                DC.W $01FF,$00,$07,15 ;    size  9/16 pixels
                DC.W $01FF,$8000,$47,31 ; size 10/16 pixels
                DC.W $01FF,$C000,$47,31 ; size 11/16 pixels
                DC.W $01FF,$E000,$47,31 ; size 12/16 pixels
                DC.W $01FF,$F000,$47,31 ; size 13/16 pixels
                DC.W $01FF,$F800,$47,31 ; size 14/16 pixels
                DC.W $01FF,$FC00,$47,31 ; size 15/16 pixels

                DC.W $FF,$FF00,$48,31 ; size 16/16 pixels ; 8 Shifts
                DC.W $80,$00,$08,15 ;    size  1/16 pixels
                DC.W $C0,$00,$08,15 ;    size  2/16 pixels
                DC.W $E0,$00,$08,15 ;    size  3/16 pixels
                DC.W $F0,$00,$08,15 ;    size  4/16 pixels
                DC.W $F8,$00,$08,15 ;    size  5/16 pixels
                DC.W $FC,$00,$08,15 ;    size  6/16 pixels
                DC.W $FE,$00,$08,15 ;    size  7/16 pixels
                DC.W $FF,$00,$08,15 ;    size  8/16 pixels
                DC.W $FF,$8000,$48,31 ; size  9/16 pixels
                DC.W $FF,$C000,$48,31 ; size 10/16 pixels
                DC.W $FF,$E000,$48,31 ; size 11/16 pixels
                DC.W $FF,$F000,$48,31 ; size 12/16 pixels
                DC.W $FF,$F800,$48,31 ; size 13/16 pixels
                DC.W $FF,$FC00,$48,31 ; size 14/16 pixels
                DC.W $FF,$FE00,$48,31 ; size 15/16 pixels

                DC.W $7F,$FF80,$49,31 ; size 16/16 pixels ; 9 Shifts
                DC.W $40,$00,$09,15 ;    size  1/16 pixels
                DC.W $60,$00,$09,15 ;    size  2/16 pixels
                DC.W $70,$00,$09,15 ;    size  3/16 pixels
                DC.W $78,$00,$09,15 ;    size  4/16 pixels
                DC.W $7C,$00,$09,15 ;    size  5/16 pixels
                DC.W $7E,$00,$09,15 ;    size  6/16 pixels
                DC.W $7F,$00,$09,15 ;    size  7/16 pixels
                DC.W $7F,$8000,$49,31 ; size  8/16 pixels
                DC.W $7F,$C000,$49,31 ; size  9/16 pixels
                DC.W $7F,$E000,$49,31 ; size 10/16 pixels
                DC.W $7F,$F000,$49,31 ; size 11/16 pixels
                DC.W $7F,$F800,$49,31 ; size 12/16 pixels
                DC.W $7F,$FC00,$49,31 ; size 13/16 pixels
                DC.W $7F,$FE00,$49,31 ; size 14/16 pixels
                DC.W $7F,$FF00,$49,31 ; size 15/16 pixels

                DC.W $3F,$FFC0,$4A,31 ; size 16/16 pixels ; 10 Shifts
                DC.W $20,$00,$0A,15 ;    size  1/16 pixels
                DC.W $30,$00,$0A,15 ;    size  2/16 pixels
                DC.W $38,$00,$0A,15 ;    size  3/16 pixels
                DC.W $3C,$00,$0A,15 ;    size  4/16 pixels
                DC.W $3E,$00,$0A,15 ;    size  5/16 pixels
                DC.W $3F,$00,$0A,15 ;    size  6/16 pixels
                DC.W $3F,$8000,$4A,31 ; size  7/16 pixels
                DC.W $3F,$C000,$4A,31 ; size  8/16 pixels
                DC.W $3F,$E000,$4A,31 ; size  9/16 pixels
                DC.W $3F,$F000,$4A,31 ; size 10/16 pixels
                DC.W $3F,$F800,$4A,31 ; size 11/16 pixels
                DC.W $3F,$FC00,$4A,31 ; size 12/16 pixels
                DC.W $3F,$FE00,$4A,31 ; size 13/16 pixels
                DC.W $3F,$FF00,$4A,31 ; size 14/16 pixels
                DC.W $3F,$FF80,$4A,31 ; size 15/16 pixels

                DC.W $1F,$FFE0,$4B,31 ; size 16/16 pixels ; 11 Shifts
                DC.W $10,$00,$0B,15 ;    size  1/16 pixels
                DC.W $18,$00,$0B,15 ;    size  2/16 pixels
                DC.W $1C,$00,$0B,15 ;    size  3/16 pixels
                DC.W $1E,$00,$0B,15 ;    size  4/16 pixels
                DC.W $1F,$00,$0B,15 ;    size  5/16 pixels
                DC.W $1F,$8000,$4B,31 ; size  6/16 pixels
                DC.W $1F,$C000,$4B,31 ; size  7/16 pixels
                DC.W $1F,$E000,$4B,31 ; size  8/16 pixels
                DC.W $1F,$F000,$4B,31 ; size  9/16 pixels
                DC.W $1F,$F800,$4B,31 ; size 10/16 pixels
                DC.W $1F,$FC00,$4B,31 ; size 11/16 pixels
                DC.W $1F,$FE00,$4B,31 ; size 12/16 pixels
                DC.W $1F,$FF00,$4B,31 ; size 13/16 pixels
                DC.W $1F,$FF80,$4B,31 ; size 14/16 pixels
                DC.W $1F,$FFC0,$4B,31 ; size 15/16 pixels

                DC.W $0F,$FFF0,$4C,31 ; size 16/16 pixels ; 12 Shifts
                DC.W $08,$00,$0C,15 ;    size  1/16 pixels
                DC.W $0C,$00,$0C,15 ;    size  2/16 pixels
                DC.W $0E,$00,$0C,15 ;    size  3/16 pixels
                DC.W $0F,$00,$0C,15 ;    size  4/16 pixels
                DC.W $0F,$8000,$4C,31 ; size  5/16 pixels
                DC.W $0F,$C000,$4C,31 ; size  6/16 pixels
                DC.W $0F,$E000,$4C,31 ; size  7/16 pixels
                DC.W $0F,$F000,$4C,31 ; size  8/16 pixels
                DC.W $0F,$F800,$4C,31 ; size  9/16 pixels
                DC.W $0F,$FC00,$4C,31 ; size 10/16 pixels
                DC.W $0F,$FE00,$4C,31 ; size 11/16 pixels
                DC.W $0F,$FF00,$4C,31 ; size 12/16 pixels
                DC.W $0F,$FF80,$4C,31 ; size 13/16 pixels
                DC.W $0F,$FFC0,$4C,31 ; size 14/16 pixels
                DC.W $0F,$FFE0,$4C,31 ; size 15/16 pixels

                DC.W $07,$FFF8,$4D,31 ; size 16/16 pixels ; 13 Shifts
                DC.W $04,$00,$0D,15 ;    size  1/16 pixels
                DC.W $06,$00,$0D,15 ;    size  2/16 pixels
                DC.W $07,$00,$0D,15 ;    size  3/16 pixels
                DC.W $07,$8000,$4D,31 ; size  4/16 pixels
                DC.W $07,$C000,$4D,31 ; size  5/16 pixels
                DC.W $07,$E000,$4D,31 ; size  6/16 pixels
                DC.W $07,$F000,$4D,31 ; size  7/16 pixels
                DC.W $07,$F800,$4D,31 ; size  8/16 pixels
                DC.W $07,$FC00,$4D,31 ; size  9/16 pixels
                DC.W $07,$FE00,$4D,31 ; size 10/16 pixels
                DC.W $07,$FF00,$4D,31 ; size 11/16 pixels
                DC.W $07,$FF80,$4D,31 ; size 12/16 pixels
                DC.W $07,$FFC0,$4D,31 ; size 13/16 pixels
                DC.W $07,$FFE0,$4D,31 ; size 14/16 pixels
                DC.W $07,$FFF0,$4D,31 ; size 15/16 pixels

                DC.W $03,$FFFC,$4E,31 ; size 16/16 pixels ; 14 Shifts
                DC.W $02,$00,$0E,15 ;    size  1/16 pixels
                DC.W $03,$00,$0E,15 ;    size  2/16 pixels
                DC.W $03,$8000,$4E,31 ; size  3/16 pixels
                DC.W $03,$C000,$4E,31 ; size  4/16 pixels
                DC.W $03,$E000,$4E,31 ; size  5/16 pixels
                DC.W $03,$F000,$4E,31 ; size  6/16 pixels
                DC.W $03,$F800,$4E,31 ; size  7/16 pixels
                DC.W $03,$FC00,$4E,31 ; size  8/16 pixels
                DC.W $03,$FE00,$4E,31 ; size  9/16 pixels
                DC.W $03,$FF00,$4E,31 ; size 10/16 pixels
                DC.W $03,$FF80,$4E,31 ; size 11/16 pixels
                DC.W $03,$FFC0,$4E,31 ; size 12/16 pixels
                DC.W $03,$FFE0,$4E,31 ; size 13/16 pixels
                DC.W $03,$FFF0,$4E,31 ; size 14/16 pixels
                DC.W $03,$FFF8,$4E,31 ; size 15/16 pixels

                DC.W $01,$FFFE,$4F,31 ; size 16/16 pixels ; 15 Shifts
                DC.W $01,$00,$0F,15 ;    size  1/16 pixels
                DC.W $01,$8000,$4F,31 ; size  2/16 pixels
                DC.W $01,$C000,$4F,31 ; size  3/16 pixels
                DC.W $01,$E000,$4F,31 ; size  4/16 pixels
                DC.W $01,$F000,$4F,31 ; size  5/16 pixels
                DC.W $01,$F800,$4F,31 ; size  6/16 pixels
                DC.W $01,$FC00,$4F,31 ; size  7/16 pixels
                DC.W $01,$FE00,$4F,31 ; size  8/16 pixels
                DC.W $01,$FF00,$4F,31 ; size  9/16 pixels
                DC.W $01,$FF80,$4F,31 ; size 10/16 pixels
                DC.W $01,$FFC0,$4F,31 ; size 11/16 pixels
                DC.W $01,$FFE0,$4F,31 ; size 12/16 pixels
                DC.W $01,$FFF0,$4F,31 ; size 13/16 pixels
                DC.W $01,$FFF8,$4F,31 ; size 14/16 pixels
                DC.W $01,$FFFC,$4F,31 ; size 15/16 pixels
The Atari ST(E) BLiTTER in brief by The Paranoid of Paradox 2012

Navigation menu