Asm improvements for .64 project
--------------------------------
Asm improvements may speed up develop of .64 related projects. Improvements are not
backward compatible with previous assemblers.

Main goals:

- speedup writing code
- allow particular code debug
- no 'stdlib' should be required while using special directives
- do not eat memory
- keep speed near to hand writted code

Intro
-----
We can divide new function into following groups:

- function calls
- code flow control
- expression processing
- debug specials
- module generation/relocating

Some if these function requires "settings" (eg memory block for stack). All of these
functions need to be initialized by special diorective, which may take some parametres.
Special directives qwill start, as usual, with dot "."

Because we don't force using any form of 'stdlib', some functions may require to include
portions of code at some 'place'. Like: Asm may generate "module". For compiling
and running the module, you don't need any code. But for loading - you may need it. In this
case, you can include proper directive which will 'insert' that code (which can for example
load/relocate given module).

"Cost" is what CPU/mem we will eat. It is written in form cycles[memory]. I do ignore page
boundaries etc.

Special directives
------------------

    .execregs mem,mem,mem...

        allocate registers (zero page only) for temporary using, while execution code
        if this directive is not used, some functions can be not avilable

        generated code need to follow, that these registers may be damaged by some
        commands, calls, etc. So, they should be used only as temporary registers in
        single block of instructions


Functions calls
---------------
Function calls are here to allow "C" style function calls :) Their syntax is following:

    fn_name (params)

where () is required, even no parameter is passed. There is missing "jsr". () are here because
of syntax reason.

Function is defined by directive 

name    .proc specifications

            and

        .endproc

Every function can take arguments. Arguments may be optional. Function will be abble to 'test' how
many arguments was passed. Arguments may be 8 or 16 bits wide. Function decides, where to 'put'
arguments. Arguments may be stored in registers, or memory location. If function uses variable
parameter count, then A register is _always_ used for 'parameter' count. Function may prserve
register values, so caller can have sure, registers are not changed.

Specificatins is string, which consits of one or more specs. Specs are specified by spaces.

    preserve:regs       - function will preserve given registers; registers is combination of a/x/y
                          if ':' is not used, all registers are preserved
    8:dst:name          - 8 bit parameter declaration. dst could be
                            a x y   - register name
                            stack   - CPU stack user for passing (caller "pha")
                            mem     - memory reference (label/value)
                          name define parameter name, and is not required (only for mem/stack)
    16:dst:name         - 16 bit parameter declaration. dst could be
                            a x y   - combination of 2 registers, first takes HI second LOW (eg. ax)
                            stack   - CPU stack user for passing (caller "pha")
                            mem     - memory reference (label/value) - 2 bytes are taken HI/LO
                            mem,mem - HI/LO memory reference
                          name define parameter name, and is not required (only for mem/stack)
    ?:dst:name          - all following other parameters are optional
                          all next optional parametres need to have SAME width (8/16)
                          dst is 8 bit destination, where to store used parameter count
    selfmodify          - function is self modifing and may change registers -> tell asm to not
                          compute which registers are really used, and assume ALL are used
    recursive           - function can be called recursive ('mem' is refused for value passing)
                          this keyword supress debug code generation (recursive call exception)
    savestack *         - try to use lees stack as possible
    rom *               - do not use any self modifing code, we run from 'rom' (eprom, ...)
                          this requries special directive ".execreg" - if not used -> error
    cleanstack *        - procedure is resposible to clean stack (if not given, caller 'clean' stack)
    keepstack *         - reverse to cleanstack
    exported            - if function belongs to module, it is exported (by its name)
    imported            - if function belongs to module, it is imported


Flags marked with * can be inherited from module.

When procedure is defined, it is called like "C" style. So, we can do:

    fn (19,label)

Function itselfs references parameter by register name if stored in register. When function overwrite
register value, parameter is lost.

When parameter is stored in memory, function is resposible to aloace proper memory place, by .byte or
.word directive, or setup proper self-mod code. So, referencing is also clear.

If parameter stored on stack, funtion HAVE TO load 'x' (or y) register into stack top (tsx) and then
it can use

    tsx / tsx txa tay

    lda name,x
    lda name,y
    ldy name,x
    ldx name,y

name is defined as "$0100 + offset to parameter". In case of 16 bit, you reference LO byte, so, HI is
usually name+1 as you can expect.

When parameter is stored in stack, function can reference parameter in its whole body.

Function calls implementaion
----------------------------
ASM walk throw function and get list of all used registers. If funcion calls another function,
it go into depth. If function calls indirect or label, asm assume ALL registers are used. If there
is used selfodigfy directive, asm also assume all registers are used.

If funcion preserve registers, wich are used, in the prolog/epilog is placed following code (if
relevant register should be preserved). If preserved register are not touched, no code is generated.

1. stack

        pha/pla
        tya pha/pla tay
        txa pha/pla tax

        cost 4[2] or 8[4]


2. stack + keep A register untouched

        sta execreg
        tya pha/pla tay
        txa pha/pla tax
        lda execreg

        cost like (1) + 6[4]

3. self modifing

        sta mem/lda #$09
        stx mem/ldx #$00
        sty mem/ldy #$00

        costs 6[7]

1,2 are required if 'rom' used
3   is used when we can selfmodify and we need to keep A

We can't simply use some memory location, because of one proc will call another, these memory can
be damaged. Also, code flow directives may cause problems.

There is always one exit point. You can't use "rts" instruction in function. You always need to
reach '.endproc' directive. At this place, asm also generates epilog. You may use "rts" instruction
as _last_ before '.endproc'

Prolog/Epilog can also contain 'debug' extensions. See debug for more details.

After prolog, function is executed. Parameters are stored in proper place, as defined in .proc
directive. You just reference mem (which could be self-modify code) or register.

When we use "stack" for passing parametres AND also variable parameter count is used (for optional
params), prolog contains one 'sta $xxxx' which stores parameter count. It may do 'asl' and 'shl' if
16 bit parameters are used. It costs 4[3] or 8[5]. If we can't use self modify code, one byte in zero
page is used (see .execregs dir). Epilog need to be a bit more complicated when stack is used for parametres
and function is resposible to clean the stack (see 'cleanstack')

Stack bytes if we call function and return addres is $1234 with ($88,$99) as params:

Stack

    $xx  <---- stack point
    $33  (address -1 stored)
    $12
    $88
    $99

Epilog - no optional params

    tsx
    txa
    tay
    
    inx     [generated, repeated for parameter bytes count]
    inx

    txs

    lda $0101,y  ; clean I guess, nothing special
    sta $0101,x
    lda $0102,y
    sta $0102,x

    ; <- preserve regs epilog code

    rts

    Cost: 30[18]


Epilog - optional params or parameters longer than 4 bytes (including)

    tsx
    txa
    tay
    clc
    adc #$04        [generated parameters length in bytes]

    adc #$00        [self modifing code, present only when stack used for optional params]
    adc $execreg    [present only when stack used for optional params - no self modifing version]

    tax
    txs

    lda $0101,y
    sta $0101,x
    lda $0102,y
    sta $0102,x

    ; <- preserve regs epilog code

    rts

    Cost: 32[20], with optionals 34[22] / 35[22]


Call epilog @ caller - caller is resposible to clear stack (shorter params, up to ~6 bytes)

    pla  [repeat for every byte param]
    pla
    ...

Call epilog @ caller - caller is resposible to clear stack

    tsx
    txa
    clc
    adc #0
    tax
    txs

When stack is used for parametres, recursive calls are possible. Just keep in mind, we have 256 byte stack :)
This epilog properly handles stack.


Who clear the stack ? For 1-3 bytes per call function it seems that caller is beter solution. On the other
side, if function is called 50x in the program, it may costs hundreds bytes.

So:

- when SPEED is important then CALLER should clean stack (or simple do not use stack params)
- when MEMORY is important and function called on many places, FUNCTION should clear the stack
- system libraries should always use 'cleanstack' (disc, UI, ...) we do not care here about 30 cycles
  but we care that there will be 500 calls -> we may waste up to some ~kb of memory ! which hurts :)

Modules
-------
Module is defined by ".module". Everything what follows until ".endmodule" belongs to module.
Modules is standalone binary block, which can be relocted.

    .module alias

      .moduleprop name value

    .endmodule

moduleprop sets module propery/behvaiour. Avilable things:

    name        module name
    copyright   usual thing :)
    address     module memory address - default: relocable
    link        dynamic / fixed / table
    type        inline / loadable

When module is located at fixed address, first is placed 'table' of functions (together with $6c
jmp instruction - so, table can be directly used)

Module can export only functions. If some aprt is going to use module, it doesn't need its sources,
enough is to get 'library' file. Module library file is generated automatically. Module library looks
like:

    .module XXX imported ... (other module declaration, required to proper usage)

    .proc name params imported
    .endproc

    .endmodule

Library file is 'required' at all. It will not work referecing 'module sources'. Asm will not have
any clue if module should be assemlbed, or not, etc etc.

When some code want to use module function, it simple use "module.name" as its name. Like:

    tfx3.about()

Modules could use 3 types of linking.

1. direct address
2. indirect throw table
3. dynamic calls

[1] generates

    jsr $xxxx

[2] generates

    jsr function_table[index]

    function_table is created during module load

[3] generates

    $fc $mfff  (3 bytes) where 'mm' is module index (within module) and $ff is function
    while loading, this is changed into direct memory address. It doesn't necessary be $fc
    on control byte, but something....


When module is assemled, it can be 'inline' or 'loadable'. If it is inline, it is like normal
asm. Asm generates header. In the header is stored name, etc. and dependand modules. When
this module is loaded, that header is required. So, assembling module is like normal assembling,
it just generate more than 'simple' instructions. Module can be "inline" loadable in another block
of code, if it will have some usage.

When module is inline, all its functions can be used by any normal asm. Loadable module can be
referenced only from another module unless its address is fixed.