making_full_use_of_memory_ports []

In-depth look at ports

Attentive readers of introductory section on local memory may have noticed that out of five registers that constitute a port, only four were described. Now is the time to dive deeper into port operation and explain the role(s) of registers REP0 and REP1.

Inside a port

The ports are intended to be efficient and flexible means of accessing VASYL's local memory. As such, they have been equipped with extra hardware to accelerate some of the most frequent operations. We have already seen memory pointer post-increment/decrement in action, but there is more.

Each of the two ports has its own DMA channel to the local memory, and is able to make one access - either read or write - per system clock cycle. This gives a theoretical transfer rate of ~1 MB/s per port. The 6510 is too slow to saturate even one such channel, because in the best case it can write to IO region once every four cycles (e.g. using a sequence of STA instructions). However, since display lists are also able to write to port registers, they can use ports to manipulate local memory at great speed. And since display lists are themselves located in that memory, it opens the possibility for display lists to manipulate both themselves and the data they use.

Before we get to that, let's have a closer look at registers REP0 and REP1.

REP0 and REP1

As you remember from the introductory chapter, writing a value to register PORT0 will transfer that value to a destination in local memory pointed to by ADR0L and ADR0H. For instance, this code

        LDA #<target
        STA VREG_ADR0L
        LDA #>target
        STA VREG_ADR0H
        LDA #$A0
        STA VREG_PORT0

will store value $a0 at location target in the current memory bank. Following the store, ADR0 will be increased by STEP0's contents (or decreased, depending on its sign - it's a value between -128 and +127).

        LDA #<target
        STA VREG_ADR0L
        LDA #>target
        STA VREG_ADR0H
        LDA #1
        STA VREG_STEP0
        LDA #$A0
        STA VREG_PORT0
        STA VREG_PORT0
        STA VREG_PORT0
        STA VREG_PORT0

will thus result in $a0 being stored in locations target, target+1, target+2, and target+3. Now, rather than repeating ourselves, instead we can use register REP0 - a value stored there repeats the last write to PORT0 this many times, i.e.

        LDA #<target
        STA VREG_ADR0L
        LDA #>target
        STA VREG_ADR0H
        LDA #1
        STA VREG_STEP0
        LDA #$A0
        STA VREG_PORT0
        LDA #3
        STA VREG_REP0

will have exactly same result as the previous example, because the initial write of $A0 to PORT0 will be repeated three times. What is different is speed - the three extra writes will be executed in the next three cycles, thus using the full DMA channel throughput of ~1 MB/s. Let's see how we could use it to rapidly clear an 8 KiB screen, and learn a few more things in the process.

        LDA #<screen
        STA VREG_ADR0L
        LDA #>screen
        STA VREG_ADR0H
        LDA #1
        STA VREG_STEP0
        LDX #8192 / 256  ; We will be clearing a page at a time.
        LDA #$00
        STA VREG_PORT0   ; First byte here...
        LDY #255         ; ...then remaining 255 bytes of the first page...
loop:
        STY VREG_REP0    ; ...and 256 on the following ones.
waitrep:
        LDY VREG_REP0    ; Is auto-repetition still ongoing?
        BNE waitrep      ; If so, let's wait for it to end.
        
        DEX
        BNE loop

A few things to note:

We want the routine to be accurate, so after clearing the very first byte with a write to PORT0, we only need to clean 255 bytes of the first page. That's why we write 255 to REP0 on the first pass through the loop, but 0 on subsequent ones.
A value of 0 written to REP0 means 256, i.e. “repeat the last action 256 times”.
As the sequence of writes progresses, CPU is free to do other things. However, if it wants to kick off another repetition, it first needs to wait for the current one to finish. Since REP0 contains the number of bytes that remain to be written, all that needs to be done is repeatedly reading REP0 and checking whether it reached zero.
Each port has its dedicated DMA channel. This means that it does not interfere with other memory operations, display list execution, bitmap sequencer fetches, etc.
PORT0 and PORT1 are also fully independent, so each can execute auto-repetition at full speed. You could thus be clearing the screen using both of them simultaneously, doubling the performance to ~2 MB/s.
If both ports make memory access in the same cycle, PORT0 does it first.

Copying data in local RAM

In the introductory chapter we also explained how to read from the location pointed to by a port - you just need to set CTRL_PORT_READ_ENABLE bit in CONTROL register and then proceed reading from PORT0 (or PORT1). This can obviously be combined with writing, and used to copy data around - this routine will copy 256 bytes from location source to destination:

	LDA VREG_CONTROL
	ORA #CTRL_PORT_READ_ENABLE
	STA VREG_CONTROL

        LDA #<source
        STA VREG_ADR0L
        LDA #>source
        STA VREG_ADR0H
        LDA #<target
        STA VREG_ADR1L
        LDA #>target
        STA VREG_ADR1H
        LDA #1
        STA VREG_STEP0
        STA VREG_STEP1
        
        LDX #0
loop:
        LDA VREG_PORT0
        STA VREG_PORT1
        DEX
        BNE loop

While this approach is faster than copying data in C64 base memory (you can use the simplest addressing modes, and don't have to worry about updating the source and destination addresses), it still is far below theoretical throughput of the ports - even with unrolled loops the best we can do is around 8 cycles per byte (<125KB/s).

As you might have already suspected, we can side-step the CPU entirely, and copy data directly using ports' DMA channels. This mode is activated by setting CTRL_PORT_COPY_MODE in CONTROL register. Here is the fast way to copy 256-bytes:

	LDA VREG_CONTROL
	ORA #CTRL_PORT_MODE_COPY
	STA VREG_CONTROL

        LDA #<source
        STA VREG_ADR0L
        LDA #>source
        STA VREG_ADR0H
        LDA #<target
        STA VREG_ADR1L
        LDA #>target
        STA VREG_ADR1H
        LDA #1
        STA VREG_STEP0
        STA VREG_STEP1
        
        LDA #0           ; As previously, "0" means "256".
        STA VREG_REP1    ; Write to REP1 kicks off hardware copy.

The last instruction of the above routine launches memory transfer that will take exactly 256 cycles to complete, copying at ~1 MB/s. As previously, 6510 is free to do other things while the data is being copied, but if it wants to modify any of the port registers again, it should wait for the transfer to complete:

waitcopy:
        LDA VREG_REP1    ; Is the transfer still ongoing?
        BNE waitcopy     ; If so, let's wait for it to end.

Once you're done copying and want to use ports for other purposes, remember to turn off copy mode in the CONTROL register:

	LDA VREG_CONTROL
	AND #~CTRL_PORT_MODE_MASK
	STA VREG_CONTROL

Final two comments:

Because of how DMA channels are allocated within a cycle, accelerated copying is only possible from PORT0 to PORT1, not the other way around
STEP0 and STEP1 can naturally be different from each other, enabling you to reorganize your data as you copy it: change order, insert gaps, etc.

For comprehensive example demonstrating use of all these features, please see demo_hirestext.s.

Accessing ports from a display list

Since display lists are free to write VASYL registers, it should come as no surprise that they can also use the ports. All the operations described above can be performed using a display list instructions, although due to the fact that VASYL is not a general use CPU, there are some important differences.

Let's first try to setup a simple memory clearing operation using PORT1:

    MOV   VREG_ADR1L, <buffer
    MOV   VREG_ADR1H, >buffer
    MOV   VREG_STEP1, 1
    MOV   VREG_PORT1, 0
    MOV   VREG_REP1, 49

This code will start clearing a total of 50 bytes starting from address buffer in the local memory. “Start” is an important word here, because like with the 6510, execution of the display list continues while the operation initiated by the write to VREG_REP1 progresses. This is a desirable feature, but what if we want to perform another operation using the same port? VASYL cannot read individual registers, so we cannot loop waiting for VREG_REP1 to reach zero.

Fortunately, there is an instruction specially for this situation - WAITREP. It takes one argument - a 0 or 1 value corresponding to port, and all it does is pause display list execution until an operation controlled by that port completes. So if we wanted to clear 500 bytes of the buffer, we could do this:

    MOV   VREG_ADR1L, <buffer
    MOV   VREG_ADR1H, >buffer
    MOV   VREG_STEP1, 1
    MOV   VREG_PORT1, 0      ; clear the first byte
    MOV   VREG_REP1, 0       ; clear 256 more bytes
    WAITREP 1                ; wait for the the operation started above to finish
    MOV   VREG_PORT1, 243    ; now clear remaining 500 - 256 - 1 = 243 bytes...
    WAITREP 1                ; ...and wait until it completes

How would we go about copying a buffer of 200 bytes and reversing it while doing so?

    MOV   VREG_ADR0L, <src
    MOV   VREG_ADR0H, >src
    MOV   VREG_ADR1L, <(dst + 199) ; point to the end of the destination
    MOV   VREG_ADR1H, >(dst + 199)
    MOV   VREG_STEP0, 1            ; walk from "src" up
    MOV   VREG_STEP1, -1           ; walk from end of "dst" down
    MOV   VREG_CONTROL, (1 << CONTROL_DLIST_ON_BIT) | CONTROL_PORT_MODE_COPY
    MOV   VREG_REP1, 200           ; start copying...
    WAITREP 1                      ; wait for it to end

Note, that we don't have the comfort of 6510 ORA instruction, so we cannot be selective about what bits we change in CONTROL register. We can be pretty sure CONTROL_DLIST_ON_BIT should also be set, because the display list wouldn't be executing without it, but you may also need to set other bits, depending on what you are doing.

One more VASYL instruction useful while working with ports is XFER. What it does is reading from indicated port and storing the value just read into a VIC-II or VASYL register. So

    XFER  $d020, (1)

will read a value from PORT1, and store it in VIC border color register. Here is a more complete example using PORT0.

    MOV   VREG_ADR0L, <colors
    MOV   VREG_ADR0H, >colors
    MOV   VREG_STEP0, 1
    SETA  25
    WAIT  51, 0
loop:
    XFER  $d020, (0)
    DELAYV 8
    DECA
    BRA   loop
    END
colors:
    .byte 2,3,4,5,9,8,7,6,5,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,2,0

will be using values read from array colors to change border color every 8 lines.

Finally, since you can also use XFER to write to PORT registers, it is yet another way to transfer data using display lists. Please see demo_selfmod.s for an in-depth example.

Table of Contents

In-depth look at ports

Inside a port

REP0 and REP1

Copying data in local RAM

Accessing ports from a display list