With video timing out of the way it is time to really start to learn about and implement the logic for the Acorn Electron.
From the schematic at the back of the Advanced User Guide (AUG), a few logical groups stand out as candidates to implement first.
The plan was to first implement the RAM, ROM, CPU, clock generation and the minimum amount of supporting ULA logic. After testing, move onto video modes, keyboard, variable cpu clocking, cassette interface and anything else I’d overlooked.
I’ve shaded the main logical groupings on the schematic other than the central ULA.
The RAM in the top left has a dedicated address/data bus tied to the ULA.
The ROM and CPU on the other hand both share a common address and data bus with not just the ULA but also any additional hardware present on the expansion bus.
There’s a lot of logic internal to the ULA that I will be putting off implementing for now. That includes keyboard handling, cassette interfacing, video output and sound. Additional glue logic for these can be seen to the right of the ULA.
As the Replay board has dedicated hardware for sound and video, this glue logic can be ignored. Likewise the cassette logic will be replaced with a replay interface that allows loading of data from the on-board SD card instead. However, I intend for the ULA to optionally expose the same cassette interface to enable others to create an expansion board for the replay, populated with the glue logic (highlighted in yellow) and allowing a physical cassette recorder/tapes to be used with the Core.
The memory layout of the Electron is a combination of RAM, ROM and memory mapped I/O as noted on p183-200 of the AUG1.
-- Memory Layout (AUG p183-200) -- 0000-7FFF RAM - Shared between system/user and video -- 8000-BFFF ROM - Paged (initially basic) -- C000-FBFF ROM - OS -- FC00-FCFF Fred - Memory Mapped I/O (Expansions) -- FD00-FDFF Jim - Memory Mapped I/O -- FE00-FEFF Sheila - Memory Mapped I/O (ULA) -- FF00-FFFF ROM - OS
The ROM on the Electron houses the OS and Basic Language in a Hitatchi HN613256 although the schematics suggest a Toshiba chip may have also been used.
There’s little to mention about the ROM itself, it shares the address and data bus with the ULA and CPU, the ULA handles enabling the ROM anytime it detects the CPU trying to access it i.e bit 15 of the address is low for RAM and high for ROM (unless the address is accessing 0xFC, 0xFD or 0xFE memory mapped regions)
The Electron also supports replacment of the Basic ROM by “paging” it out. The ULA only enables the main ROM for addresses in the range 0x8000-0xBFFF if the Basic ROM is also “paged in” i.e selected. Otherwise it leaves the ROM inactive allowing other hardware on the expansion bus to claim the address and provide data.
Keyboard querying is a little odd at first, it’s done by paging in a pseudo keyboard “ROM” and placing suitable address sequences on the address bus. The ULA will then return the state of the four keyboard pins via the data bus.
As for RAM, it’s currently implemented using 65535 4 bit words of BRAM. This could be moved over to use the Replay’s DRAM although doing so may be a little tricky due to clock timing (see CPU section)
The configuration is a little weird as I wanted to retain the ULAs pinout as much as possible which on the Electron comprises of
- RAM0, RAM1, RAM2, RAM3 - 4 bit data bus
- /WE - active low write enable
- /CAS - active low column address strobe
- /RAS - active low row address strobe
The four bit data bus means that in order for the ULA to read or write a byte (8 bits) it needs to do two complete read/write cycles. The first for the even bits of the byte and the second for the odd bits. Each cycle is further split with the Row latched first then the Column (and data in the case of a write)
In all it will take roughly 8 clock cycles @16MHz to read or write a single byte of RAM.
In order to do this I implemented a rough version of the Row/Col addressing scheme the original RAM and ULA would use. I say rough as the exact timing isn’t matched, only the sequence of latching a row, then col across multiple clock ticks.
Whilst most FPGA’s no longer include tri-state buffer support internally, the software will synthesise this using logic instead. For this reason, I decided to implement the addr/data bus as bi-directional with each module using “Z” to “tri-state” as needed.
For example, the read register section that places data onto the cpu data bus
b_pd <= (others => 'Z') when i_n_w = '0' or i_addr(15 downto 8) /= x"FE" else '0' & isr_status when i_addr( 3 downto 0) = x"0" else cassette_data_shift when i_addr( 3 downto 0) = x"4" else (others => 'Z');
The ULA Tri-states the bus if the cpu is trying to write or read from any address other than the ULA registers (the read data output is handled elsewhere). Data is output to b_pd for a read of register 0 or 4, for any other read address the bus tri-stated.
Elsewhere in the ULA there are several other outputs to b_pd which again are either tri-state or providing data, for example in the case of a ram read/write.
Whether that decision will come back to haunt me, time will tell.
The CPU is a 6502 for which a VHDL module already exists (thankfully). The Electron clocked it at 1MHz or 2MHz and in some instances stopped clocking it for a time.
The reason for this is down to needing to share ram access with the ULA for graphics.
The Electron has 7 display modes (0-6)
-- Modes: -- 0 - 640x256 two colour gfx, 80x32 text (20K) -- 1 - 320x256 four colour gfx, 40x32 text (20K) -- 2 - 160x256 sixteen colour gfx, 20x32 text (20K) -- 3 - 80x25 two colour text gfx -- 4 - 320x256 two colour gfx, 40x32 text (10K) -- 5 - 160x256 four colour gfx, 20x32 text (10K) -- 6 - 40x25 two colour text (8K)
The different modes place not only differing demands on RAM usage but also result in more or less CPU processing time. Why is due to RAM access times and required video pixel rate.
As summerised by Neville Maude in the October 1983 edition of Practical Computing
In Modes 0, 1 and 2 the RAM access of the video part of the ULA is interleaved between the 6502A access. For 40μs out of 64 the processor is out of action, In Mode 3 the processor is running full speed on alternate lines. In Modes 4, 5, and 6 [the CPU] runs at 1MHz all the time it accesses Ram.
The “alternate lines” claim for mode 3 appears questionable. Mode 3 is 640x200 resolution with 1bpp and 2 blank lines after every 8 lines. This would use up the full bandwidth of the ULA/RAM to maintain. I can only think that the full speed comment was referring to the two blank lines between each set of 8 active vertical lines during which time the ULA does not need to access RAM leaving the CPU free to do so.
With a 51.95µs active display period the ULA needs to output a pixel every 62.5ns although only 40µs of the active display area (640 pixels) is actually used, the rest is just a black border region. The ULA itself is clocked at 16MHz or every 62.5ns.
From that it’s clear to see that in order to output video data in a 640 mode the ULA needs to have a pixel ready every single clock and it would output 8 pixels every 8 clocks.
As discussed in the RAM section, it takes two 4 cycle reads or 8 clocks @16MHz to obtain one byte of data, or 8 bits.
The only way the ULA could keep up with 1 pixel per clock when it’s able to read 1 byte every 8 clocks is if each bit represented a single pixel. This is why mode 0 with a 640 horizontal resolution is limited to two colours or 1 bit per pixel (bpp).
It follows that a 320 pixel resolution places 1/2 the demand on RAM access, with a single byte read every other 8 cycles leaving 8 spare cycles, or the option to read two bytes to increase the number of colours (bpp).
So why is there a 320 pixel mode with 1bpp if there’s room for 2bpp? Those 8 spare cycles are important. During that time the ULA is able to let the CPU take a turn accessing RAM. This allows the CPU to be clocked at 1MHz and read from ram whilst the ULA is actively outputting video but only during the modes that have a spare 8 cycles, eg modes 4, 5 and 6.
So what happens to the CPU during modes 0-3? To quote Neville, the CPU is “out of action”. The CPU is not clocked at all, preventing it from doing anything during the 40µs display region. In these modes it will only be clocked during the h/v sync regions, the 11.95µs used to draw the border either side of the 640 active region and for modes that have it, the 2 blank vertical lines every 8.
The ULA is not cruel though and as long as the CPU promises to keep its grubby mits off the RAM, it will clock it at 2MHz. This enables the CPU to read from ROM, memory mapped I/O (incl ULA registers) and use the expansion port regardless of the display mode. It is only when the CPU attempts to access RAM that the ULA steps in and restricts it to 1MHz or 0MHz.
There’s a little more to variable clocking than the above, such as transition logic. I might cover that in a future post.
There is one final twist to CPU timing. The ULA is in control of the 0, 1 or 2MHz cpu clock except when a non maskable interrupt occurs. This type of interrupt is generated by high priority devices that may be using the expansion bus and when it occurs it is the ULA that has to stop accessing RAM.
The CPU is still only clocked at 1 MHz and receives 8 of the 16 cycles, which is enough for it to read one byte per cpu clock. However, this access is allowed even when the ULA itself requires all 16 cycles to keep up with video data. For this reason the ULA will end up outputting whatever the CPU read from ram during its slot as video data. Electron users referred to this as display snow.
It would appear that the CPU could be clocked at 2MHz during an NMI although from what I’ve read this is not the case. Perhaps it was assumed that most NMIs would be generated in display modes where the CPU shared RAM access with the ULA and allowing 2MHz clocking would have resulted in snow for all display modes rather than just modes 0-3?
Going back to the ROM implementation, originally I implemented this in BRAM and later moved to DRAM when BRAM became too scarce. The move to DRAM brought with it a ULA clock synchronization issue. This is not an issue present on the Electron, but one born out of using DRAM on the Replay board.
This may not make much sense unless you’re familiar with the Replay Framework, but for those who are the issue is that the framework uses a sys_clk which is 32MHz in the case of my core. The clock usage is usually gated by a 1 in 4 ena_sys flag giving a final clock rate of 8 MHz.
Thus 4 sys clocks occur with ena_sys active every 4th clock. This is referred to in the framework as the system clock phase (cph) 0, 1, 2 and 3 with ena_sys occurring on phase 3.
The DRAM controller is setup such that it latches the addr/data on cph(0) and will have the resulting read (or write) complete just before cph(3).
This means every time ena_sys is high you can setup a read/write from DRAM and the result will be ready the next ena_sys tick.
That’s all well and good, but I needed a 16MHz clock for the ULA so had to ignore ena_sys and instead generate a 1 in 2 clock for the ULA.
I chose to clock the ULA on cph(1) and cph(3). This keeps the ULA more synchronised with the ena_sys for DRAM access.
So where’s the issue?
If the ULA starts clocking on cph(1), then it would have missed the DRAM sampling that occurs on cph(0), resulting in random data being made available by the clock on cph(3) and not until the next ULA clock would the data actually be latched with yet another ULA clock before the result is valid.
This would be painful to deal with and inefficient.
If however the ULA starts clocking in sync with cph(3) of the system clock, then the first state could setup the read address, the DDR controller will sample this on cph(0), a ULA clock will occur on cph(1) and do nothing (at least DRAM access wise), the controller will then present the result just before cph(3) and finally on cph(3) the data can be made available to the CPU/ULA and if needed the next DDR read/write access setup.
With my current implementation, the DDR data made available to the core on cph(3) is passed to the ULA data bus via a process block, this means the data does not actually become visible to the ULA until the following ula clock.
In the image above, the first white line shows cph(3) (i.e cph_sys 8). The CPU (phi_out) clock occurs on cph(3) and the ULA is in clk_phase 0. Changes to the addr/data bus will be stable before cph(0) (i.e cph_sys 1) at which point the DDR Valid line is asserted and a read cycle begins.
The DDR read takes four sys clocks to complete (or two ULA clocks), with the read data available by the following cph(3) on clk_phase 2. At this time the data (c8 in this case) is place on the data_bus (indicated by the yellow line).
Thus the data from a completed DDR request started on clk_phase 0 will be visible to the ULA by clk_phase 3 and a new request could be made on clk_phase 4. With ample time to spare. All told, the ULA could make 2 DDR reads or the core could perform 4 DDR reads between 2MHz ticks of the CPU. Whilst a few tweaks may make it possible to perform 4 ULA based DDR accesses rather than 2, there’s already ample time spare even if RAM is moved over to DDR and access interleaved, I see little need to optimise this further.
For the above to work, the ULA has to start being clocked on cph(3). To accomplish this, when power is applied and the system comes out of reset, the ULA is kept in the reset state a little longer until the system reaches phase 3. Thus the first tick it ever receives will be edge aligned with phase 3 of the system clock allowing a DRAM access every other ULA tick.
This is one of those cases where describing the problem ends up more complex than the solution itself.
p_por : process(i_clk_sys, i_rst_sys, i_halt) begin if (i_rst_sys = '1' or i_halt = '1') then n_por <= '0'; elsif rising_edge(i_clk_sys) then -- Delay bringing out of reset until after cph(1) to align even ula_clk -- cycles with cph(3). DDR can then be accessed every other even ula_clk cycle if (i_cph_sys(2) = '1') then n_por <= '1'; end if; end if; end process; -- 16MHz from sys_clk / 2. ula_clk sync'd to cph(3) via por ula_clk <= i_cph_sys(1) or i_cph_sys(3) when n_por = '1' else '0';
I’m not too keen on abusing the n_por signal for this purpose, in addition generating a new ula_clk is bad practice. There are apparently subtle timing issues that can bite when deriving new clock domains in this way. Before implementing any other subsystem, I plan to switch everything over to a single system clock domain and use clock enables for the 16MHz, 2Mhz and 1MHz clocks.
With RAM, ROM, CPU and a few parts of the ULA now wired up to allow the CPU to read/write from RAM, there’s enough to start testing. Although no work has yet been done on the ULA’s video output, the CPU should be able to boot and start running the OS. I’ll discuss this in the next blog post.
Acorn Electron Advanced User Guide ↩︎