-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interest in an asm version of micronucleus? #182
Comments
Thanks for your generous offer. |
Since micronucles doesn't use LTO, it is more difficult to optimize the C code, especially when restricted to pure C without inline asm. Some of the asm techniques I use, such as using the carry flag, just can't be done in C. |
I forgot v-usb is such a mess of preprocessor code. I decided that it'll be easier to optimize and convert to asm if I clean it up and strip it down to only what is used by micronucleus first. Therefore I've forked your micronucleus-firmware. I've been testing with t85_default and t85_aggressive. When I'm finished with the cleanup, I'll push to my fork and make a pull request. |
I think there's still optimizations in u-wire that haven't been back-ported to micronucleus. |
Hi Ralph, nice that you are looking into a 100% ASM version! It certainly is a nice challenge. Micronucleus is a bit caught between the need to serve multiple goals: Small size, reliability and configurability. This often leads to certain compromises. Although Micronucleus uses V-USB in a non-standard way (no interrupts), the goal was always to retain V-USB as much in its original form as possible to allow upgrading for newer version. Basically only a RETI was swapped for a RET. Of course, it seems that V-USB has never been updated during the last years (support for Tiny0 and 1 series and also LGT missing). So maybe keeping V-USB pristine does not need to be a priority anymore. There are definitely some things in u-wire that could bring down to size a bit, e.g. flat code, foregoing the SRAM double buffering and multi-byte responses. My feeling is that things start to be a bit challenging when trying to retain all the configuration options? Maybe it would be wise to start with an ATTiny85 only version first ? The protocol has not changed since the introduction of 2.0, so you can use any 2.x version as a starting point. Maybe it should be a separate firmware first and only be merged after there is some clarity how everything works out? Generally, I cannot invest that much time into Micronucleus nowadays, but I will gladly assist if there are any questions. The most pressing items right now are to fix the documentation and investigate whether we could use CI to set up a proper release system. Best Regards, |
Thanks for the feedback Tim, Although it doesn't seem to be very popular, I prefer to do most of my embedded programming in asm, particularly on small 8-bit platforms like 8051 and AVR. I find it much easier to write the asm code I want, rather than trying to cajole avr-gcc into generating the code that I want. I agree with your plan; that's pretty much what I've started doing. t85_default will be my main testing target, with t85_aggressive being secondary. I shouldn't have any questions on the technical side, as low-speed USB 1.1 is a rather simple protocol when limited to EP0 control packets. Where I may have some questions is to tap your knowledge of how people are actually using micronucleus. Sometimes functionality that has no value to me is important to the average user. Regarding CI, it's a good suggestion. I'll probably use github actions for CI on my fork. Just something simple to make sure everything builds clean. |
I just got t85_aggressive down to 1342 bytes, below the next page boundary. I also improved the sync timing window from -1/+11c to -3/+7c per JK transition. |
Congrats 🥇 |
Nice! Every byte is hard work at this point 👍 |
Making tweaks to the v-USB asm code is a bit trickier than it should be. Some places the cycle numbers in the comments are at the start of the instruction, and in others it is at the end. I think Christian started off using end of instruction counts (usbdrvasm12.inc) and switched to the start of instruction in the other versions. Going with the start of instruction makes more sense to me since skip and branch instructions take 1 or 2 cycles. |
I've added a m88p config, and finished converting asm12.inc to conform to the avr-libc calling convention. This makes it easy to run micronucleus on a USBasp. I've switched to testing with the 12MHz v-USB with external xtal. No variations of the 16.5MHz version is completely stable on my PC, making it difficult to test changes I make to micronucleus. I suspect most USB hub controllers don't implement a PLL with a timing range of +/- 1.5% for low-speed devices as required by the spec. Tim, it looks like you figured this out several years ago: In addition to the code size optimizations I'm doing, I think there is still room to improve reliability. I think OSCCAL tuning doesn't always pick the best value (sometimes being off by 1 step), and I think I can reduce the code size. In picoboot I've implemented automatic OSCCAL tuning in 20 instructions. |
I just pushed an update to usbdrv.c that leaves out unused USB requests (ones mn should never see, or ignores). t85_default is now 1438 bytes and t85_aggressive is 1292. |
I'm catching some big fish today. Since the configuration descriptor is now only 9 bytes, wLength no longer needs to be checked for any of the messages mn receives. t85_aggressive is now under the next page boundary (1280) at 1274 bytes. |
👍 |
As you wish. |
I removed more dead code from usbdrv.c, and got past the next page boundary for t85_default: 1398 bytes. |
Oh indeed, I did not even remember that. I guess I was trying to corroberate the claims in the V-USB documentation regarding the stability of the individual implementations for different clockspeeds. Not all of the claims seem to make sense and none are backed up with data. I am a bit surprised, however, that you see issues with 16.5 MHz. I always regarded this as a stable variant.
Keep in mind that some ATtinys have a non-monotonic encoding of OSCCAL. Therefore the linear search ("neighbourhood search") should not be skipped. Also, an timeout conditition is handy to prevent deadlocks. |
The reference to +/- 1.5% is correct. 7.1.11 - data signalling pg 159 However I'm convinced that testing with a low-speed device outside of +/- 0.25% timing is not part of the certification process, and the DPLL in many modern hubs doesn't have the range to vary by 1.5%. Hence the most stable v-usb devices are those that use an external xtal, such as the USBasp and USBtinyISP. |
Today I started converting the usbdrv.c ProcessRx code to asm. t85_aggressive is now below 19 pages, at 1212 bytes. next I'll work on BuildTx. |
I found some easy size optimizations in the C code by using fixed registers for usbMsgPtr and usbMsgLen. Those two changes save a total of 36 bytes, so now t85_default is down to 1340 bytes. |
I finished converting usbBuildTxBlock to asm, and now t85_aggressive is 1146 bytes. |
👍 Yeah, it's just a lot of nested functions that can be flattened. |
Today I decided to do some investigation into the reliability problems I'm having with micronucleus on a t85 at 16/16.5MHz. While it's well known OSCCAL tuning on the t85 doesn't get much closer than +-0.5%, what I found is the frequency is only stable to within +/- 0.5% over short intervals (~1s). Here's the program and a sample output.
|
Htat seems to be a surprisingly strong effect. Since the trend is quite monotonic: Could this be due to self-heating? |
The resolution of the high/low count was only 16 cycles as I was only sampling 1 phase with a 8-cycle sampling loop and then multiplying by 2. I updated it to sample both clock phases of the external clock. I'm using the 1kHz output of my Rigol DS1054Z which is extremely stable and accurate to a few ppm. I also reduced the sampling window from 1s to 250ms, as I think it starts getting unreasonable if you were to retune OSCCAL more than 4x/s.
Since the resolution of the OSCCAL steps and the clock instability are cumulative, that means you can't count on much better than +/- 1% timing accuracy on a t85. p.s. I've even considered dithering clock in order to improve the tuning accuracy, but the frequency changes too quickly (within a few clock cycles). I suspect the clock design latches OSCCAL to an internal R-2R DAC driving a VCO. |
I just noticed something that might explain some of the unreliability I'm seeing while running on the internal RC oscillator: I haven't been testing with a dedicated bus; there's a low-speed mouse with a 8ms poll interval on the same bus. I think it's too limiting to require a dedicated USB bus for micronucleus, as many low-end PCs and laptops only have two. As well, you'd need to explain to most users how to determine which physical USB ports share one bus. I think before I work on converting the main loop to asm, I'm going to write a new calibration function. What I use in picoboot is a proportional algorithm; a larger time difference results in a larger change to OSCCAL. I think I can get OSCCAL to the optimal value with 3-4 empty frames, and reliably discard timing from non-empty frames. p.s. the consequences are particularly bad if there is data in a frame during the first 5 stages of the binary search. That would put the optimal OSCCAL value outside of the adjustment range of the nearest neighbor adjustment. |
These collision are not as much of an issue as one might think. Only old (1.1?) USB hubs forward all traffic to every port. Anything more modern should not present this issue. AFAIR the fix was introduced by shay based on his observations on a 1.1 hub. |
Today I did some testing with a t88 configured to tune OSCCAL to 12Mhz, and it was failing to enumerate. I added some debug code to output the clock/1000, and found it was getting tuned to 12.5Mhz. I think the tuning bug has been there for years, as I had tried to get the t88 working several years ago but had moved on to other projects after I couldn't get it working. |
There were a couple basic bugs in tuneOsccal that I found yesterday. Thinks like both scratch and shift being defined as r23 (doh!). The most serious bug took me a while to find. I created a basic realtime debug log with a 6mbps UART so I could log the counter values from OSCCAL. With the debug logging in place, I realized tuneOsccal was getting called dozens of times because mn tries to tune during bus reset, not after. That bug was fixed with an extra two instructions to wait for the end of reset. |
Upgrade wasn't actually working, it was just reporting success but not writing anything because I had forgot to enable the SELFPRGEN fuse. After enabling the fuse, I'm getting unreliable results flashing new code using the bootloader. Sometimes I get flash write error -1, other times I get no errors reported, but the code doesn't seem to have updated. I'm using OSCCAL_RESTORE_DEFAULT and OSCCAL_SLOW_PROGRAMMING. |
Nice! Do you mean Mega88 instead of tiny88? Because there are some tiny88 development boards out there (TINY-MH88) that seem to work fine. I guess all the ATMega-support was developed on boards with crystals. |
No, its a t88. They have the same monotonically increasing OSCCAL curve as
the t84a. They don't support an xtal, so the MH88 boards must use a smd
can oscillator. According to the datasheet 12MHz is the top speed, so
running at 16 would be a slight overclock. Unlike most tinies, it has
hardware SPI and i2c.
https://www.microchip.com/wwwproducts/en/ATTINY88
I think dropping the speed to 8MHz and then pumping it back up to 12 each
time the main loop runs could be creating harmonic oscillations in the
clock, so I'm going to change it to only change speed when writing or
erasing flash. I also may pick a sacrificial t88 to test flash operations
with the RC oscillator at 12MHz to see if it works. I've seen posts from
some paranoid old farts on AVRfreaks warning that you can brick your AVR
doing this, but I've never seen anyone test it.
…On Sun, Sep 6, 2020, 02:36 cpldcpu ***@***.***> wrote:
Nice! Do you mean Mega88 instead of tiny88? Because there are some tiny88
development boards out there (TINY-MH88) that seem to work fine. I guess
all the ATMega-support was developed on boards with crystals.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#182 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKNZ6RDOXN42Q4HDPS5ZT3SEMNWBANCNFSM4QGV7XHQ>
.
|
After modifying the SLOW_PROGRAMMING code to only slow the clock during erase/program, the stability seems better with the t88. Not quite perfect, as I got a recoverable error once during testing:
Upgrade isn't working for me though. Even though the code sucessfully uploads, the old bootloader doesn't get replaced. I think this is a bug introduced in the micronucleus-firmware fork, as the upgrade for 2.04 worked fine when I tried it on a t85. Since flashing and erasing with the mn tool works fine, I'll leave the upgrade debugging to someone else. |
I decided to have some fun today, and see what happens when modifying the flash at >12Mhz. It turns out it may even be a good thing. |
Interesting! One challenge I see is that you cannot really measure read margin using the user registers. So, it would be possible that you write the cell successfully, but the charge stored on it is not sufficient to allow for a 10 year data retention. Maybe an easy hack could be to write twice? Because it should be fair to assume that the charge stored per time is a constant if all other conditions are the same. |
Don't forget charge is 1-(e^-x), so reducing the time by 1/3rd reduces the charge by only 24%. For retention the t88 spec is actually "20 years at 85°C / 100 years at 25°C". I suppose I could do an experiment to test retention by writing zeros to the page, then repeatedly verifying the page at high temperature. If I can find 26-30AWG wire with high temperature insulation, then I could probably do the testing at 150-160C. However extrapolating the lifetime/temperature curve, I'd need to run the AVR at 200C to get below 1 year expected lifetime. Given that's above the melting point of my solder, I don't see that experiment happening anytime soon. However since I've already pushed it two orders of magnitude past it's rated number of write cycles, I may try the repeated read test just to make sure it still holds the zeros for at least a few days. edit: According to a Micron technical note I found, the degradation due to elevated temperature is more exponential than I thought. 150C unpowered overnight followed by a read test of < 24hrs may be good enough. |
The writing mechanism does most likely not behave like an RC circuit. I don't know what kind of write mechanism is used in the AVR, could be HCI or FN. Usually it is rate limited by the electron transfer process and is operated at strong overbias, so that the gate charge does not really influence the rate. Maybe it is possible to vary supply voltage to test for programming margin. Since they seem to have an internal charge pump, that may not work, though. |
I'm trying to solve a small problem with the new OSCCAL tuner. At 8Mhz, one USB bit-time is 6.67 cycles. The counting loop takes 10 cycles, so its possible to miss a single high bit on D+. That means occasionally it could mistake the EOP SE0 as SOF. Thoughts? |
Here's what I've come up with for a double-sampling loop.
|
hm... why do you want to run oscall at 8 MHz? There is no V-USB implementation at that frequency? Maybe another idea to address the issue: How about repeating the fixed point iteration for more cycles and also accumulate fractional count of osccal? Then the error would be smoothed out over time. |
The target speed is not 8MHz, it's 12 on devices like the t88 and t84. They start off at 8MHz. The problem is actually more likely than I stated above. 1 low-speed bit(666.7ns) at 12Mhz is 8 cycles, so at 8Mhz it's only 5.333 cycles. The error from misinterpreting EOP as SOF could be more than an order of magnitude (if the subject EOP is in the 1st 1/10th of the frame). |
This version fixed the bug with missed D+ pulses, but introduced a problem with frequent missed SOFs. The problem was checking the USBMINUS bit of the two samples that were or'd together meant that only 3 SE0 samples in a row was guaranteed to be detected. I changed it to I'd prefer to avoid the extra code in main to boost OSCCAL when |
You could use the pin change interrupt to detect D+ pulses. Then you could use a fast loop to detect SOF in D-. btw, one thing that could prove a bit problematic with the current loops is a lack of timeout. There were some issues with micronucleus when people had noise on the USB pins even when the device was not connected to the host. In that case sometimes osccall is invoked. This was especially an issues with Olimex devices - no idea why. -> I see that you still have the escape via BRNE. Good :) |
After doing some work on usbasp firmware, I'm back to hacking on mn. One of my ideas is optimizing or even eliminating the CRC16 code. A couple weeks ago I combined reading descriptors from flash with the CRC function. I've been thinking that it should be possible to calculate the CRC at compile time. It would mean breaking up the descriptors into 8-byte chunks, and appending the CRC to each chunk. That would add 12 bytes to descriptors, and save 42 bytes of the CRC code, for a net gain of 30 bytes. I've written a small C++ program to test the concept:
I tested it with gcc 5.4.0, which needs C++14 support enabled for constexpr functions. I confirmed that the constexpr function is evaluated at compile time:
This would have the added benefit of eliminating the runtime cost of calculating the CRC. |
I think I've really gone down the rabbit hole now. After so many days of reading usb code (both vUSB and usbtiny) I started writing my own. Here's the plans:
I already some prototype generating a valid ACK packet (sync, PID, & EOP) in less than 40 asm instructions. Rx will need a bit more code, probably 50-60 instructions. Here's the core of the tx code:
|
That's a cool idea! How do you prevent bit stuffing? By only using one nibble of the payload? I guess it's a bit more tricky to avoid bit stuffing in the CRC. If it is possible to generate all CRC codes at compile time, one could add a bit of salt, i guess? |
Sending inverted nibbles like PIDs is probably the easiest way of sending data. A 8/6 GCR-like encoding would be preferable if I can come up with a small decoding algorithm. In either case it would allow for some error checking on the device side. My reason for wanting more than 4 decoded data bytes per data packet is so I can send R1:R0, Z, and 1 byte for SPMCSR in each user control packet. It greatly simplifies the bootloader protocol, and is similar to a technique I used in my first picoboot bootloader that made it possible to fit in just 64 bytes. The CRC isn't so hard since all of the USB responses from the target never change. If mn had a flash read command for verify, then the CRC problem would be more complicated. In my mn fork, there's 6 different data packets the device needs to send: 3 for the device descriptor, 2 for the configuration descriptor, and one for the mn config. Looking at the USB traces, I noticed there was only 1 stuffed bit, and that was for the 0xFF device protocol field. I think I can get rid of the mn config packet by using the class, subclass, and protocol fields of the device descriptor. Both the 2nd signature byte and total flash size can be encoded in just 3 bits. The upper nibble is always 9, and the lower nibble is the flash size in kB as 2^N. That's why the t88, t84, t85, & m8 all have 0x93 as their middle signature byte. They all have 2^3 = 8 kB of flash. Page size only needs a couple bits, and user flash space available could be efficiently encoded as total flash size - bootloader size in pages. |
No need for a complex line coding, I think. You just need to insert two zeroes into each byte, e.g. bit 7 and 0 so that no adjacent bytes will case a long run. Using nibbles may be a bit easier to handle for block transfer, though. Transferring each byte address together with each cell to be writted is a bit dangerous from a safety perspective, in my opinion, because it is much more difficult to perform a consistency check on client side. It would easily be possible to brick the device. Maybe just transferring the block offset is safer? |
I've got a prototype for build-time CRC generation working. After reviewing the enumeration transactions again, I noticed the 3rd chunk (2 bytes) of the device descriptor has a stuffed bit in the CRC (36671/0x8F3F). I've experimented with adding a surplus byte to the descriptor for salting the CRC. Its been enumerating fine on Windows 7 and on Linux with an extra byte at the end of the descriptor, and with bLength left as 18. It seems the USB drivers ignore the extra data. I suspect MacOS will work as well, but it would be good to have someone test it. |
Nice! So the compiler automatically infers compile time evaluation from the fact that only consts are used? Or does it require an additional keyword for the function declaration? How do you deal with cases where the CRC leads to bitstuffing? Asserts?` |
It might be possible to inline the CRCs with some constexpr and lto wizardry, but I'm starting out simple with the gencrc program. I'll have it generate a header file that defines the CRCs, and then I'll add checks for making sure the CRCs don't need bit stuffing. I've also been thinking of changing the descriptor structures to combine the device and configuration descriptors. The device descriptor is always the first descriptor sent after bus reset, and then the configuration descriptor. By making the enumeration code stateful, sequencing through the descriptor chunks for non-status IN requests, I think I can simplify and streamline the rx handling code. |
ah ok, that is an offline calculation. I guess thats more than sufficient. I was just confused for a bit, because I though this code would use constant function evaluation, but I did not notice any extra keywords. |
Hi @nerdralph, it looks like we will finally get the repository cleaned up now. I believe we can release MN 2.5 soon. For the version after that, it would make a lot of sense to migrate some of your more advanced changes. |
Although I haven't pushed any MN updates in the last few months, I've been mulling over some ideas about MN while I work on other projects. The latest idea has to do with the code to sync up with the incoming USB packet. I realized my first attempt to get single-cycle accuracy isn't any better than the original code (which is accurate to 2 cycles). When it comes to improving reliability on targets, single-cycle accuracy would reduce the sampling error from .3125% to .156% for a 12MHz (8 cycle/bit) implementation. Since the total error margin of 1/2 a bit over a ~80-bit packet is .625%, single-cycle accuracy should in theory make a material difference with targets running on the RC oscillator. |
I've written a few tiny bootloaders in AVR assembler, and am thinking of writing a USB bootloader. If there's interest in an optimized version of micronucleus in pure assembler for the avr25 architecture, I'd be willing to write it. I think I can get the t85/16.5Mhz version down to 1kB.
If there's not interest in an asm branch for micronucleus, I may just copy it to use as a base, since it already has a working programming tool.
The text was updated successfully, but these errors were encountered: