Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interest in an asm version of micronucleus? #182

Open
nerdralph opened this issue Aug 20, 2020 · 71 comments
Open

interest in an asm version of micronucleus? #182

nerdralph opened this issue Aug 20, 2020 · 71 comments

Comments

@nerdralph
Copy link
Contributor

I've written a few tiny bootloaders in AVR assembler, and am thinking of writing a USB bootloader. If there's interest in an optimized version of micronucleus in pure assembler for the avr25 architecture, I'd be willing to write it. I think I can get the t85/16.5Mhz version down to 1kB.
If there's not interest in an asm branch for micronucleus, I may just copy it to use as a base, since it already has a working programming tool.

@nerdralph nerdralph changed the title interest in an asm version of micronucleus interest in an asm version of micronucleus? Aug 20, 2020
@ArminJo
Copy link
Contributor

ArminJo commented Aug 21, 2020

Thanks for your generous offer.
But for me personally assembler is too hard to maintain.
But you are more than welcome to look over the existing assembler code for optimizations, as well as over the compiler generated assembler, to find spots to optimize or maybe small routines to implement in assembler.

@nerdralph
Copy link
Contributor Author

Since micronucles doesn't use LTO, it is more difficult to optimize the C code, especially when restricted to pure C without inline asm. Some of the asm techniques I use, such as using the carry flag, just can't be done in C.
Therefore I'll go with plan B, and copy micronucleus and as a starting point.

@nerdralph
Copy link
Contributor Author

I forgot v-usb is such a mess of preprocessor code. I decided that it'll be easier to optimize and convert to asm if I clean it up and strip it down to only what is used by micronucleus first. Therefore I've forked your micronucleus-firmware.
So far I've removed everything related to USB_CFG_CHECK_CRC, FAST_CRC, IMPLEMENT_HALT, HAVE_INTRIN_ENDPOINT3 (&EP3_NUMBER), HAVE_INTRIN_ENDPOINT, SUPPRESS_INTR_CODE, & INTR_POLL_INTERVAL, IMPLEMENT_FN_WRITE, FN_WRITEOUT, and USB_CFG_HAVE_FLOWCONTROL.

I've been testing with t85_default and t85_aggressive. When I'm finished with the cleanup, I'll push to my fork and make a pull request.

@micronucleus micronucleus deleted a comment from ArminJo Aug 22, 2020
@nerdralph
Copy link
Contributor Author

I think there's still optimizations in u-wire that haven't been back-ported to micronucleus.
https://github.com/cpldcpu/u-wire/tree/master/firmware
In particular, usbdrv/usbdrv.c has been significantly reduced, with the basic USB message processing happening in main.c now.

@cpldcpu
Copy link
Member

cpldcpu commented Aug 23, 2020

Hi Ralph,

nice that you are looking into a 100% ASM version! It certainly is a nice challenge. Micronucleus is a bit caught between the need to serve multiple goals: Small size, reliability and configurability. This often leads to certain compromises.

Although Micronucleus uses V-USB in a non-standard way (no interrupts), the goal was always to retain V-USB as much in its original form as possible to allow upgrading for newer version. Basically only a RETI was swapped for a RET. Of course, it seems that V-USB has never been updated during the last years (support for Tiny0 and 1 series and also LGT missing). So maybe keeping V-USB pristine does not need to be a priority anymore.

There are definitely some things in u-wire that could bring down to size a bit, e.g. flat code, foregoing the SRAM double buffering and multi-byte responses.

My feeling is that things start to be a bit challenging when trying to retain all the configuration options? Maybe it would be wise to start with an ATTiny85 only version first ? The protocol has not changed since the introduction of 2.0, so you can use any 2.x version as a starting point. Maybe it should be a separate firmware first and only be merged after there is some clarity how everything works out?

Generally, I cannot invest that much time into Micronucleus nowadays, but I will gladly assist if there are any questions. The most pressing items right now are to fix the documentation and investigate whether we could use CI to set up a proper release system.

Best Regards,
Tim

@nerdralph
Copy link
Contributor Author

Thanks for the feedback Tim,

Although it doesn't seem to be very popular, I prefer to do most of my embedded programming in asm, particularly on small 8-bit platforms like 8051 and AVR. I find it much easier to write the asm code I want, rather than trying to cajole avr-gcc into generating the code that I want.

I agree with your plan; that's pretty much what I've started doing. t85_default will be my main testing target, with t85_aggressive being secondary.

I shouldn't have any questions on the technical side, as low-speed USB 1.1 is a rather simple protocol when limited to EP0 control packets. Where I may have some questions is to tap your knowledge of how people are actually using micronucleus. Sometimes functionality that has no value to me is important to the average user.

Regarding CI, it's a good suggestion. I'll probably use github actions for CI on my fork. Just something simple to make sure everything builds clean.

@nerdralph
Copy link
Contributor Author

I just got t85_aggressive down to 1342 bytes, below the next page boundary.
https://github.com/nerdralph/micronucleus-firmware/tree/asm

I also improved the sync timing window from -1/+11c to -3/+7c per JK transition.

@ArminJo
Copy link
Contributor

ArminJo commented Aug 24, 2020

Congrats 🥇

@cpldcpu
Copy link
Member

cpldcpu commented Aug 26, 2020

Nice! Every byte is hard work at this point 👍

@nerdralph
Copy link
Contributor Author

Making tweaks to the v-USB asm code is a bit trickier than it should be. Some places the cycle numbers in the comments are at the start of the instruction, and in others it is at the end. I think Christian started off using end of instruction counts (usbdrvasm12.inc) and switched to the start of instruction in the other versions. Going with the start of instruction makes more sense to me since skip and branch instructions take 1 or 2 cycles.
I've also noticed some mistakes in the cycle counts. The push on line 96 actually ends with cycle 12.
https://github.com/micronucleus/micronucleus/blob/master/firmware/usbdrv/usbdrvasm12.inc#L96
The cycle numbers for the next two push instructions are off by 2, and then from line 100 they are correct.

@nerdralph
Copy link
Contributor Author

I've added a m88p config, and finished converting asm12.inc to conform to the avr-libc calling convention. This makes it easy to run micronucleus on a USBasp. I've switched to testing with the 12MHz v-USB with external xtal. No variations of the 16.5MHz version is completely stable on my PC, making it difficult to test changes I make to micronucleus. I suspect most USB hub controllers don't implement a PLL with a timing range of +/- 1.5% for low-speed devices as required by the spec. Tim, it looks like you figured this out several years ago:
https://forums.obdev.at/viewtopic.php?t=8819

In addition to the code size optimizations I'm doing, I think there is still room to improve reliability. I think OSCCAL tuning doesn't always pick the best value (sometimes being off by 1 step), and I think I can reduce the code size. In picoboot I've implemented automatic OSCCAL tuning in 20 instructions.
https://github.com/nerdralph/picoboot-lib/blob/master/pb-lib.S#L107

@nerdralph
Copy link
Contributor Author

I just pushed an update to usbdrv.c that leaves out unused USB requests (ones mn should never see, or ignores). t85_default is now 1438 bytes and t85_aggressive is 1292.

@nerdralph
Copy link
Contributor Author

I'm catching some big fish today. Since the configuration descriptor is now only 9 bytes, wLength no longer needs to be checked for any of the messages mn receives. t85_aggressive is now under the next page boundary (1280) at 1274 bytes.

@ArminJo
Copy link
Contributor

ArminJo commented Aug 27, 2020

👍
If you push the hex files to your repo, I can make some independent and intermediate tests for your t85_default and t85_aggressive versions.

@nerdralph
Copy link
Contributor Author

+1
If you push the hex files to your repo, I can make some independent and intermediate tests for your t85_default and t85_aggressive versions.

As you wish.
https://github.com/nerdralph/micronucleus-firmware/tree/asm/firmware/releases

@nerdralph
Copy link
Contributor Author

I removed more dead code from usbdrv.c, and got past the next page boundary for t85_default: 1398 bytes.

@cpldcpu
Copy link
Member

cpldcpu commented Aug 28, 2020

I suspect most USB hub controllers don't implement a PLL with a timing range of +/- 1.5% for low-speed devices as required by the spec. Tim, it looks like you figured this out several years ago:
https://forums.obdev.at/viewtopic.php?t=8819

Oh indeed, I did not even remember that. I guess I was trying to corroberate the claims in the V-USB documentation regarding the stability of the individual implementations for different clockspeeds. Not all of the claims seem to make sense and none are backed up with data.

I am a bit surprised, however, that you see issues with 16.5 MHz. I always regarded this as a stable variant.

In addition to the code size optimizations I'm doing, I think there is still room to improve reliability. I think OSCCAL tuning doesn't always pick the best value (sometimes being off by 1 step), and I think I can reduce the code size. In picoboot I've implemented automatic OSCCAL tuning in 20 instructions.

Keep in mind that some ATtinys have a non-monotonic encoding of OSCCAL. Therefore the linear search ("neighbourhood search") should not be skipped. Also, an timeout conditition is handy to prevent deadlocks.

@nerdralph
Copy link
Contributor Author

I suspect most USB hub controllers don't implement a PLL with a timing range of +/- 1.5% for low-speed devices as required by the spec. Tim, it looks like you figured this out several years ago:
https://forums.obdev.at/viewtopic.php?t=8819

Oh indeed, I did not even remember that. I guess I was trying to corroberate the claims in the V-USB documentation regarding the stability of the individual implementations for different clockspeeds. Not all of the claims seem to make sense and none are backed up with data.

I am a bit surprised, however, that you see issues with 16.5 MHz. I always regarded this as a stable variant.

The reference to +/- 1.5% is correct. 7.1.11 - data signalling pg 159
"For low-speed functions, the required data-rate whentransmitting (TLDRATE) is 1.50 Mb/s ±1.5% (15,000 ppm)."
http://sdphca.ucsd.edu/lab_equip_manuals/usb_20.pdf

However I'm convinced that testing with a low-speed device outside of +/- 0.25% timing is not part of the certification process, and the DPLL in many modern hubs doesn't have the range to vary by 1.5%. Hence the most stable v-usb devices are those that use an external xtal, such as the USBasp and USBtinyISP.

@nerdralph
Copy link
Contributor Author

Today I started converting the usbdrv.c ProcessRx code to asm. t85_aggressive is now below 19 pages, at 1212 bytes.
https://github.com/nerdralph/micronucleus-firmware/blob/asm/firmware/releases/t85_aggressive20200828.hex

next I'll work on BuildTx.

@nerdralph
Copy link
Contributor Author

I found some easy size optimizations in the C code by using fixed registers for usbMsgPtr and usbMsgLen. Those two changes save a total of 36 bytes, so now t85_default is down to 1340 bytes.
https://github.com/nerdralph/micronucleus-firmware/blob/asm/firmware/releases/t85_default20200830.hex

@nerdralph
Copy link
Contributor Author

I finished converting usbBuildTxBlock to asm, and now t85_aggressive is 1146 bytes.
https://github.com/nerdralph/micronucleus-firmware/blob/asm/firmware/releases/t85_aggressive20200830.hex

@cpldcpu
Copy link
Member

cpldcpu commented Aug 31, 2020

👍 Yeah, it's just a lot of nested functions that can be flattened.

@nerdralph
Copy link
Contributor Author

Today I decided to do some investigation into the reliability problems I'm having with micronucleus on a t85 at 16/16.5MHz. While it's well known OSCCAL tuning on the t85 doesn't get much closer than +-0.5%, what I found is the frequency is only stable to within +/- 0.5% over short intervals (~1s). Here's the program and a sample output.
https://github.com/nerdralph/debugSerial/blob/master/examples/clockProfiler/clockProfiler.ino

bootup OSCCAL: 0x58
external clock signal detected
testing OSCCAL 0x54...
avg cycles/ms:15732, high-low:15776-15648
testing OSCCAL 0x55...
avg cycles/ms:15811, high-low:15904-15792
testing OSCCAL 0x56...
avg cycles/ms:15930, high-low:16016-15904
testing OSCCAL 0x57...
avg cycles/ms:16062, high-low:16176-16032
testing OSCCAL 0x58...
avg cycles/ms:16061, high-low:16160-16032
testing OSCCAL 0x59...
avg cycles/ms:16215, high-low:16288-16176
testing OSCCAL 0x5A...
avg cycles/ms:16348, high-low:16432-16288
testing OSCCAL 0x5B...
avg cycles/ms:16484, high-low:16576-16448
testing OSCCAL 0x5C...
avg cycles/ms:16537, high-low:16640-16512

@cpldcpu
Copy link
Member

cpldcpu commented Sep 1, 2020

Htat seems to be a surprisingly strong effect. Since the trend is quite monotonic: Could this be due to self-heating?
It could be quite interesting to monitor temperature and VDD using the internal ADC at the same time.
Maybe some information could be extracted to trigger re-sync?

@nerdralph
Copy link
Contributor Author

nerdralph commented Sep 1, 2020

The resolution of the high/low count was only 16 cycles as I was only sampling 1 phase with a 8-cycle sampling loop and then multiplying by 2. I updated it to sample both clock phases of the external clock. I'm using the 1kHz output of my Rigol DS1054Z which is extremely stable and accurate to a few ppm. I also reduced the sampling window from 1s to 250ms, as I think it starts getting unreasonable if you were to retune OSCCAL more than 4x/s.
I tried it on a t13 (downclocked to 8M) as that's what I happen to have setup this am. I'll try an actual t85 later, but I don't expect it to be much different since I've found the RC oscillators on both chips behave similarly.

testing OSCCAL 0x49
avg cycles/ms:7605, high-low: 7640-7592
testing OSCCAL 0x4A
avg cycles/ms:7690, high-low: 7728-7680
testing OSCCAL 0x4B
avg cycles/ms:7790, high-low: 7816-7768
testing OSCCAL 0x4C
avg cycles/ms:7883, high-low: 7912-7856
testing OSCCAL 0x4D
avg cycles/ms:7974, high-low: 8000-7952
testing OSCCAL 0x4E
avg cycles/ms:8068, high-low: 8096-8048
testing OSCCAL 0x4F
avg cycles/ms:8157, high-low: 8200-8144
testing OSCCAL 0x50
avg cycles/ms:8165, high-low: 8192-8128
testing OSCCAL 0x51
avg cycles/ms:8282, ⸮igh-low: 8304-8240

Since the resolution of the OSCCAL steps and the clock instability are cumulative, that means you can't count on much better than +/- 1% timing accuracy on a t85.

p.s. I've even considered dithering clock in order to improve the tuning accuracy, but the frequency changes too quickly (within a few clock cycles). I suspect the clock design latches OSCCAL to an internal R-2R DAC driving a VCO.

@nerdralph
Copy link
Contributor Author

nerdralph commented Sep 1, 2020

I just noticed something that might explain some of the unreliability I'm seeing while running on the internal RC oscillator:
https://github.com/micronucleus/micronucleus/blob/master/firmware/osccalASM.S#L153

I haven't been testing with a dedicated bus; there's a low-speed mouse with a 8ms poll interval on the same bus. I think it's too limiting to require a dedicated USB bus for micronucleus, as many low-end PCs and laptops only have two. As well, you'd need to explain to most users how to determine which physical USB ports share one bus.

I think before I work on converting the main loop to asm, I'm going to write a new calibration function. What I use in picoboot is a proportional algorithm; a larger time difference results in a larger change to OSCCAL. I think I can get OSCCAL to the optimal value with 3-4 empty frames, and reliably discard timing from non-empty frames.

p.s. the consequences are particularly bad if there is data in a frame during the first 5 stages of the binary search. That would put the optimal OSCCAL value outside of the adjustment range of the nearest neighbor adjustment.

@cpldcpu
Copy link
Member

cpldcpu commented Sep 1, 2020

These collision are not as much of an issue as one might think. Only old (1.1?) USB hubs forward all traffic to every port. Anything more modern should not present this issue. AFAIR the fix was introduced by shay based on his observations on a 1.1 hub.

@cpldcpu
Copy link
Member

cpldcpu commented Sep 4, 2020

The A version seems to be a shrink with updated oscillator. The old from the non-a version looks quite different:

grafik

@nerdralph
Copy link
Contributor Author

Today I did some testing with a t88 configured to tune OSCCAL to 12Mhz, and it was failing to enumerate. I added some debug code to output the clock/1000, and found it was getting tuned to 12.5Mhz. I think the tuning bug has been there for years, as I had tried to get the t88 working several years ago but had moved on to other projects after I couldn't get it working.
Unfortunately the new tuneOsccal code isn't working either, but I've only done a couple simple tests. I should have it debugged and working over the next 24 hours.

@nerdralph
Copy link
Contributor Author

There were a couple basic bugs in tuneOsccal that I found yesterday. Thinks like both scratch and shift being defined as r23 (doh!).

The most serious bug took me a while to find. I created a basic realtime debug log with a 6mbps UART so I could log the counter values from OSCCAL. With the debug logging in place, I realized tuneOsccal was getting called dozens of times because mn tries to tune during bus reset, not after. That bug was fixed with an extra two instructions to wait for the end of reset.
It's close to working correctly, with the last issue being a 0.5% difference in frame length from what I expected.

@nerdralph
Copy link
Contributor Author

With 2 hours to spare in my 24hr target, I have success! I have a t88 running rock solid. I'm using a USBasp PCB since they are a cheap tqfp AVR USB development board. I use a red LED soldered to pads intended for an 0805 polyfuse as a combined power LED and 3.35V "regulator" since it has a vF of 1.65V at 5-10mA. Running at 3.35V means I don't need any zeners on UDM/UDP, and I use a 7.5k pullup to 5V on UDM. I've created a new config for it called t88_asp.
t88asp

I've tried multiple connect/disconnects, --erase-only, flashing upgrade.hex, and have had zero errors so far. With LED blink it's 1322 bytes. I'll do a bit more testing with it tonight, and tomorrow I plan to test with the t84a.

@nerdralph
Copy link
Contributor Author

Upgrade wasn't actually working, it was just reporting success but not writing anything because I had forgot to enable the SELFPRGEN fuse. After enabling the fuse, I'm getting unreliable results flashing new code using the bootloader. Sometimes I get flash write error -1, other times I get no errors reported, but the code doesn't seem to have updated. I'm using OSCCAL_RESTORE_DEFAULT and OSCCAL_SLOW_PROGRAMMING.

@cpldcpu
Copy link
Member

cpldcpu commented Sep 6, 2020

Nice! Do you mean Mega88 instead of tiny88? Because there are some tiny88 development boards out there (TINY-MH88) that seem to work fine. I guess all the ATMega-support was developed on boards with crystals.

@nerdralph
Copy link
Contributor Author

nerdralph commented Sep 6, 2020 via email

@nerdralph
Copy link
Contributor Author

After modifying the SLOW_PROGRAMMING code to only slow the clock during erase/program, the stability seems better with the t88. Not quite perfect, as I got a recoverable error once during testing:

>> Eep! Connection to device lost during erase! Not to worry
>> This happens on some computers - reconnecting...
>> Reconnected! Continuing upload sequence...

Upgrade isn't working for me though. Even though the code sucessfully uploads, the old bootloader doesn't get replaced. I think this is a bug introduced in the micronucleus-firmware fork, as the upgrade for 2.04 worked fine when I tried it on a t85. Since flashing and erasing with the mn tool works fine, I'll leave the upgrade debugging to someone else.

@nerdralph
Copy link
Contributor Author

I decided to have some fun today, and see what happens when modifying the flash at >12Mhz. It turns out it may even be a good thing.
http://nerdralph.blogspot.com/2020/09/flashing-avrs-at-high-speed.html

@cpldcpu
Copy link
Member

cpldcpu commented Sep 6, 2020

Interesting!

One challenge I see is that you cannot really measure read margin using the user registers. So, it would be possible that you write the cell successfully, but the charge stored on it is not sufficient to allow for a 10 year data retention.

Maybe an easy hack could be to write twice? Because it should be fair to assume that the charge stored per time is a constant if all other conditions are the same.

@nerdralph
Copy link
Contributor Author

nerdralph commented Sep 6, 2020

One challenge I see is that you cannot really measure read margin using the user registers. So, it would be possible that you write the cell successfully, but the charge stored on it is not sufficient to allow for a 10 year data retention.

Don't forget charge is 1-(e^-x), so reducing the time by 1/3rd reduces the charge by only 24%. For retention the t88 spec is actually "20 years at 85°C / 100 years at 25°C". I suppose I could do an experiment to test retention by writing zeros to the page, then repeatedly verifying the page at high temperature. If I can find 26-30AWG wire with high temperature insulation, then I could probably do the testing at 150-160C. However extrapolating the lifetime/temperature curve, I'd need to run the AVR at 200C to get below 1 year expected lifetime. Given that's above the melting point of my solder, I don't see that experiment happening anytime soon.

However since I've already pushed it two orders of magnitude past it's rated number of write cycles, I may try the repeated read test just to make sure it still holds the zeros for at least a few days.

edit: According to a Micron technical note I found, the degradation due to elevated temperature is more exponential than I thought. 150C unpowered overnight followed by a read test of < 24hrs may be good enough.

@cpldcpu
Copy link
Member

cpldcpu commented Sep 6, 2020

Don't forget charge is 1-(e^-x), so reducing the time by 1/3rd reduces the charge by only 24%

The writing mechanism does most likely not behave like an RC circuit. I don't know what kind of write mechanism is used in the AVR, could be HCI or FN. Usually it is rate limited by the electron transfer process and is operated at strong overbias, so that the gate charge does not really influence the rate.

Maybe it is possible to vary supply voltage to test for programming margin. Since they seem to have an internal charge pump, that may not work, though.

@nerdralph
Copy link
Contributor Author

I'm trying to solve a small problem with the new OSCCAL tuner. At 8Mhz, one USB bit-time is 6.67 cycles. The counting loop takes 10 cycles, so its possible to miss a single high bit on D+. That means occasionally it could mistake the EOP SE0 as SOF.
One idea I have is to bump OSCCAL to 255 at the start, and let it tune it down. With OSCCAL at 255, the clock speed will be over 12Mhz. A slight downside to this is it would require conditional compilation since we don't want to do it on the tX5 and other devices with a discontinuous OSCCAL curve.
Another idea I have is to add 2 instructions to the loop (in & or) to take 2 samples, each 6 cycles apart. That guarantees at least one sample per USB bit-time.

Thoughts?

@nerdralph
Copy link
Contributor Author

Here's what I've come up with for a double-sampling loop.

waitSOF:                                ; 12-cycle loop
    lsr shift
    in scratch, USBIN
    or scratch, r0
    adiw countL, fraction
    bst scratch, USBPLUS
    bld shift, 7
    in r0, USBIN
    sbrs scratch, USBMINUS
    tst shift
    brne waitSOF                        ; not idle last 7 loops

@cpldcpu
Copy link
Member

cpldcpu commented Sep 9, 2020

hm... why do you want to run oscall at 8 MHz? There is no V-USB implementation at that frequency?

Maybe another idea to address the issue: How about repeating the fixed point iteration for more cycles and also accumulate fractional count of osccal? Then the error would be smoothed out over time.

@nerdralph
Copy link
Contributor Author

hm... why do you want to run oscall at 8 MHz? There is no V-USB implementation at that frequency?

Maybe another idea to address the issue: How about repeating the fixed point iteration for more cycles and also accumulate fractional count of osccal? Then the error would be smoothed out over time.

The target speed is not 8MHz, it's 12 on devices like the t88 and t84. They start off at 8MHz. The problem is actually more likely than I stated above. 1 low-speed bit(666.7ns) at 12Mhz is 8 cycles, so at 8Mhz it's only 5.333 cycles.

The error from misinterpreting EOP as SOF could be more than an order of magnitude (if the subject EOP is in the 1st 1/10th of the frame).

@nerdralph
Copy link
Contributor Author

Here's what I've come up with for a double-sampling loop.

waitSOF:                                ; 12-cycle loop
    lsr shift
    in scratch, USBIN
    or scratch, r0
    adiw countL, fraction
    bst scratch, USBPLUS
    bld shift, 7
    in r0, USBIN
    sbrs scratch, USBMINUS
    tst shift
    brne waitSOF                        ; not idle last 7 loops

This version fixed the bug with missed D+ pulses, but introduced a problem with frequent missed SOFs. The problem was checking the USBMINUS bit of the two samples that were or'd together meant that only 3 SE0 samples in a row was guaranteed to be detected. I changed it to sbrs r0, USBMINUS so the check for SE0 is once every loop (12 cycles). With this change, and with starting OSCCAL with 224 to ensure the initial speed is >12MHz, the t88 I've been testing has been working perfectly. Dozens of re-enumerations, erase, and upload operations have worked error-free, even using a WRITE_SLEEP time of 3ms and fast-mode.

I'd prefer to avoid the extra code in main to boost OSCCAL when (OSCCAL_HAVE_XTAL == 0) && (F_CPU == 12000000), so I've been thinking about writing another version of the tuner that has a tight 6-cycle loop looking for SE0, and then checks PCIF to see if there was a D+ pulse. This would allow the tuner to start at 8Mhz and not risk missing the 2-bit SE0 for SOP.

@cpldcpu
Copy link
Member

cpldcpu commented Sep 12, 2020

You could use the pin change interrupt to detect D+ pulses. Then you could use a fast loop to detect SOF in D-.

btw, one thing that could prove a bit problematic with the current loops is a lack of timeout. There were some issues with micronucleus when people had noise on the USB pins even when the device was not connected to the host. In that case sometimes osccall is invoked. This was especially an issues with Olimex devices - no idea why.

-> I see that you still have the escape via BRNE. Good :)

@nerdralph
Copy link
Contributor Author

After doing some work on usbasp firmware, I'm back to hacking on mn. One of my ideas is optimizing or even eliminating the CRC16 code. A couple weeks ago I combined reading descriptors from flash with the CRC function.
https://github.com/nerdralph/micronucleus-firmware/blob/asm/firmware/usbdrv/usbdrvasm.S#L205

I've been thinking that it should be possible to calculate the CRC at compile time. It would mean breaking up the descriptors into 8-byte chunks, and appending the CRC to each chunk. That would add 12 bytes to descriptors, and save 42 bytes of the CRC code, for a net gain of 30 bytes.

I've written a small C++ program to test the concept:

#include <stdint.h>

int constexpr sum(const uint8_t arr[], int n)
{
    int sum = 0;
    for (int i = 0; i < n; sum += arr[i++]);
    return sum;
}

int main()
{
    uint8_t arr[] = {1, 2, 3};
    return sum(arr, sizeof(arr));
}

I tested it with gcc 5.4.0, which needs C++14 support enabled for constexpr functions.
avr-gcc -std=c++14 -mmcu=avr25 -Os -c const_sum.cpp

I confirmed that the constexpr function is evaluated at compile time:

00000000 <main>:
   0:   86 e0           ldi     r24, 0x06       ; 6
   2:   90 e0           ldi     r25, 0x00       ; 0
   4:   08 95           ret

This would have the added benefit of eliminating the runtime cost of calculating the CRC.
Instead of using a constexpr function, another possibility is to write a parser that reads the descriptor definitions, and then generates the segmented descriptors with the CRC added. The big problem is that although it is technically possible, I can't see a way of implementing this idea without an impact on code readability.

@nerdralph
Copy link
Contributor Author

I think I've really gone down the rabbit hole now. After so many days of reading usb code (both vUSB and usbtiny) I started writing my own. Here's the plans:

  1. really small, probably half the size of vUSB. I plan to call it picoUSB.
  2. 7 cycles/bit so it will work at 10.5MHz, well within the OSCCAL tuning range of 8MHz AVRs
  3. to keep it really simple, no bit stuffing (none of the data will have more than 5 sequential "1" bits) Using device class 0 instead of 0xFF doesn't seem to break anything.
  4. if I can figure out a clean way of doing it, compile-time CRC calculation

I already some prototype generating a valid ACK packet (sync, PID, & EOP) in less than 40 asm instructions. Rx will need a bit more code, probably 50-60 instructions.
It should be slightly more tolerant of timing errors because 7 cycles permits sampling in the center of the bit instead of slightly off-center with 8 cycles per bit at 12MHz.

Here's the core of the tx code:

; 7-cycle/bit tx loop, cnt = 2 * # of bits to tx
nextByte:
    breq txEoP
    ld data, X+
tx2bit:
    sbrs data, 0
    out USBPORT-2, mask                 ; even bits
    lpm
    lsr data
    subi cnt, -4
    sbrs data, 0
    out USBPORT-2, mask                 ; odd bits
    brhc nextByte                       ; subi modifies H
    lsr data
    nop
    rjmp tx2bit

@cpldcpu
Copy link
Member

cpldcpu commented Sep 23, 2020

That's a cool idea! How do you prevent bit stuffing? By only using one nibble of the payload?

I guess it's a bit more tricky to avoid bit stuffing in the CRC. If it is possible to generate all CRC codes at compile time, one could add a bit of salt, i guess?

@nerdralph
Copy link
Contributor Author

Sending inverted nibbles like PIDs is probably the easiest way of sending data. A 8/6 GCR-like encoding would be preferable if I can come up with a small decoding algorithm. In either case it would allow for some error checking on the device side. My reason for wanting more than 4 decoded data bytes per data packet is so I can send R1:R0, Z, and 1 byte for SPMCSR in each user control packet. It greatly simplifies the bootloader protocol, and is similar to a technique I used in my first picoboot bootloader that made it possible to fit in just 64 bytes.

The CRC isn't so hard since all of the USB responses from the target never change. If mn had a flash read command for verify, then the CRC problem would be more complicated. In my mn fork, there's 6 different data packets the device needs to send: 3 for the device descriptor, 2 for the configuration descriptor, and one for the mn config. Looking at the USB traces, I noticed there was only 1 stuffed bit, and that was for the 0xFF device protocol field. I think I can get rid of the mn config packet by using the class, subclass, and protocol fields of the device descriptor. Both the 2nd signature byte and total flash size can be encoded in just 3 bits. The upper nibble is always 9, and the lower nibble is the flash size in kB as 2^N. That's why the t88, t84, t85, & m8 all have 0x93 as their middle signature byte. They all have 2^3 = 8 kB of flash. Page size only needs a couple bits, and user flash space available could be efficiently encoded as total flash size - bootloader size in pages.

@cpldcpu
Copy link
Member

cpldcpu commented Sep 24, 2020

No need for a complex line coding, I think. You just need to insert two zeroes into each byte, e.g. bit 7 and 0 so that no adjacent bytes will case a long run. Using nibbles may be a bit easier to handle for block transfer, though.

Transferring each byte address together with each cell to be writted is a bit dangerous from a safety perspective, in my opinion, because it is much more difficult to perform a consistency check on client side. It would easily be possible to brick the device. Maybe just transferring the block offset is safer?

@nerdralph
Copy link
Contributor Author

I've got a prototype for build-time CRC generation working.
https://github.com/nerdralph/micronucleus-firmware/blob/asm/firmware/gencrc.c
Next I'll move the descriptor definitions out of usbdrv.c and into their own file, with the appropriate CRC inlined after every 8 data bytes.

After reviewing the enumeration transactions again, I noticed the 3rd chunk (2 bytes) of the device descriptor has a stuffed bit in the CRC (36671/0x8F3F). I've experimented with adding a surplus byte to the descriptor for salting the CRC. Its been enumerating fine on Windows 7 and on Linux with an extra byte at the end of the descriptor, and with bLength left as 18. It seems the USB drivers ignore the extra data. I suspect MacOS will work as well, but it would be good to have someone test it.

@cpldcpu
Copy link
Member

cpldcpu commented Oct 5, 2020

Nice! So the compiler automatically infers compile time evaluation from the fact that only consts are used? Or does it require an additional keyword for the function declaration?

How do you deal with cases where the CRC leads to bitstuffing? Asserts?`

@nerdralph
Copy link
Contributor Author

Nice! So the compiler automatically infers compile time evaluation from the fact that only consts are used? Or does it require an additional keyword for the function declaration?

How do you deal with cases where the CRC leads to bitstuffing? Asserts?`

It might be possible to inline the CRCs with some constexpr and lto wizardry, but I'm starting out simple with the gencrc program. I'll have it generate a header file that defines the CRCs, and then I'll add checks for making sure the CRCs don't need bit stuffing.

I've also been thinking of changing the descriptor structures to combine the device and configuration descriptors. The device descriptor is always the first descriptor sent after bus reset, and then the configuration descriptor. By making the enumeration code stateful, sequencing through the descriptor chunks for non-status IN requests, I think I can simplify and streamline the rx handling code.

@cpldcpu
Copy link
Member

cpldcpu commented Oct 7, 2020

ah ok, that is an offline calculation. I guess thats more than sufficient. I was just confused for a bit, because I though this code would use constant function evaluation, but I did not notice any extra keywords.

@cpldcpu
Copy link
Member

cpldcpu commented Jan 31, 2021

Hi @nerdralph,

it looks like we will finally get the repository cleaned up now. I believe we can release MN 2.5 soon. For the version after that, it would make a lot of sense to migrate some of your more advanced changes.

@nerdralph
Copy link
Contributor Author

Although I haven't pushed any MN updates in the last few months, I've been mulling over some ideas about MN while I work on other projects. The latest idea has to do with the code to sync up with the incoming USB packet. I realized my first attempt to get single-cycle accuracy isn't any better than the original code (which is accurate to 2 cycles). When it comes to improving reliability on targets, single-cycle accuracy would reduce the sampling error from .3125% to .156% for a 12MHz (8 cycle/bit) implementation. Since the total error margin of 1/2 a bit over a ~80-bit packet is .625%, single-cycle accuracy should in theory make a material difference with targets running on the RC oscillator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants