Post by Rick CPost by a***@math.uni.wroc.plI am using laptops to control test fixtures via a USB serial port. I'm looking at combining many test fixtures in one chassis, controlled over one serial port. The problem I'm concerned about is not the speed of the bus, which can range up to 10 Mbps. It's the interface to the serial port.
The messages are all short, around 15 characters. The master PC addresses a slave and the slave promptly replies. It seems this message level hand shake creates a bottle neck in every interface I've looked at.
FTDI has a high-speed USB cable that is likely limited by the 8 kHz polling rate. So the message and response pair would be limited to 4 kHz. Spread over 256 end points, that's only 16 message pairs a second to each target. That might be workable if there were no other delays.
While investigating other units, I found some Ethernet to serial devices and found some claim the serial port can run at up to 3.7 Mbps. But when I contacted them, they said each message has a 1 ms delay, so that's only 500 pairs per second, or maybe 2 pairs per second per channel. That's slow!
They have multi-port boxes, up to 16, so I've asked them if they will run with a larger aggregate rate, or if the delay on one port impacts all of them.
I've also found another vendor with a similar product, and I've asked about that too.
I'm surprised and disappointed the Ethernet devices have such delays. I would have expected them to work better given their rather high prices.
I could add a module, to interface between the PC serial port and the 16 test fixtures. It would allow the test application on the PC to send messages to all 16 test fixtures in a row. The added module would receive on separate lines, the 16 responses and stream them out to the port to the PC as one, continuous message. This is a bit messier since now, the 16 lines from this new module would need to be marked since they have to plug into the right test fixture each day.
Or, if I could devise a manner of assigning priority, the slaves could all manage the priority themselves and still share the receive bus to the serial port on the PC. Again, this would look like one long message to the port and the PC. The application program would see the individual messages and parse them separately. Many of the commands from the PC could actually be shortened to a single, broadcast command since the same tests are done on all targets in parallel. So using an RJ-45 connector, there would be the two pairs for the serial port, and two pairs for the priority daisy-chain.
I guess I'm thinking out loud here.
LOL, so now I'm leaning back toward the USB based FTDI RS-422 cable and a priority scheme so every target gets many, more commands per second. I just ran the math, and this would be almost 20,000 bits per command. Try to run that at 8,000 times per second and a 100 Mbps Ethernet port won't keep up.
I've written to FTDI about the actual throughput I can expect with their cables. We'll see what they come back with.
I am not sure if you get that there are two issues: througput and latency.
Of course I'm aware of it. That's the entirety of the problem.
Post by a***@math.uni.wroc.plIf you wait for answer before sending next request you will be bounded
by latency.
Until I contacted the various vendors, I had no reason to expect their hardware to have such excessive latencies. Especially in the Ethernet converter, I would have expected better hardware. Being an FPGA sort of guy, I didn't even realize they would not implement the data path in an FPGA.
How do you know that data path is not in hardware? One question is
if hardware is able to opperate with low latency. Another is if it
should. And frequently answer to secend question is no, it should
not try to minimize latency. Namely, Ethernet has minimal packet
size which is about 60 characters. If you send each character in
separate packet, then there would be very bad utilization of media.
So, converter is expected to wait till there is enough characters
to transmit. Note that at 115200 bits/s delay of 1ms is roughly
11 characters, so not so big. At lower rates delay becomes less
signifincant and at higher rates people usually care more about
throughput than latency. And do not forget that Ethernet is
shared medium, even if convertor could manage to transmit with
lower latency withing available Ethernet bandwidth, it could
do that only at cost of other users (possibly second convertor).
And from a bit different point of view: normally there will be
software in the path, giving you 0.1ms of latency on good modern
unloaded hardware and much more in worse conditions. Also,
Ethernet likes packets of about 1400 bytes size. On 10 Mbit/s
Ethernet this is about 1.4 ms for transmitssion of packet.
If network in not dedicated to convertor such packets are likely
to appear from time to time and convertor has to wait till
such packet is fully transmitted and only then gets chance
to transmit. So, you should regularly expect delays of order
1ms. Of course, with 100 Mbit/s Ethernet or gigabit one
media delays are smaller, but serial convertors are frequently
deployed in legacy contexts where 10 Mbit/s matter.
Post by Rick CI found one company that does use an FPGA for a USB to serial adapter, but I expect the PC side USB software may be problematic as well. It makes you wonder how they ever get audio to work over USB. I guess lots of buffering.
Audio is quite different than serial. Audio can be pre-scheduled
but in general you do not know when there will be traffic on
serial port.
Post by Rick CPost by a***@math.uni.wroc.plOTOH if you fire several request without waiting, then
you will be limited by througput.
Yes, but the current protocol using a single target works with one command at a time. In ignorance of the many problems with serial port converters, I was planning to use the same protocol. I have several new ideas, including various ways to combine messages to multiple targets, into one message. Or... I could move the details of the various tests into the target FPGAs, so they receive a command to test function X, rather than the multiple commands to write and read various registers that manipulate the details being tested.
Concerns with this include the need to reload all the FPGAs, any time the are updated with a new test feature, or bug fix. That's probably 64 FPGAs. I could use one FPGA per test fixture, for a total of 16, but that makes the routing a bit more problematic. Even 16 is a PITA.
Also, I've relied on monitoring the command stream to spot bugs. That would require attaching a serial debugger of some sort to the interface to the UUT, and the internal test controller would be much harder to observe. Currently, that is controlled by commands as well.
Post by a***@math.uni.wroc.plWith relatively cheap convertors
on Linux to handle 10000 roundtrips for 15 bytes messages I need
CH340 2Mb/s, waiting, 6.890s
That's 11.3 per target, per second. (128 targets)
Post by a***@math.uni.wroc.plCH340 2Mb/s, overlapped 1.058s
That's pretty close to 74 per target, per second.
I used to use the CH340 devices, but we had intermittent lockups of the serial port when testing all day long. I switched to FTDI and that went away. I think you told me you have no such problems. Maybe it's the CH340 Windows serial drivers.
Well, my use is rather light. Most is for debugging at say 9600 or
115200. And when plugged in convertor mostly sits idle. I previously
wrote that CH340 did not work at 921600. More testing showed that
it actually worked, but speed was significantly different, I had to
set my MCU to 847000 communicate. This could be bug in Linux driver
(there is rather funky formula connecting speed to parameters
and it looks easy to get it wrong). Similary, when CH340 was set to 576800
I had to set MCU to 541300. Even after matching speed at nomial
576800, 921600 and 1152000 test time was much (more than 10 times)
higher than for other rates (I only tested 1 character messages at those
rates, did not want to wait for full test). Also, 500000 was significantly
slower than 460800 (but "merely" 2 times slower for 1 character messages
and catching up with longer messages). Still, ATM CH340 looks
resonably good.
Remark: I bought all my convertors from Chinese sellers. IIUC
FTDI chip is faked a lot, but other too. Still, I think they
show what is possible and illustrate some difficulties.
Post by Rick CPost by a***@math.uni.wroc.plCP2104 2Mb/s, waiting, 2.514s
CP2104 2Mb/s, overlapped 1.214s
I don't know what the CP2104 is.
It is a chip by Silicon Laboratories. Datasheet gives contact address
in Austin, TX.
Post by Rick CI'm not certain what "overlapped" means in this test. Did you just continue to send 15 byte messages with no delays 10,000 times?
No. My slave simply returns back each received character. There is
some software delay but it should be less than 2us. So even waiting
test has some overlap at character level. To get more overlap above
I cheated: my test program was sending 1 more character than it should.
So sent message was 16 bytes, read was 15. After reading 15 another
batch of 16 was sent and so on. In total there were 10000 more
characters sent than received. My hope was that OS would read
and buffer excess characters, but it seems that at least for
CP2104 they cause trouble. My current guess is that OS is
reading only when requested, but I did not investigate deeper...
Post by Rick CSince you are in the mood for testing, what happens if you run overlapped, with 128 messages of 15 characters and wait for the replies before sending the next batch? Also, if you don't mind, can you try 20 character messages?
OK, I tried modifeed version of my test program. It first sends
k messages without reading anything, then goes to main loop where
after sending each message it read one. At the end it tail loop
which reads last k messages without sending anything. So, there
is k + 1 messages in transit: after sending message k + i program
waits for answer to message i. In total there is 10000 messages.
Results are:
CH340, 15 char message 20 char message
k = 0 6.869s 7.163s
k = 1 4.682s 1.320s
k = 2 0.992s 1.320s
k = 3 0.991s 1.319s
k = 4 0.991s 1.320s
k = 5 0.990s 1.319s
k = 8 0.992s 1.320s
k = 12 0.990s 1.320s
k = 20 0.992s 1.319s
k = 36 0.991s 1.321s
k = 128 0.991s 1.319s
CP2104, 15 char message 20 char message
k = 0 2.508s 3.756s
k = 1 1.897s 1.993s
k = 2 1.668s 2.087s
k = 3 1.486s 1.887s
k = 4 1.457s 1.917s
k = 5 1.559s 1.877s
k = 8 1.455s 1.803s
k = 12 1.337s 1.501s
k = 20 1.123s 1.499s
k = 36 1.125s 1.502s
k = 128 reliably stalled, there were random stalls in other cases
FTDI232R,
2 Mbit/s 15 char message 20 char message
k = 0 5.478s 3.755s
k = 1 4.929s 3.030s
k = 2 2.506s 3.339s
k = 3 2.459s 2.020s
k = 4 1.708s 1.061s
k = 5 1.671s 1.032s
k = 8 0.764s 1.021s
k = 12 0.772s 1.014s
k = 20 0.763s 1.009s
k = 36 0.758s 1.007s
k = 128 0.757s 1.008s
FTDI232R,
3 Mbit/s 15 char message 20 char message
k = 0 8.216s 10.007s
k = 1 5.006s 4.344s
k = 2 3.338s 1.602s
k = 3 2.406s 1.444s
k = 4 1.766s 1.316s
k = 5 1.599s 1.673s
k = 8 1.040s 1.327s
k = 12 1.071s 1.312s
With k = 20, k = 36 and k = 128 communication stalled.
With PL2303HX at 2 Mbit/s I had a lot of transmission errors,
so did not test speed.
Post by Rick CPost by a***@math.uni.wroc.plThe other end was STM32F030, which was simply replaying back
received characters.
Note: there results are not fully comparable. Apparently CH340
will silently drop excess characters, so for overalapped operation
I simply sent more charactes than I read. OTOH CP2104 seem to
stall when its receive buffer overflows, so I limited overlap to
avoid stalls. Of course real application would need some way
to ensure that receive buffers do not overflow.
Wait, what? How would overlapped operation operate if you have to worry about lost characters???
I'm not sure what "stall" means. Did it send XOFF or something?
My program uses blocking system calls, it did not finish in resonable
time. I did not investigate deeper. ATM I assume that OS/driver
is correct os that my program would get characters if convertor
delivered them. I also assume that MCU is fast enough to avoid
loss of any character (character processing should be less than
2us, at 2 Mbit/s I have 5us per character). In inital test
I have sent more characters then I wanted receive, so loss of
some characters would not stop the program (OK, loss of more than
10000 would be too much). I this batch of tests I sent exactly
the number of characters that I wanted to receive, so loss of
any would cause infinite wait.
Post by Rick CAny idea on what size of aggregated messages would prevent character loss? That's kind of important.
Each convertor has finite transmission and receive buffers.
Accordinng to datasheet CP2104 have 576 character receive buffer.
For other I do now have numbers handy, but I would expect something
between 200 characters and kilobyte. When characters arrive via
serial port they fill receive buffers. Driver/OS/user program have
to promptly read them. When doing first test my hope was that
OS/driver will read characters from convertor and store them
is system buffer. But then I saw stalls with CP2104. After I have
seen this my guess was that in my test I overflowed CP2104 receive
buffer (in my initial test I was sending 10000 characters more than
I received, so much more than receive buffer size). However I have
seen stalls with k = 18 and message size 15. And even with k = 0 and
message size 20. In both cases new test program guaranteed that amount
of data in transit was much smaller than stated buffer size.
So, at least for CP2104 there must be some other reason.
Post by Rick CPost by a***@math.uni.wroc.plSo, you should be easily able to handle 10000 round trips
per second provided there is enough overlap. For this
you need to ensure that only one device is transmitting to
PC. If you have several FPGA-s on a single board, coordinating
them should be easy. Of couse, you need some free pins and
extra tracks. I would use single transceiver per board,
depending on coordination to ensure that only one FPGA
controls transceiver at given time. Anyway, this would
allow overlapped transmisson to all devices on single
board. With multiple boards you would need some hardware
or software protocol decide which board can transmit.
On hardware side a single pair of extra wires could
carry needed signals (that is your "priority daisy chain").
Yes, the test fixture boards have to be set up each day and to make it easy to connect, (and no backplane) I was planning to have two RJ-45 connectors on the front panel. A short jumper would string the RS-422 ports together.
My thinking, if the aggregated commands were needed, was to use the other pins for "handshake" lines to implement a priority chain for the replies. The master sets the flag when starting to transmit. The first board gives all the needed replies, then passes the flag on to the next board. When the last reply is received by the master, the flag is removed and the process is restarted.
Post by a***@math.uni.wroc.plAs other suggested you could use multiple convertors for
better overlap. My convertors are "full speed" USB, that
is they are half-duplex 12 Mb/s. USB has significant
protocol overhead, so probably two 2 Mb/s duplex serial
convertes would saturate single USB bus. In desktops
it is normal to have several separate USB controllers
(buses), but that depends on specific motherboard.
Theoreticaly, when using "high speed" USB converters,
several could easily work from single USB port (provided
that you have enough places in hub(s)).
I've been shying away from USB because of the inherent speed issues with small messages. But with larger messages, hi-speed converters can work, I would hope. Maybe FTDI did not understand my question, but they said even on the hi-speed version, their devices use a polling rate of 1 ms. They call it "latency", but since it is adjustable, I think it is the same thing. I asked about the C232HD-EDHSP-0, which is a hi-speed device, but also mentioned the USB-RS422-WE-5000-BT, which is an RS-422, full-speed device. So maybe he got confused. They don't offer many hi-speed devices.
But the Ethernet implementations also have speed issues, likely because they are actually software based.
The issues are more fundamental: both in USB and Ethernet there
is per message/packet overhead. Low latency means sending data
soon after it is available, which means small packets/messages.
But due to overheads small packets are bad for throughput.
So designers have to choose what they value more and in both
cases the whole system is normally optimized for throughput.
Post by Rick CPost by a***@math.uni.wroc.plAn extra thing: there are reasonably cheap PC compatible
boards, supposedly they are cheaper and more easy to buy
than Raspberry Pi (but I did not try buy them). If you
need really large scale you could have a single such board
per batch of devices and run copy of your program there. And
a single laptop connecting to satelite board via ethernet
and collecting results.
Yeah, but more complexity. Maybe it doesn't need to run so fast. I've been working with the idea that it is not a hard thing to do, but I just keep finding more and more problems.
The one approach that seems to have the best chance at running very fast, is a PCIe board with 4 or 8 ports. I'd have to use an embedded PC, or at least a mini-tower or something. Many of these seem to have rather low end x86 CPUs. There's also the overhead of the PC OS, so maybe I need to do some testing before I worry with this further. I have one FTDI cable. I can use an embedded MCU board for the other end I suppose. It will give me a chance to get back into Mecrisp Forth. I wonder how fast the MSP430 UART will run?
MSP430G2553 theoretically allows setting quite high rates like 4 Mbit/s,
but it is not clear it it will run (if noise immunity is good enough).
AFAICS 1 Mbit/s is supposed to work. Other thing is software speed,
I think that software can handle 1 Mbit/s, but probably not more.
Post by Rick CI might have an ARM board that runs Mecrisp, I can't recall.
--
Waldek Hebisch