[Beowulf] Pricing and Trading Networks: Down is Up, Left is Right
diep at xs4all.nl
Thu Feb 16 11:26:08 EST 2012
Yes very good article.
In fact it's even more clumsy than most guess.
For those who didn't get the problem of the article - there is 2
datafeeds A and B that ship 'market data' at most exchanges.
Basically all big exchanges work similar there. Especially the
derivatives exchanges, most of them have a very similar protocol,
and that's the only spot where you really can make money now based
The market data is lists of what price you can get something for (say
a future Mellanox, MLNX is its short), and what price it sells for.
It's however an incremental update meanwhile the datafeeds are RAW/
UDP, so not TCP.
TCP as we know is about the only protocol fixed well, so the RAW
format poses a big problem there.
On paper it is indeed possible to ask for retransmission at a
different channel, but in case of market surges that's gonna be too
slow of course.
The other feed you also cannot use, as 1 of the both feeds A and B is
gonna be a lot faster than the other.
Even the very slow IBM software, you can publicly buy,
if you google for it, at their twitter they claim a total processing
speed of the 'market data' of around 7-11 us (microseconds),
the trading (so buying and selling yourself) happens with a TCP
Of course you can forget trading at platforms or a computer that's
not on the exchange itself. Just the latency as we know of receiving
data from a datacenter with an ocean in between you measure in
milliseconds, factor 1000 slower than microseconds.
So won't make you much of a profit to trade like that, except if you
use it to compare 2 different exchanges with each other and try to
from that. That's however basically asking of you to be a billion
dollar company as you need quite some infrastructure for that to be
Speaking of having big cash being an advantage - some exchanges offer
if you pay really big cash a faster connection (10 gigabit versus 1
for cheap dollars). Most traders won't be able to pay for that big
So it's funny that different exchanges get mentionned here - as
you're only fast enough for 1 exchange with a local machine as a
But now the weirdest thing - i offered myself at different spots to
write a < 1 microsecond software to parse that marketdata, but no one
wants to hire you
it seems - they just look for JAVA coders.
Example is also Morgan Chase. All their job offers, which i receive
daily, 99% is JAVA jobs.
Java is of course TOO SLOW for working with trading data. Much better
is C/C++ and assembler after you already have achieved that < 1
microseconds in C.
Note that the 'FPGA'S" that are advertized costs millions most of
them and the only latency quote i saw there is 2 microseconds, which
also sounds a bit
slow to me, but well.
Furthermore they usually hire FINANCIALS who happen to be able to
program. They are so so behind in this world - not seldom also because
many of the IT managers i spoke with of financial companies, they
hardly have a highschool degree - they don't take any risk.
What's the risk of paying 1 programmer to make a faster framework for
We speak about hundreds of millions to billion dollar companies now
which don't take that risk.
Speed is of course everything. It's a NSA game now - and i keep
wondering why only a few hedgefunds hire such persons - majority of
the traders over here,
they really run so much behind - you would be very shocked if you
realize how much profit they do not make because of this.
Most importantly of course, now i'm gonna say something most on this
list understand and which to financial guys is nearly impossible to
being very fast removes a worst case. Odds you go bankrupt are A LOT
SMALLER during market surges and you lose less during unpredicted surges
by your algorithms.
Speaking of algorithms - the word algorithm in the financial world is
too heavy of a word. Gibberish is a better wording. But well i say
that last of course as
someone who has made pretty complex algorithms for patterns in
computerchess - also took me 15 years to learn how to do that.
Please note that it's wishful thinking guessing anything would change
in how trading happens at exchanges - if one nation would modify
something - traders
just move to a different exchange.
Right now CME (chicago) is the biggest derivatives market. Seems they
try in Europe to create a bigger one.
I'm pretty sure i don't betray any banking secrecy code if i call
them very clever. If they learn one day what a computer is, that is.
And as 0 of the traders *ever* in his life will be busy 'improving'
the system, of course 0 politicians have any clue what happens over
there and that sort of a
NSA race for speed it has become.
Though i'm very good at that, i'm not sure whether i like it.
During a congressional hearing in the US a year or 2 ago or so,
one academic who clearly realized the problem, stated that he wanted
a penalty directly after trading - as he had figured out that some
traders trade like 200 times a second in the same future Mellanox
this is called in traders terminology), to keep using the same
example. He wanted to 'solve' that problem by introducing a rule that
after trading in an
instrument one would need to wait for another 100 milliseconds to
trade again in that instrument.
A guy from tradeworx hammered that away, as that it would hurt
liquidity at the market.
Note such academic solutions do not solve the fundamental problem
that if someone goes first and is 1 picosecond faster,
that he's the one allowed to buy that instrument against that price
it was offered for. That there is a delay afterwards doesn't solve
that fundamental problem.
Furthermore you kind of tease away traders and exchanges don't like
In the meantime exchanges are upgrading their hardware and moving to
new datacenters. Some already migrated past years. So any discussion
here already is total outdated as the datacenters got way faster.
FTSE for example announced that their total processing time has been
reduced to somewhat just above a 100 microseconds and migrated to
More are too follow there.
On Feb 16, 2012, at 4:26 PM, Eugen Leitl wrote:
> Pricing and Trading Networks: Down is Up, Left is Right
> My introduction to enterprise networking was a little backward. I
> started out
> supporting trading floors, backend pricing systems, low-latency
> trading systems, etc... I got there because I'd been responsible
> for UNIX
> systems producing and consuming multicast data at several large
> Inevitably, the firm's network admin folks weren't up to speed on
> matters of
> performance tuning, multicast configuration and QoS, so that's where I
> focused my attention. One of these firms offered me a job with the
> "network" in the title, and I was off to the races.
> It amazes me how little I knew in those days. I was doing PIM and MSDP
> designs before the phrases "link state" and "distance vector" were
> in my
> vocabulary! I had no idea what was populating the unicast routing
> table of my
> switches, but I knew that the table was populated, and I knew what
> PIM was
> going to do with that data.
> More incredible is how my ignorance of "normal" ways of doing
> things (AVVID,
> SONA, Cisco Enterprise Architecture, multi-tier designs, etc...)
> gave me an
> advantage over folks who had been properly indoctrinated. My
> designs worked
> well for these applications, but looked crazy to the rest of the
> staff (whose underperforming traditional designs I was replacing).
> The trading floor is a weird place, with funny requirements. In
> this post I'm
> going to go over some of the things that make trading floor
> Redundant Application Flows
> The first thing to know about pricing systems is that you generally
> have two
> copies of any pricing data flowing through the environment at any
> Ideally, these two sets originate from different head-end systems, get
> transit from different wide area service providers, ride different
> infrastructure into opposite sides of your data center, and
> terminate on
> different NICs in the receiving servers.
> If you're getting data directly from an exchange, that data will
> probably be
> arriving as multicast flows. Redundant multicast flows. The same
> data arrives
> at your edge from two different sources, using two different multicast
> If you're buying data from a value-add aggregator (Reuters, Bloomberg,
> etc...), then it probably arrives via TCP from at least two different
> sources. The data may be duplicate copies (redundancy), or be
> among the flows with an N+1 load-sharing scheme.
> Losing One Packet Is Bad
> Most application flows have no problem with packet loss. High
> trading systems are not in this category.
> Think of the state of the pricing data like a spreadsheet. The rows
> represents a securities -- something that traders buy and sell. The
> represent attributes of that security: bid price, ask price, daily
> high and
> low, last trade price, last trade exchange, etc...
> Our spreadsheet has around 100 columns and 200,000 rows. That's 20
> cells. Every message that rolls in from a multicast feed updates
> one of those
> cells. You just lost a packet. Which cell is wrong? Easy answer:
> All of them.
> If a trader can't trust his data, he can't trade.
> These applications have repair mechanisms, but they're generally
> slow and/or
> clunky. Some of them even involve touch tone. Really:
> The Securities Industry Automation Corporation (SIAC) provides a
> retransmission capability for the output data from host systems.
> As part of
> this service, SIAC provides the AutoLink facility to assist vendors
> requesting retransmissions by submitting requests over a touch-tone
> Reconvergence Is Bad
> Because we've got two copies of the data coming in. There's no
> reason to fix
> a single failure. If something breaks, you can let it stay broken
> until the
> end of the day.
> What's that? You think it's worth fixing things with a dynamic routing
> protocol? Okay cool, route around the problem. Just so long as you can
> guarantee that "flow A" and "flow B" never traverse the same core
> router. Why
> am I paying for two copies of this data if you're going to push it
> through a
> single device? You just told me that the device is so fragile that
> you feel
> compelled to route around failures!
> Don't Cluster the Firewalls
> The same reason we don't let routing reconverge applies here. If
> there are
> two pricing firewalls, don't tell them about each other. Run them as
> standalone units. Put them in separate rooms, even. We can afford
> to lose
> half of a redundant feed. We cannot afford to lose both feeds, even
> for the
> few milliseconds required for the standby firewall take over. Two
> (four firewalls) would be okay, just keep the "A" and "B" feeds
> Don't team the server NICs
> The flow-splitting logic applies all the way down to the servers.
> If they've
> got two NICs available for incoming pricing data, these NICs should be
> dedicated per-flow. Even if there are NICs-a-plenty, the teaming
> schemes are
> all bad news because like flows, application components are also
> It's okay to lose one. Getting one back? That's sometimes worse. Keep
> Recovery Can Kill You
> Most of these pricing systems include a mechanism for data
> receivers to
> request retransmission of lost data, but the recovery can be a
> problem. With
> few exceptions, the network applications in use on the trading
> floor don't do
> any sort of flow control. It's like they're trying to hurt you.
> Imagine a university lecture where a sleeping student wakes up,
> asks the
> lecturer to repeat the last 30 minutes, and the lecturer complies.
> kind of how these systems work.
> Except that the lecturer complies at wire speed, and the whole
> lecture hall
> full of students is compelled to continue taking notes. Why should
> the every
> other receiver be penalized because one system screwed up? I've got
> trades to
> The following snapshot is from the Cisco CVD for trading systems.
> it shows
> how aggressive these systems can be. A nominal 5Mb/s trading
> regularly hits wire-speed (100Mb/s) in this case.
> The graph shows a small network when things are working right. A
> big trading
> backend at a large financial services firm can easily push that
> green line
> into the multi-gigabit range. Make things interesting by breaking
> stuff and
> you'll over-run even your best 10Gb/s switch buffers (6716 cards
> have 90MB
> per port) easily.
> Slow Servers Are Good
> Lots of networks run with clients deliberately connected at slower
> than their server. Maybe you have 10/100 ports in the wiring closet
> gigabit-attached servers. Pricing networks require exactly the
> opposite. The
> lecturer in my analogy isn't just a single lecturer. It's a team of
> lecturers. They all go into wire-speed mode when the sleeping
> student wakes
> How will you deliver multiple simultaneous gigabit-ish multicast
> streams to
> your access ports? You can't. I've fixed more than one trading
> system by
> setting server interfaces down to 100Mb/s or even 10Mb/s. Fast
> clients, slow
> servers is where you want to be.
> Slowing down the servers can turn N*1Gb/s worth of data into
> N*100Mb/s --
> something we can actually handle.
> Bad Apple Syndrome
> The sleeping student example is actually pretty common. It's
> amazing to see
> the impact that can arise from things like:
> a clock update on a workstation
> ripping a CD with iTunes
> briefly closing the lid on a laptop
> The trading floor is usually a population of Windows machines with
> sitting behind them. Keeping these things from killing each other is a
> daunting task. One bad apple will truly spoil the bunch.
> How Fast Is It?
> System performance is usually measured in terms of stuff per
> interval. That's
> meaningless on the trading floor. The opening bell at NYSE is like
> turning on
> a fire hose. The only metric that matters is the answer to this
> question: Did
> you spill even one drop of water?
> How close were you to the limit? Will you make it through
> tomorrow's trading
> day too?
> I read on twitter that Ben Bernanke got a bad piece of fish for
> dinner. How
> confident are you now? Performance of these systems is binary. You
> survived or you did not. There is no "system is running slow" in
> this world.
> Routing Is Upside Down
> While not unique to trading floors, we do lots of multicast here.
> is funny because it relies on routing traffic away from the source,
> than routing it toward the destination. Getting into and staying in
> mindset can be a challenge. I started out with no idea how routing
> worked, so
> had no problem getting into the multicast mindset :-)
> NACK not ACK
> Almost every network protocol relies on data receivers
> ACKnowledging their
> receipt of data. But not here. Pricing systems only speak up when
> goes missing.
> QoS Isn't The Answer
> QoS might seem like the answer to make sure that we get through the
> smoothly, but it's not. In fact, it can be counterproductive.
> QoS is about managed un-fairness... Choosing which packets to drop.
> pricing systems are usually deployed on dedicated systems with
> switches. Every packet is critical, and there's probably more of
> them than we
> can handle. There's nothing we can drop.
> Making matters worse, enabling QoS on many switching platforms
> reduces the
> buffers available to our critical pricing flows, because the buffers
> necessarily get carved so that they can be allocated to different
> kinds of
> traffic. It's counter intuitive, but 'no mls qos' is sometimes the
> thing to do.
> Load Balancing Ain't All It's Cracked Up To Be
> By default, CEF doesn't load balance multicast flows. CEF load
> balancing of
> multicast can be enabled and enhanced, but doesn't happen out of
> the box.
> We can get screwed on EtherChannel links too: Sometimes these quirky
> applications intermingle unicast data with the multicast stream.
> Perhaps a
> latecomer to the trading floor wants to start watching Cisco's
> stock price.
> Before he can begin, he needs all 100 cells associated with CSCO.
> This is
> sometimes called the "Initial Image." He ignores updates for CSCO
> until he's
> got the that starting point loaded up.
> CSCO has updated 9000 times today, so the server unicasts the
> initial image:
> "Here are all 100 cells for CSCO as of update #9000: blah blah
> blah...". Then
> the price changes, and the server multicasts update #9001 to all
> If there's a load balanced path (either CEF or an aggregate link)
> between the
> server and client, then our new client could get update 9001
> before the initial image (unicast) shows up. The client will
> discard update
> 9001 because he's expecting a full record, not an update to a
> single cell.
> Next, the initial image shows up, and the client knows he's got
> through update #9000. Then update #9002 arrives. Hey, what happened
> to #9001?
> Post-mortem analysis of these kinds of incidents will boil down to the
> software folks saying:
> We put the messages on the wire in the correct order. They were
> by the network in the wrong order.
> ARP Times Out
> NACK-based applications sit quietly until there's a problem. So
> quietly that
> they might forget the hardware address associated with their
> gateway or with
> a neighbor.
> No problem, right? ARP will figure it out... Eventually. Because
> these are
> generally UDP-based applications without flow control, the system
> fire off a single packet, then sit and wait like it might when
> talking TCP.
> No, these systems can suddenly kick off a whole bunch of UDP datagrams
> destined for a system it hasn't talked to in hours.
> The lower layers in the IP stack need to hold onto these packets
> until the
> ARP resolution process is complete. But the packets keep rolling
> down the
> stack! The outstanding ARP queue is only 1 packet deep in many
> implementations. The queue overflows and data is lost. It's not
> strictly a
> network problem, but don't worry. Your phone will ring.
> Losing Data Causes You to Lose Data
> There's a nasty failure mode underlying the NACK-based scheme. Lost
> data will
> be retransmitted. If you couldn't handle the data flow the first
> time around,
> why expect to handle wire speed retransmission of that data on top
> of the
> data that's coming in the next instant?
> If the data loss was caused by a Bad Apple receiver, then all his
> suffer the consequences. You may have many bad apples in a moment.
> One Bad
> Apple will spoil the bunch.
> If the data loss was caused by an overloaded network component,
> then you're
> rewarded by compounding increases in packet rate. The exchanges
> don't stop
> trading, and the data sources have a large queue of data to re-send.
> TCP applications slow down in the face of congestion. Pricing
> speed up.
> Packet Decodes Aren't Available
> Some of the wire formats you'll be dealing with are closed-source
> Others are published standards for which no WireShark decodes are
> available. Either way, you're pretty much on your own when it comes to
> Responding to Will's question about data sources: The streams come
> from the
> various exchanges (NASDAQ, NYSE, FTSE, etc...) Because each of these
> exchanges use their own data format, there's usually some layers of
> processing required to get them into a common format for application
> consumption. This processing can happen at a value-add data
> (Reuters, Bloomberg, Activ), or it can be done in-house by the end
> Local processing has the advantage of lower latency because you
> don't have to
> have the data shipped from the exchange to a middleman before you
> see it.
> Other streams come from application components within the company.
> There are
> usually some layers of processing (between 2 and 12) between a
> pricing update
> first hitting your equipment, and when that update is consumed by a
> The processing can include format changes, addition of custom
> fields, delay
> engines (delayed data can be given away for free), vendor-switch
> systems (I
> don't trust data vendor "A", switch me to "B"), etc...
> Most of those layers are going to be multicast, and they're going
> to be the
> really dangerous ones, because the sources can clobber you with LAN
> rather than WAN speeds.
> As far as getting the data goes, you can move your servers into the
> exchange's facility for low-latency access (some exchanges actually
> the same length of fiber to each colocated customer, so that nobody
> can claim
> a latency disadvantage), you can provision your own point-to-point
> for data access, you can buy a fat local loop from a financial network
> provider like BT/Radianz (probably MPLS on the back end so that one
> loop can get you to all your pricing and clearing partners), or you
> can buy
> the data from a value-add aggregator like Reuters or Bloomberg.
> Responding to Will's question about SSM: I've never seen an SSM
> component. They may be out there, but they might not be a super
> good fit.
> Here's why: Everything in these setups is redundant, all the way
> down to
> software components. It's redundant in ways we're not used to
> seeing in
> enterprises. No load-balancer required here. The software components
> collaborate and share workload dynamically. If one ticker plant
> fails, his
> partner knows what update was successfully transmitted by the dead
> peer, and
> takes over from that point. Consuming systems don't know who the
> servers are,
> and don't care. A server could be replaced at any moment.
> In fact, it's not just downstream pricing data that's multicast.
> Many of
> these systems use a model where the clients don't know who the data
> are. Instead of sending requests to a server, they multicast their
> for data, and the servers multicast the replies back. Instead of:
> <handshake> hello server, nice to meet you. I'd like such-and-
> it's actually:
> hello? servers? I'd like such-and-such! I'm ready, so go ahead
> and send
> it whenever...
> Not knowing who your server is kind of runs counter to the SSM
> ideal. It
> could be done with a pool of servers, I've just never seen it.
> The exchanges are particularly slow-moving when it comes to
> changing things.
> The modern exchange feed, particularly ones like the "touch tone"
> example I
> cited are literally ticker-tape punch signals wrapped up in an IP
> The old school scheme was to have a ticker tape machine hooked to a
> from the exchange. Maybe you'd have two of them (A and B again).
> There would
> be a third one for retransmit. Ticker machine run out of paper?
> Call the
> exchange, and here's more-or-less what happens:
> Cut the chunk of paper containing the updates you missed out of
> spool of tape. Scissors are involved here.
> Grab a bit of header tape that says: "this is retransmit data
> for XYZ
> Tape these two pieces of paper together, and feed them through
> a reader
> that's attached to the "retransmit line"
> Every bank in New York will get the retransmits, but they'll
> know to
> ignore them.
> XYZ Bank clips the retransmit data out of the retransmit ticker
> and pastes it into place on the end where the machine ran out of
> These terms "tick" "line" and "retransmit", etc... all still apply
> modern IP based systems. I've read the developer guides for these
> systems (to
> write wireshark decodes), and it's like a trip back in time. Some
> of these
> systems are still so closely coupled to the paper-punch system that
> you get
> chads all over the floor and paper cuts all over your hands just
> from reading
> the API guide :-)
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf