Wireshark to the rescue

Wireshark is a free and open source network protocol analyzer which can be really useful when analyzing a wide range of network related issues.

Recently it turned out to be a real life saver on the project I currently work on. A web service client application which had been developed by a consultant company in India was going to call a web service hosted on a test server by my company in Oslo. But no matter how much the developers working in India tried, it would not work and they would always get the following exception:

System.Net.WebException was caught
Message=”The underlying connection was closed: The connection was closed unexpectedly.”
Source=”System.Web.Services”
StackTrace:
at System.Web.Services.Protocols.WebClientProtocol.GetWebResponse(WebRequest request)
at System.Web.Services.Protocols.HttpWebClientProtocol.GetWebResponse(WebRequest request)
at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)

However, everything worked fine when we tested our service from external and internal networks in Oslo.

When tracing the incoming request from India in Wireshark we could see the following:

The request reached our server, but we were unable to send the “100 Continue” response back to the client. It was possible to reach our web server through a browser on the client machine, so there should be no firewalls blocking the communication. It seemed like the connection had been closed by the client.

Next we got the developers in India to try the same request in SoapUI, and then it worked! This made us think that the problem was in the client application and not at the infrastructure level. So we spent several hours trying to troubleshoot the client environment, without any success. Google gave us numerous reports (1, 2, 3) of other people experiencing the same issues, but the suggested solution neither didn’t work nor did they explain the exact reason for the problem. Most of the suggestions involved excluding KeepAlive from the HTTP header and to use HTTP version 1.0 instead of version 1.1.

The next step was to log the request by using Fiddler Web Debugger on the calling server in India and then try to replay the request. The first replay of the request failed, as expected:

HTTP/1.1 504 Fiddler – Receive Failure
Content-Type: text/html; charset=UTF-8
Connection: close
Timestamp: 22:17:14.207

[Fiddler] ReadResponse() failed: The server did not return a response for this request.

So there was no reply from our server. Next we tried to remove the HTTP KeepAlive header as suggested by some of the blog posts we found on Google, and then resubmitting the request in Fiddler:

FiddlerHeader

And now the request worked in Fiddler! Once the TCP connection was established, we could even replay the original request which failed, and it would work.

But why did this work?

Based on the test results in Fiddler we arrived at the conclusion that the problem was not in the client application, but rather at the infrastructure level.

So we installed Wireshark on the calling server and did some more tracing. Finally we could see what was causing us problems:

WireSharkFragmentationNeeded

A router is telling us that the size of our IP datagram is too big, and that it needs to be fragmented. This is communicated back to the calling server by the ICMP message shown in the picture above.

By inspecting the ICMP message in Wireshark we can find some more details:

WireSharkIPDetails

There are several interesting things to observe in the picture above:

  1. The problem occurs when the router with IP address 209.58.105.21 tries to forward the datagram to the next hop (this is a backbone router located in Mumbai)
  2. The router in the next hop accepts a datagram size of 1496 bytes, while we are sending 1500 bytes.
  3. The router at 209.58.105.21 sends an ICMP message back to the caller which says that fragmentation of the datagram is needed

By executing the “tracert” command on the remote server we could get some more information about where on the route the problem occurred:

[…]
3    26 ms    26 ms    26 ms  203.200.137.9.ill-chn.static.vsnl.net.in [203.200.137.9]
4    31 ms    31 ms    31 ms  59.165.191.41.man-static.vsnl.net.in [59.165.191.41]
5    66 ms    66 ms    66 ms  121.240.226.26.static-mumbai.vsnl.net.in [121.240.226.26]
6    70 ms    70 ms    70 ms  if-14-0-0-101.core1.MLV-Mumbai.as6453.net [209.58.105.21]
7   184 ms   172 ms   171 ms  if-11-3-2-0.tcore1.MLV-Mumbai.as6453.net [180.87.38.10]
8   174 ms   173 ms   194 ms  if-9-5.tcore1.WYN-Marseille.as6453.net [80.231.217.17]
9   175 ms   176 ms   175 ms  if-8-1600.tcore1.PYE-Paris.as6453.net [80.231.217.6]
10   191 ms   176 ms   229 ms  80.231.154.86
11   174 ms   174 ms   213 ms  prs-bb2-link.telia.net [213.155.131.10]
[…]

Conclusions

A white paper is available at Cisco which describes the behaviour which we could observe above. The router which requested fragmentation of the datagram did not do anything wrong, it just acted according to the protocol standards. The problem was that the OS and/or network drivers on the calling server did not act on the ICMP message and did not try to either use IP fragmentation or to reduce the MTU size to a lower value which wouldn’t require fragmentation.

According to the Cisco white paper it is a common problem that the ICMP message will be blocked by firewalls, but that was not the case for our scenario.

And what about the request we could get working in Fiddler by removing “Connection: Keep-Alive” from the header? It worked because the datagram would become small enough to not require fragmentation (<= 1496 bytes) when we removed this header.

Resources

Wireshark homepage: http://www.wireshark.org/

Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC: http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml

LEAP Conference – day 3

A summary of day 3 of the LEAP conference in Redmond, Seattle

Sync: Why, what and how

With Lev Novik

IMG_1117
Why Sync?
  • Facilitates rich clients
    • Faster response, richer UX
  • Legacy applications can be migrated to use the Cloud as a data storage by using Sync
General Sync Challenges
  • Granularity of changes
  • Change (non-) Reflection
    • Using a timestamp. Use locking until synchronization is finished?
  • Conflicts
    • Not detecting conflicts will result in data loss
    • Complex algorithms for conflics detection exists, which don’t require storing the history of all changes
  • Loops
    • Multiple devices synchronizing data to multiple servers at the same time
    • Can result in duplicated data
  • Hierarchical data
    • The order of synchronization is important
    • Eg. one endpoints adds an item to a folder, while another endpoint deletes the entire folder
  • Item filtering
    • Optimization by syncing parts of the data more frequently
  • “Column” filtering
    • Select parts of the data
    • Challenge: Can’t do conflicts detection, since one of the endpoints don’t have the complete version of the data
  • Errors and interruptions
    • Not all conflicts can be solved automatically
      • Doing so will result in loss of data
      • Must wait for a human to resolve them
Microsoft Sync Framework
  • What does MS Sync Framework do?
    • Makes it easy to sync participating endpoints
      • Build in endpoints for
        • V1: File system, relational databases
        • V2: SQL Data Services, Live Mesh, ++
  • The Sync Session
    • Data stores implements a Sync Provider
    • The Sync application has a Sync Orchestrator which communicates with the endpoints’ sync providers
    • Synch Framework Runtime
      • Metadata
        • Versioning
      • Runtime
        • Algorithms to solve sync problems
      • Metadata Store
        • For those who can’t store the metadata themselves
      • Simple Provider Framework
        • Makes writing providers easy
How do customers use the sync framework?
  • Write sync applications
    • Implement synch orchestration
  • Write sync providers in order to support sync
    • Declare an object identifier
    • Declare versioning
    • Enumerate changes
Sync Participants
  • Sync endpoints
    • Stores metadata
    • Can be many kinds of devices, and the sync logic should not be implemented for each of them
  • Sync providers
    • Does most of the sync work
    • Operates on the endpoints’ meta data
  • Sync application
    • Has the Synch Orchestrator

The sync logic can be placed in different locations (eg. on the client or in a web service) for differenc scenarios.

Sync Framework on MSDN: http://msdn.microsoft.com/sync/

 

Visual Studio Team System: ALM as we do it at Microsoft

With Stephanie Cuthbertson

IMG_1120

Some facts about Microsoft Development
  • TFS usage at MS
    • VS 2008
      • 13 000 users
      • 2 570 000 work items
      • 40 100 000 source file
Planning and tracking
  • Feature planning and prioritizing in the development of VSTS 2010
    • Value props prioritizing
      • Voting and weighting/prioritizing of features in an Excel sheet
      • Work items are then imported to TFS
  • Generate MS Project GANTT from TFS
VS 2010 demo
  • Simple task editing integration with Excel and MS Project
  • Improved forecasting statistics and status reports
  • User requirement tracking
    • Can edit requirements through a web interface
      • Requires a separate (new) licence
    • Can link requirements to test cases
Branching
  • In the development of VSTS 2010, branching per feature is used
  • Feature must pass “Quality Gates” before merging into active branch
    • Feature complete
    • Test complete
    • All bugs fixed
    • Static code analysis
    • Localization testing
    • etc
Tracking and reporting in VSTS 2010
  • Better SharePoint integration
  • Web dashboard
    • Extensive statistics and analytics possibilities

Always Responsive Apps in a World of Public Safety

With Mario Szpuszta

IMG_1122

A case study for a system for ship tracking and tracing delivered to Frequentis.

Who is Frequentis AG?
  • Provides systems for
    • Air traffic
    • Ship tracking & tracing
    • Coordination systems for police offices
Terms
  • MCS- Maritime Communication System
    • Ship – Ship, Ship – Land, Land – Land
    • Usually hardware interface
  • CAD – Computer Aided Dispatching
    • Collaborative Incident Management
    • This is the kind of software made in this case study
  • TnT – Tracking and Tracing
    • CAD and MCS Solution from Frequentis
Tracking & Tracing Architecture
  • GUI in WPF
    • Several modules
    • Complex requirements
      • Lots of information and operations available for the users
    • Could not use CAB, Prism or similar frameworks since the GUI would then run in one process and one app domain. The entire system should not go down if one module crashes.
    • Each GUI module runs in a separate process. A separate shell was created in order to achieve this.
  • Communication with Maritime Communication System with .NET remoting
  • GUI communicate with the services through a message bus
  • Server
    • WCF service modules
    • Windows 2008 and SQL Server 2005
The Service Bus
  • Complex communication
    • Everyone communicate with everyone
  • Failure of one system may not affect others
  • Challenges
    • Not every harbour can pay for the required infrastructure, like huge server farms
    • Failure of single entity may not affect others
  • Classic architecture
    • Keep it simple
      • Lightweight
      • Reliable
    • Loosely couples
    • Many-to-many communication
  • Solution
    • Created custom Message Subscription Database
    • Use WCF Peer-to-Peer channel for communication
      • Issue: Max. 700 msg/sec limitation due to slow serialization
      • No Duplex-bindings, no MSMQ
        • Just leverage NetTcp-bindings
  • Tech-hints for WCF
    • NetDataContractSerializer will include assembly info – serialization will fail if endpoints have different assembly versions, even though the contracts are compatible
    • DataContractSerializer enables loosely coupling
Creating a responsive user interface
  • The application may never hang at any time
  • Encapsulate logic in “autonomous” tasks
  • Set of jobs executed based on commands
  • Core rule: Everything executed asynchronously
    • Thread pool with queue and queue manager
  • Commands, Jobs and Queues
    • Business logic encapsulated into Jobs (and ONLY there)
    • Commands executed autonomously without side effects
  • Results from Async Jobs
    • Modules implements INotify interface
      • Passed into the constructor of a job
      • Job calls back to module through INotify
  • Communication with other systems
    • Create yet another job
    • Job talks to IConnectionPoint
  • Tasks, Jobs – Tech Hint
    • CCR (Concurrency Coordination Runtime, originally from The Robotics Studio)
      • Simplified execution of concurrent tasks
      • Has now been released as a separate toolkit separated from Robotics
IMG_1125
WPF-based client
  • Why WPF?
    • Huge amount of information needed to be presented
    • Frequentis hired a separate UX-research team
      • Different alternative UX-stories were investigated
    • Advanced requirements for alternative visualizations of data
  • Presentation Model Pattern
    • Separate UI from code

 

Green Computing through Sharing

With Pat Helland

IMG_1126

Introduction
  • In 2006, 1,5% of the electricity in US was consumed by Data Centers
    • This is more than what is consumed by TVs
    • Projected to double every fifth year
  • Sharing resources vs. dedicated resources
    • Shared resources may not be available when you need them
    • Dedicated resources are expensive and have less utilization
  • Sharing through
    • Virtual machines
    • Cloud computing

 

The evolving landscape of data centers
  • Power Usage Effectiveness (PUE)
    • PUE = Total Facility Power / IT Equipment Power
    • Typical factor is 1.7
  • Power and cooling is expensive
    • Infrastructure and energy cost are both more expensive than the server cost
  • Redundancy
    • Represents more than 20% of the data center cost
    • All servers require
      • Dual power paths
      • Dual network
  • “Chicago Data Center”
    • Highly efficient data center with PUE = 1,2
    • Servers are located in isolated steel containers, each containing 2 000+ servers
      • Individual servers are never maintained

 

Over-Provisioning versus Over-Booking of Power
  • Power Provisioning
    • Total power consumption for a server is typically 200W
    • Power consumption typically peaks at about 90% for a data center
      • Theoretical max power consumption is seldom used, eg. because disk usage prevents 100% CPU utilization
      • This means that it is possible to add more servers than the theoretical max limit in order to utilize the available power
Services and Incentives
  • Amazon’s Server Oriented Architecture
    • One page request typically use over 150 services
  • Service Level Agreements (SLAs)
    • Example: 300ms response for 99.9% of requests with 500 requests per sec
What does this mean for developers?
  • Factories are more efficient than hand-crafted manufacturing