Asterisk in the Trenches
Asterisk in the Trenches: One small company's struggle to set up a reliable open-source phone system
- Company Background: Construction Capital Source
- Construction Lending Company
- Headquarters in Murray (~20 employees)
- Satellite offices in Boise, Denver and Phoenix (~4 employees each)
- Technical details
- Internet connectivity by office:
- SLC: (15Mbps Utopia and leftover T1)
- Boise: (ELI T1)
- Denver: (Comcast business cable 8Mb down/768K up)
- Phoenix: (Cox business cable 6Mb down/768k up)
- IPsec VPNs connecting offices together
- At first of year, only one technically-skilled employee (Carl). Jordan came a few months later.
Decision-making process
- SLC and Boise had separate T1-based Avaya systems. (approximately $5000 per site plus occasional cost of more phones)
- New offices and expansion of existing offices forced us to make a decision
- Bosses liked lower-cost proposition of Asterisk
- Linksys SPA-942s looked cool (also cheap through UTAUG deal)
- Abundant bandwidth (Utopia) and friendly local provider (Arrival Telecom) also influenced our decision
- Ignorance made us a little overconfident (how hard can it be, right? :)
- Single-server non-hardware solution
- No need for servers in each office, especially since there are no technical people there
- No need to mess with PRI hardware with good local provider
- Connection to provider over Utopia was extremely low latency (~5ms)
- Cheapest solution
- Simpler to manage only one server
- Downsides:
- Remote offices lose dialtone when internet goes down
- Increased overall latency at remote offices
- Cheaper non-T1 internet connections slightly less reliable
- QoS for remote offices is more complex
Initial Configuration
- Single Asterisk server in SLC office
- 3.0 Ghz hyperthreaded P4
- 250 GB SATA drive
- 1 GB RAM
- AAH 2.7
- Quality of service issues
- Took a while to find firewall QoS package that worked
- Most firewalls can't prioritize VPN traffic
- Send one email attachment and phone calls would bomb
- For a long time, thought we were accessing asterisk server over public IP but weren't--only the initial SIP request went out over public IP but subsequent communications went over internal non-prioritized IP
- Finally made asterisk server publicly accessible and resolved QoS issues
Initial Configuration (continued)
- Complex dialplan made it more difficult to add customizations that weren't intended by the AAH developers. Minor nitpicks:
- Couldn't enable call waiting by default without hack
- Couldn't enable announcement on direct-dialed extension
- Wanted users not to have to type password if accessing voicemail from own phone
- ...and more
- All these customizations were possible of course, but complexity of dialplan made them hazardous to venture--"black box" system
- We wanted to understand the dialplan better and only install necessary features
- Even after resolving network QoS issues, we still were plagued by intermittent quality problems
- Had occasional DTMF complaints for a long time--thought it was stupid callers but realized it was a problem with some of our DIDs.
- Occasional technical problems at our Internet and VoIP providers sometimes made it difficult to distinguish whose system was at fault.
- We initially thought that packet loss was a serious problem but later discovered that it was not a major contributor
- Had to learn how to interpret non-technical employee descriptions of call problems ("it sounds like I'm talking under water")
- Hard to get people to keep a detailed log of call issues--vague descriptions don't help much
- One good solution was to dial in with Skype and listen to MOH all day long
- Under-promise and over-deliver--after numerous promises of "we've finally fixed it this time" we learned to keep our mouths shut
Round 2 - New and Improved Configuration
- Redundant Asterisk server in SLC office
- Two identical boxes using Heartbeat for failover
- 3.0 Ghz hyperthreaded P4
- 250 GB SATA drive
- 1 GB RAM
- Ubuntu Dapper Drake with standard asterisk package
- Custom dialplan
- Completely automated server configuration with shell scripts (new server can be up and running in minutes)
- Discoveries
- Syncing registration file means failover occurs in 1-2 seconds and only current calls are dropped
- Rsyncing file systems between redundant servers caused CPU spikes--switched to DRBD mirror for storing call recordings and voice mail and everything runs well now
- Jitter buffer is very bad when your IAX trunk latency is low--after turning it off our elusive call quality issues disappeared (hat-tip to this post on asterisk-dev list).
- Do anything you can to offload non-essential tasks to other servers. For example, our call recording streams are mixed by another server as a nightly cron job.
- Call recording can greatly increase the stress on your server and requires extra attention.
- zttest is pretty much useless on non-hardware-based systems (more info).
Possible Improvements
- A number of measures could be taken to further improve our system's reliability:
- Install non-hardware-based asterisk server at each office ($650 X 3 offices for non-redundant servers, $1,300 X 3 for mirrored solution)
- Migrate to layer-2-based VoIP provider on Utopia
- Upgrade server to use hardware PRI ($1,700 for each box in the mirror, ongoing cost of T1)
- Install PRI-based asterisk server in each office ($8,750 for non-redundant servers, $17,500 for mirrored solution, roughly $250/month per office for T1s)
- We are still looking for a better fax solution. A good T37 or T38 solution would be the final missing link in our VoIP solution.
- Other good ideas mentioned during presentation:
- Create a web form with check boxes to help users report call quality issues more conveniently
- There is a general consensus that Polycom phones are the best, mostly due to their superior speakerphones. In retrospect, we would have ordered Polycom over the Linksys SPA-942s, even though they are a little more difficult to administer.
- Completely eliminating the jitter buffer scared some people. A decent alternative may be to tweak its parameters. By default it is set to buffer one full second of packets, which is probably overkill for most situations.
- The more you can automate your installation, the easier it is to test a variety of alternatives, and the more quickly you can recover from disasters. Full automation increases administrator confidence, which makes troubleshooting more efficient and successful.
- One long ordeal I failed to mention was the nuisance of porting existing DIDs to our provider. This was a long and painful process that I wouldn't wish on anyone. Factor this in when taking the plunge.