Per your request I have reviewed and evaluated your current IT infrastructure with emphasis on your servers, backup strategy, external security and network devices.  The results are presented in outline form with the most important items first.  Recommendations are highlighted in bold and represent my suggestions as to how to remedy the problem noted. 

I have identified a number of serious problems that should be dealt with as soon as possible (items 1-3).  I have mentioned all of these to management while I was onsite and have stressed the importance of correcting them as soon as possible.  Each of them has the potential to cause serious data loss and business interruption.   

Correcting items 4-6 will allow problems similar to those in 1-3 to be seen sooner.  It is possible that there are other pending failures that are not yet visible because the servers are not configured to report on the error.  That is the thrust of 4-6:  to identify hidden or newly developing problems before they cause a system failure.

Items 7-8 address external security issues.  You currently have a functioning firewall that is protecting your servers and workstations from outside attack.  Items 7 and 8 will answer the questions:

  • Am I keeping out everything that should be kept out?
  • Am I letting in everything that should be let in?
  • Am I keeping in everything that should be kept in?
  • Am I letting out everything that should be let out?

 At the same time the current mix of servers and services will be evaluated from the perspective of what would happen to the rest of the network if an Internet-facing service were to be compromised.

Items 9-10 address internal security and stability issues.  These include antivirus and operating system updates.

Item 11 will move the office from a peer-to-peer environment to a Windows domain environment.  In a nutshell, in the peer-to-peer network model every computer and every server is the final authority as to which resources are available and who gets to use them in which way.  In a Domain environment all of those functions (and many more) are controlled by a single authority. 

Peer-to-peer networking does not scale well; the more users, the more servers, the more shared resources that are involved the more difficult the management of the network becomes.  Domain networks however will scale to tens of thousands of resources with no problem and can be expanded to support almost any size network that can be imagined.

Item 12 implements a secure remote access environment.  This can be done before 11 but it is easier to manage in a domain. 

Items 13-14 address items that will extend the life of the servers and workstations.  The end result is a more stable network that does not require as much “emergency” intervention due to configuration problems or hardware failure.

Item 15 addresses the life-cycle management of servers and workstations.  This allows for better budgeting and planning for replacement of hardware.

Item 16 can really be done at any time after 1-3 are taken care of. 

Items 17-21 address policies and procedures that will make your network easier to manage and troubleshoot.

Item 22  can be done any time.

 

I am available to come on site and begin correcting these problems immediately.  My schedule is beginning to fill up so I would like to get started as soon as possible and try to get at least numbers 1, 2, 3 and perhaps begin to work on 4 and 5.  A lot of the work will need to be done off-hours so that servers can be restarted without disrupting your daily operations.  Please let me know as soon as possible when you would like for me to start.

 

 

 

Prioritized list of recommendations

  1. BkUpFTPSvr1 is logging errors on the boot disk.  The good news is that you won’t lose your email or your web site when this drive dies but the bad news is you won’t be able to access either one because the server will not boot.  This is probably one of your oldest servers running some of your most mission-critical applications.  This is a recipe for disaster. 
    1. Fix or replace the failing drive immediately. 
    2. Your email server and your web server should not be on the same hardware anyway.  Web servers are often breached and it is possible that all of your email could be exposed if that happens.  Plan to separate these functions on to different hardware.  Neither needs to be particularly robust, but they do need to be backed up regularly.
  2. Backups.  While we are on the subject of backups:  you really don’t have any.  This is just as important as fixing the drive on the mail server.  It’s really a toss-up as to which should be done first.  You have an important server that may fail at any time.  If you have good backups then this failure is easier to recover from.  On the other hand if you fix the bad drive then the immediate need for a good backup system is lessened.  It’s really going to be your call but I’d fix the server first if it were me. 

 

The current backup strategy is a simple file copy of files on each of the servers to a folder on an external disk drive.  This would be a bad backup procedure in your home and it is a terrible way to back up data that your company depends on.  To make matters worse, one of the backup drives is experiencing drive issues and may fail.  The way the backups are structured you have one copy of everything.  If something happens to a file on your server and you don’t notice the same day then you may not be able to use your backup to recover from the problem.  At one time you had a robotic tape library driven by a good commercial backup program and tapes were being taken offsite.  I don’t know if you outgrew the tapes or if there is something wrong with the robotic library or if no one knew how to run it, but at some point someone decided to implement your current strategy and you have been playing with fire ever since.

  1. Fix the problem on the failing backup device.  That way you at least have a chance that you will have a working backup if the need arises.
  2. Implement a proper backup strategy that involves multiple generations of backups and offsite storage of backed up data.  The ideal would be a local backup to disk and the disk backup would be replicated to an offsite location and to tape.  Specific generations of the tape backup would also be rotated offsite.
  3. Implement a regular plan of testing any backup strategy devised.  A backup that is not tested is not a backup.
  1. DFS and DFS2 are both experiencing drive issues.  The problem may be one or more drives or it may be a configuration or firmware issue with the RAID controller.  In any case it does not appear to be breaking anything right now but it should be fixed as soon as you have a good backup of one or both systems.
    1. Identify and correct the source of the errors on each of these servers.
    2. More specific recommendations cannot be provided without knowing the source of the errors.
  2. The first three items were chosen because there are indications of an impending failure.  Server logs were scanned for indications of failing drives.  Predictive failure is a function of the disk drive and the BIOS.  It was not possible to check the BIOS settings on the servers without rebooting them so it is unknown if the absence of warnings in the logs is due to an absence of problems or because the feature is not turned on in BIOS.
    1. On all servers verify that any predictive failure features are active. 
    2. If any servers receive BIOS updates, check again to see if the feature was included or modified in the update.
  3. Regular checking of the physical condition of disk drives is an important part of regular maintenance.  In most cases this cannot be done while the server is in use so it must be scheduled during the regular maintenance window.  This is particularly important on older drives (presumed to be present in your older servers) and on servers that do not have predictive drive failure activated.
    1. Designate a regular maintenance window for each server.
    2. Exercise hardware and look for signs of impending failure.
  4. General maintenance concerns. Based on the outdated BIOS and Windows Updates it is probable that other internal maintenance has not been performed.  This would include chipsets, raid controllers, switch and router firmware just to mention a few.
    1. Thoroughly evaluate the condition of every device that is in production.  Correct deficiencies where possible.  This would need to be done at a time when servers and other equipment can be rebooted as required.
  5. The current configuration of your firewall exposes internal network unnecessarily.
    1. Redesign your firewall and server infrastructure so that services and files that do not need to be exposed to the Internet are not running on severs that are exposed to the Internet.
  6. Evaluate current firewall rules. 
    1. Firewall should be tested regularly for effectiveness of rule set both in what is allowed and what is denied.
  7. Antivirus.  All computers and servers should have a valid, licensed, active copy of antivirus software.  You appear to have a license for Symantec Antivirus Corporate Edition.  BKUPFTPSRV1 is your current SAV server.
    1. Computers connected to BKUPFTPSRV1 with up-to-date AV software
      1. Acctng
      2. Cadfileadmn1
      3. Copyctrmgr
      4. Digitalsvr1
      5. Dispatch1
      6. Estimating
      7. Ftpsvr1
      8. Hrsandra
      9. Planroom
      10. Sales3dv
      11. Salesassist1es
      12. Ups
    2. Computers with no known antivirus
      1. All the rest
  8. Automatic Updates.  All computers should be configured to download and install updates from Microsoft on a regular basis.  Each computer and server can be configured to download and install updates individually or they can be configured to use an internal server which has downloaded each required update just once.  If the internal server method is used it is easy to produce a report on which computers are behind on their updates.  If they are each installing individually then finding the ones that are not up to date is more complicated and labor-intensive.  Internal server is recommended after domain is set up.  Manually configure all servers and workstations to automatically download and install in the meantime.  Unpatched servers and workstations potentially leave the entire network at risk.
    1. Servers current settings
      1. BKUPFTPSVR1 – auto download, manual install
      2. PlanRoom – no auto download
      3. CADSVR1 – auto download, autoinstall daily
      4. DigitalSVR1 – no auto download
      5. DFS2 – autoinstall daily
      6. DFS1 – autoinstall broken.  May be related to drive problems.
      7. FTPSVR1 – no auto download
      8. JobTrackSvr1 – no download
      9. Data1 – no download
  9. Your current peer-to-peer network design will not scale well.  You are already experiencing this in unnecessary support and configuration costs on the workstations and the servers.  In your current situation it would have been a trivial exercise to change the passwords and lock out terminated employees if your network infrastructure were properly designed.  A properly designed network will reduce your ongoing support costs, give your better security, more flexibility, more granular control of resources and speed up setup and provisioning of new computers. 
    1. Design and implement a domain structure that will support all of your existing Windows servers and workstations.
    2. Identify and configure security structures to allow appropriate access to network resources.
  10. Implement a secure and manageable remote access process.  Your current remote access is a mixture of Windows Remote Desktop connections and unsupported and unmonitored evaluation copies of commercial software (technically a violation of the license agreement).   The fact that only the terminated employees knew about the latter almost proved very costly last week.  As mentioned in the section on peer-to-peer networking, turning off remote access should have been as simple as disabling a single account for each user.
  11. Implement a regular and thorough maintenance on each workstation.  This will help prevent problems and often allow them to be corrected before they get bad enough to be noticed by the end-user. 
  12. All computers and servers should be protected by an appropriately sized UPS device with the proper console/control cable and monitoring software.  All of the servers are protected with some type of UPS but it is a hodge-podge of individual units stacked in the server room.  The rear of the stack is a tangled mass of power cables.  There is no way to tell if the servers are plugged into the appropriate unit or to see the serial numbers on most of the UPS devices.  At least one of the UPS units has a bad battery.  None of the units are being monitored and none of them can initiate a controlled shut-down of a server when the battery is almost drained.  The load on any one UPS ranges from 20% to 90% with at least one “no idea”.  It is certainly better than nothing but it is far from ideal.  I did not evaluate the current UPS configuration on any workstation.
    1. Verify that each server is plugged properly into an appropriately sized UPS.
    2. Verify that each UPS is properly cabled and configured to protect the server(s) that are plugged into it.
    3. Verify that each UPS and battery are functioning correctly and implement a regular testing plan to identify failing units before they let a server go down “hard”.
    4. Verify that the load is properly spread among the UPS units and that battery run time is within acceptable limits.
    5. Monitor UPS activity to identify possible power issues.
    6. Plan to replace the pile of stand-alone units with properly sized rack mount units as they wear out.
    7. Verify that each computer and sensitive electronic device in the building is protected by a properly sized UPS or surge suppressor.
    8. Verify that each computer is properly cabled to the UPS and that the UPS is properly configured to shut the computer down when the battery is almost dead.
    9. Verify that all existing UPS units are properly sized, have sufficient run time and that the batteries are functioning correctly.
  13. Look at the entire life-cycle of a workstation or a server.
    1. Standardize server and workstation hardware for ease of support.  Build-it-yourself workstations can save money at first but the ongoing support costs generally eat up any one-time savings from using generic parts to build workstations in-house.
      1. Identify classes of users, for instance:  Office, CAD, Sales (you would know better than I would at this point).
      2. For each class, create a standard hardware and software configuration that will work for most users in the class.
    2. Workstations should be replaced every 3-5 years.  Replacement should be based on actual need, not seniority or any other unrelated criteria.  There should be an identifiable and quantifiable reason why a person needs a newer, faster computer and those people should get them.  Their computers are passed down to workers who don’t need the latest and greatest computer to do their job more effectively.  When computers either won’t do what you need them to do in a timely manner or are causing excessive support calls then they should be replaced.  If the new computer does not improve the bottom line, then you really don’t need it.  When the old computer is hurting the bottom line you need to get rid of it.
      1. Evaluate each computer in your business based on the following criteria.
        1. Does this computer currently help me do my job in an effective manner?
        2. What would it have to do differently in order to help me do my job in an effective manner?
        3. Is it more cost effective to fix my existing computer or to replace it with a new one?  This requires an understanding of what is required to fix the computer and the actual costs of fixing it or replacing it.  Ultimately this is a management decision but it is helpful for the end user to go through the process even if management comes to a different conclusion.
      2. Based on the previous answers and on the age of the computer, where does it fit into the 3-5 year replacement matrix? 
      3. Budget for a replacement or repair depending on the previous answers.  The 4 and 5 year users are actually using hand-me-down 3-year machines so you only need to budget for the reprovisioning  costs (if any).
    3. Consider virtualizing servers and/or virtual desktops as a means to cut down on hardware costs.
  14. Cameras.  Not all cameras are recording to disk.  Is this by design?
  15. I’m glad to have a reasonably comprehensive list of user names and passwords available during this evaluation, but the fact that they are stored on a network share in an unencrypted Word document scares me.  According to the documentation these passwords go back to at least 2/08 and with one exception they have all been correct.  That scares me as well.  There is no way to know who has a copy of the document and there is no way to control access to the document other than through the passwords that have apparently never changed.
    1. Change all passwords on a regular basis.
    2. Require complex passwords (it’s easy – I can teach you how to remember them).  Passwords with more access should be more complex.  Remote access passwords should be more complex.
    3. Forbid storing passwords in any unencrypted form.
    4. Require copies of all system passwords to be documented and stored in a secure manner.  At least two people in senior management should have access to those passwords in case of an emergency.
    5. Never share passwords
  16. Software licensing
    1. Verify that all software is properly licensed.  Consider open licensing for Microsoft products.  It costs a little more but it is easier to manage in the long run.
    2. Maintain physical control of software license and install media. 
  17. Warranties.  Only two servers are still under warranty.  Two others are 4 months out.  It may be possible to renew. 
    1. Recommend keeping warranties active on all production servers.
    2. Recommend retiring servers than can no longer be put under warranty.
  18. Synchronize all devices to an accurate time source to allow for log file correlation.  This will happen automatically when the domain is set up.
    1. FTPSVR1 is currently 12 hours fast.
  19. Log files
    1. Server log files should be configured to allow data to be held until processed and recorded.  This time will depend on how often you process your logs.
    2. Log files should be processed on a regular basis.  Errors should be handled and warnings evaluated.
    3. Security logs should be evaluated carefully and often.
  20. KVM switch is not plugged into a surge suppressor or UPS.  Move during next maintenance window.

 

Server

Purchased

Warranty

Note

BKUPFTPSVR1

2000-2003**

Unknown but probably expired.

Do first.

 

Probably one of your oldest servers.  Server is known to be experiencing disk problems.  Failure of this server would disable web site and email.

 

This is old hardware running extremely mission-critical applications.  This is a recipe for disaster.

CADSVR1

6/19/2010

6/19/2013

 

DATA01

9/29/2010

9/29/2013

 

DFS

2/5/2008

2/5/2011

Work on this after you have good backups.

 

Logs show problems with one of the volumes.  This may be a software problem with the RAID controller or it may be a physical problem.  At this time it does not seem to be affecting anything.  Recommend not making any changes until a stable backup system is in place. 

DFS2

2/5/2008

2/5/2011

Work on this after DFS is fixed.

 

Logs indicate problems with one of the drives.  It does not seem to be causing problems with the applications running on the server but should be addressed as soon as you can get a good backup.

DIGITALSVR1

2000-2003**

Unknown but probably expired.

 

FTPSVR1

2000-2003**

Unknown but probably expired.

 

JOBTRACKSVR1

2002-2008**

Unknown but probably expired.

 

PLANROOM

4/21/2005

4/20/2008

 

SCO

3/7/2005***

3/6/2008**

 

TeraServer 1 & 2

 

 

Do second.

 

This is just barely less important than the first item.  Unit 2 is experiencing problems with one or more drives in the enclosure.  Since this is your only backup device it would be very important to fix this as soon as possible or move to a more robust backup strategy.

** Date is based on the dates when the operating system installed on the server was in common use.  It should be noted that in at least one case it is known that an obsolete operating system was installed on a server at least three years later than would have been expected.  Accurate dates should be available in your depreciation system.  If the dates are significantly different than this estimate then the recommendations should be reconsidered in light of the more accurate dates.

*** Date is based on known purchase date of similar hardware.