Below I have listed the Xilinx implementation properties I use when giving priority to timing rather than area. Settings not mentioned are not as important, or are recommended to be given the default value.

"Exp." means experiment, that is, for that property it might be worth to try with different settings.

Synthesis Options

Property Setting Comment
Optimization Goal Speed
Optimization Effort High
Keep Hierarchy No It's generally very beneficial to allow optimization across unit boundaries. If all in/outs of all units are registered though, this option can safely be set to Yes.
Netlist Hierarchy Rebuilt Make netlist names appear as though hierarchy was preserved - great e.g. when adding signals with ChipScope Inserter

HDL Options

Property Setting Comment
FSM Encoding Alg. Exp. Experimenting can really make a difference for timing problems on FSMs. But experiment with different encodings on a specific FSM - use the constraint fsm_encoding in the VHDL rather than this property.
FSM Style LUT
Shift Register Extr. Yes For delay chains (cascaded DFFs), don't use any resets and XST will infer shift register macros - superior to DFFs in terms of both speed and area
Resource Sharing No

Xilinx Specific Options

Property Setting Comment
Register Duplication Yes
Equivalent Reg. Removal No/Exp. Removing equivalent registers increases fan-out. "Yes" typically saves area but loses speed, but can also shorten routing paths giving better timing.
Register Balancing Yes
Move First/Last FF Stage Yes Although I like keeping registers to/from pads in the IOBs, for pure timing performance, they should be allowed to be pushed into logic
LUT Combining Exp.
Optimize Inst. Prim. Yes/Exp. Although relatively placed and routed already, allowing optimization across the primitive<->user logic border typically gives better timing.

MAP Properties

Property Setting Comment
Placer Effort Level High
Placer Extra Effort Normal
Comb. Logic Opt. Yes
Register Duplication Yes Will allow PAR to add registers to decrease the fanout
Global Opt. Yes One of the most important MAP setting to improve timing. In my experience most designs improve timing by 20-30%. Comes to a significant cost in terms of runtime though.
Equivalent Reg. Removal Yes See comment under Xilinx Specific Options above
Allow Logic Opt. Across Hierarchy Yes
Maximum Compression No/Exp. Packs design as densely as possible. Usually saves area but with a disadvantage for timing.
LUT Combining No
MAP Slice Logic into Unused BRAMs No I generally don't want those 1.3 ns for the BRAM DI/ADDR delay in my timing budget, no...

Place & Route Properties

Property Setting Comment
PAR Effort Level High
Extra Effort Normal

MAP and PAR both has a mutli-threading option. By default, it is set to off. If you're working on a machine with multiple CPUs or a multicore CPU, make sure to enable these options to reduce runtime.