# Complete Parameter Guide for arun() The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality. ## Core Parameters ```python await crawler.arun( url="https://example.com", # Required: URL to crawl verbose=True, # Enable detailed logging cache_mode=CacheMode.ENABLED, # Control cache behavior warmup=True # Whether to run warmup check ) ``` ## Cache Control ```python from crawl4ai import CacheMode await crawler.arun( cache_mode=CacheMode.ENABLED, # Normal caching (read/write) # Other cache modes: # cache_mode=CacheMode.DISABLED # No caching at all # cache_mode=CacheMode.READ_ONLY # Only read from cache # cache_mode=CacheMode.WRITE_ONLY # Only write to cache # cache_mode=CacheMode.BYPASS # Skip cache for this operation ) ``` ## Content Processing Parameters ### Text Processing ```python await crawler.arun( word_count_threshold=10, # Minimum words per content block image_description_min_word_threshold=5, # Minimum words for image descriptions only_text=False, # Extract only text content excluded_tags=['form', 'nav'], # HTML tags to exclude keep_data_attributes=False, # Preserve data-* attributes ) ``` ### Content Selection ```python await crawler.arun( css_selector=".main-content", # CSS selector for content extraction remove_forms=True, # Remove all form elements remove_overlay_elements=True, # Remove popups/modals/overlays ) ``` ### Link Handling ```python await crawler.arun( exclude_external_links=True, # Remove external links exclude_social_media_links=True, # Remove social media links exclude_external_images=True, # Remove external images exclude_domains=["ads.example.com"], # Specific domains to exclude social_media_domains=[ # Additional social media domains "facebook.com", "twitter.com", "instagram.com" ] ) ``` ## Browser Control Parameters ### Basic Browser Settings ```python await crawler.arun( headless=True, # Run browser in headless mode browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit" page_timeout=60000, # Page load timeout in milliseconds user_agent="custom-agent", # Custom user agent ) ``` ### Navigation and Waiting ```python await crawler.arun( wait_for="css:.dynamic-content", # Wait for element/condition delay_before_return_html=2.0, # Wait before returning HTML (seconds) ) ``` ### JavaScript Execution ```python await crawler.arun( js_code=[ # JavaScript to execute (string or list) "window.scrollTo(0, document.body.scrollHeight);", "document.querySelector('.load-more').click();" ], js_only=False, # Only execute JavaScript without reloading page ) ``` ### Anti-Bot Features ```python await crawler.arun( magic=True, # Enable all anti-detection features simulate_user=True, # Simulate human behavior override_navigator=True # Override navigator properties ) ``` ### Session Management ```python await crawler.arun( session_id="my_session", # Session identifier for persistent browsing ) ``` ### Screenshot Options ```python await crawler.arun( screenshot=True, # Take page screenshot screenshot_wait_for=2.0, # Wait before screenshot (seconds) ) ``` ### Proxy Configuration ```python await crawler.arun( proxy="http://proxy.example.com:8080", # Simple proxy URL proxy_config={ # Advanced proxy settings "server": "http://proxy.example.com:8080", "username": "user", "password": "pass" } ) ``` ## Content Extraction Parameters ### Extraction Strategy ```python await crawler.arun( extraction_strategy=LLMExtractionStrategy( provider="ollama/llama2", schema=MySchema.schema(), instruction="Extract specific data" ) ) ``` ### Chunking Strategy ```python await crawler.arun( chunking_strategy=RegexChunking( patterns=[r'\n\n', r'\.\s+'] ) ) ``` ### HTML to Text Options ```python await crawler.arun( html2text={ "ignore_links": False, "ignore_images": False, "escape_dot": False, "body_width": 0, "protect_links": True, "unicode_snob": True } ) ``` ## Debug Options ```python await crawler.arun( log_console=True, # Log browser console messages ) ``` ## Parameter Interactions and Notes 1. **Cache and Performance Setup** ```python # Optimal caching for repeated crawls await crawler.arun( cache_mode=CacheMode.ENABLED, word_count_threshold=10, process_iframes=False ) ``` 2. **Dynamic Content Handling** ```python # Handle lazy-loaded content await crawler.arun( js_code="window.scrollTo(0, document.body.scrollHeight);", wait_for="css:.lazy-content", delay_before_return_html=2.0, cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load ) ``` 3. **Content Extraction Pipeline** ```python # Complete extraction setup await crawler.arun( css_selector=".main-content", word_count_threshold=20, extraction_strategy=my_strategy, chunking_strategy=my_chunking, process_iframes=True, remove_overlay_elements=True, cache_mode=CacheMode.ENABLED ) ``` ## Best Practices 1. **Performance Optimization** ```python await crawler.arun( cache_mode=CacheMode.ENABLED, # Use full caching word_count_threshold=10, # Filter out noise process_iframes=False # Skip iframes if not needed ) ``` 2. **Reliable Scraping** ```python await crawler.arun( magic=True, # Enable anti-detection delay_before_return_html=1.0, # Wait for dynamic content page_timeout=60000, # Longer timeout for slow pages cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl ) ``` 3. **Clean Content** ```python await crawler.arun( remove_overlay_elements=True, # Remove popups excluded_tags=['nav', 'aside'],# Remove unnecessary elements keep_data_attributes=False, # Remove data attributes cache_mode=CacheMode.ENABLED # Use cache for faster processing ) ```