=====Erste Schritte mit Apache Sqoop 2 - v1.99 ===== Mit Apache Sqoop (siehe => http://sqoop.apache.org/ ) können Daten zwischen Hadoop und einer Datenbank ausgetauscht werden. Architektur: {{ :hadoop:sqoop_v01.png?500 |Apache sqoop 1.99 Übersicht }} In der Version 2 (v1.99) wird eine Client Server Architektur realisiert. ====Installation Apache Sqoop==== Voraussetzung: * Hadopp mindestens als Client auf dem Server installiert * siehe zum Beispiel diese Hadoop Umgebung [[nosql:hadoop_first_setup|Eine Hadoop 2.4 Test Umgebung installieren und konfigurieren]] Ablauf: * Downlad Sqoop von [[http://mirror.netcologne.de/apache.org/sqoop/1.99.3/sqoop-1.99.3-bin-hadoop200.tar.gz|sqoop-1.99.3-bin-hadoop200.tar]] für Hadoop 2.x oder neuer * Entpacken nach /opt/hadoop/product * Umgebung setzen * Oracle JDBC Treiber herunterladen [[http://www.oracle.com/technetwork/database/features/jdbc/jdbc-drivers-12c-download-1958347.html|Oracle Database 12c Release 1 JDBC Driver Downloads]] * JDBC Treiber in das Sqoop Server Lib Verzeichnis kopieren * Test Ablauf: # auspacken cd /opt/hadoop/product wget http://mirror.netcologne.de/apache.org/sqoop/1.99.3/sqoop-1.99.3-bin-hadoop200.tar.gz tar xfvz sqoop-1.99.3-bin-hadoop200.tar.gz rm sqoop-1.99.3-bin-hadoop200.tar.gz # Verlinken ln -s sqoop-1.99.3-bin-hadoop200 sqoop #Umgebung setzen vi ~/.bashrc .. #Sqoop export SQOOP_HOME=/opt/hadoop/product/sqoop export PATH=$SQOOP_HOME/bin:$PATH .. . ~/.bashrc #Copy Oracle JDBC Driver in das server Lib verzeichnis: cp ~/ojdbc7.jar $SQOOP_HOME/server/lib/ #lib Verzeichnis anlegen mkdir $SQOOP_HOME/lib #1. test des Clients sqoop.sh client Sqoop home directory: /opt/hadoop/product/sqoop Sqoop Shell: Type 'help' or '\h' for help. sqoop:000> show version client version: Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013 sqoop:000> ===Konfiguration Apache Sqoop=== Für den Zugriff auf die Hadoop Lib in der Datei catalina.properties in $SQOOP_HOME/server/conf den common.loader Pfad anpassen, dazu **ALLE** hadoop jar Verzeichnisse mit in den Pfad aufnehmen! vi $SQOOP_HOME/server/conf/catalina.properties common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/common/*.jar,/opt/hadoop/product/hadoop/share/hadoop/common/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/hdfs/*.jar,/opt/hadoop/product/hadoop/share/hadoop/hdfs/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/mapreduce/*.jar,/opt/hadoop/product/hadoop/share/hadoop/mapreduce/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/tools/*.jar,/opt/hadoop/product/hadoop/share/hadoop/tools/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/yarn/*.jar,/opt/hadoop/product/hadoop/share/hadoop/yarn/lib/*.jar die Sqoop Server Konfigurieren befindet sich unter **$SQOOP_HOME/server/conf** in der Datei **sqoop_bootstrap.properties** und der Datei **sqoop.properties**. In der **sqoop.properties** muss das Haddop Config Verzeichniss konfiguriert/angepasst werden: vi $SQOOP_HOME/server/conf/sqoop.properties # Hadoop configuration directory org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/opt/hadoop/product/hadoop/etc/hadoop/ Die Ports lassen sich mit der Datei **$SQOOP_HOME/server/bin/setenv.sh** über **SQOOP_HTTP_PORT** and **SQOOP_ADMIN_PORT** setzen, der Default ist 12000 and 12001. Server starten: $SQOOP_HOME/bin/sqoop.sh server start Über die URL http://localhost:12000/sqoop/version läßt sich prüfen, ob der Server mit welcher Version läuft. curl http://localhost:12000/sqoop/version Server stoppen: $SQOOP_HOME/bin/sqoop.sh server stop ===Test des Servers und Beheben erster evlt. Fehler=== Mit client am Server anmelden und testen: $SQOOP_HOME/bin/sqoop.sh client Sqoop Shell: Type 'help' or '\h' for help. # Am Server anmelden sqoop:000> set server --host nosqldb01 --port 12000 --webapp sqoop Server is set successfully sqoop:000> show version # Server abfragen sqoop:000> show version --all client version: Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013 Exception has occurred during processing command Exception: com.sun.jersey.api.client.UniformInterfaceException Message: GET http://nosqldb01:12000/sqoop/version returned a response status of 404 Not Found Tritt dieser Fehler auf, liegt vermutlich ein Konfigurationsproblem vor. Logfile unter $SQOOP_HOME/server/logs analyiseren: vi localhost.2014-08-09.log ... Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/ ... Lösung: cd $SQOOP_HOME/server/conf vi sqoop.properties # Hadoop configuration directory org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/opt/hadoop/product/hadoop/etc/hadoop/ Fehler beim Stoppen des Server: Aug 09, 2014 2:44:41 PM org.apache.catalina.startup.ClassLoaderFactory validateFile WARNING: Problem with directory [/opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/lib], exists: [false], isDirectory: [false], canRead: [false] #Lösung mkdir /opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/lib ==== Repository Datenbank ==== In einer default Installation liegt die Repository Datenbank unter $SQOOP_HOME/@BASEDIR@/repository/db ==== Die ersten Daten zwischen der Datenbank und Hadoop kopieren ==== Es soll die Tabelle **SYSTEM.AUD$** aus der Oracle Datenbank als Text Datei in das Hadoop Cluster importiert werden. Ablauf: * Client starten und mit dem Sqoop Server verbinden * Connection Object anlegen * Job Object definieren * Job starten und Daten nach Hadoop importieren **Client starten und mit dem Sqoop Server** $SQOOP_HOME/bin/sqoop.sh client sqoop:000> set server --host nosqldb01 --port 12000 --webapp sqoop **Connection Object anlegen** Für das Connection Object werden die folgenden Informationen benötigt: * Treibername - oracle.jdbc.driver.OracleDriver * JDBC URL - jdbc:oracle:thin:@//10.10.10.1:1521/gpi * Username und Password sqoop:000> create connection --cid 1 Creating connection for connector with id 1 Please fill following values to create new connection object Name: GPI DB Connection configuration JDBC Driver Class: oracle.jdbc.driver.OracleDriver JDBC Connection String: jdbc:oracle:thin:@//10.10.10.1:1521/gpi Username: system Password: ****** JDBC Connection Properties: There are currently 0 values in the map: entry# Security related configuration options Max connections: 10 New connection was successfully created with validation status FINE and persistent id 1 sqoop:000> show connection --xid 1 1 connection(s) to show: Connection with id 2 and name GPI DB (Enabled: true, Created by hadoop at 8/9/14 5:27 PM, Updated by hadoop at 8/9/14 5:27 PM) Using Connector id 1 Connection configuration JDBC Driver Class: oracle.jdbc.driver.OracleDriver JDBC Connection String: jdbc:oracle:thin:@//10.10.10.1:1521/gpi Username: system Password: JDBC Connection Properties: Security related configuration options Max connections: 10 **Job Object definieren** Für das Job Object werden die folgenden Informationen benötigt: * Name und Schema der Tabelle * Speicherort im HDFS hdfs dfs -mkdir /auditVault hdfs dfs -chmod 777 /auditVault sqoop:000> create job --xid 1 --type import Creating job for connection with id 1 Please fill following values to create new job object Name: GPI AUD Import Database configuration Schema name: SYSTEM Table name: AUD$ Table SQL: Table column names: Partition column name:USERID Nulls in partition column: Boundary query: Output configuration Storage type: 0 : HDFS Choose: 0 Output format: 0 : TEXT_FILE 1 : SEQUENCE_FILE Choose: 1 Compression format: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY Choose: 0 Output directory: /auditVault/GPI_AUD Throttling resources Extractors: Loaders: New job was successfully created with validation status FINE and persistent id 1 sqoop:000> show job +----+----------------+--------+-----------+---------+ | Id | Name | Type | Connector | Enabled | +----+----------------+--------+-----------+---------+ | 1 | GPI AUD Import | IMPORT | 1 | true | +----+----------------+--------+-----------+---------+ **Job starten und Daten nach Hadoop importieren** sqoop:000> submission start --jid 1 Exception has occurred during processing command Unknown command: No such property: start for class: groovysh_evaluate ???????? sqoop:000> start job --jid 1 Exception has occurred during processing command Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception # Fehler meldung sichtbar machen: sqoop:000> set option --name verbose --value true sqoop:000> start job --jid 1 ... Caused by: Exception: java.lang.Throwable Message: GENERIC_JDBC_CONNECTOR_0005:No column is found to partition data Stack trace: ... sqoop:000>update job --jid 1 #alle alten werte bis auf Partition column name: TIMESTAMP# #und Output directory: /auditValut/gpi_aud sqoop:000> start job --jid 1 2014-08-09 17:56:17 CEST: FAILURE_ON_SUBMIT Exception: java.io.FileNotFoundException: File does not exist: hdfs://nosqldb01:9000/opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/server/webapps/sqoop/WEB-INF/lib/sqoop-common-1.99.3.jar Diese Anpassung an Hadoop Configuration löst dieser Verhalten: vi $HADOOP_HOME/etc/hadoop/mapred-site.xml mapreduce.framework.name yarn Nächster Test: sqoop:000> start job --jid 1 2014-08-09 18:01:18 CEST: FAILURE_ON_SUBMIT Exception: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/hadoop-yarn/staging/hadoop/.staging. Name node is in safe mode. The reported blocks 10 has reached the threshold 0.9990 of total blocks 10. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension # Lösung: hdfs dfsadmin -safemode leave Nächster Test: sqoop:000> start job --jid 1 Submission details Job ID: 1 Server URL: http://nosqldb01:12000/sqoop/ Created by: hadoop Creation date: 2014-08-09 18:06:13 CEST Lastly updated by: hadoop External ID: job_1407600059752_0001 http://nosqldb01:8088/proxy/application_1407600059752_0001/ 2014-08-09 18:06:13 CEST: BOOTING - Progress is not available # was tut sich: sqoop:000> start job --jid 1 Submission details Job ID: 1 Server URL: http://nosqldb01:12000/sqoop/ Created by: hadoop Creation date: 2014-08-11 21:17:42 CEST Lastly updated by: hadoop External ID: job_1407784150673_0001 http://nosqldb01:8088/proxy/application_1407784150673_0001/ Connector schema: Schema{name=AUD$,columns=[ Decimal{name=SESSIONID,nullable=null,precision=null,scale=null}, Decimal{name=ENTRYID,nullable=null,precision=null,scale=null}, Decimal{name=STATEMENT,nullable=null,precision=null,scale=null}, Date{name=TIMESTAMP#,nullable=null,fraction=null,timezone=null}, Text{name=USERID,nullable=null,size=null}, Text{name=USERHOST,nullable=null,size=null}, Text{name=TERMINAL,nullable=null,size=null}, Decimal{name=ACTION#,nullable=null,precision=null,scale=null}, Decimal{name=RETURNCODE,nullable=null,precision=null,scale=null}, Text{name=OBJ$CREATOR,nullable=null,size=null}, Text{name=OBJ$NAME,nullable=null,size=null}, Text{name=AUTH$PRIVILEGES,nullable=null,size=null}, Text{name=AUTH$GRANTEE,nullable=null,size=null}, Text{name=NEW$OWNER,nullable=null,size=null}, Text{name=NEW$NAME,nullable=null,size=null}, Text{name=SES$ACTIONS,nullable=null,size=null}, Decimal{name=SES$TID,nullable=null,precision=null,scale=null}, Decimal{name=LOGOFF$LREAD,nullable=null,precision=null,scale=null}, Decimal{name=LOGOFF$PREAD,nullable=null,precision=null,scale=null}, Decimal{name=LOGOFF$LWRITE,nullable=null,precision=null,scale=null}, Decimal{name=LOGOFF$DEAD,nullable=null,precision=null,scale=null}, Date{name=LOGOFF$TIME,nullable=null,fraction=null,timezone=null}, Text{name=COMMENT$TEXT,nullable=null,size=null}, Text{name=CLIENTID,nullable=null,size=null}, Text{name=SPARE1,nullable=null,size=null}, Decimal{name=SPARE2,nullable=null,precision=null,scale=null}, Binary{name=OBJ$LABEL,nullable=null,size=null}, Binary{name=SES$LABEL,nullable=null,size=null}, Decimal{name=PRIV$USED,nullable=null,precision=null,scale=null}, Decimal{name=SESSIONCPU,nullable=null,precision=null,scale=null}, Date{name=NTIMESTAMP#,nullable=null,fraction=null,timezone=null}, Decimal{name=PROXY$SID,nullable=null,precision=null,scale=null}, Text{name=USER$GUID,nullable=null,size=null}, Decimal{name=INSTANCE#,nullable=null,precision=null,scale=null}, Text{name=PROCESS#,nullable=null,size=null}, Binary{name=XID,nullable=null,size=null}, Text{name=AUDITID,nullable=null,size=null}, Decimal{name=SCN,nullable=null,precision=null,scale=null}, Decimal{name=DBID,nullable=null,precision=null,scale=null}, FloatingPoint{name=SQLBIND,nullable=null,byteSize=null}, FloatingPoint{name=SQLTEXT,nullable=null,byteSize=null}, Text{name=OBJ$EDITION,nullable=null,size=null}]} 2014-08-11 21:17:42 CEST: BOOTING - Progress is not available Status in YARN: yarn application -list 14/08/11 21:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/11 21:21:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1407784150673_0001 Sqoop: GPI AUD Import MAPREDUCE hadoop default RUNNING UNDEFINED 14% http://nosqldb01:35645 Leider bricht der Ladevorgang am Ende immer ab: Caused by: org.apache.sqoop.common.SqoopException: GENERIC_JDBC_CONNECTOR_0001:Unable to get a connection at org.apache.sqoop.connector.jdbc.GenericJdbcExecutor.(GenericJdbcExecutor.java:51) at org.apache.sqoop.connector.jdbc.GenericJdbcImportExtractor.extract(GenericJdbcImportExtractor.java:42) at org.apache.sqoop.connector.jdbc.GenericJdbcImportExtractor.extract(GenericJdbcImportExtractor.java:31) at org.apache.sqoop.job.mr.SqoopMapper.run(SqoopMapper.java:96) ... 7 more Caused by: java.sql.SQLRecoverableException: No more data to read from socket ?? Suche wir weiter ?? Mehr Erfolg mit : [[nosql:oracle_hadoop_big_data_appl_erste_schritte##daten_aus_oracle_mit_sqoop_2_-_19_in_das_hdfs_exportieren|Erste Schritte mit der Oracle Big Data Appliance]] ==== Einsatz von OraOop ===== Oraoop is a special plugin for sqoop that provides faster access to Oracle's RDBMS by using custom protocols that are not available publicly. Quest software partnered with Oracle to get those protocols, implemented them and created Oraoop. See = > [[https://github.com/QuestSoftwareTCD/OracleSQOOPconnector|Quest:registered: Data Connector for Oracle and Hadoop]] See : => http://blog.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With-OraHive.pdf ??? Seit zwei Jahren nicht mehr gepflegt ???? ==== Sqoop Client API ==== über die Client API läßt sich auch direkt auf den Server zugreifen um das zum Beispiel in eigene Programme zu integrieren. => http://sqoop.apache.org/docs/1.99.2/ClientAPI.html ==== Quellen ==== * http://sqoop.apache.org/docs/1.99.3/Sqoop5MinutesDemo.html Netz: * http://hadooped.blogspot.de/2013/06/apache-sqoop-part-3-data-transfer.html * http://hadooped.blogspot.de/2013/05/apache-sqoop-for-data-integration.html